Informatics team finds simple rules that explain universal laws of written text

Similarity Cloud for 'mac' vs 'pc'
Similarity Cloud for 'mac' vs 'pc'

Alessandro Flammini and Filippo Menczer, along with M. Ángeles Serrano from the University of Barcelona, have authored a paper entitled “Modeling Statistical Properties of Written Text” that has been published in the PLoS One. The paper introduces and validates a generative model that explains from simple rules the simultaneous emergence of patterns of written text observed in many languages. The paper focuses on the well-known Zipf’s law of word frequencies, as well as additional patterns such as Heaps’ law of word diversity, the bursty nature of rare words, and similarity among documents. Through their model, the researchers found a connection between word burstiness and the topicality of text. In addition, they identify dynamic word ranking and memory across documents as key mechanisms to explain the organization of written text. The semantic similarity between topics, which is one of the features that the model aims to explain, is visualized by the Similarity Cloud, an online tool developed by computer science graduate student Mark Meiss. The model developed by the researchers and the findings of this paper could lead to improved techniques for identifying key terms that capture the topics of a Web page, which is crucial for matching search queries to relevant results and ads. More...