Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf’s law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf’s law for word frequencies, in this project we focus on burstiness, Heaps’ law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text.
This research has been published on PLoS ONE and can have practical applications in text mining. For example, improved algorithms to detect the topic of a document could lead to better contextual ad matching techniques. The Similarity Cloud illustrates the shared topicality between two documents or collections, which is captured in our model. See the press release.
Prior work explored models for the growth of the Web’s link graph that take into account the content of pages. A 2002 paper in PNAS studied the relationship between the similarity among pages and the probability that they are linked to each other. A 2004 paper in PNAS proposed a model incorporating these findings to reproduce not only the link structure but also the distribution of similarity between linked pages, given a background distribution of similarity among any two pages. Combining the text and link generation models will give us a framework to understand to entire process by which information is created and linked on the Web.
We analyzed three data sets:
- The Industry Sector data set is available from Andrew McCallum’s data repository.
- Our crawl of 150,000 Web pages (459MB) sampled from the ODP in 2007. This contains 15 directories, with 10,000 random HTML pages from each of the 15 top level categories.
- Our set of 100,000 Wikipedia topic pages (38MB) randomly sampled from an English Wikipedia dump in 2007. Built by Jacob Ratkiewicz, this is a single file where topic term vectors are represented in sequence, separated by empty lines. Each topic is a sequence of lines, one line per word. Each line has two columns separated by spaces. The first column is a term ID, the second is the number of times the corresponding term occurs in the topic.
For each data set we removed HTML/Wiki markup, then conflated the remaining terms using the Porter Stemming Algorithm.