Web Traffic Analysis & Modeling

structure of a logical web sessionThe Web traffic analysis group studies the structure and dynamics of the networks formed by HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows us to examine large volumes of traffic data while minimizing biases associated with other data sources. It also gives us valuable referrer information that we can use to reconstruct the subset of the Web graph actually traversed by users.

We study the structure and dynamics of this Web subgraph with the goal of developing a better understanding of user behavior online and creating more realistic models of Web traffic. The potential applications of this analysis include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.

Among our more intriguing findings to date are that server traffic (as measured by number of clicks) and site popularity (as measured by distinct users) both follow distributions so broad that they lack any well-defined mean. Actual Web traffic turns out to violate three assumptions of the random surfer model: users don’t start from any page at random, they don’t follow outgoing links with equal probability, and their probability of jumping is dependent on their current location. Search engines appear to be directly responsible for a smaller share of Web traffic than often supposed. These results were presented at WSDM2008 (paper | talk).

Our most recent study (presented at Hypertext 2009) examined the conventional notion of a Web session as a sequence of requests terminated by an inactivity timeout. Such a definition turns out to yield statistics dependent primarily on the timeout value selected, which we find to be arbitrary. For that reason, we have proposed logical sessions defined by the target and referrer URLs present in a user’s Web requests. We are currently working on improved models of Web surfing that are able to recreate not only the broad distribution of traffic, but also the basic statistics of logical sessions. Late breaking results were presented at WSDM2009.

Animations are available to visualize the session trees in actual traffic data or as predicted by our ABC model (paper coming soon). The trees you see will tend to be more interesting examples rather then the more common, boring type containing only a couple of pages.

Project Participants

Mark Meiss

Mark Meiss

Bruno Gonçalves

Bruno Gonçalves

Fil Menczer, PI

Fil Menczer

Sandro Flammini

Sandro Flammini

Jose Ramasco

Jose Ramasco

Alex Vespignani

Alex Vespignani

Santo Fortunato

Santo Fortunato

Support

Pervasive Technology Labs at Indiana University Mark Meiss is supported by the Advanced Network Management Laboratory, which is one of the Pervasive Technology Labs established at Indiana University with the assistance of the Lilly Endowment.
Nsf_logo This research is also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.
DHS Logo This research is also supported in part from the Institute for Information Infrastructure Protection research program. The I3P is managed by Dartmouth College and supported under Award Number 2003-TK-TX-0003 from the U.S. DHS, Science and Technology Directorate.

Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the U.S. Department of Homeland Security, Science and Technology Directorate, I3P, National Science Foundation, or Indiana University.