Scholarometer is becoming a more mature tool. The idea behind scholarometer — crowdsourcing scholarly data — was presented at the Web Science 2010 Conference in Raleigh, North Carolina, along with some promising preliminary results. Recently acquired functionality includes a Chrome version, percentile calculations for all impact measures, export of bibliographic data in various standard formats, heuristics to determine reliable tags and detect ambiguous names, etc. Next up: an API to share annotation and impact data, and an interactive visualization for the interdisciplinary network.
The IEEE Spectrum piece Real-Time Search Stumbles Out of the Gate discusses the recent integration of real-time search features, such as Twitter and other microblog entries, into major search engines. Professor Filippo Menczer, CNetS associate director, comments in the article on the challenges posed by real-time search. Here is an excerpt of the interview:
IU’s Menczer suggests that with all this user-generated content, the environment is more complex than the one Google’s PageRank algorithm had to deal with. While search used to be about relationships between pages, he explains, now it’s about relationships between ”people, tags, Web pages, ratings, votes, and direct social links….It may not be that page A points to page B but rather that user John follows Mary and replies to the tweet of Jane and retweets it.” That makes it ”a more complicated ecosystem,” he says, ”but a very rich one,” and search engines will need ”more sophisticated ways to extract data from these relationships” […]
CNetS graduate student Diep Thi Hoang and associate director Filippo Menczer have developed a tool (called Scholarometer, previously Tenurometer in beta version) for evaluating the impact of scholars in their field. Scholarometer uses the h-index, which combines the scholarly output with the influence of the work, but adds the universal h-index proposed by Radicchi et al. to compare the impact of research in different disciplines. This is enabled by a social mechanism in which users of the tool collaborate to tag the disciplines of the scholars. “We have computer scientists, physicists, social scientists, people from many different backgrounds, who publish in lots of different areas,” says Menczer. However, the various communities have different citation methods and different publishing traditions, making it difficult to compare the influence of a sociologist and a computer scientist, for example. The universal h-index controls for differences in the publishing traditions, as well as the amount of research scholars in various fields have to produce to make an impact. Menczer is especially excited about the potential to help show how the disciplines are merging into one another. More from Inside Higher Ed… (Also picked up by ACM TechNews and CACM.)
Fil Menczer is one of the organizers of Hypertext 2009, the 20th ACM Conference on Hypertext an Hypermedia. The conference will be held June 29-July 1 at the Villa Gualino Convention Centre, on the hills overlooking Torino, Italy. Hypertext is the main venue for high quality peer-reviewed research on “linking.” The Web, the Semantic Web, the Web 2.0, and Social Networks are all manifestations of the success of the link. With a 70% increase in submissions, Hypertext 2009 will have a strong and diverse technical program covering all research concerning links: their semantics, their presentation, the applications, as well as the knowledge that can be derived from their analysis and their effects on society. The conference will also feature demos, posters, a student research competition, four workshops, and keynotes by Lada Adamic and Ricardo Baeza-Yates.
We study the structure and dynamics of Web traffic networks based on data from HTTP requests made by users at Indiana University. Gathering anonymized requests directly from the network rather than relying on server logs and browser instrumentation allows us to examine large volumes of traffic data while minimizing biases associated with other data sources. It also gives us valuable referrer information that we can use to reconstruct the subset of the Web graph actually traversed by users.
Our Web traffic (click) dataset is available!
Our goal is to develop a better understanding of user behavior online and creating more realistic models of Web traffic. The potential applications of this analysis include improved designs for networks, sites, and server software; more accurate forecasting of traffic trends; classification of sites based on the patterns of activity they inspire; and improved ranking algorithms for search results.
Among our more intriguing findings are that server traffic (as measured by number of clicks) and site popularity (as measured by distinct users) both follow distributions so broad that they lack any well-defined mean. Actual Web traffic turns out to violate three assumptions of the random surfer model: users don’t start from any page at random, they don’t follow outgoing links with equal probability, and their probability of jumping is dependent on their current location. Search engines appear to be directly responsible for a smaller share of Web traffic than often supposed. These results were presented at WSDM2008 (paper | talk).
Another paper (also here; presented at Hypertext 2009) examined the conventional notion of a Web session as a sequence of requests terminated by an inactivity timeout. Such a definition turns out to yield statistics dependent primarily on the timeout value selected, which we find to be arbitrary. For that reason, we have proposed logical sessions defined by the target and referrer URLs present in a user’s Web requests.
Inspired by these findings, we designed a model of Web surfing able to recreate not only the broad distribution of traffic, but also the basic statistics of logical sessions. Late breaking results were presented at WSDM2009. Our final report in the ABC model was presented at WAW 2010.
|Mark Meiss was supported by the Advanced Network Management Laboratory, one of the Pervasive Technology Labs established at Indiana University with the assistance of the Lilly Endowment.|
|This research was also supported in part by the National Science Foundation under awards 0348940, 0513650, and 0705676.|
|This research was also supported in part from the Institute for Information Infrastructure Protection research program. The I3P is managed by Dartmouth College and supported under Award Number 2003-TK-TX-0003 from the U.S. DHS, Science and Technology Directorate.|
Opinions, findings, conclusions, recommendations or points of view of this group are those of the authors and do not necessarily represent the official position of the U.S. Department of Homeland Security, Science and Technology Directorate, I3P, National Science Foundation, or Indiana University.
The Web Dynamics group works to build a better understanding of how the Web, the Wikipedia, and similar large information networks, grow and change over their lifetime. Of particular interest is how nodes in these networks gain popularity.
Our preliminary work has painted a picture of the Web as a place in which popularity is very dynamic and unpredictable. Surges in popularity for topics are similar to earthquakes and avalanches in terms of their unpredictability — both in when they will happen and on what scale. However, we find that spikes in popularity are often correlated with events in the news — as evidenced by positive correlation between Google Trends data and traffic to bursty Wikipedia topics. Work on this project has been presented at SocialCom 2010 Symposium on Social Intelligence and Networking (SIN-10). A review of these issues with an emphasis on the modeling problem was also published in Physical Review Letters.
Further research is focused on modeling — and predicting — popularity bursts, as well as exploration of other networks and sources of data, such as Twitter.
The goal of Sixearch (carl.cs.indiana.edu/6S) is to provide an open-source platform for developing a context aware personalized peer-to-peer (P2P) distributed information retrieval system. The application currently supports collaborative Web search with scalability.
Sixearch uses the idea of modeling neighbor nodes by their content but without assuming the presence of special directory hubs. As shown on the left, each peer is both a (limited) directory hub and a content provider; it has its own topical crawler guided by its user’s information content and local search engine. Peers communication is built on JXTA platform. When a user submits a query, it is first matched against the local engine, and then routed to neighbor peers to obtain more results. Ideally, the peer network should lead to the emergence of a clustered topology by intelligent collaboration between the peers. While traditional search engines such as Google and Yahoo provide access to very large document collections, the Sixearch P2P Web search application provides a complementary way for users to actively and collaboratively share their own document collections. However, the Sixearch framework allows traditional search engines to naturally be included as peers; such peers would quickly emerge as reliable, trustworthy, and general authority nodes.
The right figure displays a screenshot of the queries being sent among peers. Peer interactions are visualized by an applet. The area of each node is proportional to the size of its Web index. The edges represent the queries exchanged between two peers. The connectivity of each peer is an indirect measure of centrality, authority, and/or reliability of the peer as learned by the other peers.
Our work on Sixearch has been published in AAAI Magazine (preprint), and presented at Hyperterxt 2009 (demo), ACM SAC2009 (paper), RIAO2007 (demo), ACM CIKM P2PIR2006 (paper), WTAS2005 (paper), WWW2005 (poster), and WWW2004 (poster). Unfortunately, we have also been the victims of shameless plagiarism.
Visit Sixearch to learn more and download the application or contribute to it!
Members & Collaborators
|Sixearch project is based upon work supported by the the National Science Foundation under award No. IIS-0133124 and IIS-0348940 CAREER: Scalable Search Engines Via Adaptive Topic-Driven Crawlers.|
|Recently, Sixearch project has received the IBM 2007 UIMA award on 6S: A Collaborative Web Search Network.|
News | People | Publications | Meetings | Mailing list | Talks | Pics | Computing | Data | Coffee
Networks & agents Network
NaN is a research group exploring the modeling, simulation, and analysis of complex social and information networks, adaptive agents, and social computing systems. We especially focus on social media and the Web as complex techno-social networks in which we leave abundant traces of our activities: what we do, what we are interested in, whom we talk to, what knowledge we acquire and contribute. Our research spans from modeling the dynamic processes that occur online (how information networks grow and evolve, how individual and collective traffic patterns emerge, how attention bursts are generated and shaped by social and search tools) to designing tools that mine the Web to build better search, navigation, management, and recommendation tools (where ‘better’ means more intelligent, autonomous, robust, personalized, contextual, scalable, adaptive, and so on). We have ongoing collaborations with colleagues at the ISI Foundation, Yahoo Labs, and MoBS Lab.
Active NaN projects
Archived NaN projects
A PNAS paper on Growing And Navigating The Small World Web By Local Content was announced in press releases by PNAS News and UIowa. A radio interview for the program Science in Action was broadcast by BBC World Service (QuickTime | Flash | MP3). The paper received coverage in Technology Research News, ACM TechNews, Complexity Digest, Insight, @-web, Ascribe, Boston.com, E4, ResearchBuzz
Web pages cluster by content type (TRN News)