Talk by Alessio Cardillo

alessio

Speaker: Alessio Cardillo, École Polytechnique Fédérale de Lausanne

Title: Automatic identification of relevant concepts in scientific publications

Date: 02/10/2017

Time: 12:15pm

Room: Informatics East 322

Abstract: Recently, scientists have devoted many efforts to study the organization and evolution of science by exploiting the textual information contained in the articles like: keywords and terms extracted from title/abstract. However, only few studies focus on the analysis of the core of an article, i.e., its body. The access to the whole text of documents allows to study, instead, the organization of scientific knowledge using networks of similarity between articles based on their whole content.

I use the concepts extracted from the documents/articles available within the ScienceWISE platform to build the network of similarity between them. However, such network possesses a remarkably high link density (36%). As a consequence, attempts of associating groups of documents (communities) to a given topic are of limited success. The reason is that not all the concepts are equally informative and may not be useful to discriminate the articles. The presence of ``generic concepts'' gives rise to spurious similarities responsible for a large amount of connections in the system.

To get rid of such concepts, I will introduce a method to gauge their relevance according to an information-theoretic approach. The significance of a concept $c$ is encoded by the distance between its maximum entropy, $S_{\max}$, and the observed one, $S_c$. After removing concepts having an entropy within a certain distance from the maximum, I rebuild the similarity network and analyze its community structure (topics). The consequences of this are twofold: the number of links decreases, as well as the noise present in the strength of similarities between articles. Hence, the filtered network displays a more well defined community structure, where each community contains articles related to a specific topic. Finally, the method can be applied to any kind of documents, and works also in a coarse-grained mode since it is able to identify the relevant concepts for a certain set of articles, allowing the study of a documents corpus at different scales.

Bio: Alessio Cardillo is currently postdoc research fellow at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland. His research interests focus on the analysis of the structure of networked systems like: urban mobility and street patterns, scientific collaborations, collections of documents and multiplex networks. He is also interested in the emergence of collective behaviours such as cooperation or synchronization by means of coevolutionary dynamics.