Fecha: 7 de mayo de 2013

Ponente: Henry Anaya, (NLP&IR-UNED)

Lugar de celebración: Sala 1.03, ETSI Informática, UNED (mapa)

Resumen:

The ever-increasing availability of text documents has led to a growing challenge for information systems to effectively manage and retrieve the information comprised in large collections of texts according to the users¿ information needs. However, it is not always easy or even possible for the users to formulate such needs precisely. For example, the users may not be familiar with the vocabulary that defines their actual needs, or simply they may wish to get a broad summary of the collection in order to guide their searches. For this reason, there exists a great interest to develop new tools for analyzing and summarizing these collections according to their topics; i.e., the main subject themes that run over their documents.

Traditional clustering and Probabilistic Topic Modeling (PTM) are unsupervised learning technique that has been widely used in the process of topic discovery from documents. Basically, clustering methods are aimed at generating document groups or clusters, each one representing a different topic; whereas PTM approaches focus on learning a set of word distributions capable of generating each document in a collection to represent the topics.

However, as pointed out in several works, traditional clustering and PTM approaches are not enough to properly discover and describe the topics comprised in a text collection. Firstly, the obtained clusters/distributions do not necessarily correspond to actual topics of interest.   That is, they do not always correlate with human judgements so as to always provide ostensible end-users with semantically coherent and meaningful topics that summarize the content comprised in a text collection. Topics are actually clusters/probability distributions of words that while statistically significant, sometimes are difficult to interpret and explain by humans since the information they convey in many cases is not at all related to a subject heading. Secondly, clustering methods do not provide descriptions that summarize the clusters¿ contents (so that users can judge their relevance), and the descriptions provided by PTM approaches are currently limited to list the most probable (frequent) terms under each distribution.

In this context, this talk presents two approaches for discovering topics focused on two issues: (1) how  to discover the semantically coherent and meaningful (interpretable, subject-heading like)  topics comprised in a text collection, and (2) how to simultaneously provide an appropriate description for each topic so that  humans can easily judge its relevance.