Information Organization and Retrieval with Collaboratively Generated Content

Proliferation of ubiquitous access to the Internet enables millions of Web users to collaborate online on a variety of activities. Many of these activities result in the construction of large repositories of knowledge, either as their primary aim (e.g., Wikipedia) or as a by-product (e.g., Yahoo! Answers). In this tutorial, we will discuss organizing and exploiting collaboratively generated content (CGC) for information organization and retrieval. Specifically, we intend to cover two complementary areas of the problem: (1) using such content as a powerful enabling resource for knowledge-enriched, intelligent representations and new information retrieval algorithms, and (2) development of supporting technologies for extracting, filtering, and organizing collaboratively created content.

The unprecedented amounts of information in CGC enable new, knowledge-rich approaches to information access, which are significantly more powerful than the conventional word-based methods. Considerable progress has been made in this direction over the last few years. Examples include explicit manipulation of human-defined concepts and their use to augment the bag of words (cf. Explicit Semantic Analysis), using large-scale taxonomies of topics from Wikipedia or the Open Directory Project to construct additional class-based features, or using Wikipedia for better word sense disambiguation.

However, the quality and comprehensiveness of collaboratively created content varies significantly, and in order for this resource to be useful, a significant amount of preprocessing, filtering, and organization is necessary. Consequently, new methods for analyzing CGC and corresponding user interactions are required to effectively harness the resulting knowledge. Thus, not only the content repositories can be used to improve IR methods, but the reverse pollination is also possible, as better information extraction methods can be used for automatically collecting more knowledge, or verifying the contributed content. This natural connection between modeling the generation process of CGC and effectively using the accumulated knowledge suggests covering both areas together in a single tutorial.

The intended audience of the tutorial includes IR researchers and graduate students, who would like to learn about the recent advances and research opportunities in working with collaboratively generated content. The emphasis of the proposed tutorial will be on comparing the existing approaches and presenting practical techniques that IR practitioners can use in their research. We also plan to cover open research challenges, as well as survey available resources (software tools and data) for getting started in this research field.


Dr. Eugene Agichtein is an Assistant Professor in the Math & Computer Science Department at Emory University. He is a founder of the Emory Intelligent Information Access Laboratory (IRLab). Eugene's research expertise is in information access and retrieval, in particular on understanding and modeling user interactions in web search and social media to improve information access and discovery. He has published extensively on web information retrieval and on information extraction from text and the web. Some of Eugene's research publications were recognized with the "Best Student Paper" award at the ICDE 2003 conference, and the "Best Paper Award" at the SIGMOD 2006 conference.  Eugene is also actively involved in the IR and Web Search research community, and will serve as the Program Co-chair of the WSDM 2012 conference in Seattle. He has served as a Senior Program Committee member (area chair) of SIGIR 2007, 2008, 2009, 2010, and 2011 conferences, ICWSM 2010 and 2011 conferences, area chair for Information Retrieval for HLT 2010 conference, and a Tutorials Co-Chair for the Annual Meeting of the Association for Computational Linguistics (ACL 2008) conference. Eugene has also served on Program Committees of the SIGIR, AAAI, KDD, ACL, EMNLP, ICDE, COLING, WWW, WSDM, and HLT conferences. Directly related to the tutorial topic, Dr. Agichtein co-founded and co-chaired the first three workshops on Search in Social Media (SSM 2008) at CIKM 2008 in Napa Valley, California, SSM 2009 at SIGIR 2009 in Boston, and SSM 2010 at WSDM 2010 in New York, which drew large participation from both academic and industrial researchers. Eugene presented tutorials at AAAI 2010 and WWW 2010 on "Modelling Searcher Intent and Behavior", and has given invited lectures on related topics at major research labs and web search engines including Google, Microsoft, Yahoo, Ebay, and Yandex. Eugene has previously presented a popular tutorial on Scalable Information Extraction and Integration at the ACM KDD 2006 conference, and an invited SIGKDD webcast on "Scalable Information Extraction" in 2007.

Dr. Evgeniy Gabrilovich is a Senior Research Scientist and Manager of the NLP & IR Group at Yahoo! Research. His research interests include information retrieval, machine learning, and computational linguistics. Recently, he organized a workshop on feature generation and selection for information retrieval at SIGIR 2010, workshops on the synergy between user-contributed knowledge and research in AI at IJCAI 2009 and AAAI 2008, and workshops on information retrieval for advertising at SIGIR 2009 and 2008. Evgeniy served as a Senior PC member or Area Chair at SIGIR, AAAI, IJCAI, EMNLP, and ICWSM, and also served on the Program Committees of WWW, WSDM, SIGIR, CIKM, AAAI, ACL, EMNLP, HLT, COLING, and JCDL. Evgeniy is a recipient of the Karen Sparck Jones Award for his contributions to natural language processing and information retrieval. Evgeniy earned his MSc and PhD degrees in Computer Science from the Technion - Israel Institute of Technology. In his Ph.D. thesis, he developed a methodology for using large scale repositories of world knowledge (e.g., all the knowledge available in Wikipedia) to enhance text representation beyond the bag of words. Evgeniy presented tutorials on computational advertising at SIGIR 2010, CIKM 2009, IJCAI 2009 (invited), EC 2008 and ACL-HLT 2008 (invited). He also presented invited keynote talks at ECIR 2011, Canadian AI 2009, and the Workshop on Social Web Search and Mining held at CIKM 2009. Evgeniy also presented multiple talks at major research labs in academia and industry, including Microsoft, Google, HP Labs, and Yandex.