Ted Pedersen

Language Independent Methods of Clustering Similar Contexts (with applications)

Dr. Ted Pedersen (Ph.D. Southern Methodist University, 1998) is an Associate Professor of Computer Science at the University of Minnesota, Duluth. His has done research in nearly all aspects of word sense disambiguation, including supervised, unsupervised, and knowledge intensive methods. He has also investigated methods of identifying collocations in large corpora, and measuring semantic similarity and relatedness among concepts. He is the recipient of a National Science Foundation (USA) CAREER award.

Language Independent Methods of Clustering Similar Contexts (with applications)

Methods that identify similar (but not identical) units of text have wide potential application. For example, Web search results can be better organized by grouping together pages with related and similar content. Email can be automatically foldered and categorized by finding which messages are similar to each other. Word senses can be discovered by clustering multiple contexts that use a particular ambiguous word.

This course will introduce a language independent methodology for identifying similar contexts based on lexical features. The course will explore the use of first and second order co--occurrence vectors for representing contexts, and introduce methods for carrying out dimensionality reduction that lower the noise and computational complexity associated with these large feature spaces. A number of different clustering methods will be discussed, as will various methods of evaluating the quality of the clustering results. Finally, the course will explore methods of automatically generating descriptive labels for clusters.

This is a full day course that will feature 3 hours of lecture, and 3 hours of laboratory work. The latter will be based on the freely available packages SenseClusters, the Ngram Statistics Package, and WordNet::Similarity.

 

Outline of Course:

 

Foundations (3 hours)

 

            - identifying lexical features

                        - reviewing measures of association

                        - reviewing statistical tests of significance

 

            - context representation

                        - first order features

                        - second order features

 

            - dimensionality reduction (for contexts)

                        - singular value decomposition (SVD)

                        - multi dimensional scaling (MDS)

 

            - clustering techniques

                        - agglomerative

                        - hybrid methods

 

            - evaluation techniques

                        - comparisons to gold standard

                        - measures of purity and entropy in clusters

 

Applications (3 hours)

 

            - name discrimination

            - word sense identification

            - identifying sets of related words/synonyms

            - ontology creation

            - email topic organization/automatic foldering

 

The practical session will be based on the use of a web interface from:

http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

The system can be installed under Linux and it will run faster. The students can install the SenseClusters package, which is available at http://senseclusters.sourceforge.net 

Some CDs with all the software will be available during the summer school.

 

Additional recommendations regarding useful tools and background readings (not as prerequisite):

 

1) SenseClusters - http://senseclusters.sourceforge.net

 

Good background papers:

Name Discrimination by Clustering Similar Contexts (Pedersen, Purandare, and Kulkarni) - in the Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics, February 13-19, 2005, Mexico City.

http://www.d.umn.edu/~tpederse/Pubs/cicling2005.pdf

 

Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces (Purandare and Pedersen) - Appears in the Proceedings of the Conference on Computational Natural Language Learning (CoNLL), May 6-7, 2004, Boston, MA

http://www.d.umn.edu/~tpederse/Pubs/conll04-purandarep.pdf

 

2) Ngram Statistics Package: http://www.d.umn.edu/~tpederse/nsp.html

 

A good background paper:

The Design, Implementation, and Use of the Ngram Statistics Package (Banerjee and Pedersen) - Appears in the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, February 17-21, 2003, Mexico City

http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

 

3) WordNet-Similarity: http://wn-similarity.sourceforge.net/

 

A good background paper:

Extended Gloss Overlaps as a Measure of Semantic Relatedness (Banerjee and Pedersen) - Appears in the Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, August 9-15, 2003, Acapulco, Mexico. 

http://www.d.umn.edu/~tpederse/Pubs/ijcai03.pdf