VALEX is a new large valency (subcategorization) lexicon for English verbs which is suitable for (statistical) NLP, linguistic and psycholinguistic.

A Large Subcategorization Lexicon for English Verbs

VALEX is a new large valency (subcategorization) lexicon for English verbs which is suitable for (statistical) natural language processing (NLP), linguistic and psycholinguistic use. The lexicon was developed by members of the Natural Language and Information Processing Group at the University of Cambridge Computer Laboratory. It is freely available under the GNU General Public Licence.

VALEX includes subcategorization frame (SCF) and frequency information for 6,397 English verbs. It assumes a classification of 163 SCF types (Briscoe, 2000) — a superset of those found in the ANLT and COMLEX Syntax dictionaries. The SCFs abstract over specific lexically-governed particles and prepositions and specific predicate selectional preferences but include some derived semi-predictable bounded dependency constructions, such as particle and dative movement. The lexicon provides a lexical entry for each verb and SCF combination. It includes 212,741 entries in total, 33 per verb on average.

For a detailed description of the lexicon and how it was constructed see: Anna Korhonen, Yuval Krymolowski and Ted Briscoe (2006). ‘A Large Subcategorization Lexicon for Natural Language Processing Applications’. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Genova, Italy. PDF.

VALEX differs from other existing valency lexicons in the following ways:

  • It was acquired automatically from five large corpora (both British and American) and the Web. The corpus data (consisting of 15.9M sentences in total) were processed using a recent version (Korhonen, 2002) of the comprehensive subcategorization acquisition system of Briscoe and Carroll (1997).
  • Since the lexicon was acquired automatically, it contains some noise (i.e. incorrect SCF entries and inaccurate frequencies). Software is therefore provided with the lexicon which can be used to remove noise from the lexicon, improve the quality of automatically acquired SCF distributions and/or create sub-lexicons suitable for different purposes. Four sub-lexicons (created by running the software with the best performing options) are also provided for users which are more accurate than the basic lexicon and which can be readily employed for tasks that require better accuracy.
  • The lexicon includes statistical information about the frequencies and relative frequencies of SCFs in corpus data. This makes it particularly suitable for statistical (NLP) use.

Download, copyright notice and feedback

The First Release of the lexicon (July 2006) includes the following materials:

  • The description of the 163 SCF types in the lexicon
  • The large automatically acquired (unfiltered, noisy) subcategorization lexicon
  • Software which can be used to filter out noisy SCFs from the large lexicon, improve the quality of automatically acquired SCF distributions, and build sub-lexicons suitable for different purposes
  • Four sub-lexicons created using the software which are more accurate than the basic noisy lexicon and which can be readily employed by users who prefer not to run the software themselves
  • Documentation which explains the different sub-lexicon options provided by the software and evaluates their accuracy
DOWNLOAD VALEX

Copyright © 2006 Anna Korhonen and Ted Briscoe, University of Cambridge. VALEX is free software under the terms of the GNU General Public License.

Please acknowledge the use of the lexicon and the related related materials in any publications by providing a reference to the Korhonen, Krymolowski and Briscoe (2006) paper and to this page.

We would be pleased to receive comments on the materials provided here. Please contact us with any feedback, questions or suggestions you may have.

References

Bran Boguraev and Ted Briscoe. 1987. Large lexicons for natural language processing: utilising the grammar coding system of the Longman Dictionary of Contemporary English. In Computational Linguistics 13(3-4): 203–218. PDF

Ted Briscoe. 2000. Dictionary and System Subcategorisation Code Mappings. Unpublished manuscript, University of Cambridge Computer Laboratory. Included in the download materials.

Ted Briscoe. 2001. From Dictionary to Corpus to Self-Organizing Dictionary: Learning Valency Associations in the Face of Variation and Change. In Proceedings of Corpus Linguistics. Lancaster University, UK. PDF

Ted Briscoe and John Carroll. 1997. Automatic Extraction of Subcategorization from Corpora. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Washington, DC. PDF

Ralph Grishman, Catherine Macleod and Adam Meyers. 1994. Comlex syntax: building a computational lexicon. In Proceedings of the 15th International Conference on Computational Linguistics. Kyoto, Japan. PS

Anna Korhonen and Ted Briscoe. 2004. Extended Lexical-Semantic Classification of English Verbs. In Proceedings of the HLT/NAACL Workshop on Computational Lexical Semantics, Boston, MA. PDF

Anna Korhonen, Genevieve Gorrell and Diana McCarthy. 2000. Statistical Filtering and Subcategorization Frame Acquisition. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Hong Kong. PDF

Anna Korhonen. 2002. Subcategorization Acquisition. Ph.D. thesis published as Technical Report UCAM-CL-TR-530. Computer Laboratory, University of Cambridge. PDF

Anna Korhonen and Yuval Krymolowski. 2002. On the Robustness of Entropy-Based Similarity Measures in Evaluation of Subcategorization Acquisition Systems. In Proceedings of the Sixth Conference on Natural Language Learning. Taipei, Taiwan. PDF

Anna Korhonen, Yuval Krymolowski and Ted Briscoe. 2006. A Large Subcategorization Lexicon for Natural Language Processing Applications. In Proceedings of the 5th International Conference on Language Resources and Evaluation. Genova, Italy. PDF

Anna Korhonen, Yuval Krymolowski and Zvika Marx. 2003. Clustering Polysemic Subcategorization Frame Distributions Semantically. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. 64–71. PDF

Anna Korhonen and Judita Preiss. 2003. Improving Subcategorization Acquisition using Word Sense Disambiguation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. 48–55. PDF

Beth Levin. 1993. English Verb Classes and Alternations. Chicago University Press.