NLP Consultancy

Adjective–noun dataset

This dataset of adjective–noun (AN) combinations is extracted from the parsed version of the publicly-available CLC FCE Dataset . The error coding is used to divide the set into two subsets — correctly used ANs and those that are annotated as errors due to inappropriate choice of an adjective or/and noun. For the ANs that are used correctly in some contexts and incorrectly in others, the most frequent annotation from the CLC is used. The dataset contains 4681 correct and 530 incorrect combinations.

The set of ANs is further divided into corpus-attested and corpus-unattested examples, where parsed version of the British National Corpus (BNC) is used for reference with the frequency threshold set to 3 occurrences in the corpus.

Both the CLC FCE Dataset and the BNC corpus are lemmatised, tagged and parsed using the RASP system (Briscoe et al., 2006; Andersen et al., 2008).


The Dataset is released for non-commercial research and educational purposes under the following licence agreement:

  1. By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor.
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall acknowledge use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the following publication: Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben, ‘A New Dataset and Method for Automatically Grading ESOL Texts’, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset ‘as is’. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever.
  8. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.


You may download the Adjective–noun Dataset if you agree to the licence.