iLexIR
NLP Consultancy
TAP
The TAP classifier has been tightly integrated with the RASP toolkit so that it is easy to undertake experiments to find the optimal set of feature types and instances for a particular classification task, whether this be at the document, passage, sentence or (sub)sentence level. However, in many real world applications it is not possible to train a classifier in a fully supervised fashion because data is only partially or noisily labelled.
A significant element of the research undertaken with the toolkit has been to explore the use of bootstrapping and other semi-supervised techniques to circumvent the need for large quantities of well-annotated training data. In areas such as anonymisation (Medlock, 2006) and biomedical named entity recognition (Vlachos et al., 2006) we have been successful in bootstrapping accurate classifiers from text automatically annotated with RASP.
Features
Standard text classification adopts the ‘bag of words’ (BoW) model in which a document is treated as an unstructured multiset of terms and information about word position or syntactic structure is ignored. This approach works well for document topic classification but less well for sentiment or genre classification, or for (sub)sentential classification tasks such as named entity recognition, anonymisation, or (non)-speculative assertion identification (e.g. Medlock, 2006).
The RASP toolkit makes available a range of features beyond BoW, based on morphological analysis (lemmas, stems), part of speech tags, and word co-occurrences mediated by grammatical relations rather than by adjacency or windowing. These additional feature types can be made available to machine learning classifiers, and feature instances from these types that are effective for a given classification task can be selected during the training phase by the classifier for run-time application.
Performance
The timed aggregate perceptron (TAP) classifier (Medlock, forthcoming) is a highly scalable linear classifier which has been shown to outperform SVMs and Bayesian logistic regression (BLR) on topic and other text classification tasks. The TAP classifier achieved better classification accuracy than either popular alternative, but trained in near linear time.
This means that a classifier trained on the entire Reuters RCV1 corpus of around 800,000 news stories (Lewis et al., 2004) divided into 103 classes could be built in around 3.5hrs CPU time (as opposed to around 20hrs for the SVM or 50hrs for BLR). This is a significant advantage for real-world applications where reductions in training time allow vital experimentation into enhancing feature generation and selection as well as frequent retraining as data is accumulated.