We have rights to TAP

patented software.

The timed aggregate perceptron (TAP) classifier is a highly scalable linear machine learning classifier.

RASP Compatible

TAP tightly integrates with the RASP toolkit so that it is easy to find the optimal set of feature types and instances for a particular classification task.

The Problem

In real world applications it is often not possible to train a classifier in a fully supervised fashion because data is only partially or noisily labelled.

Our Solution

We have been successful in bootstrapping accurate classifiers to circumvent the need for large quantities of well-annotated training data.

A significant element of the research undertaken with the toolkit has been to explore the use of bootstrapping and other semi-supervised techniques to circumvent the need for large quantities of well-annotated training data.

In areas such as anonymisation (Medlock, 2006) and biomedical named entity recognition (Vlachos et al., 2006) we have been successful in bootstrapping accurate classifiers from text automatically annotated with RASP.

Standard ‘Bag of Words’ (BoW) models

Standard text classification adopts the ‘bag of words’ (BoW) model in which a document is treated as an unstructured multiset of terms and information about word position or syntactic structure is ignored.

This approach works well for document topic classification but less well for sentiment or genre classification, or for (sub)sentential classification tasks such as named entity recognition, anonymisation, or (non)-speculative assertion identification (e.g. Medlock, 2006).

The RASP Solution.

The RASP toolkit makes available a range of features beyond BoW, based on morphological analysis (lemmas, stems), part of speech tags, and word co-occurrences mediated by grammatical relations rather than by adjacency or windowing.

These additional feature types can be made available to TAP, and feature instances from these types that are effective for a given classification task can be selected during the training phase by the classifier for run-time application.

Fast & Accurate.

TAP is a highly scalable linear classifier. It outperforms SVMs and Bayesian logistic regression (BLR) on topic and other text classification tasks and achieves better classification accuracy.

TAP can train in linear time. On the entire Reuters RCV1 corpus of around 800,000 news stories (Lewis et al., 2004) divided into 103 classes in linear time TAP trains in ~3.5hrs. This is a significant advantage for real-world applications as it allows experimental feature selection and frequent retraining as data is accumulated.