ANC2Go: A Web Service for Customized Corpus Delivery
Timothy Brown, Vassar College ’16 and Prof. Nancy Ide
Language data labeled with annotations identifying linguistic features such as part of speech, syntax, sense information, etc. is heavily used for research and development in natural language processing (NLP). Large corpora with linguistic annotations are mined to create language models based on statistics derived from the data, which are used to drive NLP processing software in tasks such as speech recognition and generation, information retrieval, question answering, and machine translation.
A major obstacle to the use and reuse of large annotated corpora has been the lack of a common format to represent the information, which has in turn made it difficult for one researcher to use annotations produced at another site with his own software – i.e., the data and software are very often not interoperable.
The ANC project has created several large corpora with extensive linguistic annotations and made them freely available for use by the NLP community. To overcome the interoperability problem, we have developed a unique web service, called ANC2Go, that allows users to choose both the set of texts they are interested in as well as the desired annotations from among those in our corpora, and to receive the results in one of several commonly used formats, which can then be immediately input to most NLP software.
This URSI project was focused on development of the ANC2Go application, specifically the design and implementation of its user interface. The original interface suffered from several usability problems, namely that it would break when multiple corpora were added. We designed and created an entirely new interface from scratch, allowing for a complete re-envisioning of how the tool looks, and how users interact with it. In addition, the complexity of the code-base was reduced dramatically, allowing for easier maintenance and the ability to quickly add new features in the future.