The Benefits of Crowdsourcing: Generating Better Estimates of Noun Phrases from Crowdsourced Data
Reno J. Kriz, Vassar College ’16 and Prof. Nancy IdeOne of the ultimate goals of computer science is for people to be able to communicate with computers in ordinary human languages. A crucial step in this process lies in a computer’s ability to recognize where parts of a sentence begin and end. Our project deals with generating more accurate gold-standard labels of the boundaries of noun phrases found in texts from the American National Corpus. In previous work, a small number of trained annotators labeled the noun phrases of these texts to generate the gold-standard labels. However, recent research found that labels obtained through a probabilistic analysis of crowdsourced data provides higher-accuracy gold standard labels at a much lower cost than the traditional approach. Because of this discovery, we made use of crowdsourced labels from Amazon Mechanical Turk (AMT); AMT is a service provided by Amazon that allowed us to upload texts for many people around the world to annotate. From there, we implemented an Expectation Maximization algorithm to find the maximum likelihood estimates for the boundaries of the noun phrases. Using these estimates, we generated new gold-standard labels for the noun phrases in these texts. Once we ascertain that this process worked as intended, we will move on to producing labels for other parts of sentences, such as verb phrases.