Section Overview
Natural language applications like machine translation, question answering, and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). Some richer semantic representation is badly needed.
The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to produce such a resource. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. Over the course of the five-year program, our current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic.
Our plan is to make this resource available to the natural language research community so that decoders for these phenomenon can be trained to generate the same structure in new documents. Lessons learnt over the years have shown that the quality of annotation is very crucial if it is going to be used for training machine learning algorithms. Taking this cue, we are going to ensure that each layer of annotation in OntoNotes will have at least 90% inter-annotator agreement. Our pilot studies have shown that predicate structure, word sense, ontology linking, and coreference can all be annotated rapidly and with better than 90% consistency.
This level of semantic representation goes far beyond the entity and relation types currently targeted in the ACE program, since every concept in the text will be indexed, not just 100 pre-specified types. For example, consider this sentence: "The founder of Pakistan's nuclear program, Abdul Qadeer Khan, has admitted that he transferred nuclear technology to Iran, Libya, and North Korea". In addition to the names, each of the nouns "founder", "program", and "technology" would be assigned a word sense and linked to an appropriate ontology node. The propositional connection signaled by "founder" between Khan and the program would also be marked. The verbs "admit" and "transfer" would have their word sense and argument structures identified and be linked to their equivalent ontology nodes. One argument of "admit" is "he", which would be connected by coreference to Khan, and the other is the entire transfer clause. The verb "transfer", in turn, has "he/Khan" as the agent, the technology as the item transferred, and the three nations Iran, Libya, and North Korea as the destination of the transfer. A graphical view of the representation is shown below:
The OntoNotes Representation -- Larger image
Significant breakthroughs that change large sections of the field occur from time to time in Human Language Technology. The Penn Treebank in the late 1980's transformed parsing, and the statistical paradigm similarly transformed MT and other applications in the early 1990s. We believe that OntoNotes has the potential for being a breakthrough of this magnitude: it will be the first time ever that a semantic resource of this substantial size will be produced. As we have seen with the Treebank and WordNet, a publicly available resource unleashes an enormous amount of work internationally on algorithms and on the automated creation of semantic resources in numerous other domains and genres. We believe that this new level of semantic modeling will empower semantics-enabled applications to break the current accuracy barriers in transcription, translation, and question answering, fundamentally changing the nature of human language technology.



