2007 North American Computational Linguistics Olympiad
Some of the material on this page consists of modified Wikipedia content, provided here under the terms of the GNU Free Documentation License.
Information about Language Technology
Language technology is often called Human Language Technology (HLT) and consists of Computational Linguistics (or CL) and Speech Technology at its core and includes many application oriented aspects of them as well. Language technology is closely connected to Computer Science and Linguistics.
Language Technology Areas
Here are the general language technology areas:
- Machine Translation
- Information Retrieval and Extraction
- Natural Language Processing
- Question Answering
- Computational Biology
- Speech Recognition
- Speech Synthesis
- Speaker Identification and Verification
- Dialogue Systems
Machine Translation
Machine Translation, sometimes referred to by the acronym MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. It is considered to be a very challenging problem, in part due to the large variability in the structures of the 6,000 languages of the world. At its basic level, MT performs simple substitution of atomic words in one natural language for words in another. Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms. Subcategories of MT: Dictionary-based MT, Statistical MT, Example-based MT, Interlingual MT.
Some Groups and Researchers in the Area:
- Kevin Knight, ISI. http://www.isi.edu/~knight/ See Knight's paper on translating from Arcturan to Centauri: http://www.isi.edu/natural-language/mt/aimag97.ps
- Jaime Carbonell, http://www.cs.cmu.edu/~jgc/
Information Retrieval and Extraction
Information Retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or World Wide Web or intranets, for text, sound, images or data. There is a common confusion, however, between data retrieval, document retrieval, information retrieval, and text retrieval, and each of these has its own bodies of literature, theory, praxis and technologies. IR is like most nascent fields interdisciplinary, based on computer science, mathematics, library science, information science, cognitive psychology, linguistics, statistics, physics.
Information Extraction (IE) is a type of information retrieval whose goal is to automatically extract structured or semistructured information from unstructured machine-readable documents. It is a sub-discipline of language engineering, a branch of computer science.
Some Groups and Researchers in the Area:
- James Allen, http://ciir.cs.umass.edu/~allan/
- Jamie Callen: http://www.cs.cmu.edu/~callan/
- W. Bruce Croft: http://ciir.cs.umass.edu/personnel/croft.html
Natural Language Processing
Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Natural language generation systems convert information from computer databases into normal-sounding human language, and natural language understanding systems convert samples of human language into moreformal representations that are easier for computer programs to manipulate.
Some Groups and Researchers in the Area:
- Stanford NLP group: http://nlp.stanford.edu/
Question Answering
Question Answering (QA) is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines. QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the world wide web. Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain question answering deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.Some Groups and Researchers in the Area:
- Dan Moldovan and colleagues at the University of Texas http://www.hlt.utdallas.edu/~moldovan/
- demo QA system over the Web from Language Computer Corporation: http://www.languagecomputer.com/
- QA competition: TREC. http://trec.nist.gov/
Computational Biology
Computational Biology is an interdisciplinary field that applies the techniques of computer science and applied mathematics to problems inspired by biology. Major fields that use computational biology techniques include:Bioinformatics, which applies algorithms and statistical techniques to biological datasets that typically consist of large numbers of DNA, RNA, or protein sequences. Examples of specific techniques include sequence alignment, which is used for both sequence database searching and for comparison of homologous sequences; gene finding; and prediction of gene expression. (The term computational biology is sometimes used as a synonym for bioinformatics.) Computational genomics, a field within genomics which studies the genomes of cells and organisms by high-throughput genome sequencing that requires extensive post-processing known as genome assembly, and which uses DNA microarray technologies to perform statistical analyses on the genes expressed in individual cell types. Systems biology, which aims to model large-scale biological interaction networks (also known as the interactome), often using differential equations. Protein structure prediction and structural genomics, which attempt to systematically produce accurate structural models for three-dimensional protein structures that have not been solved experimentally. Computational biochemistry and biophysics, which make extensive use of structural modeling and simulation methods such as molecular dynamics and Monte Carlo-inspired Boltzmann sampling methods in an attempt to elucidate the kinetics and thermodynamics of protein functions.Some Groups and Researchers in the Area:
- Brown University http://www.brown.edu/Research/CCMB/
Speech Recognition
Speech Recognition (in many contexts also known as 'automatic speech recognition', computer speech recognition or erroneously as Voice Recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program. Speech recognition applications that have emerged over the last years include voice dialing (e.g., Call home), call routing (e.g., I would like to make a collect call), simple data entry (e.g., entering a credit card number), and preparation of structured documents (e.g., a radiology report).Some Groups and Researchers in the Area:
- Carnegie Mellon University, http://www.cs.cmu.edu/~robust/
- MIT, http://www.rle.mit.edu/speech/
Speech Synthesis
Speech Synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. Synthesized speech can also be created by concatenating pieces of recorded speech that are stored in a database. Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output. The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer. Many computer operating systems have included speech synthesizers since the early 1980s.Some Groups and Researchers in the Area:
- Carnegie Mellon University, http://www.speech.cs.cmu.edu/speech/
- Alan W. Black, http://www.cs.cmu.edu/~awb/
Speaker Identification and Verification
Speaker Verification or Voice Authentication is a type of Speaker recognition. It is the problem of verifying a person's identity solely by their voice. It can be used for purposes such as security applications that use a voice print to replace typed passwords and PINs. The voice is then used to authenticate the user. speaker identification is a type of speaker recognition. It is the problem of identifying a person solely by their voice. It can be used for purposes such as police investigations.Some Groups and Researchers in the Area:
- Microsoft Research Speech Technology group http://research.microsoft.com/srg/
Dialogue System
A Dialogue System is a computer system intended to converse with a human. Dialogue systems have employed text, speech, graphics, haptics, gestures, face configurations, body positions, emotions, and other modes for communicative intent on both the input and output channel.Some Groups and Researchers in the Area:
- Alexander I. Rudnicky, http://www.cs.cmu.edu/~air/
Olympiad Locations |
Organizing Committee |
|
Pittsburgh area (hosted by Carnegie Mellon University) contact: Lori Levin, lsl cs.cmu.edu
|
Lori Levin (General Chair), Carnegie Mellon University |
| Philadelphia area (hosted by U. of Pennsylvania) contact: Mitch Marcus, mitch cis.upenn.edu
|
Thomas Payne (General Chair), University of Oregon |
|
Boston area (hosted by Brandies Univeristy, Cambridge) contact: James Pustejovsky, boston.olympiad gmail.com
|
Dragomir R. Radev (Program Chair), University of Michigan |
|
Ithaca area (hosted by Cornell University) contact: Claire Cardie, cardie cs.cornell.edu
|
William Lewis (Outreach Chair), University of Washington |
|
Online participation contact: Dragomir R. Radev, radev umich.edu
|
James Pustejovsky (Sponsorship Chair), Brandeis University |
| Barbara Di Eugenio (Follow-up Chair), University of Illinois at Chicago | |
|
|
| NAACL | |
cs.cmu.edu



