spacer Universiteit van Tilburg spacer
NederlandsContact
spacer
Ga terug naar
spacer
TREVI
The TREVI project fell under the Esprit (Information Technology) programme of the European Union.

TREVI (Text Retrieval and Enrichment for Vital Information) aimed at offering a solution tot the problem of information overflow, i.e. the difficulty experienced by both small and large companies in extracting useful information from large amounts of data coming from the numerous electronic textual information services available at local or global level (Internet, proprietary networks, subscription services, World Wide Web, etc.).

Key result of the TREVI project was a set of software tools (The TREVI Toolkit) representing a substantial improvement in the flexible management of distributed textual information sources.
The TREVI Toolkit does not rely on simple text-based search tools, but rather combines concept-based search and active data mining techniques to enrich online input text streams.

The Infolab was responsible for the design and implementation of the two lexicons of the system (English and Spanish), containing both morphosyntactic and semantic information. A large part of Infolab's task involved the building and extraction of domain lexicons needed for the customizing of the system for specific applications.


Results

CentER-AR has created a lexicon containing syntactic and semantic information. The syntactic information is for the English and Spanish language, the semantic information is language independant (and linked to the English and Spanish data).

The lexicon database data can be used through different storage mechanisms; Through flat text-files (for main-memory database), RDBMS (Oracle), and an advanced Object Oriented Database system (a custom system written in Java on top of ObjectStore).

In TREVI, the lexicon is used to support the text parser (a smart 'search engine' that is able to analyze sentences to better 'understand' texts) to make decisions based on syntactic and semantic information (e.g. to determine the infinitive of verb inflections, or to deduce the correct meaning of words that may have more than one meaning (like 'bank')). To support such a system we have a powerful API to query the database on a high level (accessible through CORBA). On top of that there is a highly intuitive graphical interface to show/query the data. We have also developed various data-input tools.

Contents: The TREVI Lexicon contains several kinds of data (for which we provide some statistics in the accompanying tables):

1. Syntactic Word Information

Building on the licensed data-sets from Celex and Aries, the database contains information on all the forms (verb inflections, noun plural and singular) in which words can appear in the English and Spanish language.

English terms (CELEX): 30.000 nouns, 8500 verbs and 4000 adverbs/adjectives
Spanish terms (ARIES): 20.000 nouns, 6000 verbs and 10.000 adverbs/adjectives

2. Semantic Conceptual Network

The lexicon contains a semantical network of concepts. Its structure uses the same principles as Wordnet. It consists of:
  • Concepts, placeholders for elements that have a certain meaning
  • Links between concepts and terms. The terms denote the concepts in a certain language
  • Relationships between concepts. The relationships are links between concepts, and the network they form gives meaning to the concepts. The lexicon can support many different types of relationships, in the current database the emphasis is on the hierarchical and part-of relationships.
Unlike Wordnet, the TREVI lexicon is smaller and its network is created with more care. Moreover, the network is divided in clear layers, and it is possible to attach (link) conceptual networks of third party domains to the base lexicon. The lexicon consists of three logical layers:
  • Top Level Ontology (TLO), a (small) set of 65 categorizing concepts.
  • Basic Level Ontology (BLO), a set of concepts from day to day life, which are shared among everybody.
  • Various domain specific ontologies which may be linked to the BLO.
BLO Concepts 4800
BLO concept links to English words 6500
BLO concept links to Spanish words 6800

We have created two domain lexicons. One in which the large amount of Wordnet data was semi-automatically connected to the BLO to allow the lexicon to recognize a larger amount of words. Another, handmade, domain lexicon for concepts/terms used to better categorize the indexed articles of one of our customers, Reuters. The two extra domain specific lexicons are both linked (hierarchically) to the BLO, which allows the use of information specified for the BLO on the concepts in those domains, such as frames.

Wordnet Concepts 26500
Wordnet concept links to celex words: 35000
Reuters Concepts 1900
Reuters Terms and Expressions: 2200

3. Verb Frame Information

Verb frames are structures that indicate how verbs may be used in sentences. In the TREVI lexicon, we have manually determined a large set of both semantic and syntactic frames for the basic concepts and terms (in the BLO). Because of how the lexicon is structured, these frames can also be used to determine information of verbs that are in the domain specific lexicons.

We have determined the semantic frame of over 700 concepts. Semantic frames can be used to determine what kind of concepts can be used in combination with a concept like e.g. 'jump' (a 'living thing'). Furthermore we have determined the syntactic frame of over 1100 English verbs and 1125 Spanish verbs. Syntactics frames can be used to e.g. determine how many terms must be used in combination with a certain verb (e.g. 'give' in the English language has a presenter, a receiver, and an item that is given, so a maximum of three related terms in a sentence).

The semantic and syntactic frames are linked (like concepts and denoting words are linked).


There is also an article available (PostScript). It appeared in the proceedings of the IJCAI 97 conference.

spacer