The TREVI project fell under the Esprit (Information Technology) programme
of the European Union.
TREVI (Text Retrieval and Enrichment for Vital Information) aimed at
offering a solution tot the problem of information overflow, i.e. the
difficulty experienced by both small and large companies in extracting
useful information from large amounts of data coming from the numerous
electronic textual information services available at local or global level
(Internet, proprietary networks, subscription services, World Wide Web,
etc.).
Key result of the TREVI project was a set of software tools (The TREVI
Toolkit) representing a substantial improvement in the flexible management
of distributed textual information sources.
The TREVI Toolkit does not rely on simple text-based search tools, but
rather combines concept-based search and active data mining techniques to
enrich online input text streams.
The Infolab was responsible for the design and implementation of the two
lexicons of the system (English and Spanish), containing both
morphosyntactic and semantic information. A large part of Infolab's task
involved the building and extraction of domain lexicons needed for the
customizing of the system for specific applications.
Results
CentER-AR has created a lexicon containing syntactic and semantic
information. The syntactic information is for the English and Spanish
language, the semantic information is language independant (and linked to
the English and Spanish data).
The lexicon database data can be used through different storage mechanisms;
Through flat text-files (for main-memory database), RDBMS (Oracle), and an
advanced Object Oriented Database system (a custom system written in Java on
top of ObjectStore).
In TREVI, the lexicon is used to support the text parser (a smart 'search
engine' that is able to analyze sentences to better 'understand'
texts) to make decisions based on syntactic and semantic information (e.g.
to determine the infinitive of verb inflections, or to deduce the correct
meaning of words that may have more than one meaning (like 'bank')). To
support such a system we have a powerful API to query the database on a high
level (accessible through CORBA). On top of that there is a highly intuitive
graphical interface to show/query the data. We have also developed various
data-input tools.
Contents: The TREVI Lexicon contains several kinds of data (for which
we provide some statistics in the accompanying tables):
1. Syntactic Word Information
Building on the licensed data-sets from Celex and Aries, the database
contains information on all the forms (verb inflections, noun plural and
singular) in which words can appear in the English and Spanish language.
| English terms (CELEX): | 30.000 nouns, 8500 verbs and 4000 adverbs/adjectives |
| Spanish terms (ARIES): | 20.000 nouns, 6000 verbs and 10.000 adverbs/adjectives |
2. Semantic Conceptual Network
The lexicon contains a semantical network of concepts. Its structure uses
the same principles as
Wordnet. It consists of:
- Concepts, placeholders for elements that have a certain meaning
- Links between concepts and terms. The terms denote the concepts in a
certain language
- Relationships between concepts. The relationships are links between
concepts, and the network they form gives meaning to the concepts. The
lexicon can support many different types of relationships, in the current
database the emphasis is on the hierarchical and part-of relationships.
Unlike Wordnet, the TREVI lexicon is smaller and its network is created with
more care. Moreover, the network is divided in clear layers, and it is
possible to attach (link) conceptual networks of third party domains to the
base lexicon. The lexicon consists of three logical layers:
- Top Level Ontology (TLO), a (small) set of 65 categorizing
concepts.
- Basic Level Ontology (BLO), a set of concepts from day to day life,
which are shared among everybody.
- Various domain specific ontologies which may be linked to the BLO.
| BLO Concepts | 4800 |
| BLO concept links to English words | 6500 |
| BLO concept links to Spanish words | 6800 |
We have created two domain lexicons. One in which the large amount of
Wordnet data was semi-automatically connected to the BLO to allow the
lexicon to recognize a larger amount of words. Another, handmade, domain
lexicon for concepts/terms used to better categorize the indexed articles of
one of our customers, Reuters. The two extra domain specific lexicons
are both linked (hierarchically) to the BLO, which allows the use of
information specified for the BLO on the concepts in those domains, such as
frames.
| Wordnet Concepts | 26500 |
| Wordnet concept links to celex words: | 35000 |
|
| Reuters Concepts | 1900 |
| Reuters Terms and Expressions: | 2200 |
|
3. Verb Frame Information
Verb frames are structures that indicate how verbs may be used in sentences.
In the TREVI lexicon, we have manually determined a large set of both
semantic and syntactic frames for the basic concepts and terms (in the BLO).
Because of how the lexicon is structured, these frames can also be used to
determine information of verbs that are in the domain specific lexicons.
We have determined the semantic frame of over 700 concepts. Semantic
frames can be used to determine what kind of concepts can be used in
combination with a concept like e.g. 'jump' (a 'living thing'). Furthermore
we have determined the syntactic frame of over 1100 English verbs and
1125 Spanish verbs. Syntactics frames can be used to e.g. determine
how many terms must be used in combination with a certain verb (e.g. 'give'
in the English language has a presenter, a receiver, and an item that is
given, so a maximum of three related terms in a sentence).
The semantic and syntactic frames are linked (like concepts and denoting
words are linked).
There is also an article available (PostScript). It
appeared in the proceedings of the IJCAI 97 conference. |