Lingsoft · NPtool · NoDaLiNe

Term Extraction from Unrestricted Text

NODALIDA-95

Antti Arppe

1. Introduction

Lingsoft has worked on creating an aid for terminology work based on NPtool, a program for the detection of English noun phrases (NP's). NPtool is linguistically based, using Two-level Morphology (TWOL) and Constraint Grammar (CG), both developed at the Department of General Linguistics at Helsinki University. NPtool takes as input English text in ASCII format. As the bulk of terminology consists of NP's - 80-99% depending on the source - fishing for NP's as a starting point for term extraction can be considered a justifiable approach. In this respect, NPtool is very promising. Using texts containing altogether some 20,000 words, a recall rate of 98.5-100% and a precision rate of 95-98% have been reached. A thorough presentation of the actual inner workings and theoretical background of NPtool is presented by Atro Voutilainen in [1].

2. Enhancing NPtool for Term Extraction

By itself, NPtool only gives a list of maximal length NP's extracted from a given text and a two-way evaluation of these NP's into sure and unsure findings (preceded by ok: or ?: respectively). For the purposes of terminology work, NPtool has been enhanced at Lingsoft with new features. The examples given below are based on a text by Stephen Hawking on cosmology.

2.1. Subsets of NP's

All acceptable permutations of NP's and their subsets are extracted from a text. That is, instead of presenting only the maximal length NP, all the acceptable subsets of this NP are also given. The aim is to present a terminologist all the possibilities, though this choice does approximately triple the number of listed NP's. This gives the following set:
    Premodifier-1 + ... + Premodifier-n + Head
    Premodifier-2 + ... + Premodifier-n + Head
    Premodifier-n + Head
    Premodifiers + Head + Postmodifiers of which all subsets with one or
        more premodifier or postmodifier dropped off are given
    Head + Postmodifier-1 + ... + Postmodifier-n
    Head + Postmodifier-1 + ... + Postmodifier-n-1
    Head + Postmodifier-1
    Head
    NP's in Postmodifiers
Premodifiers include mainly adjectives, determiners, and non-head nouns. Premodifiers can consist of coordinating phrases. Postmodifiers accepted by NPtool are, on the other hand, prepositional phrases that unambiguously postmodify a preceding nominal head, such as preverbal NP-PP sequences. Currently, postmodifiers accepted by NPtool do not include postmodifying non-finite clausal constructions, such as participles or infinitives. In recent tests, however, their omission has not substantially affected the recall value of NPtool. Thus, in addition to the maximal length NP:
    exact form of the correct theory of quantum gravity
the following phrases are be extracted in addition to the phrase itself (initial determiners are left out):
    exact form of the correct theory
    exact form
    form
    form of the correct theory of quantum gravity
    form of the correct theory
    correct theory of quantum gravity
    correct theory
    theory of quantum gravity
    theory
    quantum gravity
    gravity
Actually, NPtool supports many other possibilities of relaxing or extending the pattern used to extract NP's. For example, the number of postmodifiers can be limited. Extracting relatively long NP's can in any case be useful especially for (machine) translation, where the translation of a complex NP consisting of a chain of NP's can be significantly different from the translations of its constituent NP's.

2.2. Generic modifiers

NP's beginning with so-called generic modifiers are removed from the list of NP's. This does not, however, lessen the level of recall, since all the subsets of any given NP are extracted. Generic modifiers are both determiners, adjectives and prefix phrases, such as:
    characteristic
    particular
    sheer
    mere
    kind of
    notion of
    so-called
    some
    one
These words and phrases are often used as 'fillers' in text to soften an expression. Removing them decreases the number of listed NP's by 1-5%, depending on the domain and type of text, and the contents of the list, which certainly lessens the subsequent manual work. The contents of this 'stop list' can be revised according to the characteristics of the domain and type of text.

2.3. Frequencies

Frequencies of occurrence are counted for each instance of a given NP. Frequency has been emphasized by Justeson and Katz in the selection of good candidate terms. [2]

2.4. Grouping NP's

In addition to lists of NP's sorted according to their frequency, they can also be arranged into groups according to their grammatical heads and their frequencies. This is to a great extent similar to the approach used by Ken Church and Ido Kagan in [3]. (Church and Kagan did not, however, extract NP's with postmodifiers.) Thus, NP's are given in a hierarchic manner, with the frequency of the head determining the position of the whole group with the same head, and the frequency of each individual NP determining the position within this group. This enables a terminologist to determine whether a simple head, a longer NP with the same head, or both, is the actual term. This grouping can be done according any grammatical head in an NP, of which the following options have been implemented:
  1. the main grammatical head of the phrase (uncertainly principle of quantum mechanics)
  2. the last head in the phrase (uncertainly principle of quantum mechanics)
  3. all the heads in the phrase (uncertainly principle of quantum mechanics)
The last option duplicates the occurrences of phrases with multiple heads. For example, with the heads principle and mechanics, the following groupings are presented:
    ===
    ok: 9 principle
    ok: 6 uncertainty principle
    ok: 3 uncertainty principle of quantum mechanics
    ok: 3 principle of quantum mechanics
    ===
    ok: 5 mechanics
    ok: 5 quantum mechanics
    ok: 3 uncertainty principle of quantum mechanics
    ok: 3 principle of quantum mechanics
    ===
As can be observed, the head mechanics never occurs without the nominal premodifier quantum. Based on this evidence it could be concluded that their combination quantum mechanics is the actual term, at least in the text in question. The same conclusion could be made for uncertainty principle, also.

2.5. Normalizing NP's

The NP's can also be normalized by flattening their syntactic structure. This procedure aims at linking the different surface variations of the same basic term. This is promoted by the tendency in especially modern American English to opt for preposing modifying nouns and NP's instead of using postmodifying prepositional phrases. Thus, one might find in a text both the following phrases, which are essentially surface representatives of the same term:
    uncertainty principle of quantum mechanics
    quantum mechanics uncertainty principle
The normalization procedure works by placing a postmodifying prepositional phrase between the modified head and its nominal premodifiers except possessive nouns (all bracketed in the examples), and the premodifying adjectives and possessive nouns. In NP's with multiple heads, this procedure is performed recursively. Thus, with longer phrases such as:
    exact form of the correct theory of quantum gravity
    present expansion phase of the universe
    god's choice of initial condition
the following transformation takes place:
    exact correct quantum gravity theory form
    present universe expansion phase
    god's initial condition choice
The procedure can be visualized in Figure 2. The same figure also shows the flattening of syntactic structure. Single evaluated texts, however, appear quite consistent in their NP formation patterns. If dual surface forms exist in a text for terms, these can easily be detected by grouping the NP's according to the main grammatical heads of the NP's.

2.6 Presenting NP's in Their Context

It is possible to set up a format to present extracted candidate terms in their contexts using NPtools internal, intermediary results.

3. Opportunities for Terminology Work and Evaluation of Benefits

The enhanced version of NPtool can be used for several aspects of terminology work:
  1. Constructing term lists from scratch using existing corpora in different organizations or domains in general.
  2. Updating terminology by comparing existing term lists to lists of NP's extracted, grouped, and normalized with the enhanced version of NPtool.
  3. Constructing bilingual corpora of terms and fixed phrases and their translation equivalents, when similar noun phrase extraction tools are available for other languages than English.
Evaluation of the benefits of using the enhanced version of NPtool for terminology work can be done in two points of view. Most important is how much its use can make terminology work more efficient. Secondly, how many of the extracted phrases are found to be actual terms. With the near perfect rates of recall and precision, automated extraction of term candidates from text most certainly increases the efficiency, consistency, and completeness in terminology work compared to manual methods. Unfortunately, we have not yet been able to use the enhanced version of NPtool in real situations of terminology work, so we cannot give exact figures. Results showing more than the doubling of efficiency with Church's and Dagan's similar tool can, however, be considered indicative of the existing potential [3].

As to the proportion of actual terms in the candidate phrases, the truth - or termness in this case - depends eventually on the judgement of the terminologist or other end user. What is considered a good term depends greatly on the usage to which the terms or phrases are put. The classical definition of a term as an universally established concept most certainly makes a more strict criterion than the need in computer-aided translation for simply spotting recurrent phrases with a special meaning for some organization. Table 1 gives the results of one evaluation of candidates terms extracted with the enhanced version of NPtool from a text dealing with translation. As can be seen, a clear minority of candidate terms were eventually chosen as actual terms. Dividing the candidate terms between those with a single occurrence and those with multiple occurrences, however, gives somewhat different results. For candidate terms with multiple occurrences, the proportion of those judged correct choices is much higher than for all candidate terms. On the other hand, for candidate terms with a single occurrence, the situation is quite the opposite. Frequency would seem to correlate positively with termness, which is in agreement with the findings of Justeson and Katz [2].

Table 1: Evaluation of Candidate Terms Extracted from a Text on Translation (12225 words)

Statistics         Number  Terms  Question marks  Non-terms
Candidate Phrases   2656    21%         57%          22%
Candidate Phrases:
frequency = 1       1972    17%         44%          50%
Candidate Phrases:
frequency >= 2       684    36%         50%          14%

4. Term Characteristics

It would be of great help in terminology extraction if there were some features which would differentiate between terms and non-terms. One such feature could be the surface syntactic patterns of the candidate terms. Table 2 shows the most common surface syntactic patterns of candidate terms judged to be actual terms in the aforementioned sample text. As can be seen, out of 34 different surface syntactic patterns of terms, the first eight account for slightly over 90 percent of all terms. Limiting the extraction of candidate terms to these eight patterns could be one method of cutting down on the subsequent manual evaluation work. One has to note, however, that the precision for all these patterns varies between 15-40 percent, amounting to an overall precision of 27 percent. This means that well over half of the candidate terms extracted with the aforementioned eight patterns would be 'garbage'.

Table 2: Surface Syntactic Patterns of Terms (Sample of 558 term types representing 34 different patterns, approximately half of which have only a single representative term)

Pattern  Percentage of   Cumulative  Precision
         All Term Types    Recall
A N           27%           27%         29%
N N           17%           44%         40%
N             16%           60%         16%
N P N         11%           71%         29%
A N N          8%           79%         42%
N P A N        5%           84%         31%
N P N N        4%           88%         33%
A N P N        3%           91%         24%
I N            1%           92%         15%
A A N          1%           93%         21%
N P N P N      1%           94%         26%
N N N          1%           95%          8%
Others         5%          100%          -

N = noun A = adjective P = preposition I = ing-form of a verb
Table 2 did not include the frequencies of individual terms representing a particular syntactic pattern. The results are somewhat different, if this is taken into account (table 3). The precision figures in table 3 basically tell what is the possibility of a random NP with a certain pattern to be a term, if there is no knowledge of its overall frequency in the text. One conclusion is that the shortest terms, simple nouns and two word compound words, are used most, since they represent approximately over 80 percent of all occurrences of terms. Single nouns alone account for almost a half of all term occurrences in text. While these simple nouns have only a 33 percent probability of being a term, double noun compound words consisting of two words have a 73 percent probability of being a term.

Table 3: Surface Syntactic Patterns of Terms (Sample of 3137 term tokens representing 34 different patterns, of which 14 have only a single occurrence)

Pattern  Percentage of  Cumulative  Precision
          term tokens     Recall
N             45%           45%         34%
N N           19%           64%         73%
A N           17%           81%         52%
N P N          4%           85%         38%
N P N N        3%           88%         71%
A N N          3%           91%         54%
N P A N        2%           93%         54%
I N            1%           94%         38%
Others         6%          100%          -

N = noun A = adjective P = preposition I = ing-form of a verb
If high precision were the main objective, the patterns at the top of the list would again be different (table 4). As could be expected, the longer phrases dominate this list, the single noun being as far behind as in position 20/34. The cumulative recall, however, of the first ten patterns with highest recall is only 72 percent. But again, the precision drops below 50 percent already after the first two patterns.

Table 4: Surface Syntactic Patterns of Terms with Highest Precision (Sample of 558 term types representing 34 different patterns)

Pattern        Precision   Percentage of  Cumulative
                            Term Types      Recall
A I N            67%           0.7%           0.7%
N P N P N P N    50%           0.2%           0.9%
A N N            42%           7.7%           8.6%
N N              40%          17.2%          25.8%
N P N N          33%           3.8%          29.6%
I N P N N        33%           0.2%          29.8%
A A N P N        33%           0.2%          30.0%
N P A N          31%           4.7%          34.7%
A N              29%          26.7%          61.4%
N P N            28%          10.6%          72.0%
Others            -             28%         100.0%

N = noun A = adjective P = preposition I = ing-form of a verb

References

[1] Atro Voutilainen. 1993. "NPtool. A detector of English noun phrases". In: Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio: Ohio State University, June 22, 1993.

[2] John Justeson and Slava Katz. 1993. Technical terminology: some linguistic properties and an algorithm for identification in text. Technical Report RC 18906, IBM Research Division.

[3] Ken Church and Ido Dagan. 1994. "Termight: Identifying and Translating Technical Terminology". In: Fourth Conference on Applied Language Processing, Stuttgart, Germany: Association for Computational Linguistics, October 13-15, 1994.


webmaster@lingsoft.fi