Premodifier-1 + ... + Premodifier-n + Head
Premodifier-2 + ... + Premodifier-n + Head
Premodifier-n + Head
Premodifiers + Head + Postmodifiers of which all subsets with one or
more premodifier or postmodifier dropped off are given
Head + Postmodifier-1 + ... + Postmodifier-n
Head + Postmodifier-1 + ... + Postmodifier-n-1
Head + Postmodifier-1
Head
NP's in Postmodifiers
Premodifiers include mainly adjectives, determiners, and non-head
nouns. Premodifiers can consist of coordinating phrases. Postmodifiers
accepted by NPtool are, on the other hand, prepositional phrases
that unambiguously postmodify a preceding nominal head, such as
preverbal NP-PP sequences. Currently, postmodifiers accepted by
NPtool do not include postmodifying non-finite clausal
constructions, such as participles or infinitives. In recent tests,
however, their omission has not substantially affected the recall
value of NPtool. Thus, in addition to the maximal length
NP:
exact form of the correct theory of quantum gravity
the following phrases
are be extracted in addition to the phrase itself (initial determiners
are left out):
exact form of the correct theory
exact form
form
form of the correct theory of quantum gravity
form of the correct theory
correct theory of quantum gravity
correct theory
theory of quantum gravity
theory
quantum gravity
gravity
Actually, NPtool supports
many other possibilities of relaxing or extending the pattern used to
extract NP's. For example, the number of postmodifiers can be
limited. Extracting relatively long NP's can in any case be useful
especially for (machine) translation, where the translation of a
complex NP consisting of a chain of NP's can be significantly
different from the translations of its constituent NP's.
characteristic
particular
sheer
mere
kind of
notion of
so-called
some
one
These words and phrases are often used as 'fillers' in text to
soften an expression. Removing them decreases the number of listed
NP's by 1-5%, depending on the domain and type of text, and the
contents of the list, which certainly lessens the subsequent manual
work. The contents of this 'stop list' can be revised according to the
characteristics of the domain and type of text.
===
ok: 9 principle
ok: 6 uncertainty principle
ok: 3 uncertainty principle of quantum mechanics
ok: 3 principle of quantum mechanics
===
ok: 5 mechanics
ok: 5 quantum mechanics
ok: 3 uncertainty principle of quantum mechanics
ok: 3 principle of quantum mechanics
===
As can be
observed, the head mechanics never occurs without the nominal
premodifier quantum. Based on this evidence it could be concluded
that their combination quantum mechanics is the actual term, at
least in the text in question. The same conclusion could be made for
uncertainty principle, also.
uncertainty principle of quantum mechanics
quantum mechanics uncertainty principle
The normalization procedure works by placing a
postmodifying prepositional phrase between the modified head and its
nominal premodifiers except possessive nouns (all bracketed in the
examples), and the premodifying adjectives and possessive nouns. In
NP's with multiple heads, this procedure is performed
recursively. Thus, with longer phrases such as:
exact form of the correct theory of quantum gravity
present expansion phase of the universe
god's choice of initial condition
the following
transformation takes place:
exact correct quantum gravity theory form
present universe expansion phase
god's initial condition choice
The procedure can be visualized in Figure 2. The same figure also
shows the flattening of syntactic structure. Single evaluated texts,
however, appear quite consistent in their NP formation patterns. If
dual surface forms exist in a text for terms, these can easily be
detected by grouping the NP's according to the main grammatical heads of
the NP's.
As to the proportion of actual terms in the candidate phrases, the truth - or termness in this case - depends eventually on the judgement of the terminologist or other end user. What is considered a good term depends greatly on the usage to which the terms or phrases are put. The classical definition of a term as an universally established concept most certainly makes a more strict criterion than the need in computer-aided translation for simply spotting recurrent phrases with a special meaning for some organization. Table 1 gives the results of one evaluation of candidates terms extracted with the enhanced version of NPtool from a text dealing with translation. As can be seen, a clear minority of candidate terms were eventually chosen as actual terms. Dividing the candidate terms between those with a single occurrence and those with multiple occurrences, however, gives somewhat different results. For candidate terms with multiple occurrences, the proportion of those judged correct choices is much higher than for all candidate terms. On the other hand, for candidate terms with a single occurrence, the situation is quite the opposite. Frequency would seem to correlate positively with termness, which is in agreement with the findings of Justeson and Katz [2].
Table 1: Evaluation of Candidate Terms Extracted from a Text on Translation (12225 words)
Statistics Number Terms Question marks Non-terms Candidate Phrases 2656 21% 57% 22% Candidate Phrases: frequency = 1 1972 17% 44% 50% Candidate Phrases: frequency >= 2 684 36% 50% 14%
Table 2: Surface Syntactic Patterns of Terms (Sample of 558 term types representing 34 different patterns, approximately half of which have only a single representative term)
Pattern Percentage of Cumulative Precision
All Term Types Recall
A N 27% 27% 29%
N N 17% 44% 40%
N 16% 60% 16%
N P N 11% 71% 29%
A N N 8% 79% 42%
N P A N 5% 84% 31%
N P N N 4% 88% 33%
A N P N 3% 91% 24%
I N 1% 92% 15%
A A N 1% 93% 21%
N P N P N 1% 94% 26%
N N N 1% 95% 8%
Others 5% 100% -
N = noun A = adjective P = preposition I = ing-form of a verb
Table 2 did not include the frequencies of
individual terms representing a particular syntactic pattern. The
results are somewhat different, if this is taken into account (table
3). The precision figures in table 3 basically tell what is the
possibility of a random NP with a certain pattern to be a term, if
there is no knowledge of its overall frequency in the text. One
conclusion is that the shortest terms, simple nouns and two word
compound words, are used most, since they represent approximately over
80 percent of all occurrences of terms. Single nouns alone account for
almost a half of all term occurrences in text. While these simple nouns
have only a 33 percent probability of being a term, double noun
compound words consisting of two words have a 73 percent probability
of being a term.Table 3: Surface Syntactic Patterns of Terms (Sample of 3137 term tokens representing 34 different patterns, of which 14 have only a single occurrence)
Pattern Percentage of Cumulative Precision
term tokens Recall
N 45% 45% 34%
N N 19% 64% 73%
A N 17% 81% 52%
N P N 4% 85% 38%
N P N N 3% 88% 71%
A N N 3% 91% 54%
N P A N 2% 93% 54%
I N 1% 94% 38%
Others 6% 100% -
N = noun A = adjective P = preposition I = ing-form of a verb
If high
precision were the main objective, the patterns at the top of the list
would again be different (table 4). As could be expected, the longer
phrases dominate this list, the single noun being as far behind as in
position 20/34. The cumulative recall, however, of the first ten
patterns with highest recall is only 72 percent. But again, the
precision drops below 50 percent already after the first two patterns.Table 4: Surface Syntactic Patterns of Terms with Highest Precision (Sample of 558 term types representing 34 different patterns)
Pattern Precision Percentage of Cumulative
Term Types Recall
A I N 67% 0.7% 0.7%
N P N P N P N 50% 0.2% 0.9%
A N N 42% 7.7% 8.6%
N N 40% 17.2% 25.8%
N P N N 33% 3.8% 29.6%
I N P N N 33% 0.2% 29.8%
A A N P N 33% 0.2% 30.0%
N P A N 31% 4.7% 34.7%
A N 29% 26.7% 61.4%
N P N 28% 10.6% 72.0%
Others - 28% 100.0%
N = noun A = adjective P = preposition I = ing-form of a verb
[2] John Justeson and Slava Katz. 1993. Technical terminology: some linguistic properties and an algorithm for identification in text. Technical Report RC 18906, IBM Research Division.
[3] Ken Church and Ido Dagan. 1994. "Termight: Identifying and Translating Technical Terminology". In: Fourth Conference on Applied Language Processing, Stuttgart, Germany: Association for Computational Linguistics, October 13-15, 1994.