CROSS-LANGUAGE INFORMATION RETRIEVAL
ARIST CHAPTER
DOUGLAS W. OARD & ANNE R. DIEKEMA

INTRODUCTION

	This chapter reviews research and practice in Cross-Language
Information Retrieval (CLIR) that seeks to support the process of
finding documents written in one natural language (e.g., English or
Portuguese) with automated systems that can accept queries expressed
in other languages.  With the globalization of the economy and the
continued internationalization of the Internet, CLIR is becoming an
increasingly important capability that facilitates the effective
exchange of information.  For retrospective retrieval, CLIR allows
users to state queries in their native language and then retrieve
documents in any supported language.  This can simplify searching by
multilingual users and, if translation resources are limited, can
allow monolingual searchers to allocate those resources to the most
promising documents.  In selective dissemination applications, CLIR
allows monolingual users to specify a profile using words from one
language and then use that profile to identify promising documents in
many languages.  Adaptive filtering systems that seek to learn
profiles automatically can use CLIR to process training documents that
may not be in the same language as the documents that later must be
selected.

This review uses the term "documents" fairly broadly, since CLIR can
be applied to a variety of modalities including character coded text,
scanned images of printed pages, and recordings of human speech.
Similarly, supporting the process of finding documents should be
construed broadly as well, including both fully automated functions
and capabilities that support productive human-system interaction.
CLIR also appears in the literature as multilingual information
retrieval (HULL & GREFENSTETTE), and as translingual information
retrieval (CARBONELL ET AL.), but all work conforming to the
definition stated above is described in this chapter as CLIR for
consistency.

The first reported work on CLIR was the development of the
International Road Research Documentation system that used a
controlled vocabulary thesaurus with aligned indexing terms in
English, French and German (PIGUR).  PEVZNER (1969, 1972) also
implemented a Boolean exact match text retrieval system, translating a
Russian thesaurus into English.  SALTON (1970, 1973) conducted some
smaller studies, augmenting the SMART system with hand-constructed
bilingual term lists.  By the mid-1970's it had been established that
systems built using these techniques could achieve performance across
languages on a par with their within-language performance.  Commercial
acceptance soon followed, and by 1977 ILJON was able to identify four
multilingual text retrieval systems operating in Europe.
Standardization quickly emerged as an important issue.  In 1978 the
International Standards Organization formally adopted ISO Standard
5964 on the construction of multilingual thesauri (INTERNATIONAL
ORGANIZATION FOR STANDARDIZATION), and that standard has remained
unmodified since 1985.

	Multilingual thesauri do not, however, completely solve the
CLIR problem.  DUBOIS identified three factors that motivate the
search for other techniques: cost, currency and usability.  First,
indexing and maintenance costs limit the scalability of
thesaurus-based systems, although some automated tools are able to
assist with these tasks.  Second, thesauri in production applications
often lag somewhat behind the current use of terminology because new
words enter human languages each year.  But perhaps the most serious
limitation of thesaurus-based techniques is that untrained users seem
to have difficulty exploiting their capabilities.  Searching free text
is the obvious alternative to use of a controlled vocabulary, and
LANDAUER & LITTMAN (1990,1991) were the first to explore the potential
for free text CLIR.  Extending an automatic technique for reducing the
effect of vocabulary differences on retrieval effectiveness, they
sought to partially overcome the systematic vocabulary differences
that result from choosing a different language. RADWAN & FLUHR began
work in 1991 on an alternative technique that was based on translating
the queries using manually encoded translation knowledge. Although
much progress has been made since that time, these two early
explorations of broad-coverage free text CLIR defined the two dominant
themes that still guide research and practice: corpus-based and
knowledge-based approaches.

Scope

	This review brings together historical and contemporary
research on automated techniques for cross-language retrieval of
written and spoken text, both for retrospective retrieval and for
selective dissemination.  The review does not cover gestural languages
such as American Sign Language, nor does it address
language-independent techniques for recommending documents based
either on ratings assigned by other users or on hypertext links.  This
is the first ARIST review of CLIR, but ERES previously reviewed
international information transfer and METOYER-DURAN reviewed work on
transfer of information across language barriers in a domestic
context.  Other surveys have addressed CLIR with more limited scope.
PIGUR described early work on CLIR, with particular emphasis on
developments in the former Soviet Union.  FLUHR provided an overview
of modern approaches, and OARD (1997b) provided a more recent
overview.  JONES & JAMES reviewed the field with particular attention
to cross-language speech retrieval, and OARD & DORR (1996) produced
the most extensive survey to date.
	
Organization

	The review begins with an examination of the literature on
user needs for of CLIR.  The main part of the chapter then follows the
retrieval system model shown in Figure 1, adapted from OARD (1997c).
Each section highlights the unique requirements imposed on one or more
stages of that model in cross-language retrieval applications.  The
matching stage is covered in the somewhat greater detail, reflecting
the treatment in the literature.  Evaluation techniques are then
described, and the review concludes with some observations regarding
future research directions.

 
Figure 1. Information retrieval system model.


USER NEEDS

	MEADOWS cited a number of studies that together suggest that
in the early 1970's about half of the world's scientific literature
was published in English.  WELLISCH observed that English-language
secondary sources (e.g., indexing and abstracting services) add to
this total.  Several massive translation efforts also contribute to
the availability of information in English.  World Translations Index,
for example, lists 269 journals for which cover-to-cover translations
(mostly into English) of every issue are prepared, and hundreds more
that are selectively translated on a regular basis (INTERNATIONAL
TRANSLATIONS CENTRE).  STUDEMAN reported that the U.S.  Foreign
Broadcast Information Service translated 200 million words in a single
year from over 3,500 publications in 55 languages.  So English does
serve, to some degree, as what WELLISCH called the "lingua franca of
information retrieval tools."

	Despite these efforts, much of the world's information is not
available in English.  HUTCHENS ET AL. found that about a third of the
researchers at the University of Sheffield (United Kingdom) suspected
that they had failed to learn of relevant work in a non-English
speaking country.  It turned out that a similar proportion had in fact
discovered foreign language work that would have been more useful had
it turned up earlier.  MEADOWS found corroborating evidence for this
problem on a larger scale, noting that researchers writing in English
tend to over-cite other work in English and to under-use foreign
language work, when compared to the linguistic distribution of
scholarly writing in their field.

Discovering documents in a foreign language is, of course, only part
of the problem.  GOLDSTEIN found that between 20% and 45% of
electrical engineers in Mexico encountered documents in unfamiliar
languages at least once each month, and WOOD (1967) and ELLEN obtained
similar results for a broad range of disciplines in the United
Kingdom.  WOOD (1974) offered some insight into the assistance that
may be needed, reporting that over half the researchers requesting
full-text translations from the British Library felt that summary
translations of the results along with translations of figure and
table captions were sufficient to provide the information that they
required.

The recent growth of the global Internet has focused increased
attention on the need for information exchange across linguistic
barriers.  A 1997 study by the INTERNET SOCIETY & ALIS TECHNOLOGIES,
for example, found that 12% of World Wide Web pages that were randomly
selected contained material in one of fourteen languages other than
English.  With upwards of 100 million web pages already indexed by the
largest web search engines, this translates into an enormous potential
demand for CLIR services.  Projections during periods of exponential
growth are always subject to question, but PIONEER CONSULTING
estimates that by the year 2002 electronic collaborations will produce
over 500 million messages per day that cross national borders.
 
DOCUMENT PREPROCESSING

Documents exist in many modalities, including character-coded text,
printed pages and recorded speech.  Each modality can, in turn, have
several alternate representations.  Character coded text can be
expressed using different character sets, and a single character set
may have alternative encodings.  Some encoding schemes include
alternate representations for the same character, and popular usage
can introduce similar complications (e.g., accents on upper case
Spanish characters are typically present in some countries but omitted
in others).  Similarly, printed pages may be available digitally in a
number of formats, including page description languages or page
images.  For recorded speech, the speech rate can vary, a variety of
accents may be present, and technical characteristics such as encoding
and compression schemes can affect the fidelity of a recording.  One
goal of document preprocessing is to reduce this range of possible
representations to a consistent character-coded text representation
for each language that is present in a document.

Before such a representation can be constructed, the languages present
in a document must be identified.  This may be known a priori, it may
be coded using a markup convention, or it may need to be determined
from the contents of the document.  GREFENSTETTE compared two
automatic language identification techniques for ten Western European
languages.  A technique based on the observed predominance of three-
letter sequences (character trigrams) performed well, correctly
classifying more than 93% of the evaluation sentences that contained
at least six words and at least 99.8% of those that contained sixteen
or more words.  KIKUI integrated similar techniques with automatic
character set detection for World Wide Web documents.  ZISSMAN has
shown that fairly accurate automatic spoken language identification is
also possible, correctly classifying 89% of all 45 second speech
samples as one of eleven languages.  LEE ET AL. achieved correct
language identification among six languages in 95% of scanned page
images, a task complicated somewhat by the need for
language-independent skew detection and by the variety of character
fonts that might be used for each language.

Once the modality, language and encoding of each document are known,
indexing features must be identified.  In English, the most common
indexing features for character coded text are word stems formed by
automatic suffix removal.  The utility of automatic suffix removal
algorithms varies by language, with some languages exhibiting more
productive morphology than English (and thus realizing a greater
benefit from suffix removal) and other languages lacking any
morphological variation at all.  Construction of a sophisticated
stemmer for a new language might require considerable effort, but
BUCKLEY ET AL. (1994) developed a useful Spanish stemmer in less than
one day by manually identifying common suffix patterns.  Short phrases
that are detected by word cooccurrence, syntactic parsing, or
dictionary lookup are sometimes indexed as well.  Languages that
permit fairly free construction of compositional compounds (e.g.,
German) make phrase detection straightforward, but compound splitting
is then needed to identify constituent words within a compound term.
For example, the German compound Kraftwagenfuehrerschein consists of
Kraftwagen (truck or lorry) and Fuehrerschein (drivers license).
WECHSLER ET AL. described a fairly effective compound splitting
technique based on longest substring matching that requires only a
list of terms for the language in question.  Users might search for an
entire phrase or compound, or they may search only for one of the
constituent terms, so it is common to use both the phrase (or
compound) and its constituent terms as indexing features.  Some
languages lack explicit word boundaries altogether in their written
form (e.g., Chinese), introducing an extreme version of the compound
splitting problem known as segmentation.  GUO summarized prior work on
Chinese segmentation and compared a number of techniques based on
longest substring matching.  WILKENSON has reported, however, that
overlapping two character sequences (character bigrams) provided
indexing features for Chinese that were about as effective as those
discovered using dictionary-based longest substring matching.

Character recognition errors make the accurate identification of
indexing features even more challenging when processing scanned page
images.  Optical character recognition systems typically depend on
extensive training using manually assembled examples, and accurate
systems are presently available for only a limited set of languages.
Furthermore, recognition accuracy degrades rapidly when presented with
poor reproductions or handwritten manuscripts.  SMEATON & SPITZ sought
to minimize these limitations while enhancing indexing speed by
constructing indexing features for English words using encoded
character shape groups (e.g., "b" and "h" might be assigned the same
shape code) rather than individual character codes, and COOPER applied
a similar technique to Thai.  Retrieval effectiveness suffered
significantly in English when compared to character coded text, but
less of an adverse effect was apparent in Thai.

Recorded speech poses an even greater challenge, both with respect to
speed and accuracy.  Speech typically lacks explicit boundary markers
between words, so the problem resembles the segmentation problem in
languages such as Chinese, and variations in pronunciation, speaking
rate, and the fidelity of the recording make speech recognition vastly
more complex.  Speech recognition systems trained on manually prepared
time-synchronized transcripts can produce useful indexing features,
but the needed training material is available for only a limited set
of languages, recognition accuracy degrades when presented with
applications for which the training material was not representative,
and present processing speeds limit the size of the collections that
can be indexed.  SHERIDAN ET AL. (1997) used overlapping three-phoneme
sequences (phone trigrams) as indexing features for recorded German
speech in an attempt to overcome these limitations.  Their initial
results were disappointing, but NG & ZUE found have that phone
trigrams can offer a viable alternative to word-based indexing for
spoken documents in English.

QUERY FORMULATION

	TAYLOR observed that users must compromise their information
needs to match the perceived capabilities of available information
systems when creating queries.  Information retrieval systems seek to
support this process by providing facilities for query specification
and through incorporation of query refinement techniques such as
relevance feedback.  Users with little exposure to controlled
vocabulary searching, for example, often find that formulation of
effective queries using a printed thesaurus is difficult.  Such users
might benefit from a query interface that depicts the available
indexing terms and their relationships in their preferred language.
LI ET AL. developed such a system for English and Japanese, using
versions of the INSPEC thesaurus in each language.  Although no user
study results were reported for multilingual applications, SMITH &
POLLITT performed a qualitative assessment of a monolingual version of
the same system.

The fully automatic query translation techniques described in the next
section can be viewed as one type of support for query formulation in
free text CLIR systems, but more interactive approaches have also been
implemented.  The QUILT system described by DAVIS & OGDEN (1997), for
example, optionally displayed the Spanish translation of English query
terms.  A user who is able to read Spanish might thus be able to
recognize erroneous translations, even if they lacked the fluency
necessary to form effective queries without assistance.  If so, they
could then switch to a monolingual mode and enter the correct Spanish
terms.  YAMABANA ET AL. implemented a more sophisticated approach in
which candidate translations of each term were displayed immediately,
along with retranslations of each candidate back into the query
language.  Users unable to read the candidate translations could
quickly skim the retranslations, and an alternate candidate could be
chosen if necessary.  An Internet demonstration of the READWARE system
from Management Information Technologies, Inc. illustrated a further
extension of this approach that could accommodate several languages
simultaneously.  READWARE depicted known senses of every query term
using one near-synonym in the query language for each sense and
allowed the user to designate the intended senses.  For each selected
word sense, a set of near-synonyms in English, German and French was
passed on to the matching stage as a multilingual query.  Together
these three approaches illustrate a range of options that illustrate
alternative ways of balancing capability with interface complexity.

MATCHING

 
Figure 2. Matching strategies

Matching Strategies	

Broadly stated, information retrieval systems construct
representations of the documents and the information need and then
match those representations to identify documents that are most likely
to satisfy the need.  In what MALONE ET AL. called "content-based"
techniques, the representations are constructed from terms (e.g.,
stems, words, phrases, or character n-grams) that appear in the
documents and the queries.  Techniques for matching representations
constructed from different vocabularies thus form a central component
of CLIR systems.  FURNAS ET AL. observed that information retrieval
systems suffer from a vocabulary problem that results in part from
variability in word usage.  CLIR is simply an extreme case of this
problem in which the words are selected from nearly disjoint
vocabularies.  Four general approaches to cross-language matching have
emerged in CLIR: cognate matching, query translation, document
translation, and interlingual techniques.

	Cognate matching.  Cognate matching essentially automates the
process by which readers might try to guess the meaning of an
unfamiliar term based on similarities in spelling or pronunciation.  A
simple version of cognate matching in which untranslatable terms are
retained unchanged is often used in CLIR systems to match proper nouns
and technical terminology (BALLESTEROS & CROFT 1997; GEY & CHEN, DAVIS
& OGDEN, 1998; ELKATEB & FLUHR, HULL & GREFENSTETTE; KRAAIJ &
HIEMSTRA).  DAVIS extended this technique using fuzzy matching to
discover Spanish cognates for English words that did not appear in a
bilingual dictionary.  BUCKLEY ET AL. (1998) applied a more
sophisticated approach, creating equivalence classes for letter
sequences with similar sounds (e.g., "c," "k," and "qu" share an
equivalence class).  Since the translation knowledge is embedded
directly in the matching scheme, cognate matching can be used in
isolation.  Most often, however, cognate matching is combined with
other cross-language matching approaches.

	Query Translation. Query translation is a more general
strategy in which the query (or some internal representation of the
query) is automatically converted into every supported language. Query
translation is relatively efficient and can be done on the fly.  The
principal limitation of query translation is that queries are often
short and short queries provide little context for disambiguation.
Homonymous words (those with more than one distinct meaning) produce
undesirable matches even in monolingual retrieval (KROVETZ & CROFT).
Translation ambiguity compounds this problem, potentially introducing
additional terms that are themselves homonymous.  For this reason,
controlling translation ambiguity is a central issue in the design of
effective query translation techniques.  Phrases typically exhibit
less translation ambiguity than single words, and the literature
suggests that phrase recognition strategies can substantially improve
retrieval effectiveness.  BALLESTEROS & CROFT (1997) observed
beneficial effects from manual translation of phrases identified
through syntactic analysis, and both RADWAN & FLUHR and KRAAIJ &
HIEMSTRA explored techniques for automatically choosing an appropriate
word order for phrases in which the constituent words had been
translated separately.  HULL & GREFENSTETTE investigated the effect of
noncompositional phrases that cannot be reconstructed from
translations of the constituent terms and found an additional benefit.

Document translation. Document translation is just the opposite of
query translation, automatically converting all of the documents (or
their representations) into each supported query language.  Documents
typically provide more context than queries, so more effective
strategies to limit the effect of translation ambiguity may be
possible.  Another potential advantage is that selected documents can
be presented to the user for examination without on-demand translation
(KRAAIJ).  On the other hand, massive translation can be an expensive
undertaking, and the costs are even greater if several query languages
must be supported.  As a result, relatively few experiments have
compared document translation with query translation (OARD ET AL.),
and ERBACH ET AL. suggested using document translation only for small
collections in limited domains.

Interlingual techniques.  Interlingual techniques convert both the
query and the documents into a unified language-independent
representation.  Controlled vocabulary techniques based on
multilingual thesauri are the most common examples of this approach.
Because each controlled vocabulary term typically corresponds to
exactly one concept, terms from any language may be used to index
documents or to form queries.  HLAVA ET AL. described a technique for
partially automating the assignment of indexing terms to documents in
several languages.  Some fully automated interlingual techniques have
also been implemented.  Latent semantic indexing (LANDAUER & LITTMAN,
1990, 1991; DUMAIS ET AL.; BERRY & YOUNG; REHDER ET AL.) and the
generalized vector space model (CARBONNELL ET AL.) both use a document
aligned training corpus to learn a mapping from one or more languages
into a language- neutral representation.  Document and query
representations from either language can be mapped into this space,
allowing similarity measures to be computed both within and across
languages.


Figure 3. Sources of translation knowledge


Sources of Translation Knowledge

	Each of the four matching approaches to CLIR depends on some
form of translation knowledge.  That knowledge may be encoded manually
or extracted automatically from corpora, and CLIR techniques may take
exploit translation knowledge in more than one form.  The literature
typically refers to techniques using translation knowledge from
manually encoded translation knowledge as knowledge-based
approaches. Techniques using translation knowledge from corpora are
referred to as corpus-based techniques. The correspondence rules used
for cognate matching represent one form of manually encoded
translation knowledge.  Three other manually encoded sources of
translation knowledge have been applied to CLIR: ontologies, machine
translation lexicons, and bilingual dictionaries.  Three types of
corpora have also been used: document-aligned corpora, sentence and
term aligned corpora, and unaligned corpora.  This section considers
each of the six sources of translation knowledge in turn.

Ontologies.  Ontologies are structures that encode domain knowledge by
specifying relationships between concepts.  Thesauri are ontologies
that are designed specifically to support information retrieval.  At
present multilingual thesauri are the dominant sources of translation
knowledge in operational CLIR systems.  Thesauri can support both
controlled vocabulary and free-text retrieval, providing insight into
both hierarchical relationships (broader terms, narrower terms),
synonymy, and more general associations (related terms).  Such
relationships can help experienced users define better queries by
enhancing their understanding of the structure of knowledge for the
topic being searched.  The European Parliament's multilingual EUROVOC
thesaurus is one example of a multilingual thesaurus.  A common
approach to create a multilingual thesaurus is to translate an
existing monolingual thesaurus, and KALACHKINA provides algorithms to
deal with terms that lack direct translations.  SOERGEL, however,
cautions against merely translating an existing thesaurus since the
expression of concepts in the original language will then dominate the
conceptual structure.  General-purpose ontologies such as WordNet
(MILLER) are emerging as alternatives to traditional thesauri because
their broader coverage permits use of sophisticated knowledge
structures in broader domains that has heretofore been possible.  By
encoding additional relationships such as "part- whole" and "kind-of,"
WordNet explicitly captures a broader range of structural knowledge
than traditional thesauri.  The EuroWordNet project is developing a
multilingual ontology resembling WordNet with components in Dutch,
English, Italian and Spanish that are linked by an "interlingual
index."  CLIR support is a specific design goal of the project, and
GILARRANZ ET AL. (1997a, 1997b) have described how EuroWordNet might
be used to support a query translation strategy.  Other projects
(e.g., GermaNet described by HAMP & FELDWEG) are extending these ideas
to other languages.

Bilingual dictionaries.  Machine-readable bilingual dictionaries have
been widely used to support query translation strategies (BALLESTEROS
& CROFT, 1997; GEY & CHEN, DAVIS, DAVIS & OGDEN 1998; FLUHR ET AL.,
HULL & GREFENSTETTE; KRAAIJ & HIEMSTRA; KWOK; NGUYEN ET AL.; YAMABANA
ET AL.).  Bilingual dictionaries are typically designed for human use,
so translations of individual terms are often augmented with examples
showing how those terms could be used in context.  It would be
difficult to extract generalizations from those examples that could be
used automatically, so machine readable dictionaries are typically
processed manually or automatically to reduce them to a bilingual term
list, perhaps with additional information such as part-of-speech.  In
essence, dictionary-based translation consists of looking up each
query term in the resulting bilingual term list and selecting the
appropriate translation equivalents.  The simplest way of using such a
bilingual term list is to select every known translation for each
term, and that approach is often used as a baseline in
dictionary-based CLIR evaluations.  Both RADWAN & FLUHR and DAVIS have
shown that limiting the translations to those with the same
part-of-speech (e.g., noun or verb) can improve retrieval
effectiveness, and KRAAIJ & HIEMSTRA experimented with the use of
preferred translations that were noted in their dictionary.  OARD ET
AL.  demonstrated that arbitrarily choosing a single translation can
be just as good (by the average precision measure), apparently because
on balance as many queries are helped as are hurt.  HULL explored the
ability of structured queries to further limit translation ambiguity,
implementing a weighted Boolean matching strategy that exploited the
observation that correct translations are more likely to cooccur than
incorrect translations.  Dictionary-based CLIR can suffer from limited
dictionary coverage, inaccuracies during automatic construction of the
bilingual term list, and incorrect selection of the appropriate
translation equivalents (BALLESTEROS & CROFT, 1997; FLUHR ET AL.;
GAUSSIER ET AL.; HULL & GREFENSTETTE; NGUYEN ET AL.), but it is
sufficiently efficient and effective to be useful in many
applications.

Machine translation lexicons.  Machine translation systems are
becoming fairly widely available, although machine-readable
dictionaries still cover a greater number of language pairs (KRAAIJ).
Machine translation systems encode translation knowledge in a
"lexicon" that contains the information needed for automatic analysis,
translation and generation of natural language.  One goal of natural
language analysis is to disambiguate terms in ways that can limit
translation ambiguity, and the lexicon is often designed to provide
information that is useful for this purpose.  The most straightforward
way to apply a machine translation lexicon to CLIR is to simply use
the machine translation system to translate either the queries or the
document collection.  Queries are rarely provided as well formed
sentences, however, so the effectiveness of this approach may be
limited in query translation applications (HULL & GREFENSTETTE,
KRAAIJ).  Machine translation systems necessarily choose a single
preferred translation for each term, and ERBACH ET AL. have observed
that such a singular choice might adversely affect retrieval
effectiveness.  Examples of the use of machine translation for query
and document translation can be found in OARD & HACKETT.

	Document aligned corpora.  Document aligned corpora are
document collections in which useful relationships between sets of
documents in different languages are known.  Parallel corpora are made
up of translation equivalent sets, each containing a document and one
or more translations.  Comparable collections, on the other hand, are
typically separately authored but related by topical content.  Aligned
document sets in comparable corpora may contain one or more documents
in each language (PETERS & PICCHI; SHERIDAN & BALLERINI).  The basic
strategy for using document aligned corpora is to represent each term
using the pattern of aligned sets in which that term occurs and then
to construct language-neutral representations of documents in any
supported language using the resulting term representations.
Techniques from linear algebra are typically used to compute and
manipulate these term representations.  When the language of each
document is known, each terms is typically tagged with a language
marker in order to avoid undesired conflation of different concepts in
other languages.  CARBONELL ET AL. implemented one such technique, the
Generalized Vector Space Model (GVSM), using a parallel corpus.
Latent Semantic Indexing (LSI) extends this approach by conflating
terms that have similar representations, often increasing recall
without adversely affecting precision.  Both parallel corpora
(LANDAUER & LITTMAN 1990, 1991; DUMAIS ET AL.) and comparable corpora
(REHDER) have been used with LSI.  BARRY & YOUNG found that the
effectiveness of LSI could be improved by using an aligned corpus of
short passages rather than one formed from longer documents.  Although
LSI is sometimes more effective than GVSM, computation of the term
conflation step is computationally intensive (CARBONELL ET AL.).
SHERIDAN & BALLERINI and MATEEV ET AL. investigated an alternative
approach, building a bilingual term list for query translation using
term representations computed from a comparable corpus of news stories
that was aligned using classification codes, publication dates and
cognates.  They found the terms in each language that were most
similar to each query term (using a vector similarity measure) and
then used several of the most similar terms as the translated query.
While LSI uses more sophisticated techniques to conflate similar
terms, SHERIDAN & BALERINI's technique is more efficient.

Sentence and term aligned corpora.  Comparable corpora can be aligned
only to the document level, but many individual sentences in parallel
corpora can be aligned automatically using dynamic programming
techniques.  DAVIS used a sentence-aligned parallel corpus directly to
augment dictionary based query translation without substantial
improvement over a simpler dictionary-based technique.  OARD (1996,
1997a) used sentence alignments as a basis for aligning individual
terms, but again found that knowledge based techniques (in this case,
machine translation) were more effective when the corpus based
technique was required to extract translation knowledge from one
collection and then apply it to another.  In those experiments, a set
of sentence aligned translations of United Nations documents was used
as a source of translation knowledge, and a monolingual collection of
Spanish newswire articles was used for evaluation.  CARBONELL ET
AL. implemented a similar approach, evaluating retrieval effectiveness
on a portion of the same corpus from which translation knowledge had
been extracted.  Under those conditions, the sentence aligned corpus
that was used to produce term alignments outperformed every other
technique they tried.  OARD (1997a) saw a similar improvement when
comparing the same-collection performance of LSI with the performance
of the same algorithm when trained on a different collection.  It thus
appears that document and sentence aligned techniques may be most
useful when the needed alignments are known within some portion of the
same collection from which retrieval is desired.  Although such a
situation may exist in a few applications (e.g., if translations are
being made routinely, but they are not available immediately), this
factor is likely to somewhat circumscribe the utility of techniques
based on document and sentence aligned corpora.

Unaligned corpora.  A representative monolingual document collection
is, of course, available in any in application of CLIR to
retrospective retrieval.  Such collections are often assembled for
filtering applications as well because they provide useful collection
frequency statistics.  When representative documents in more than one
language are present in (or can be added to) such a collection, the
collection itself can be used in conjunction with a bilingual term
list as an additional source of translation knowledge even if a priori
document alignments are not known.  BALLESTEROS & CROFT (1997) applied
fully automatic passage-level pseudo-relevance feedback using the
query language portion of their unaligned corpus to refine the query
representation.  By augmenting the original query with terms appearing
in top-ranked passages, monolingual pseudo-relevance feedback often
improves recall without a significant adverse effect on precision.
They then applied dictionary based query translation to produce a
version of the query in the desired language, followed by fully
automatic passage-level pseudo-relevance feedback using the portion of
the unaligned corpus containing documents in that language.  When
applied individually, each pseudo- relevance feedback step improved
CLIR effectiveness, and the combination outperformed either step
alone.  KROVETZ & CROFT and SANDERSON have shown that ranked retrieval
techniques tend to reinforce the appropriate interpretation of words
that admit more than one interpretation.  Viewed in this light, the
first pseudo-relevance feedback step serves to limit the adverse
effect of translation ambiguity by including additional terms that are
related to the original query terms.  YAMABANA ET AL. sought to
achieve the same result more directly.  For each query term, they
identified one related term in the unaligned corpus that often
appeared in a sentence with the query term.  They then selected the
candidate that most often appeared in the same sentence as some
possible translation of the related term.  YAMABANA ET AL. obtained
some improvement in translation accuracy using this technique, but
they did not evaluate the effect of that improvement on retrieval
effectiveness.  PICCHI & PETERS have proposed a similar technique that
exploits more context by considering the possible translations of
groups of words surrounding each query term in the unaligned corpus.
Although techniques based on unaligned corpora appear promising,
SHERIDAN ET AL. (1997) failed to find any improvement when using
languages and collections different from those used by BALLESTEROS &
CROFT.  It thus appears the nature of the unaligned corpus and/or the
way in which additional context-revealing terms context are chosen can
substantially affect the results.
	 
SELECTION, EXAMINATION AND DELIVERY

As MARCHIONINI has observed, searching and browsing are complementary
activities.  Automated systems apply rather simple techniques to
enormous volumes of information, while humans can effectively exploit
quite sophisticated selection heuristics on fairly small sets.  One
important goal of the user interface is to expose the information on
which users can base these decisions.  Retrieval systems containing
full text typically support two browsing strategies: selection of
documents from a list of promising candidates identified by the
system, and detailed examination of individual documents.

Support for selection presents unique challenges when the documents
are written in an unfamiliar language.  Monolingual selection
interfaces typically present document titles along with some
information about the source of the material and when it was produced.
Occasionally some form of summary such the first few lines or some
individual words automatically extracted from the document are also
presented.  Conversion of names and dates using simple transliteration
schemes is relatively straightforward, but title translation is more
complex.  Translation of titles using a fully automatic machine
translation is a possibility, but titles rarely form the sort of well-
formed linguistic expressions that typical machine translation systems
are optimized for.  KIKUI ET AL. reported that choosing the most
common candidate translation (using a monolingual corpus) and then
reordering the terms using some simple rules produced usable
translations of English web page titles into Japanese.  RESNIK
evaluated an alternative strategy for translating brief listings into
English, displaying as many as three alternative translations when
faced with translation ambiguity.  Using a decision theoretic measure,
they found that such translations were more effective than a naive
Bayesian classifier, but not as effective as monolingual selection.

	Support for examination poses an even greater challenge.
Several companies market translation software that is compatible with
popular web browsers, and proxy translation servers are becoming
available on the Internet.  Typical machine translation systems are
not yet fast enough to keep up with interactive selection and
scrolling behavior, however, so interactive searching is inhibited to
some extent when query translation is used.  Approaches based on
advance translation of every document avoid this problem, but the time
and expense involved limit application of those techniques.  Rapid
word-by-word translation like that explored by KIKUI ET AL. and RESNIK
could in principle be used with query translation, but the utility of
such techniques for examining relatively long documents in a CLIR
system has not yet been explored.  Traditional abstracting services
such as INSPEC have adopted a more parsimonious approach, manually
preparing abstracts for every document in the supported query language
(usually English) regardless of the abstracted documents' language.
FRANZEN & KARLGREN (1997) proposed automating this process by
translating brief extracts or summaries as an alternative to
translating entire documents on demand, but research on cross-language
summarization is just beginning.

 	The ultimate delivery of selected documents in a usable form
may be a somewhat more tractable problem than support for interactive
examination if adequate time for translation can be allowed when
arranging for delivery.  O'HAGAN provided an overview of the
translation industry and observed that globally interconnected
networks will make it possible to marshal worldwide translation
resources upon demand.  Although fully automatic machine translation
can presently only produce high quality translations in very limited
subject areas, O'HAGAN suggested that a robust and responsive
translation infrastructure could be built using machine assisted human
translation.  The human effort involved will likely make delivery the
most expensive component on a per- document basis, so effective
recognition of the most promising documents using the query
formulation, matching, selection and examination stages is
particularly important.

EVALUATION

	Experimental evaluation of CLIR systems poses unique
challenges because the languages covered by the translation resources
must match the languages covered by the evaluation resources.  The
situation is further complicated when alternative techniques that
require different translation resources are compared.  A CLIR test
collection thus consists of a set of documents in one or more
languages, a set of queries in a language or languages different from
that of the documents, relevance judgments for each query- document
pair, and translation resources such as dictionaries, bilingual
corpora, or cognate matching rules.

	LANDAUER & LITTMAN (1990) developed a simple evaluation
technique known as mate finding for use with document-aligned corpora.
Mate finding is a variation on known item retrieval, a classic
evaluation strategy in which the rank assigned to a unique item that
is known to be relevant to the query is used as the measure of
effectiveness.  LANDAUER & LITTMAN (1990, 1991) partitioned an
English-French parallel collection, extracting translation knowledge
from one part and using the other part for evaluation.  Each English
document was then used as a query, and statistics describing the rank
of the known French translation for each document were presented.
CARBONELL ET AL. found that mate retrieval was less able to
discriminate among fairly good techniques than more traditional
strategies in which recall and precision were reported, but mate
retrieval remains useful as a simple strategy for identifying
promising CLIR techniques when more sophisticated evaluation resources
are not available.

	RADWAN & FLUHR used French translations of the 1,398 abstracts
in the English Cranfield collection to compute precision-recall graphs
and an average precision measure.  DAVIS & DUNNING adopted an
alternate strategy, manually translating Spanish topic descriptions
into English and then using those topic descriptions to construct
English queries to retrieve Spanish newswire articles from the Text
REtrieval Conference (TREC).  Manual translation of queries is now a
widely used evaluation strategy because it permits existing test
collections to be inexpensively extended to any language pair for
which translation resources are available.  Because manual translation
requires the application of human judgment, evaluation collections
constructed in this way exhibit some variability based on the
terminology chosen by a particular translator.  But if a standard set
of translations is agreed upon, such a strategy offers a meaningful
basis for selecting between alternative CLIR techniques.

 	There are, however, some applications for which manual query
translation would not produce an adequate test collection.
Corpus-based techniques, for example, may not perform well on
collections that differ markedly from the corpora on which they were
trained.  There presently is no widely accepted metric for reporting
the similarity of two corpora, so same-corpus (i.e., best case)
evaluations are typically performed using a held- back portion of the
corpus.  CARBONELL ET AL. produced a test collection in this way by
exhaustively performing 33,630 relevance judgments for a portion of a
parallel collection of English and Spanish documents.  This produced a
test collection that was about the same size as the Cranfield
collection used by RADWAN & FLUHR, but with the added characteristic
that the remainder of the parallel corpus was available for the
extraction of translation knowledge.  SHERIDAN & BALLERINI also built
a test collection from a document aligned corpus, but they developed a
genre-specific strategy for newswire articles.  By constructing
queries for unpredicted events and ending their search three days
after the event (which produced a different collection size for each
query) they cut down the number of relevance judgments considerably.
Newswire stories are fairly readily available in character-coded form,
so this evaluation strategy may provide an economical alternative for
many applications.

	Evaluation of adaptive filtering techniques that learn to
select documents in one language based on user reactions to documents
in other languages imposes further requirements on an evaluation
collection because a third partition of the evaluation collection may
be needed.  OARD (1997a) constructed such a collection using
monolingual test collections in English and Spanish for which four
topic descriptions were closely aligned and a parallel corpus of
English and Spanish documents for which no relevance judgments were
needed.  In addition to the adaptive filtering evaluation, some
indication of the degree of similarity between one of the monolingual
test collections and the parallel corpus was also obtained.

	Relatively large document collections are needed to accurately
reflect the performance of IR systems in large-scale applications, and
potential need to subdivide the collection two or three ways
exacerbates the situation.  Obtaining statistical significance will
often require more queries for query translation experiments than for
monolingual experiments on the same collection because uneven
translation accuracy introduces an additional source of variation.
And collections covering a wide range of languages and modalities will
be needed to assess the effect of variations in morphology, word
boundary marking, and recognition accuracy.  At present the TREC CLIR
collection described by MATEEV ET AL. and SCH�UBLE & SHERIDAN is
the most comprehensive step in that direction.  Using an approach
known as pooled relevance assessment, relevance judgments for about
100,000 newswire articles in each of three languages (English, French
and German) were developed by judging documents selected using several
different retrieval techniques.  The documents are not translations of
each other, but they are drawn from the same genre and time frame and
SHERIDAN ET AL. (1998) have automatically identified some possible
alignments between some of the French and German documents in the
collection.

Some insight into the contribution of alternative translation
techniques can be obtained by comparing CLIR results with the
effectiveness of a similar monolingual technique on the same
collection.  Typically expressed as a percentage of monolingual
effectiveness, reported values typically range from around 50% for
unconstrained dictionary based query translation to 75% or so for more
sophisticated techniques.  Direct comparisons are difficult, however,
because the monolingual reference technique is often different,
parameter variations can introduce additional variations even when the
reference technique is nominally the same, the effect of differing
collections on relative effectiveness is not well characterized, and
different effectiveness measures may have been used.  HULL &
GREFENSTETTE reported precision averaged over several fixed numbers of
documents to characterize high precision interactive searching, while
BALLESTEROS & CROFT (1997) reported precision averaged over the full
range of recall values.  Relative performance figures can help
identify particularly promising techniques, however, and then the most
promising techniques can be subjected to a more rigorous side by side
comparison.

RESEARCH DIRECTIONS

	Nearly three decades of research on and practice of controlled
vocabulary techniques for CLIR and eight years of research on free
text techniques have produced a wide array of useful techniques, but
more remains to be done.  Existing research on user needs for CLIR,
for example, addresses the deliberate dissemination of information
well but the impact of ubiquitous networking and the resulting trend
towards flattened organizational structures has yet to be addressed.
Some issues, such as the impact of networked communications on the
translation infrastructure supporting ultimate use of selected
documents, have implications for both controlled vocabulary and free
text CLIR.  But free text techniques are still relatively new, and it
is there that many of the open research questions are to be found.

Important research issues are found in each stage of the model shown
in Figure 1.  The distinction between user-assisted and fully
automatic query translation is rather sharply drawn at present, with
users either being offered the opportunity to help resolve translation
ambiguity for every term or for none of them.  More sophisticated
strategies might retain much of the benefit of user-assisted
translation while avoiding unnecessary allocation of user effort and
screen space to that task.  Present document preprocessing systems are
typically language specific, often using hand-built components for
tasks such as character set conversion, compound splitting, and
stemming.  The development of easily configured tools for such tasks
would make the addition of additional languages a far more tractable
task.  The matching stage has received a great deal of attention, but
cognate matching has only recently been investigated carefully.
Further work on additional language pairs and strategies for combining
cognate matching with other techniques appear to be the natural next
steps.  The importance of selection, examination and delivery for CLIR
system design is now beginning to be recognized, but much remains to
be done.  It is not yet clear, for example, whether rapid translation
of the entire text or automatic generation of translated summaries
will provide the best support for examination, and answering that
question may require the development of new evaluation techniques.
Other evaluation issues also require attention.  Perhaps most
importantly, it will not be possible to accurately characterize the
performance of document and sentence aligned corpus-based techniques
in practical applications without some way to measure the degree of
difference between the corpus from which the translation knowledge is
extracted and the collection from which retrieval is desired.

As CLIR has matured, increasingly integrated approaches have been
investigated.  Dictionary based query translation has been improved
using unaligned corpora (BALLESTEROS & CROFT 1996,1997), and term
aligned corpora have been refined by seeding the alignments using a
bilingual dictionary (YANG ET AL. 1997).  Fully automatic query
translation techniques are being augmented with user assisted query
translation.  This trend will likely continue, encompassing other
components and techniques as productive interactions are discovered.
GACHOT ET AL., for example, has observed that closer coupling between
machine translation and matching techniques might be helpful because
additional linguistic information would be available.  Ultimately the
distinctions that have been drawn in this chapter between separate
components and different techniques may be as useful for explaining
how they are coupled as for how they are different.

CONCLUSION

Controlled vocabulary CLIR techniques are now widely deployed, and
free text systems for practical applications are beginning to appear.
Although monolingual retrieval is still more effective for free text
than CLIR, several useful CLIR techniques are known.  Query
translation, document translation, interlingual techniques and cognate
matching provide a range of alternatives that can be tailored to
specific applications.  Document preprocessing strategies have been
developed for scanned page images and recorded speech, but character
coded text remains the most easily processed format.  Interactive
applications pose additional challenges, since users may not have the
language skills that would be needed to select and examine documents
in their original language.  Additional opportunities are present as
well, however, since the user can help refine translation knowledge
that is extracted from dictionaries, bilingual corpora, or other
sources.  Evaluation poses additional challenges that the recent
development of the TREC CLIR test collection has begun to address.

Many modern information systems support only a single language, but
that limitation will likely become increasingly untenable in an era of
ubiquitous global networks and vast international information flows.
Cross-language information retrieval is one component of the
technological infrastructure that will help make the World Wide Web a
truly worldwide resource, and it will undoubtedly find widespread
application in other parts of the information industry as well.
Although much remains to be done, the techniques that have been
developed and the ways in which they have been applied provide useful
signposts for developers that wish to begin exploring the
opportunities that cross-language information retrieval presents.

BIBLIOGRAPHY

ALLAN, J., CALLAN, J., CROFT, W. B., BALLESTEROS, L., BYRD, D., SWAN,
R., & XU, J. 1998. INQUERY Does Battle with TREC-6. In: Proceedings of
the Sixth Text Retrieval Conference (TREC-6), National Institute of
Standards and Technology (NIST), Gaithersburg, MD.
http://trec.nist.gov

ATA, B. M. A., MOHD, T., SEMBOK, T., & YUSOFF, M. 1995. SISDOM : a
multilingual document retrieval system. Asian Libraries, 1995; 4(3):
37--46

AUSTIN, D. 1977. Progress Towards Standard Guidelines for the
Construction of Multilingual Thesauri. In: Third European Congress on
Information Systems and Networks (Vol. 1): Verlag
Dokumentation. 1977. pp. 341--402.

BALLESTEROS, L., & CROFT, W. B. 1996. Dictionary Methods for
Cross-Lingual Information Retrieval. In: R. R. Wagner & H. Thoma
(Eds.) New York: Springer.1996.  pp. 791--801. ISBN: 354061656X. Also
appeared in Lecture Notes in Computer Science, ISSN: 0302-9743 1996,
issue 1134.  http://ciir.cs.umass.edu/info/psfiles/irpubs/ir.html

BALLESTEROS, L., & CROFT, W. B. 1997. Phrasal Translation and Query
Expansion Techniques for Cross-Language Information Retrieval.  In:
Proceedings of the 20th International ACM SIGIR Conference on Research
and Development in Information Retrieval .1997. pp. ?

BENKING, H., & KAMPFFMEYER, U. 1992. Harmonization of Environmental
Meta- Information with a Thesaurus-based multi-lingual and
multi-medial Information System.  In: A. Zygielbaum (Ed.), AIP
Conference Proceedings 283, Earth and Space Science Information
Systems: American Institute of Physics. 1992. pp. 688--695.

BERRY, M., & YOUNG, P. 1995. Using Latent Semantic Indexing for
Multilanguage Information Retrieval. Computers and the Humanities,
1995; 29(6): 413-429. ISSN 0010- 4817.

BLAKE, P. 1992. The MenUSE System for Multilingual Assisted Access to
Online Databases, in the context of current EC funded
projects. On-line Review, 1992; 16(3): 139-146. June. ISSN 0309-314X.

BUCKLEY, C., MITRA, M., WALZ, J., & CARDIE, C. 1998. Using Clustering
and SuperConcepts Within SMART: TREC 6. In: Proceedings of the Sixth
Text Retrieval Conference (TREC-6), National Institute of Standards
and Technology (NIST), Gaithersburg, MD.  http://trec.nist.gov

BUCKLEY, C., SALTON, G., ALLAN, J., & SINGHAL, A. 1994. Automatic
Query Expansion Using SMART: TREC 3. In: Harman, D. K. (Ed.), Overview
of the Third TextREtrieval Conference (TREC-3), pp. 69-80. National
Institute of Standards and Technology (NIST), Gaithersburg, MD.
http://www-nlpir.nist.gov/TREC/trec3.papers/cornall.new.ps

CARBONELL, J., YANG, Y., FREDERKING, R., BROWN, R. D., GENG, Y., &
LEE, D. 1997. Translingual Information Retrieval: A Comparative
Evaluation.  In: Proceedings of the Fifteenth International Joint
Conference on Artificial Intelligence. pp. ??

CHACHRA, V. 1993. Subject Access in an Automated Multithesaurus and
Multilingual Environment. In: S. McCallum & M. Ertel (Eds.), 2nd
Satellite Meeting on Automated Systems for Access to Multilingual and
Multiscript Library Materials : Saur.1993. pp.  63--76. ISBN:
3598217978; Also appeared in IFLA PUBLICATIONS 1994, Vol. 70.

CHMIELEWSKA-GORCZYCA, E., & STRUK, W. 1994. Translating Multilingual
Thesauri. In: P. Stanucikova & I. Dahlberg (Eds.), 1st European
Conference on Environmental Knowledge Organization and Information
Management.  Frankfurt: Indeks Verlag.1994. pp. 150--155. ISBN:
3886726002 3886726010; also appeared in Knowledge Organization in
Subject Areas ISSN 0946-9389, 1994, vol. 1.

COOPER, D. 1997. How to Read Less and Know More: Approximate OCR for
Thai. In: Belkin, N., Narasimhalu, D. & Willett, P. (Eds.),
Proceedings of the 20th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval: ACM
SIGIR.1997. pp. 216-225. ISBN 0-89791-836-3.

D'OLIER, J. H. 1977. Multilingualism in Scientific and Technical
Documentation.  International Forum on Information and Documentation,
1977; 2(4): 20--24

DAVIS, M. 1997. New Experiments in Cross-Language Text Retrieval at
NMSU 's Computing Research Lab. In: D. K. Harman (Ed.), The Fifth Text
REtrieval Conference ( TREC-5). National Institute of Standards and
Technology (NIST), Gaithersburg, MD.
http://crl.nmsu.edu/users/madavis/Site/Book2/trec5.ps

DAVIS, M. & DUNNING, T. 1995. A TREC Evaluation of Query Translation
Methods for Multi-Lingual Text Retrieval. In: Harman, D. K. (Ed.) The
Fourth Text Retrieval Conference (TREC-4). National Institute of
Standards and Technology (NIST), Gaithersburg, MD.
http://trec.nist.gov,
http://crl.nmsu.edu/users/madavis/Site/Book2/trec4.ps

DAVIS, M. W., & OGDEN, W. C. 1997. Implementing cross-language text
retrieval systems for large-scale text collections and the world wide
web, AAAI Symposium on Cross-Language Text and Speech Retrieval :
American Association for Artificial
Intelligence. 1997. pp. 2-10. ISBN: 1-57735-040-5; Technical Report:
SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

DAVIS, M. & OGDEN, W. 1998. Free Resources and Advanced Alignment for
Cross- Language Text Retrieval. In: Proceedings of the Sixth Text
Retrieval Conference (TREC- 6), National Institute of Standards and
Technology (NIST), Gaithersburg, MD.  http://trec.nist.gov

DEFENSE ADVANCED RESEARCH PROJECTS AGENCY 1996. Tipster Text
Program. Morgan Kaufmann

DUBOIS, C. P. R. 1987. Free Text vs. Controlled Vocabulary: A
Reassessment. Online Review, Vol. 11, No. 4, pp. 243-253.

DUCLOY, J. 1996. Tools and Techniques for Digital Libraries. ERCIM
News, 1996; 27.
http://www-ercim.inria.fr/publication/Ercim_News/enw27/ducloy.html

DUMAIS, S. T., LETSCHE, T. A., LITTMAN, M. L., & LANDAUER, T. K. 1997.
Automatic Cross-Language Retrieval Using Latent Semantic Indexing,
AAAI Symposium on Cross-Language Text and Speech Retrieval : American
Association for Artificial Intelligence.  1997. pp. 15-21. ISBN:
1-57735-040-5; Technical Report: SS-97- 05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

ELKATEB, F. & FLUHR, C. 1998. EMIR at the CLIR Track of TREC 6. In:
Proceedings of the Sixth Text Retrieval Conference (TREC-6), National
Institute of Standards and Technology (NIST), Gaithersburg, MD.
http://trec.nist.gov

ELLEN, SANDRA R. 1979. Survey of Foreign Language Problems Facing the
Research Worker. Interlending Review, Vol. 7, No. 2, pp. 31-41, April.

ERES, B. K. 1989. International Information Issues. In: Williams,
M. (Ed.) Annual Review of Information Science and Technology, Vol. 24,
p. 3

ERBACH, G., NEUMANN, G., & USZKOREIT, H. 1997. MULINEX Multilingual
Indexing Navigation and Editing Extensions for the World-Wide Web,
AAAI Symposium on Cross-Language Text and Speech: American Association
for Artificial Intelligence. 1997. pp. 22-28. ISBN: 1-57735-040-5;
Technical Report: SS-97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

EVANS, D. A., HANDERSON, S. K., MONARCH, I. A., PEREIRO, J., DELON,
L., & HERSH, W. R. 1991. Mapping Vocabularies Using "Latent
Semantics'' (CMU-LCL-91- 1): Carnegie Mellon University, Laboratory
for Computational Linguistics

FLUHR, C., & RADWAN, K. 1993. Fulltext Databases as Lexical Semantic
Knowledge for Multilingual Interrogation and Machine Translation. In:
P. Brezillon & V. Stefanuk (Eds.), Proceedings of the East-West
Conference on Artificial Intelligence (EWAIC '93) Moscow: Association
for Artificial Intelligence of Russia, ICSTI.1993. pp. 124--128.

FLUHR, C. 1995. Multilingual Information Retrieval. In: R. A. Cole,
J. Mariani, H.  Uszkoreit, A. Zaenen, & V. Zue (Eds.), Survey of the
State of the Art in Human Language Technology : Center for Spoken
Language Understanding, Oregon Graduate Institute.1995. pp. 391--305.
http://www.cse.ogi.edu/CSLU/HLTsurvey/ch8node7.html

FLUHR, C., SCHMIT, D., ELKATEB, F., ORTET, P., & GURTNER, K. 1997.
Multilingual Database and Crosslingual Interrogation in a Real
Internet Application, AAAI Symposium on Cross-Language Text and Speech
Retrieval : American Association for Artificial
Intelligence.1997. pp. 32-36. ISBN: 1-57735-040-5; Technical Report:
SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

FRANZEN, K., & KARLGREN, J. 1997. Project Presentation REPTILE
Retrieval Extraction Presentation and Translation using Language
Engineering, AAAI Symposium on Cross-Language Text and Speech
Retrieval : American Association for Artificial
Intelligence.1997. pp. 37-39. ISBN: 1-57735-040-5; Technical Report:
SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

FREDERKING, R., MITAMURA, T., NYBERG, E., & CARBONELL, J. 1997.
Translingual Information Access, AAAI Symposium on Cross-Language Text
and Speech Retrieval : American Association for Artificial
Intelligence.1997. pp. 40-48. ISBN: 1- 57735-040-5; Technical Report:
SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

FURNAS, G. W., LANDAUER, T. K., AND GOMEZ, L. M., & DUMAIS,
S. 1987. The Vocabulary Problem in Human-system
Communication. Communications of the Association for Computing
Machinery. Vol. 30, No. 11, pp. 964-971.

GACHOT, D. A., LANGE, E., & YANG, J. 1998. The SYSTRAN NLP Browser: An
Application of Machine Translation Technology in Multilingual
Information Retrieval.  In: G. Grefenstette, (Ed.), Cross Language
Information Retrieval: Kluwer Academic. pp.  ??. ISBN:0-7923-8122-X,
pp. ??.  http://www.rxrc.xerox.com/research/mltt/DMHead/CLIR/

GAUSSIER, E., GREFENSTETTE, G., HULL, D. A., & SCHULZE,
B. M.1998. Xerox TREC-6 Site Report: Cross Language Text
Retrieval. In: Proceedings of the Sixth Text Retrieval Conference
(TREC-6), National Institute of Standards and Technology (NIST),
Gaithersburg, MD.  http://trec.nist.gov

GEY, F. & CHEN, A. 1998. Phrase Discovery for Cross-Language Retrieval
at TREC 6.  In: Proceedings of the Sixth Text Retrieval Conference
(TREC-6), National Institute of Standards and Technology (NIST),
Gaithersburg, MD.  http://trec.nist.gov

GIBB, J. M., & PHILLIPS, E. 1977. Scientific and Technical Publishing
in a Multilingual Society.  In: Third European Congress on Information
Systems and Networks 1977. pp.  13--27.

GILARRANZ, J., GONZALO, J., & VERDEJO, F. 1997a. An approach to
Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic
Database, AAAI Symposium on Cross-Language Text and Speech Retrieval:
American Association for Artificial
Intelligence.1997. pp. 49-55. ISBN: 1-57735-040-5; Technical Report:
SS-97- 05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

GILARRANZ, J., GONZALO, J., & VERDEJO, F. 1997b. Language-Independent
Text Retrieval with the EuroWordNet Multilingual Semantic Database.
In: Second Workshop on Multilinguality in the Software Industry: The
AI Contribution.1997. pp. ?
http://www.iit.nrcps.ariadne-t.gr/~costass/mulsaic97.html

GOLDSTEIN, E. S. 1985. The Use of Technical Information by Engineers
of the Electrical Sector of Mexico. Unpublished Doctoral Dissertation,
University of California, Los Angeles.

GREFENSTETTE. G. 1995. Comparing Two Language Identification
Schemes. In.  Proceedings of the 3rd International Conference on the
Statistical Analysis of Textual Data.
http://www.rxrc.xerox.com/researc/mltt/Tools/guesser.html

GREFENSTETTE, G. 1998. Cross Language Information Retrieval: Kluwer
Academic.  ISBN:0-7923-8122-X

GUO, J. 1997. A Comparative Study on Sentence Tokenization Generation
Schemes.  In review for journal publishing, January, 1997.
http://sunzi.iss.nus.sg:1996/guojin/papers/

HAMP, B. & FELDWEG, H. GermaNet - A Lexical-Semantic Net for German.
http://www.sfs.nphil.uni-tuebingen.de/isd/english.html

HAYASHI, Y., KIKUI, G. I., & SUSAKI, S. 1997. TITAN: A
Cross-Linguistic Search Engine for the WWW, AAAI Symposium on
Cross-Language Text and Speech Retrieval: American Association for
Artificial Intelligence.1997. pp. 56-62. ISBN: 1-57735-040-5;
Technical Report: SS-97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

HLAVA, M. M. K., HAINEBACH, R., BELONOGOV, G., & KUZNETSOV, B. 1997.
Cross-Language Retrieval - English/ Russian/ French, AAAI Symposium on
Cross- Language Text and Speech Retrieval: American Association for
Artificial Intelligence.1997. pp. 63-83. ISBN: 1-57735-040-5;
Technical Report: SS-97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

HULL, D. A., & GREFENSTETTE, G. 1996. Querying Across Languages: A
Dictionary- based Approach to Multilingual Information Retrieval. In:
Proceedings of the 19th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval: ACM
SIGIR.1996. pp. ??
http://www.xerox.fr/people/grenoble/hull/papers/sigir96.ps

HULL, D. A. 1997. Using Structured Queries for Disambiguation in
Cross-Language Information Retrieval, AAAI Symposium on Cross-Language
Text and Speech Retrieval: American Association for Artificial
Intelligence.1997. pp. 84-98. ISBN: 1-57735-040-5; Technical Report:
SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

HUTCHINS, W. J., PARGETER, L. J., & SAUNDERS, W. L. 1971. The Language
Barrier. Sheffield: University of Sheffield Postgraduate School of
Librarianship and Information Science

ILJON, A. 1977. Scientific and technical data bases in a multilingual
society. On-Line Review, 1977; 1(2): 133-136

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION (ISO). 1985.
Guidelines for the establishment and development of multilingual
thesauri: ISO English version. ISO 5964-1985 (E) distributed by the
American National Standards Institute.

INTERNATIONAL TRANSLATIONS CENTRE. 1996. Word Translations
Index. Delft, The Netherlands, International Translations
Centre. Vol. 10, #9. ISSN 0259-8264.

INTERNET SOCIETY & ALIS TECHNOLOGIES. 1997. Web Languages Hit Parade.
http://www.isoc.org:8080/palmares.en.html

JONES, G. J. F., & JAMES, D. A. 1997. A Critical Review of
State-of-the-Art Technologies for Cross-Language Speech Retrieval,
AAAI Symposium on Cross- Language Text and Speech Retrieval: American
Association for Artificial Intelligence.1997. pp. 99-110. ISBN:
1-57735-040-5; Technical Report: SS-97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

KALACHKINA, S. Y. 1987. Algorithmic Determination of Descriptor
Equivalents in Different Natural Languages. Automatic Documentation
and Mathematical Linguistics, 1987; 21(4): 21--29. English translation
from Russian.

KARATZOGLOU, M. 1997. TRANSLIB (LIB/3-3038). Patras, Greece: Knowledge
S.A.

KIKUI, G. 1996. Indentifying the Coding System and Language of On-line
Documents on the Internet. In: Sixteenth International Conference on
Computational Linguistics (COLING). International Committee on
Computational Linguistics.
http://isserv.tas.ntt.jp/chisho/paper/9608KikuiCOLING.ps

KIKUI, G., HAYASHI, Y. & SUZAKI, S. 1996. Cross-lingual Information
Retrieval on the WWW. In: Proceedings of the First Workshop on
Multilinguality in Software Engineering: The AI Contribution
(MULSAIC). European Coordinating Committee for Artificial
Intelligence.
http://isserv.tas.ntt.jp/chisho/paper/9608KikuiMULSAIC.ps.Z

KRAAIJ, W. 1997. Multilingual Functionality in the TwentyOne Project,
AAAI Symposium on Cross-Language Text and Speech Retrieval: American
Association for Artificial Intelligence.1997. pp. 127-132. ISBN:
1-57735-040-5; Technical Report: SS- 97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

KRAAIJ, W. & HIEMSTRA, D. TREC6 Working Notes: Baseline Tests for
Cross Language Retrieval with the Twenty-One System. In: TREC6 working
notes. National Institute of Standards and Technology (NIST),
Gaithersburg, MD.

KROVETZ, R. & CROFT, B. 1992. Lexical Ambiguity and Information
Retrieval. ACM Transactions on Information Systems, Vol. 10, No. 2,
pp. 115-141.

KWOK, K. L. 1997. Evaluation of an English-Chinese Cross-Lingual
Retrieval Experiment, AAAI Symposium on Cross Language Text and Speech
Retrieval : American Association for Artificial
Intelligence.1997. pp. 133-137. ISBN: 1-57735-040- 5; Technical
Report: SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

LANDAUER, T. K., & LITTMAN, M. L. 1990. Fully Automatic Cross-Language
Document Retrieval Using Latent Semantic Indexing, Proceedings of the
Sixth Annual Conference of the UW Centre for the New Oxford English
Dictionary and Text Research Waterloo, Ontario: UW Centre for the New
OED and Text Research.1990. pp.
31--38. http://www.cs.duke.edu/~mlittman/docs/x-lang.ps

LANDAUER, T. K., & LITTMAN, M. L. 1991. A statistical method for
language- independent representation of the topical content of text
segments, Proceedings of the Eleventh International Conference: Expert
Systems and Their Applications (Vol.  8) Avignon
France.1991. pp. 77--85.

LEBOWITZ, A. I., ZWART, R. P., & SCHMID, H. 1991. Multilingual
Indexing and Retrieval in Bibliographic Systems: The AGRIS
Experience. Quarterly Bulletin of the International Association of
Agricultural Librarians and Documentalists, 1991; 36(3): 187-192

LEE, D., NOHL, C. R. & BAIRD, H. ? Language Identification in Complex,
Unoriented, and Degraded Document Images.

LI, C. S., POLLITT, A. S., & SMITH, M. P. 1992. Multilingual MenUSE -
A Japanese front-end for searching English Language databases and vice
versa. In: T. McEnery & C.  Paice (Eds.), 14th Information Retrieval
Colloquium.  New York: Springer-Verlag.1992.  pp. ?? ISBN: 3540198083,
0387198083.

LIN, C.-H., & CHEN, H. 1996. An Automatic Indexing and Neural Network
Approach to Concept Retrieval and Classification of Multilingual
(Chinese-English) Documents. IEEE Transactions on Systems Man and
Cybernetics, 1996; 26(1): 75--88.
http://ai.bpa.arizona.edu/papers/chinese93/chinese93.html.

LOGINOV, B. R., & V'YUGIN, V. V. 1989. Automated Maintenance of a
Bilingual Medical Thesaurus on a Microcomputer. Automatic
Documentation and Mathematical Linguistics, 1989; 23(2):
72--75. English translation from Russian.

LOUKACHEVITCH, N. V. 1997. Knowledge Representation for Multilingual
Text Categorization, AAAI Symposium on Cross-Language Text and Speech
Retrieval : American Association for Artificial
Intelligence.1997. pp. 138-142. ISBN: 1-57735-040- 5; Technical
Report: SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

MALONE, T. W., Grant, K. R., Turbak, F. A., Brobst, S. A., & COHEN,
M. D. 1987.  Intelligent Information Sharing Systems. Communications
of the ACM. Vol. 30, no. 5, pp. 390-402.

MARCHIONINI, G. 1995. Information Seeking in Electronic
Environments. Cambridge University Press.

MATEEV, B., MUNTEANU, E., SHERIDAN, P., WECHSLER, M., and SCH�UBLE,
P. 1998. ETH TREC-6: Routing, Chinese, Cross-Language and Spoken
Document Retrieval. In: Proceedings of the Sixth Text Retrieval
Conference (TREC-6), National Institute of Standards and Technology
(NIST), Gaithersburg, MD.

MEADOWS, A. J. 1974. Communication in Science. London: Butterworths

METOYER-DURAN, CHERYL. 1993. Information Gatekeepers. In: Annual
Review of Information Science and Technology. Medford, NJ: American
Society for Information Science, pp. 111-150. Vol. 28, chapter 3.

MILLER, G. 1990. WordNet: An On-line Lexical Database. International
Journal of Lexicography, Vol. 3, no. 4. Special Issue.

NELSON, P. 1991. Breaching the Language Barrier: Experimentation with
Japanese to English Machine Translation. In: D. I. Raitt (Ed.), 15th
International Online Information Meeting Proceedings: Learned
Information.1991. pp. 21--33.

NEVILLE, H. H. 1970. Feasibility study of a scheme for reconciling
thesauri covering a common subject. Journal of Documentation, 1970;
26(4): 313--336.

NEVILLE, H. H. 1975. Alternatives to conventional multilingual
thesauri (British Library Research and Development Report 5265 HC)

NG, K. & ZUE, V. W. Phonetic Recognition for Spoken Document
Retrieval.  Proceedings of IEEE Int. Conf. Acoustics, Speech, and
Signal Processing, 1998.

NGUYEN, V. B. H., WILKINSON, R., & ZOBEL, J. 1997. Cross-Language
Retrieval in English and Vietnamese, AAAI Symposium on Cross-Language
Text and Speech Retrieval : American Association of Artificial
Intelligence.1997. pp. 143-145. ISBN: 1- 57735-040-5; Technical
Report: SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

OARD, D. W. 1996. Adaptive Vector Space Text Filtering for Monolingual
and Cross- Language Applications. Unpublished PhD Dissertation,
University of Maryland, College Park.

OARD, D. W. 1997a. Adaptive Filtering of Multilingual Document
Streams. In: Fifth RIAO Conference on Computer Assisted Information
Searching on the Internet. 1997. pp.  ?
http://www.glue.umd.edu/dlrg/~oard/research.html

OARD, D. W. 1997b. Alternative Approaches for Cross-Language Text
Retrieval, AAAI Symposium on Cross-Language Text and Speech Retrieval
: American Association for Artificial
Intelligence.1997. pp. 154-162. ISBN: 1-57735-040-5; Technical Report:
SS- 97-05.  http://www.glue.umd.edu/~oard/research.html

OARD, D. W. 1997c. Serving Users in Many Languages : Cross-Language
Information Retrieval for Digital Libraries. D-Lib Magazine. Vol. ?,
No? Dec .  http://www.dlib.org

OARD, D. W., & DORR, B. J. 1996. A Survey of Multilingual Text
Retrieval (CS-TR- 3615): University of Maryland, Institute for
Advanced Computer Studies.
http://www.glue.umd.edu/~oard/research.html

OARD, D. W., & DORR, B. J. 1998. Evaluating Cross-Language Text
Filtering Effectiveness. In: G. Grefenstette (Ed.), Cross Language
Information Retrieval: Kluwer Academic. pp.  ??. ISBN:0-7923-8122-X.
http://www.glue.umd.edu/~oard/research.html

OARD, D. W., DORR, B. J., HACKETT, P. G., & KATSOVA, M. 1998. A
Comparative Study of Knowledge-Based Approaches for Cross-Language
Information Retrieval.  Institute for Advanced Computer Studies,
University of Maryland. CS-TR-3897.

OARD, D. W. & HACKETT, P. 1998. Document Translation for
Cross-Language Text Retrieval at the University of Maryland. In:
Proceedings of the Sixth Text Retrieval Conference (TREC-6), National
Institute of Standards and Technology (NIST), Gaithersburg, MD.
http://trec.nist.gov

OFFICE FOR OFFICIAL PUBLICATIONS OF THE EUROPEAN COMMUNITIES.
1995. Thesaurus EUROVOC Volume 3: Multilingual version. Luxembourg.

O'HAGAN, M. 1996. The Coming Industry of Teletranslation. Clevedon:
Multilingual Matters. ISBN 1-85359-326-5.

PASANEN-TUOMAINEN, I. 1991. Analysis of Subject Searching in the
TENTTU Books Database. In: J. K. Lucker (Ed.), Proceedings of the 14th
Biennial Conference of IATUL (Vol. 1): International Association of
Technological University Libraries.1991.  pp. 72--77.

PASHCHENKO, N. A., KALACHKINA, S. Y., MATSAK, N. M., & PIGUR, V. A.
1982. Basic Principles for Creating Multilanguage Information
Retrieval Thesauri (Experience with implementing GOST
7.24-80). Automatic Documentation and Mathematical Linguistics, 1982;
16(3): 30--36. English translation from Russian.

PELISSIER, D., & ARTUR, O. 1986. The Multilingual Evolution of PASCAL,
10th International Online Information Meeting : Learned
Information.1986. pp. 113--121.

PETERS, C., & PICCHI, E. 1997. Using Linguistic Tools and Resources in
Cross- Language Retrieval, AAAI Symposium on Cross-Language Text and
Speech Retrieval : American Association for Artificial
Intelligence.1997. pp. 179-188. ISBN: 1-57735-040- 5; Technical
Report: SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

PEVZNER, B. R. 1969. Automatic Translation of English Text to the
Language of the Pusto- Nepusto-2 System. Automatic Documentation and
Mathematical Linguistics, 1969; 3(4): 40--48. English translation from
Russian.

PEVZNER, B. R. 1972. Comparative Evaluation of the Operation of the  Russian and  
English Variants of the "Pusto- Nepusto-2'' System. Automatic Documentation and 
Mathematical Linguistics, 1972; 6(2): 71--74. English translation from Russian.

PICCHI, E., & PETERS, C. 1996. Cross Language Information Retrieval: A
System for Comparable Corpus Querying. In: G. Grefenstette,
A. Smeaton, & P. Sheridan (Eds.), Workshop on Cross-Linguistic
Information Retrieval : ACM SIGIR.1996. pp. 24--33.
http://www.rxrc.xerox.com/research/mltt/DMHead/CLIR/

PIGUR, V. A. 1979. Multilanguage Information-Retrieval Systems:
Integration Levels and Language Support. Automatic Documentation and
Mathematical Linguistics, 1979; 13(1): 36--46. English translation
from Russian.

PIONEER CONSULTING 1997. Pioneer Forecast: International E-mail
Growth. The Pioneer Report, 1997; 1(aug): 3.
http://www.pionerconsutling.com

POLLITT, A. S., ELLIS, G. P., SMITH, M. P., GREGORY, M. R., LI, C. S.,
& ZANGENBERG, H. 1993. A Common Query Interface for Multilingual
Document Retrieval from Databases of the European Community
Institutions. In: D. I. Raitt & B.  Jeapes (Eds.), 17th International
Meeting on Online Information : Learned
Information.1993. pp. 47--61. ISBN: 0904933857

POLLITT, A. S., & ELLIS, G. P. 1993. Multilingual access to document
databases, 21st Annual Conference Canadian Society of Information
Science .1993. pp. 128-140.

RADWAN, K. 1994. Vers l'Acc�s Multilingue en Langage Naturel aux
Bases de Donn�es Textuelles. Unpublished PhD, Universit� de
Paris-Sud, Centre d'Orsay.

RADWAN, K., & FLUHR, C. 1995. Textual database lexicon used as a
filter to resolve semantic ambiguity applications on multilingual
information retrieval, 4th Annual Symposium on Document Analysis and
Information Retrieval: University of Nevada.1995. pp. 121-136.

READWARE http://?

REHDER, B., LITTMAN, M., DUMAIS, S., & LANDAUER, T. 1998. Automatic 3-
Language Cross-Language Information Retrieval with Latent Semantic
Indexing. In: Proceedings of the Sixth Text Retrieval Conference
(TREC-6), National Institute of Standards and Technology (NIST),
Gaithersburg, MD.

RESNIK, P. 1997. Evaluating Multilingual Gisting of Web Pages, AAAI
Symposium on Cross-Language Text and Speech Retrieval : American
Association for Artificial Intelligence.1997. pp. 189-195. ISBN:
1-57735-040-5; Technical Report: SS-97-05.
http://www.clis.umd.edu/dlrg/filter/sss/papers/

RIDDLE, J. N. 1992. FBIS Requirements and Capabilities. In: First
International Symposium on National Security and National
Competitiveness. Open Source Solutions (OSS),
pp. 264-271. http://www.oss.net

RIGBY, M. 1981. Automation and the UDC 1948--1980. (2 ed.). The Hague:
Federation Internationale de Documentation (FID)

ROLLAND-THOMAS, P., & MERCURE, G. E. 1989. Subject Access in a
Bilingual Online Catalog. Cataloging and Classification Quarterly,
1989; 10(1/2): 141--163

ROLLING, L. 1975. Multilingual systems: Survey of the European
scene. In: V.  Horsnell (Ed.), Report of a Workshop on Multilingual
Systems.1975. pp. 4--5.  British Library Research and Development
Report 5265 HC

SALTON, G. 1970. Automatic Processing of Foreign Language
Documents. Journal of the American Society for Information Science,
1970; 21(3): 187--194

SALTON, G. 1973. Experiments in Multi-Lingual Information
Retrieval. Information Processing Letters, 1973; 2(1): 6--11. TR
72-154.  http://cs-tr.cs.cornell.edu

SANDERSON, M. 1994. Word Sense Disambiguation and Information
Retrieval.  In: Croft, B. & van Rijsbergen, K. (Eds.), Proceedings of
the Seventeenth Annual International ACM-SIGIR Conference on Research
and Development in Information
Retrieval. pp. 142-151. Springer-Verlag.
http://www.dcs.gla.ac.uk/ir/papers/Postscript/sanderson94b.ps.gz

SCH�UBLE, P. & SHERIDAN, P. (1998) Cross-Language Information
Retrieval (CLIR) Track Overview. In: Proceedings of the Sixth Text
REtrieval Conference (TREC-6), National Institute of Standards and
Technology (NIST), Gaithersburg, MD.

SEMTURS, F. 1978. STAIRS/TLS - A System for "Free Text'' and
"Descriptor'' Searching. In: E. H. Brenner (Ed.)Vol. 15: American
Society for Information Science.1978. pp. 295--298.

SHERIDAN, P. & BALLERINI, J. P. 1996. Experiments in Multilingual
Information Retrieval Using the SPIDER System. In: H. P. Frei (Ed.),
Proceedings of the 19th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval. pp. 58-66. ISBN:
0897917928, 3891919999.  http://www-
ir.inf.ethz.ch/Public-Web-Pages/sheridan/papers/SIGIR96.ps

SHERIDAN, P, BALLERINI, J. P., & SCH�UBLE, P. 1998. Building a
Large Multilingual Test Collection from Comparable News Documents. In:
G. Grefenstette, (Ed.): Cross Language Information Retrieval.  Kluwer
Academic. pp.  ??. ISBN:0-7923- 8122-X.
http://www.rxrc.xerox.com/research/mltt/DMHead/CLIR/

SHERIDAN, P., & SCH�UBLE, P. 1997. Cross-Language Information
Retrieval in a Multilingual Legal Domain, Proceedings of the First
European Conference on Research and Advanced Technology for Digital
Libraries, Pisa, Italy. pp. 253-268.
http://www-ir.inf.ethz.ch/Public-Web-Pages/sheridan/papers/sheridan.html

SHERIDAN, P., WECHSLER, M., & SCH�UBLE, P. 1997. Cross-Language
Speech Retrieval: Establishing a Baseline Performance. In:
N. J. Belkin, A. D. Narasimhalu, & P.  Willett (Eds.), Proceedings of
the 20th International ACM SIGIR Conference on Research and
Development in Information Retrieval . pp. 99-109. ISBN: 0897918363
http://www-ir.inf.ethz.ch/Public-Web-Pages/sheridan/

SMEATON, A. F. & SPITZ, A. L. 1997. Using Character Shape Coding for
Information Retrieval. In: Proceedings of the Fourth International
Conference on Document Analysis and Recognition, ICDAR'97, Ulm,
Germany, IEEE Computer Society, pp.974-978.  http://
http://www.compapp.dcu.ie/~asmeaton/pubs-list.html

SMITH, M. P. & POLLITT, A. S. 1992. An Evaluation of Concept
Translation Through Menu Navigation in the MenUSE Intermediary
System. In: McEnery, T. & Pais, C.  (Eds.), Proceedings of 14th
Information Retrieval Colloquium (BCS).  University of Lancaster,
pp. 38-54.

SOERGEL, D. 1997. Multilingual Thesauri in Cross-Language Text and
Speech Retrieval. In: AAAI Symposium on Cross-Language Text and Speech
Retrieval : American Association for Artificial
Intelligence.1997. pp. 197-216. ISBN: 1-57735-040- 5; Technical
Report: SS-97-05.  http://www.clis.umd.edu/dlrg/filter/sss/papers/

STAMATATOS, E., MICHOS, S., PATELODIMOU, C., & FAKOTAKIS, N. 1997.
TRANSLIB : An Advanced Tool for Supporting Multilingual Access to
Library Catalogues. In: Second Workshop on Multilinguality in the
Software Industry: The AI Contribution: International Joint Conference
on Artificial Intelligence.1997. pp.
http://www.iit.nrcps.ariadne-t.gr/~costass/mulsaic97.html

STEGENTRITT, E. 1994. German Analysis: Morpho-Syntax Within the
Framework of the Free-Text Retrieval Project E.M.I.R. Saarbr�cken,
Germany: AQ-Verlag

STUDEMAN, W. 1992. Teaching the Giant to Dance: Contradictions and
Opportunities in Open Source within the Intelligence Community. In:
Proceedings of the First International Symposium on National Security
and National Competitiveness. pp. 82-92 dec. Vol. 2.
http://www.oss.net

SUZUKI, M., & HASHIMOTO, K. 1996. Enhancing Source Text for WWW
Distribution. In: S. H. Myaeng (Ed.), Proceedings of the Workshop on
Information Retrieval with Oriental Languages : Korea Research &
Development Information Center.  1996. pp.  51--56.

SYNELLIS, C. 1995. TRANSLIB User Survey Report (TRANSLIB Technical
Report): University of Patras Central Library

TAYLOR, R. S. 1962 The Process of Asking Questions. American
Documentation, Vol.  13, no. 4, pp. 391-396.

UNESCO 1971. Guidelines for Establishment and Development of
Multilingual Scientific and Technical Thesauri for Information
Retrieval. Paris, France: UNESCO report number: SC/WS/501.

VOLODIN, K. I., GUL'NITSKII, L. L., MAKSAKOVA, R. N., PARKHOMENKO, V.
F., POZHARISKII, I. F., FEDOTOVA, L. V., & YAKOVLEVA,
N. I. 1991. Bilingual Indexing of Geological Documents. Automatic
Documentation and Mathematical Linguistics, 1991; 25(6):
43--45. English translation from Russian.

WECHSLER, M., SHERIDAN, P., & SCH�UBLE, P. 1997. Multi-Language
Text Indexing for Internet Retrieval.  In: Fifth RIAO Conference on
Computer-Assisted Information Searching on the Internet. pp. ??
http://www-ir.inf.ethz.ch/Public-Web-Pages/sheridan/

WEIGAND, H. 1997. A Multilingual Ontology-based Lexicon for News
Filtering --- The TREVI Project, IJCAI Workshop on Ontologies and
Multilingual NLP : International Joint Conference on Artificial
Intelligence.1997. pp.  ??  http://crl.nmsu.edu/Events/IJCAI/

WELLISCH, H. 1973. Linguistic and Semantic Problems in the Use of
English-Language Information Services in Non-English-Speaking
Countries, International Library Review, vol. 5, no. 2, pp. 147-162.

WHITNEY, G. 1990. Language Distribution in Databases: An Analysis and
Evaluation, Metuchen, NJ: Scarecrow Press. ISBN 0-8108-2323-3.

WILKENSON, R. 1997. Chinese Document Retrieval at TREC-6. In: Harman,
D. K.  (Ed.) The Sixth Text Retrieval Conference (TREC-6). National
Institute of Standards and Technology (NIST), Gaithersburg, MD.
http://trec.nist.gov/pubs/trec6/t6_proceedings.html

WOOD, D. N. 1967. The Foreign-Language Problem Facing Scientists and
Technologists in the United Kingdom : Report of a Recent
Survey. Journal of Documentation. Vol. 23, no. 2, p. 117-130.

WOOD, D. N. 1974. Access to Information in Foreign Languages -- An
Experiment. BLL Review, Vol. 2, #1, pp. 12-14.

YAMABANA, K., MURAKI, K., DOI, S., & KAMEI, S.-I. 1998. A Language
Conversion Front-End for Cross-Linguistic Information Retrieval. In:
G. Grefenstette (Ed.), Cross Language Information Retrieval: Kluwer
Academic. pp.  ??. ISBN:0-7923- 8122-X.
http://www.rxrc.xerox.com/research/mltt/DMHead/CLIR/

YANG, Y., BROWN, R. D., FREDERKING, R. E., CARBONELL, J. G., GENG, Y.,
& LEE, D. 1997. Bilingual-corpus Based Approaches to Translingual
Information Retrieval, Second Workshop on Multilinguality in the
Software Industry: The AI Contribution: International Joint Conference
on Artificial Intelligence.1997. pp. ??
http://www.iit.nrcps.ariadne-t.gr/~costass/mulsaic97.html

ZISSMAN, M. A. 1996. Comparison of Four Approaches to Automatic
Language Identification of Telephone Speech, IEEE Trans. Speech and
Audio Proc., SAP-4(1), pp.  31-44.