TECLÍN


Technology for minority languages using the Bible


Linguists of the University of Copenhagen (Denmark) have produced language technology for 100 minority and predominant languages at the same time on the basis of biblical verses and articles from Wikipedia.

09/10/2015

Bible. Photo: Dave Bullock. Source: Wikipedia.
Bible. Photo: Dave Bullock. Source: Wikipedia.
Linguists of the University of Copenhagen (Denmark) have produced language technology for 100 minority and predominant languages at the same time on the basis of biblical verses and articles from Wikipedia.

“When we develop automatic translation systems and search engines, we generally introduce large volumes of text in the computer, which contains information about the function and meaning of the words. For historical reasons, these texts have been primarily newspaper articles in English and other predominant languages. We do not have access to similar texts in smaller languages such as the ones of the Faroe Islands, Welsh, Galician, and Irish, even the African language Yoruba, which is spoken by 28 million of people”, says the professor Anders Søgaard, from the University of Copenhague, in the press release, and echoes DAIL Software.com/ .

Søgaard and his fellows bet for searching texts that had been translated to many languages so that the knowledge about the grammar of predominant languages could be transferred to minority languages.

“The Bible has been translated into more than 1,500 languages, including most of minority languages. The translations are extremely conservative. The verses are organised in a uniform way in all languages, what means that we can do suitable computer models even for minority languages with just 200 pages of the Bible”, explains Søgaard.

Wikipedia

The online encyclopaedia created by the users of Wikipedia is a useful source for researchers, who use their texts for developing linguistic resources. Wikipedia contains more than 35 million of articles, but the fact that there are at least 129 languages represented by more than 10,000 articles is what makes Wikipedia interesting for researchers, since many of the articles refer to the same concepts and topics.

This allows us to do what we call “inverted index”, what means that the concept that articles try to describe is used precisely to describe the words used to describe it”, explains Søgaard.” If the word “glasses” appears in the entry of the Wikipedia about Harry Potter, and the German word “Brille” is used as well in the equivalent German entry, it is quite probable that both words are represented in a similar way in our models of automatic translation systems. The advantage of this model is that it can be applied to 100 different languages at the same time, including many other languages for which the use of technological resources has been restricted.



New comment:


Más sobre la Cátedra
< >

Tuesday, February 28th 2017 - 10:03 RESEARCH SEMINAR ON LANGUAGE TECHNOLOGY









Main sponsor



Media Partner