Difficult Languages Challenging Machine Translation

There are more than seven thousand languages in the world, of which four thousand are written. But only 100 languages or so can be translated using machine translation tools like Google Translate. Currently, new and promising research is underway to help us communicate in other languages as well.
Suppose you find a message that includes information that may contribute to saving a person’s life, but the problem is that you do not understand a single word of the message, and worse than that, you do not know any language among the thousands of languages in the world, this message was written, so what do you do?
If this message were written in French or Spanish, this problem would have been solved by typing the message into the machine translation engine and you would get a clear answer in English immediately. But many languages are still difficult to machine translation, including languages spoken by millions of people, such as Wolof, Luganda, Twi, and Ewi in Africa. This is because of the algorithms that these engines rely on learning from human translations, analyzing millions of words from the translated texts to improve their accuracy.
There is an inexhaustible amount of these texts in some languages, such as English, French and Spanish, thanks to the abundance of human translators in multinational institutions, such as the Canadian Parliament, the United Nations and the European Union, as they produce huge quantities of documents and translated documents. The European Parliament alone produces 1.37 billion words in 23 languages in ten years.
However, some languages, which may be widespread, may not be translated in this abundance, and hence there are not many publications in these languages, which is why they are known as languages of few sources. Artificial intelligence for training in these languages relies on religious publications, such as the Bible translated into many languages. But this information is not sufficient to train robots to produce accurately translated texts in various fields.
While the “Google Translate” application allows people to communicate in 108 different languages, the “Bing” translator, developed by Microsoft, allows communication in about 70 languages. But the number of spoken languages in the world exceeds seven thousand languages, among which at least four thousand languages have writing systems.
This language barrier may stand in the way of anyone who needs to quickly gather accurate information, such as intelligence agencies.

“The more an individual becomes interested in understanding the world, the greater the need for access to data that is not written in the English language,” says Carl Rubino, program director for the Advanced Intelligence Research Projects Agency (AIRPA), the research arm of US intelligence. Economic and political stability, the outbreak of the Coronavirus, and climate change, and therefore all these challenges are, in essence, multilingual. “
It may take many years to train a translator or intelligence analyst in a new language, and after these years they may not gain sufficient experience to perform the task assigned to them. “There are more than 500 languages spoken in Nigeria alone, for example,” says Rubino. “Even our experts, even the most famous in the world, in this country may understand very little of them.”
IARPA is funding research to develop a machine translation system that can search for, translate and summarize any written or spoken information in a language with little resources.
This project is represented in a search engine in which the user can write a query in English, for example. A list of summarized English language documents translated from a foreign language is immediately presented to him. If the user clicks on one of these documents, the fully translated document will appear. Competing teams of computer science researchers are participating in the project, and large portions of it have already been published.
Kathleen McCune, a computer scientist at Columbia University who leads one of the competing teams, believes that the goal of this project is to facilitate interaction between people of different cultures and to exchange more information about their cultures.
Research teams are using artificial neural network technology, a form of artificial intelligence that mimics some aspects of human thinking. Artificial neural network models have upended the scales in language processing in recent years. Instead of just memorizing words and sentences, these networks learn their meanings. It may be understood from the context that many vocabularies can be used to express the same concept, even if it appears on its face to be different.
However, these models usually require analyzing millions of texts to practice the language to be learned. The researchers in this project are trying to develop these models in order to train the language by analyzing fewer amounts of data. In the end, humans do not need to read official documents written over years to learn a language.
“When people learn a language, they only need to read a tiny fraction of the data that today’s MT systems need to be trained in translation,” says Regina Barzilai, a computer scientist at the Massachusetts Institute of Technology. “That’s why we’re trying to develop the new generation of machine translation systems that produce accurately translated texts.” Without you needing this huge amount of information. “
Each research team includes groups of specialists to solve a system problem. Key components such as automatic search, speech recognition technology, translation and text summarization have been modified to fit under-resourced languages.
Since 2017, the teams have focused on eight different languages, including Swahili, Tagalog, Somali and Kazakh.

Teams were successful in gathering written and spoken information in under-resourced languages from Internet sites in the form of articles, forums and videos. This information is made available on the Internet thanks to users around the world who post content in their native language.
“If you want information in the Somali language, you will find hundreds of millions of words,” says Scott Miller, a computer scientist at the University of Southern California who is involved in the project. “You can find large amounts of text in almost any language now on the Internet.”
But these texts are mostly in one language, meaning that Somali articles, for example, are not accompanied by an English translation. But Miller says neural network models may be pre-trained in different languages by analyzing texts written in only one language.
Artificial neural networks are said to learn during the training process the properties and structures of language, and then use them in the translation process. “Nobody knows the syntax these models learn,” Miller says. “There are millions of criteria.”
After the training phase in many languages, the neural network models learn to translate from one language to another, with the help of a few translated texts, perhaps a few hundreds of thousands of words in the language to be learned and the equivalent in other languages.
The multilingual search engine would then be able to search through both spoken and written information, albeit with many challenges. Speech recognition and speech-to-text technology usually finds it difficult to distinguish voices, names and geographical areas that you have not encountered before.
Peter Bell, an expert in speech techniques at the University of Edinburgh, who is part of a team, gives an example of this in a country that may be relatively unknown to the West, and where a politician may have been assassinated. Finding this politician’s name in the audio clips would be difficult.
Bell circumvented this problem by referring to texts quoted from audio clips, and looking for words that seemed unclear because the system had not encountered them before. Examining these words, one of them might be the name of this obscure politician.
After finding and translating the information, a search engine summarizes the information for the user. But in the process of summarizing, neural networks can make mistakes, which computer scientists call “hallucinations.”

Let’s say you were looking for a news report about protesters who stormed a building on Monday, but you read the summary that appeared to you that they stormed it on Thursday. This is because when you summarize a report, Neural Network Models draw information from the millions of pages that you analyzed during the training phase. These texts may contain many examples of protesters storming buildings on Thursdays, which is why the neural network expected that this would apply to the last example as well.
Neural network models may also insert dates or numbers on their own in the summary, such as “hallucinations”.
“The models of neural networks are so sophisticated, they can memorize a lot of languages and add words that are not present in the source,” says Merella Lapata, a computer scientist at the University of Edinburgh.
Lapata avoided this problem by extracting key words from each document, instead of being summarized by the machine in the form of sentences, thus preventing these neural models from adding information and introspection.
The project includes a team working on languages that have ceased to exist thousands of years ago. There is no doubt that these ancient languages are scarce, and perhaps only fragments of texts remain. Experts use these languages as a way to experiment with new technologies that may be applied to modern, under-resourced languages.
Jiaming Lu, a PhD student at the Massachusetts Institute of Technology, and his team have developed algorithms that can detect modern languages descended from ancient languages. The team feeds the algorithms with simple information about these languages and an overview of their changes.
The neural network model, based on little information, discovered that the ancient Ugaritic language in the Far East is closely related to Hebrew, and that the Iberian language, one of the ancient European languages, is closer to Basque (Pashkanish) than to other European languages.
Barzilai says, “The reliance on large quantities of translated documents is one of the weaknesses of the system. Therefore, the production of effective technological tools, whether for processing codes or for translating non-widespread languages, will contribute to the advancement of the field of machine translation.”
The teams developed models of multilingual search engines and improved their efficiency by adding new languages. “These technological tools are revolutionizing the ways in which analysts collect data from texts written in foreign languages, as they will allow analysts who only speak English to analyze data that they were previously unable to read or understand,” Rubino says.
Speakers of under-resourced languages also participate in this project, as they need important information written in foreign languages, not for the purpose of espionage, but to improve the quality of daily life.
David Eviolwa Adilani, a PhD student in computer science at the University of Saarland, Germany, who hails from Nigeria and is a native speaker of the Yoruba language, says: “When the Corona virus broke out, we were in dire need of translating the necessary health advice into many languages. To languages with few resources. “
Adilani is developing a database from Yoruba to English as part of the non-profit “Breaking the Language Barrier Between Multilingual Speakers of Africa” project. Adilani and his team members added to the database movie scenarios, news, literary works, and public talks translated into Yoruba, and used the database to improve the accuracy of a neural network model that might already be trained on religious texts, such as Jehovah’s Witness publications.
Parallel to these efforts, members of communities in Africa are participating in developing databases in other African languages, such as Ayo, the Fon, Tuy, and Luiganda.
Perhaps the day will come when we will all use multilingual search engines in our daily lives, to discover information from all over the world with the click of a button. But for now, if you want to understand texts in one of the under-resourced languages, you can’t help but learn that language and join the multi-language speaking teams who are developing databases to improve the efficiency of machine translation tools and techniques.