Varin's Blog

varinsikka.github.io

Menu

Why we're a long way from accurate machine translation

I read a paper recently that evaluated the performance of various LLMs in translating various languages. The results seem pretty interesting (overall), and then this got me into a rabbit hole of trying to understand exactly how LLms do machine translation.

Basically, inputs (the text you want to translate) are first "encoded" into tokens. Tokens are basically the smallest divisible parts of speech possible (if you're a linguistics nerd, these are basically just morphemes). Words are generally one token, such as "eat", "drink", "person", "tree", "food", but can also be more than one (like in "eat-ing", "tree-s", "push-ed", "anti-dis-establish-ment-arian-ism" with tokens separated by hyphens).

After the text is encoded into these tokens, it's then decoded into the corresponding text of the target language. This is usually more complicated than it sounds; the neural network needs to determine both what the words in the target language should be and then how they should be glued together. It needs to do this while trying to keep the meaning the same with only the input it was given.

While the approach sounds pretty elegant, it makes a pretty critical assumption that doesn't make it practical for languages that don't conform to this assumption: it assumes that all languages use the exact same set of morphemes. It's a fairly easy assumption to make, so we'll go into some languages that don't conform to it now.

Starting with two simple examples: Russian and Vietnamese. You've probably seen a lot of talk on how perceptions of color can vary languages, and both of these languages are examples of this. In English, basic words for colors can include red, orange, yellow, green, blue, purple, black, white, brown, pink, etc. Vietnamese, however, doesn't distinguish between "blue" and "green"; they're both considered "xanh". This isn't a problem if you're translating from English to Vietnamese, since whether you're describing something as "blue" or "green", you can translate it to "xanh" either way. This is a problem vice versa, however. Let's say you're translating "bàn đó xanh" ("that table is blue/green", also if this translation is wrong it was Google Translate's fault i swear) into English, and the table in question is actually blue. On its own, an LLM has no way of knowing that the table is green; it can't just look at the table itself and know what color it is. This makes accurately translating this simple, merely 3-word phrase entirely impossible without connecting it to some computer vision system that can identify the "table" mentioned and see what color it is.
Russian is a similar situation, it has a word for lighter shades of blue ("goluboy") and another for darker ones ("siniy"). If you're translating "that pen is blue" into Russian, the LLM translating it doesn't know if your pen is a lighter shade of blue or a darker one, and now we're in the same situation as Vietnamese.

Now to Arizona for a more extreme example: Navajo. Navajo is (generally speaking) an agglutinative language, which basically means it really likes to "glue" together those tokens/morphemes mentioned earlier into words rather than leave them separate as in English. Navajo also has a B-52 Stratofortress's max cargo capacity worth of grammatical constructs that are not marked at all in English; such as the independent "durative", "conclusive", "semelfactive", "distributive", "reversative", and "cursive" (+more) verb aspects, none of which (to my knowledge) have any way of being marked with their own tokens in English. In addition, Navajo verb tokens get different forms depending on the shape of the object being interacted with; so "give" in "I give them hay" would be "níłjool", but "give" in "I give them a cigarette" would be "nítįįh". On top of all that, when tokens are glued together in the language, they may surface in a slightly different form that makes the presence of that token less clear; for example, "di'nisbąąs" can be tokenized into "di-'a-ni-sh-ł-bąąs". First off, all of this makes acquiring certain tokens that have to be used in Navajo from an English text really hard, since English has no real equivalent of such grammatical features, and even if it does it probably doesn't need to be used like it does in Navajo; for example, the verb in "I am giving them water" could have the momentaneous, continuative, semelfactive, conative, or cursive aspects, and since none of these aspect distinctions are made in English, without further context the LLM can't accurately determine which one to use; the same applies for knowing the shape of an object in questions. And second, extracting tokens out of Navajo words for translation into other languages becomes harder if they surface in different ways depending on how they're combined in words (e.g. the tokens "-sh-ł-" surfacing as "s" in the previous example of "di'nisbąąs").

There are probably a lot more examples of issues like this that occur for translating languages around the world using LLMs and tokenization; one of the biggest ones revolves around color perception, like in the first example mentioned, but some languages have no numbers, some languages have no relative directions (left/right, up/down, (for/back)ward), some languages have a distinction between third- person obviative and third-person animate subjects (in other words, people in the third-person who are further away from a conversation get different pronouns than people who are closer to it), and more. Basically, given just how context-based languages can be, without some way to provide LLM machine translators with context of a phrase before translating the phrase, this isn't going to be feasible anytime soon for a lot of languages.