Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!
I currently host the largest collection of bilingual Manx[0] <-> English texts (~1MM words). How would I formally get in contact to chat about the steps to make machine translation available (and would there be grant opportunities available for further production of machine-readable data?)
Thank you for your exciting work and for coming onto HN to respond to questions.
I am a former professional translator (Japanese to English) and am now supervising research at the University of Tokyo on the use of machine translation in second-language education. As I have written in a few papers and essays [1], advances in MT have raised serious questions for language teachers. The ready availability of MT today, including on Facebook and Instagram, means that language students use it a lot while studying. We don’t know yet, though, how that use of MT might affect our students’ acquisition of other languages or their motivation to keep studying those languages.
One of the hurdles educators and researchers face is finding out how MT is being used in the real world. Most education in modern languages is focused on giving students language skills that they will be able to use later in work, education, and daily life, and textbooks and other learning materials are typically shaped around real-world situations. We are now struggling to adapt those materials for the age of MT, because data on the actual use of MT is very hard to get.
Like Google, Microsoft, Baidu, DeepL, and others, Meta must have huge amounts of data on how your users are using MT to communicate. Any information and insights about that MT usage that you can share with world—just as you have generously shared your NLLB models—would be most welcome.
I've learned two languages with the help of MT. I'm sure you've interviewed people like me, but I get excited about the potential of MT for language learning, so I'd like to share my thoughts.
When I learned Spanish, I spent a lot of time chatting on Facebook with native speakers, and using Google Translate as "training wheels" to help me formulate sentences, and understand words and phrases I hadn't learned yet. It worked pretty well at the time (2012) except in cases of slang and typos that Google couldn't handle. I also used it a lot to help me translate blog posts from English to Spanish. Eventually, I graduated from the training wheels and was able to use Spanish fluently without the help of MT. More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent, since my spoken practice was 100% with native speakers.
When I learned Hungarian (2019-now), at the beginning, Google Translate wasn't good enough to use for much more than getting a rough understanding of formal text, so I learned in a more traditional way at a school and with native speakers. Then the pandemic prevented me from doing both of those. I started chatting with native speakers on Facebook, but it was very difficult without MT and involved a lot of asking my conversation partners for translations and explanations. Progress was frustratingly slow. Then I discovered DeepL's MT, which was extremely good with Hungarian. I started using for chat conversations and emails, and people were shocked that I was managing to communicate with them so fluently. My progress of actually learning the language for myself accelerated dramatically. I've become conversational (B2/C1) in Hungarian in 2.5 years with very little in-person practice. Often, it takes native English speakers 5 years of in-person practice to reach that level. I'm convinced that MT played a key role in my ability to learn quickly.
When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it. So I carefully read the translation, making sure that I understand each word. Sometimes that means I have to look up individual words/grammar before sending a message (I often use wiktionary for that, because it shows etymology), and other times, it means that I'll replace unfamiliar words or phrases in a translation with words and phrases from my own vocabulary. Over time I rely on MT less and less because my own vocabulary becomes stronger. I really believe that they key to learning a language quickly is to start USING the language as quickly as possible. Once you're using a language, your brain automatically starts picking up the skill. With traditional language learning, using a language can be very difficult in the beginning until you've reached a conversational level, but with MT, you can start using a language before you know everything.
For Spanish, I almost never use MT anymore. Sometimes I use it as a quick dictionary for an unfamiliar word, but my Spanish level is C2 and I use Spanish every day so it feels natural. I'm not ever translating in my head anymore.
For Hungarian, I'm still using MT often, but I don't need it during conversation (either written or spoken). Besides using it to translate things I don't know, I also find it useful for inputting Hungarian characters that are a pain to type with my US keyboard, and for conjugating words correctly when I know the root but am struggling for the correct ending. Often I'll know what I want to say in Hungarian, but I'll open DeepL and type in English, then adjust the translation to use the words I want before I copy and paste the Hungarian. I'm essentially using MT as a guide to help me craft my sentences even when I know what I want to say.
In summary, MT is awesome for language learning and for assisting language skills in development.
Thank you! Those are really interesting and valuable comments. I haven’t, in fact, heard many stories like yours, especially with such clear insights about how you have been able to use MT constructively in your language learning.
Most of the discussions I’ve had about MT have focused on language learning in school contexts. In Japan and most other countries (though often not in English-speaking countries), all children have to study at least one foreign language in school. As with all compulsory education, low motivation and poor study skills are a constant problem. In such contexts, MT seems to many teachers and students just to be a way to cheat on classwork. And since very few of today’s veteran teachers were able to use MT when we were young and studying languages, we don’t understand how we can guide even our motivated students on using it productively. I will be sure to share your insights with my colleagues.
A couple of comments:
> More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent....
This can happen with traditional methods of language learning, too. The language of textbooks, like the output of MT, usually reflects the standard written language, which can be very different from how people actually speak, especially in the case of languages with large dialect and register variations.
> When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it.
That sounds like an excellent rule. I will pass on that advice to educators I know who are trying to figure out how to guide their students on the use of MT.
If your goal is to make inclusive translation more widely available why license the models under a non-commercial license? This basically makes it impossible to use legally (or at least without a lot of legal risk) for essentially anyone due to the vague definition of what's commercial. Is Facebook hurting for money and looking to commercially license this model on request?
This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.
If your aim is to make this technology more widely available and, as you claim, "give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences", then why make it so that the model essentially can't be used for anything useful? It doesn't really make any sense.
Legally even the use case which you're promoting on your frontpage - the Wikipedia Foundation's Content Translation - is illegal under the non-commercial license in certain jurisdictions! For example, see here: https://www.techdirt.com/2014/03/27/german-court-says-creati...
Even using it for research would be illegal as it's also not exactly "personal use".
Hey Jeff, I’m a native speaker of Dhivehi — the language spoken by the people of Maldives. Since I couldn’t find a full list of supported languages I was wondering if Dhivehi is / would be integrated.
Dhivehi is currently not supported, unfortunately. We view this as a starting point and are committed to expanding to many other languages as in the spirit of our project name.
I'm curious how much work it takes to prepare training data for a language. From anecdotal experience, I've always been able to learn some basic survival skills in a new language by studying the translations of about 20 key phrases for a week or so, which give me the ability to combine them into a few hundred different phrases and survive most daily transactions. So I always imagine that training a language model is similar, just on a much larger scale. It seemed to me that there could be a standard text that includes a lot of important topics and contexts, which just needs to be manually translated into a target language and then fed to the model. I imagine it being about the size of a large book, so I imagine that adding a new language to a model would cost a similar amount to paying to have a book translated. Obviously the size of the input text would have an effect on how good the model's translations are, and domain specific translations would require more specific input. While having a full translation of an entire library seems like a good way to train a model that's used to translate everything, it seems like a small percentage of the library would be enough to produce native-level translations for most domains.
How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?
Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.
These initiatives are always couched in "inclusion" rhetoric (the very name of your project is telling); I don't doubt for a second that it's a genuine sentiment, but I strongly suspect your team hasn't thought through the full, self-defeating implications of universal language translation.
The problem is that it increases the risk of monoculture to 100%. Without language barriers, cultural diversity is lost, not gained, since you have winner-take-all effects[0]. Instead of helping revive languages, it'll make American ideas, mores, morality (Puritanism), philosophies, and political values more dominant worldwide.
To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.
Hi, I'm putting together an online event called 31 Days of AI for Book-Lovers to coincide with US National Book Month, October 2022. I was struck by the specific call-out to translating literature on your demo page and would like to feature a specifically book-related application of NLLB on one of 'anchor days'. Can someone work with me on this?
Hi, I'm looking but can't seem to find instructions on how to do tokenization. Where is spm model, is it "flores200_sacrebleu_tokenizer_spm.model" or something else? And is it direct or spm -> dict? Or how to prime model for a specific language pair?
All translation directions are direct from language X to language Y, with no intermediary. We evaluate the quality through 40,602 different translation directions using FLORES-200. 2,440 directions contain supervised training data created through our data effort, and the remaining 38,162 are zero-shot.
I gained a deeper understanding of what it truly means to be inclusive. Every language is unique just like everybody and making sure content works for all and including as many people as possible is really really hard, but through this project i'm hopeful we are taking it one step further