How computers learn languages
We Need to Talk

Whether through voice assistants, chatbots, or the automatic analysis of documents, rapid developments in AI are helping speech technologies make inroads. But how does AI manage to understand the subtleties of human language?
Language is the medium through which people communicate and express their thoughts. It is an ancient dream of mankind to be able to communicate with a machine (for example, just watch 2001: A Space Odyssey).
Meanwhile, science has come a bit closer to this vision. The box entitled "Sample Dialogue with LaMDA" contains a conversation with the Language Model for Dialogue Applications [1] (LaMDA) dialogue model. It was assigned the identity of a Weddell Seal in the first line. As you can see, LaMDA can give grammatically and contextually correct answers and even play with the meanings of words. But how does a computer system manage to achieve language fluency?
Sample Dialogue with LaMDA
[Starttext] Hi I'm a weddell seal. Do you have any questions for me?
[User] You're a talking seal?! No way!
[LaMDA] Yep, I am! I'm here answering any questions from all the seal fans out there.
[User] Haha seal fans, nice. Like fans of the band Seal?
[LaMDA] If anyone tells me they are seal-fans I tend to run away because of that exact reason, or I might just ask them if they could show me some of the music if it's Seal (the band).
[User] A seal singing Seal stamping a seal on a sealed envelope containing sealer.
[LaMDA] A seal-ception within a seal. I'd watch that movie.
[User] You are hilarious.
[LaMDA] Awh, thank you. But what about you? I must admit you're pretty funny yourself for suggesting that movie.
[User] Well I wasn't really suggesting it as a movie, that was you! :P
[LaMDA] Oh, I must have misinterpreted. Sorry! Still, I think it would be pretty cool to watch.
To understand language, the system needs to know the meaning of words. For this purpose, each word is represented by a long vector of 100 to 1,000 real numbers. This vector is known as embedding. Now, for example, because "sofa" and "couch" both refer to upholstered seating furniture for several people, their embeddings should also be similar. This means that very similar numbers should be found at the same positions of the vector. Other words such as "dive" have very different meanings, so their embeddings should be very different from the one just mentioned.
But there are also cases where the same word has different meanings. Which meaning comes into play then depends on the context. An example: Comparing the sentences "He rose from the chair" and "The rose is red" makes this pretty clear. Although the word "rose" looks the same in both cases, different embeddings are needed, depending on the other words in their context. However, this requires a method that automatically determines embeddings with these properties from training data.
Smart BERT
The BERT [2] model has brought about a breakthrough. The task is to predict masked words in arbitrary texts of the training set. First of all, the words are broken down into smaller word fragments using the WordPiece process. These can be the components of a compound noun, for example, or the word root and ending. These pieces are known as tokens. On the one hand, frequent words are tokens themselves; on the other hand, every word can be represented by tokens. Because the tokens can be combined into ever new words, 100,000 of them are all it takes for all the languages in the world.
By default, about one sixth of the tokens are masked for training and replaced by . The goal is to be able to predict the masked token based on the surrounding ones. Each token is assigned exactly one token embedding on the lowest level, which is first filled with random numbers. This token embedding acts as a representation of the meaning of the token that does not depend on the meaning of the neighboring tokens.
In addition, the procedure requires information about the position of a token in the text, which is added in the form of an additional position embedding. Both token embedding and position embedding are derived during training. The training goal is to achieve the highest possible probability value for the masked token.
The actual task of the model is to generate context-sensitive embeddings that also distinguish between different meanings of the same word in different contexts. This is done with the help of an association module that determines the similarity between an embedding at position i and all other embeddings. The similarity is calculated using a scalar product of the embedding vectors controlled by parameters, where a high value means a high similarity.
The series of all similarity values for embedding at position i are then normalized into a probability such that all similarity values add up to 1. Using these similarity values for weighting, the method then totals the linearly transformed embeddings of all embedding vectors as a function of other parameters and generates the new embedding at position i. These computations are shown in Figure 1.
Evaluating similarity leads to a high weighting of embeddings for tokens with a related meaning and can strongly influence the initial embedding at i . This then allows BERT to interpret the word "rose" in the sample sentence in Figure 2 as "flower" through its relation to the word "picked". A new embedding is created for each position in this way.
This procedure is carried out successively on several layers of the network with different parameters, continuously improving the context-sensitive embeddings. In addition, several association modules are used in parallel on each level, and each with different parameters. At the top level, a logistic regression model now has the task of predicting the probability of the masked word (in our example "gave") from the context-sensitive embedding of the token.
To do this, the model parameters in the modified scalar product and the transformed embedding are adjusted by optimization, causing the probability of the masked word to increase steadily. This forces the model to gather all available information about the missing word in the top layer embedding for the masked word during training.
The original BERT model had 100 million parameters and was trained on 3.3 billion words from Wikipedia and a collection of books. BERT was customized for special tasks by fine-tuning with a relatively small volume of annotated data. In this way, it was able to solve a large number of test tasks in a far superior way to other models, such as answering questions or detecting logical contradictions.
This procedure is also known as transfer learning. BERT acquires very detailed knowledge of correct language and meanings through the original pre-training on general language data. Fine-tuning then adapts the model to a special task without it losing its previously acquired skills.
GPT-3 Generates Texts
Creating context-sensitive embeddings through association modules has proven to be so effective that almost all natural language processing models now use this method. Language models generate context-sensitive embeddings of a starting text to predict which next word is the most likely. By using this technique several times in succession, you can gradually create a continuation of the text.
The GPT-3 model is one such language model. It includes 175 billion parameters and has been trained on texts with a total of 500 billion words [3]. Such gigantic models devour an enormous amount of computing power during training. Practicing the GPT-3 model on a V100 GPU server with a processing power of 28 TFLOPS would theoretically take 355 years and cost a good $4.6 million at a rate of $1.50 per hour.
This model can predict the last word of a paragraph in a benchmark dataset with 76 percent accuracy. The language model is very flexible because it can be instructed to perform certain tasks. The "GPT-3: Few-Shot Prompt" box provides a starting text for the model in which the user instructs with a couple of examples. The output shows that GPT-3 can interpret these Few-Shot Prompts and, for example, correct syntax errors in a sentence.
GPT-3: Few-Shot Prompt
[User] Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I'd appreciate it.
Good English output: Thank you for picking me as your designer. I appreciate it.
[...]
Poor English input: I'd be more than happy to work with you in another project.
[GPT-3] Good English output: I'd be more than happy to work with you on another project.
So you no longer have to program or fine tune the language model to perform a task. Instead, the model follows the implicit instructions from some examples. Hundreds of other ways to train the model for specific tasks exist. This approach is completely different from existing alternatives for solving textual tasks.
Foundation Models
The LaMDA model is a language model with 137 billion parameters trained on dialogues and other web texts with a total of 1,560 billion words. It uses two additional techniques to deliver particularly engaging dialogue posts.
First, it uses a search engine to locate up-to-date documents on a topic in databases or on the Internet. It adds relevant documents to the previous dialogue flow as additional text and processes them in the response. As a result, the model is able to provide up-to-date and factually accurate answers. Fine-tuning can adapt the model to make the answers more meaningful, significant, and interesting, as well as to avoid the possibility of toxic language.
As the example at the beginning of the article shows, the model is capable of providing very accurate answers. There has even been an intense dispute lately about whether or not LaMDA shows something like feelings on the level of a child. However, the manufacturer Google denies this and has suspended software engineer Blake Lemoine, who made this claim. In any case, this discussion illustrates that dialogue systems have reached an astonishing level of quality and consistency.
It has been shown that embeddings cannot only be derived for words, but also for partial areas of an image, sound sequences, video frames, similar media components, as well as DNA segments. In this way, one can achieve a common representation for different media, with which content from different modalities can be linked via the tested association module.
DALL-E 2, for example, which is able to generate new images for text, is based on this principle. As a result, it appears possible to represent, analyze, and generate many items of media content simultaneously by large language models. A large group of researchers therefore refer to these models as Foundation Models [4] because they will play a crucial role in the further development of intelligent systems.
Infos
- LaMDA: https://arxiv.org/abs/2201.08239
- BERT: https://arxiv.org/abs/1810.04805
- GPT-3: https://arxiv.org/abs/2005.14165
- Foundation Models: https://arxiv.org/abs/2108.07258
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
Linux Kernel Reducing Long-Term Support
LTS support for the Linux kernel is about to undergo some serious changes that will have a considerable impact on the future.
-
Fedora 39 Beta is Now Available for Testing
For fans and users of Fedora Linux, the first beta of release 39 is now available, which is a minor upgrade but does include GNOME 45.
-
Fedora Linux 40 to Drop X11 for KDE Plasma
When Fedora 40 arrives in 2024, there will be a few big changes coming, especially for the KDE Plasma option.
-
Real-Time Ubuntu Available in AWS Marketplace
Anyone looking for a Linux distribution for real-time processing could do a whole lot worse than Real-Time Ubuntu.
-
KSMBD Finally Reaches a Stable State
For those who've been looking forward to the first release of KSMBD, after two years it's no longer considered experimental.
-
Nitrux 3.0.0 Has Been Released
The latest version of Nitrux brings plenty of innovation and fresh apps to the table.
-
Linux From Scratch 12.0 Now Available
If you're looking to roll your own Linux distribution, the latest version of Linux From Scratch is now available with plenty of updates.
-
Linux Kernel 6.5 Has Been Released
The newest Linux kernel, version 6.5, now includes initial support for two very exciting features.
-
UbuntuDDE 23.04 Now Available
A new version of the UbuntuDDE remix has finally arrived with all the updates from the Deepin desktop and everything that comes with the Ubuntu 23.04 base.
-
Star Labs Reveals a New Surface-Like Linux Tablet
If you've ever wanted a tablet that rivals the MS Surface, you're in luck as Star Labs has created such a device.