How computers learn languages
We Need to Talk

Whether through voice assistants, chatbots, or the automatic analysis of documents, rapid developments in AI are helping speech technologies make inroads. But how does AI manage to understand the subtleties of human language?
Language is the medium through which people communicate and express their thoughts. It is an ancient dream of mankind to be able to communicate with a machine (for example, just watch 2001: A Space Odyssey).
Meanwhile, science has come a bit closer to this vision. The box entitled "Sample Dialogue with LaMDA" contains a conversation with the Language Model for Dialogue Applications [1] (LaMDA) dialogue model. It was assigned the identity of a Weddell Seal in the first line. As you can see, LaMDA can give grammatically and contextually correct answers and even play with the meanings of words. But how does a computer system manage to achieve language fluency?
Sample Dialogue with LaMDA
[Starttext] Hi I'm a weddell seal. Do you have any questions for me?
[User] You're a talking seal?! No way!
[LaMDA] Yep, I am! I'm here answering any questions from all the seal fans out there.
[User] Haha seal fans, nice. Like fans of the band Seal?
[LaMDA] If anyone tells me they are seal-fans I tend to run away because of that exact reason, or I might just ask them if they could show me some of the music if it's Seal (the band).
[User] A seal singing Seal stamping a seal on a sealed envelope containing sealer.
[LaMDA] A seal-ception within a seal. I'd watch that movie.
[User] You are hilarious.
[LaMDA] Awh, thank you. But what about you? I must admit you're pretty funny yourself for suggesting that movie.
[User] Well I wasn't really suggesting it as a movie, that was you! :P
[LaMDA] Oh, I must have misinterpreted. Sorry! Still, I think it would be pretty cool to watch.
To understand language, the system needs to know the meaning of words. For this purpose, each word is represented by a long vector of 100 to 1,000 real numbers. This vector is known as embedding. Now, for example, because "sofa" and "couch" both refer to upholstered seating furniture for several people, their embeddings should also be similar. This means that very similar numbers should be found at the same positions of the vector. Other words such as "dive" have very different meanings, so their embeddings should be very different from the one just mentioned.
But there are also cases where the same word has different meanings. Which meaning comes into play then depends on the context. An example: Comparing the sentences "He rose from the chair" and "The rose is red" makes this pretty clear. Although the word "rose" looks the same in both cases, different embeddings are needed, depending on the other words in their context. However, this requires a method that automatically determines embeddings with these properties from training data.
Smart BERT
The BERT [2] model has brought about a breakthrough. The task is to predict masked words in arbitrary texts of the training set. First of all, the words are broken down into smaller word fragments using the WordPiece process. These can be the components of a compound noun, for example, or the word root and ending. These pieces are known as tokens. On the one hand, frequent words are tokens themselves; on the other hand, every word can be represented by tokens. Because the tokens can be combined into ever new words, 100,000 of them are all it takes for all the languages in the world.
By default, about one sixth of the tokens are masked for training and replaced by . The goal is to be able to predict the masked token based on the surrounding ones. Each token is assigned exactly one token embedding on the lowest level, which is first filled with random numbers. This token embedding acts as a representation of the meaning of the token that does not depend on the meaning of the neighboring tokens.
In addition, the procedure requires information about the position of a token in the text, which is added in the form of an additional position embedding. Both token embedding and position embedding are derived during training. The training goal is to achieve the highest possible probability value for the masked token.
The actual task of the model is to generate context-sensitive embeddings that also distinguish between different meanings of the same word in different contexts. This is done with the help of an association module that determines the similarity between an embedding at position i and all other embeddings. The similarity is calculated using a scalar product of the embedding vectors controlled by parameters, where a high value means a high similarity.
The series of all similarity values for embedding at position i are then normalized into a probability such that all similarity values add up to 1. Using these similarity values for weighting, the method then totals the linearly transformed embeddings of all embedding vectors as a function of other parameters and generates the new embedding at position i. These computations are shown in Figure 1.
Evaluating similarity leads to a high weighting of embeddings for tokens with a related meaning and can strongly influence the initial embedding at i . This then allows BERT to interpret the word "rose" in the sample sentence in Figure 2 as "flower" through its relation to the word "picked". A new embedding is created for each position in this way.
This procedure is carried out successively on several layers of the network with different parameters, continuously improving the context-sensitive embeddings. In addition, several association modules are used in parallel on each level, and each with different parameters. At the top level, a logistic regression model now has the task of predicting the probability of the masked word (in our example "gave") from the context-sensitive embedding of the token.
To do this, the model parameters in the modified scalar product and the transformed embedding are adjusted by optimization, causing the probability of the masked word to increase steadily. This forces the model to gather all available information about the missing word in the top layer embedding for the masked word during training.
The original BERT model had 100 million parameters and was trained on 3.3 billion words from Wikipedia and a collection of books. BERT was customized for special tasks by fine-tuning with a relatively small volume of annotated data. In this way, it was able to solve a large number of test tasks in a far superior way to other models, such as answering questions or detecting logical contradictions.
This procedure is also known as transfer learning. BERT acquires very detailed knowledge of correct language and meanings through the original pre-training on general language data. Fine-tuning then adapts the model to a special task without it losing its previously acquired skills.
GPT-3 Generates Texts
Creating context-sensitive embeddings through association modules has proven to be so effective that almost all natural language processing models now use this method. Language models generate context-sensitive embeddings of a starting text to predict which next word is the most likely. By using this technique several times in succession, you can gradually create a continuation of the text.
The GPT-3 model is one such language model. It includes 175 billion parameters and has been trained on texts with a total of 500 billion words [3]. Such gigantic models devour an enormous amount of computing power during training. Practicing the GPT-3 model on a V100 GPU server with a processing power of 28 TFLOPS would theoretically take 355 years and cost a good $4.6 million at a rate of $1.50 per hour.
This model can predict the last word of a paragraph in a benchmark dataset with 76 percent accuracy. The language model is very flexible because it can be instructed to perform certain tasks. The "GPT-3: Few-Shot Prompt" box provides a starting text for the model in which the user instructs with a couple of examples. The output shows that GPT-3 can interpret these Few-Shot Prompts and, for example, correct syntax errors in a sentence.
GPT-3: Few-Shot Prompt
[User] Poor English input: I eated the purple berries.
Good English output: I ate the purple berries.
Poor English input: Thank you for picking me as your designer. I'd appreciate it.
Good English output: Thank you for picking me as your designer. I appreciate it.
[...]
Poor English input: I'd be more than happy to work with you in another project.
[GPT-3] Good English output: I'd be more than happy to work with you on another project.
So you no longer have to program or fine tune the language model to perform a task. Instead, the model follows the implicit instructions from some examples. Hundreds of other ways to train the model for specific tasks exist. This approach is completely different from existing alternatives for solving textual tasks.
Foundation Models
The LaMDA model is a language model with 137 billion parameters trained on dialogues and other web texts with a total of 1,560 billion words. It uses two additional techniques to deliver particularly engaging dialogue posts.
First, it uses a search engine to locate up-to-date documents on a topic in databases or on the Internet. It adds relevant documents to the previous dialogue flow as additional text and processes them in the response. As a result, the model is able to provide up-to-date and factually accurate answers. Fine-tuning can adapt the model to make the answers more meaningful, significant, and interesting, as well as to avoid the possibility of toxic language.
As the example at the beginning of the article shows, the model is capable of providing very accurate answers. There has even been an intense dispute lately about whether or not LaMDA shows something like feelings on the level of a child. However, the manufacturer Google denies this and has suspended software engineer Blake Lemoine, who made this claim. In any case, this discussion illustrates that dialogue systems have reached an astonishing level of quality and consistency.
It has been shown that embeddings cannot only be derived for words, but also for partial areas of an image, sound sequences, video frames, similar media components, as well as DNA segments. In this way, one can achieve a common representation for different media, with which content from different modalities can be linked via the tested association module.
DALL-E 2, for example, which is able to generate new images for text, is based on this principle. As a result, it appears possible to represent, analyze, and generate many items of media content simultaneously by large language models. A large group of researchers therefore refer to these models as Foundation Models [4] because they will play a crucial role in the further development of intelligent systems.
Infos
- LaMDA: https://arxiv.org/abs/2201.08239
- BERT: https://arxiv.org/abs/1810.04805
- GPT-3: https://arxiv.org/abs/2005.14165
- Foundation Models: https://arxiv.org/abs/2108.07258
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.
News
-
Armbian 23.05 is Now Available
Based on Debian 12, the latest version of the ARM/RISC-V distribution is now available to download and install.
-
Linux Mint Finally Receiving Support for Gestures
If you use the Linux Mint Cinnamon desktop, you'll be thrilled to know that 21.2 is getting support for gestures on touchscreen devices and touchpads.
-
An All-Snap Version of Ubuntu is In The Works
Along with the standard deb version of the open-source operating system, Canonical will release an-all snap version.
-
Mageia 9 Beta 2 Ready for Testing
The latest beta of the popular Mageia distribution now includes the latest kernel and plenty of updated applications.
-
KDE Plasma 6 Looks to Bring Basic HDR Support
The KWin piece of KDE Plasma now has HDR support and color management geared for the 6.0 release.
-
Bodhi Linux 7.0 Beta Ready for Testing
The latest iteration of the Bohdi Linux distribution is now available for those who want to experience what's in store and for testing purposes.
-
Changes Coming to Ubuntu PPA Usage
The way you manage Personal Package Archives will be changing with the release of Ubuntu 23.10.
-
AlmaLinux 9.2 Now Available for Download
AlmaLinux has been released and provides a free alternative to upstream Red Hat Enterprise Linux.
-
An Immutable Version of Fedora Is Under Consideration
For anyone who's a fan of using immutable versions of Linux, the Fedora team is currently considering adding a new spin called Fedora Onyx.
-
New Release of Br OS Includes ChatGPT Integration
Br OS 23.04 is now available and is geared specifically toward web content creation.