Developing an AI system with Markov chains
Sure-Fire Success
Markov chains model systems that jump from state to state with predetermined probabilities, but can they help write new columns like this one after learning from previously written articles?
Hard to believe, but true: This edition of my programming column marks its 20-year anniversary. Lately, I've been diving into artificial intelligence topics, a field that is increasingly replacing traditional jobs, but I wonder if a machine could possibly one day replace authors like me? An AI system would certainly have a number of advantages over a human writer, because a robot would no doubt deliver the manuscript on time every time, to the amazement of my editors, who are not fond of my bad habit of watching deadlines swoosh by. Also, I could afford to work less hard and have more time to kick back, relax, and spend my days pursuing my surfing hobby at various Pacific beaches (Figure 1).
Moody like the Weather
An algorithm that writes columns automatically could be implemented as a so-called Markov chain, named after the Russian mathematician Andrei Markov. Markov chains are stochastic processes with different states that change according to predetermined probabilities.
The weather forecasting model shown in Figure 2, for example, says the chance of a sunny day following a "Sun" state is 65 percent. At 30 percent, the weather will change to the "Clouds" state, and at 5 percent, it goes directly to the "Rain" state. The probabilities apply only to the current state; a change to the next state does not require knowledge of any event in the past, so the probability of a transition is very easy to calculate.
In a transition matrix such as that shown in Figure 2, the computer only needs to select the row with the current state, such as the third row for the Clouds state (C), to find out that there is a 25 percent chance of the weather being sunny, a 25 percent chance of rain, and a 50 percent chance of skies remaining cloudy. Equipped with corresponding random generators, the model jumps back and forth between states, and a later analysis of where this Monte Carlo method leads eventually allows conclusions to be drawn about the most likely weather in the future.
Sentences from the Machine
Such "random walks" can predict not only the weather, they can generate sequences of words that sound like they were written by a human author. For this purpose, a program first analyzes the word sequences used in an extensive body of text. It notes which words follow one another in a sentence, and pieces together a new text from learned phrases. For example, it might realize that, in corpus, three given words are sometimes followed by word A and sometimes by word B. When generating new prose, the algorithm will roll an imaginary die each time to determine whether the three initial words are followed by A or B, according to the probabilities in which they occur in the original text.
To start off, the algorithm separates the body of text into words (tokens) and then analyzes it step by step with a rolling window of words. The first (n – 1) words of the window determine the state of the machine at the current time, and the last word in the window is the value the machine assumes during the transition to the next state.
Rolling Window
For example, for a window size of n=3, from the text
Humpty Dumpty sat on a wall,
Humpty Dumpty had a big fall.
it extracts the sequences "Humpty Dumpty sat," "Dumpty sat on," "sat on a," and so on. The first two words (Humpty Dumpty) respectively determine the current state of the machine; the following word ("sat") is the next state value. In the second part of the nursery rhyme, the algorithm stumbles again on the "Humpty Dumpty" state. However, it sees not the word "sat," but rather the word "had" following the current state. When it later creates random text, the algorithm now has two options: It can either follow up with "sat," or it can choose "had," and it will do so according to calculated random values. The result is text that shows slight human qualities because it copies word sequences, but occasionally performs crazy jumps between contexts, which can sound either confusing or funny.
The oldest Programming Snapshot in Linux Pro Magazine dates back to 2003 (although the German edition started in 1997), and Listing 1 [1] reads all raw manuscripts cobbled together of every single issue since then in the file all.txt
. The file name is first passed to the generate_from()
function in line 68. The statemap_gen
function starting in line 37 then splits the file into tokens with token_feed()
and passes on the resulting list of the window()
function from line 26, which accepts n=4 as the window width. As explained previously, the status of the Markov chain in this case is made up of the three last words; the fourth is part of the result state.
Listing 1
randomtext.py
To prepare the tokenizer of the nltk
package, it needs to be installed with pip3 install nltk
. Moreover, the tokenizer dataset punkt still needs to be downloaded separately in a Python shell (Figure 3). The tokenizer interprets punctuation marks, such as commas or colons (the regular expression in line 7 defines them all), as individual tokens. This approach later increases the readability and entertainment value of the random text.
Figure 4 shows how the algorithm generates a state file of the Markov chain from the nursery rhyme and the window length n=3. Note how the statemap generated offers both "sat" and "had" as a follow-up state for "Humpty Dumpty":
('Humpty', 'Dumpty'): ['sat', 'had']
The Python code shows some of the spice of the language, such as the yield()
operator in the token_feed()
function in line 16: So that Python can iterate over the components of an object using a for
loop, it must be of type Iterable
. These objects do not spill out their elements in one go, but they pass them through one piece at a time with each call to yield()
.
The advantage of this process is that the script does not need to store all of the data in a construct in memory; rather, it can get away with only calculating the data when needed. Therefore, an object can contain an infinite number of elements; as long as they are never needed all at once, the illusion stays alive despite limited memory.
Python also provides a variety of data structures. Double-ended queues, called deque
, as used in the window()
function in lines 26-35, can be modified really fast if the code limits itself to adding or removing elements to or from the head or tail of the queue. The only flaw is that the caller of the window()
function cannot manipulate the deque
object returned, or the algorithm gets confused.
For this reason, line 41 uses key=list(state)
to copy the elements of the deque
object into a Python list, which the statemap_gen()
function can modify to its heart's content. It interprets the first (n – 1) elements as a Markov state, and the last item as a possible next state.
In line 43, key
is a Python tuple that can no longer be modified. The dictionary statemap
maps such tuples to one or more follow-up state values. In this way, the algorithm can later look up possible follow-up state values quickly and select one randomly with random.choice()
in line 60, leading the text on unpredictable paths and imbuing the algorithm with human-like traits.
The result can be seen in Figure 5. However, it demonstrates that the applied Markov chain procedure does not automatically provide exciting new magazine articles. The grammar is still lacking enormously, and the content doesn't really make any sense. This column will probably have to be created manually for the foreseeable future.
Infos
- Listing for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/205/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Rhino Linux Announces Latest "Quick Update"
If you prefer your Linux distribution to be of the rolling type, Rhino Linux delivers a beautiful and reliable experience.
-
Plasma Desktop Will Soon Ask for Donations
The next iteration of Plasma has reached the soft feature freeze for the 6.2 version and includes a feature that could be divisive.
-
Linux Market Share Hits New High
For the first time, the Linux market share has reached a new high for desktops, and the trend looks like it will continue.
-
LibreOffice 24.8 Delivers New Features
LibreOffice is often considered the de facto standard office suite for the Linux operating system.
-
Deepin 23 Offers Wayland Support and New AI Tool
Deepin has been considered one of the most beautiful desktop operating systems for a long time and the arrival of version 23 has bolstered that reputation.
-
CachyOS Adds Support for System76's COSMIC Desktop
The August 2024 release of CachyOS includes support for the COSMIC desktop as well as some important bits for video.
-
Linux Foundation Adopts OMI to Foster Ethical LLMs
The Open Model Initiative hopes to create community LLMs that rival proprietary models but avoid restrictive licensing that limits usage.
-
Ubuntu 24.10 to Include the Latest Linux Kernel
Ubuntu users have grown accustomed to their favorite distribution shipping with a kernel that's not quite as up-to-date as other distros but that changes with 24.10.
-
Plasma Desktop 6.1.4 Release Includes Improvements and Bug Fixes
The latest release from the KDE team improves the KWin window and composite managers and plenty of fixes.
-
Manjaro Team Tests Immutable Version of its Arch-Based Distribution
If you're a fan of immutable operating systems, you'll be thrilled to know that the Manjaro team is working on an immutable spin that is now available for testing.