Putting free digital assistants to the test

Friend and Helper

© Lead Image © iofoto, 123RF.com

© Lead Image © iofoto, 123RF.com

Article from Issue 192/2016

Researchers from the University of Michigan have built an intelligent personal assistant akin to Siri and Cortana from free components. Although the Sirius Project focuses on the server load created by digital assistant software, we are interested in the usability of Sirius and its successor Lucida.

What does the chief engineer of a spaceship in the 23rd century do to operate a computer from the 20th century? He picks up the mouse and says, "Hello, computer" (Star Trek IV: The Voyage Home, Paramount Pictures, 1986). During his journey through time, Montgomery "Scotty" Scott nonetheless had to hit the keys eventually.

Owners of modern smartphones, on the other hand, can go a long way with OK, Google, Hey, Siri, or Hey, Cortana; the speech assistants understand many questions or instructions formulated in everyday language. You can only guess how many algorithms are behind the proprietary marvels.

Things are quite different with the open source intelligent personal assistant Sirius [1], which was developed in 2015 by the research group Clarity Lab at the University of Michigan [2]. The software, published under the BSD license, bundles together the free speech recognition systems CMU Sphinx [3] (PocketSphinx and Sphinx4), Kaldi [4], image recognition based on OpenCV [5], the question-answering system OpenEphyra [6], and UC Berkley's deep learning framework Caffe [7]. A Wikipedia dump forms the basis for OpenEphyra's data corpus. With aid from all of these components, Sirius is in a position to answer typed or spoken questions and to recognize objects in images (Figure 1).

Figure 1: Components of the Sirius intelligent personal assistant (based on an image at the Clarity Lab website [1]).

The developers at Clarity Lab formulated the aim of the software in an abstract [8] for the Sirius tutorial that took place during the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-20). They proceed from the assumption that the demand for intelligent personal assistants (IPAs) will increase in the future and ask what server architectures will have to look like to handle the workload of these programs. Because of a lack of open source IPAs to calculate the load, they developed Sirius so they could represent the resource requirements realistically.

How does Sirius fare in practical use? Is the program a suitable helper on the Linux desktop? Those running the test considered these questions, and carefully examined Sirius and its successor, Lucida [9]. They installed the software on Ubuntu 14.04 and Ubuntu 16.04, used the Sirius speech recognition, tested its question-answering system, and scrutinized its image recognition abilities. Lucida is not yet as far along. So far, only a simple question and answer game has operated in its demo version, which the testing team briefly exercised.

Ready-to-Assemble Kit

The Clarity Lab website offers a download that includes the Sirius application, Sirius Suite, and the web front-end server; the Sirius Suite alone with a Caffe snapshot; and the Wikipedia dump for the question-answering system [10].

After unpacking the Sirius archive, you switch to the sirius-1.0.1/sirius-application directory. A few scripts here import the software expected by Sirius, load components from the Internet, and compile and install them. The scripts are written for Ubuntu 14.04; if you use this somewhat older LTS version (that is nevertheless supported until 2019), you should enter the following four commands:

sudo ./get-dependencies.sh
sudo ./get-opencv.sh

If you use the current Ubuntu 16.04, adjust the get-dependencies.sh script in the text editor beforehand and comment out the entry for adding the external FFmpeg repository (ppa:kirillshkrogalev/ffmpeg-next). The external package source is no longer necessary because FFmpeg is in the official Xenial repositories.

Next, execute the first three commands, but before you call up ./compile-sirius-servers.sh, place a symbolic link from /usr/bin/libtoolize to /usr/bin/libtool, because the Kaldi makefile searches for this binary.

A fast Internet connection is an advantage, because the scripts download a whole host of software. With the OpenCV download, around 3GB of data are copied onto the disk; Kaldi takes up 2GB. The Sirius archive itself is 470MB in size, and the Wikipedia dump encompasses some 11GB. When completely installed, Sirius and its components occupy around 25GB of disk space.

The scripts that bring the speech recognition, image recognition, and question-answering system into the arena are in the sirius-application/run-scripts directory with start at the beginning of their file names. All three components are implemented as server services. The scripts you use to direct your requests to the servers are also found here with test in their file names.

Good Listener

In their first attempt, the test team fed a few of the WAV files stored in the sirius-application/inputs/questions directory to Sirius automatic speech recognition (ASR) and started the ASR server in a terminal in succession with one of the three available back ends (Kaldi, PocketSphinx, and Sphinx4):

./start-asr-server.sh kaldi
./start-asr-server.sh PocketSphinx
./start-asr-server.sh sphinx4

We then called up the sirius-asr-test.sh script in a second terminal together with a question (Provided) and saw the result from Sirius (Figure 2). Sometimes it worked well, sometimes only after waiting a while, and sometimes not at all; the communication with Sphinx4 using Ubuntu 16.04 completely misfired. For the comparison, the test team recorded the sentences themselves (Recorded) with a microphone and sent them to all three back ends. With the aid of five example sentences, Table 1 shows what Kaldi, PocketSphinx, and Sphinx4 understood.

Table 1

Sirius ASR Back Ends






Who invented the telegraph?



who invented the telegraph

who invented the telegraph

who invented the telegraph



we went at the telegraph

we're going to the telegraph

with only scowled

Where is the Louvre Museum located?



where is the liberal museum love the change yeah

where is the liver uneasy and located

where's the louvre museum located



where was the little free museums okay tent

where is the u. over a museum located

london back while passengers are

Where did John Lennon die?



where do you john lennon dot

where did john lennon got

where did john lennon died



when it it's john lennon die

where did john lennon die

only after all how often run

What is the population of France?



what is the population of france

what is the population of forms

what is the population of france



uh what is the population of france

what is the population of trunks

in a half and unload newark crown

What is the speed of light?



which is the speed of light

what is the speed of light

what is the speed of light



well just the speed of flights

what does the speed of light

the injury to half moon last

Figure 2: After WAV files are sent to Sirius ASR, you see what was understood by the back end.

The quality of text recognition is very patchy: With the WAV files provided, only the Sphinx4 back end worked almost flawlessly. On the other hand, with the testers' own recordings, the correctly recognized sentences remain a strange exception. The developers may have trained their speech recognition libraries primarily with the files they enclosed, which are spoken with an American accent throughout. With the test team's own recordings (in British English with a German accent), Sphinx4 particularly was unable to cope; the other engines at least recognized individual words.

Quality of the audio should not explain the lack of understanding, because a decent microphone was used. The testers recorded their sentences at random with a headset and a different frequency response, and the recordings still delivered inferior results. The Google and Apple speech recognition engines recognized almost all the questions on the test team's smartphones.

Answer Me

If the digital assistant understands a question, it would be great if it could answer it as well. The Sirius developers employ the question-answering system OpenEphyra [6] for this step.

A Wikipedia dump without semantic distinctions serves as the data corpus. The developers created this with Indri [11], a search engine specialized for large text corpora. You can download the Wikipedia knowledge database from the Sirius download page and extract it into the sirius-application/question-answer directory.

Now start the QA server with the start-qa-server.sh script from the sirius-application/run-scripts directory. On the Ubuntu 16.04 test machine, this did not work without further ado; a call to ant – which uses the XML build files for OpenEphyra and documentation files – in the sirius-application/question-answer directory was necessary before the server started working. If you receive an insufficient threads configured warning, you can fix it with a simple hack and comment out this line in the sirius-application/question-answer/src/info/ephyra/OpenEphyraServer.java file:

con1.setThreadPool(new QueuedThreadPool(NTHREADS));

After taking care of this problem, you must call up the compile-sirius-servers.sh script once more and restart the QA server.

Now you can ask questions in a second terminal; for example:

./sirius-qa-test.sh "what is the speed of light"

After a confirmation that the question has come through, a message appears stating that the question has gone to the server. After a short wait, the answer pops up in the terminal (Figure 3).

Figure 3: Once the OpenEphyra server is running in one terminal window, you can enter your question in another and receive your answer there, as well.

Because spoken and typed questions are both possible, it would be great if you could combine these. That is no problem with Sirius; you simply start the ASR service along with the QA server and use the following script for communication:

./sirius-asr-qa-test.sh ../inputs/real/who.is.the.current.president.of.the.united.states.wav

Depending on the ASR back end, the analysis then continues. After this part has successfully transcribed the question, however, the QA service still requires some time to find the answer, so patience is needed.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95


njobs Europe
Njobs Netherlands Njobs Deutschland Njobs United Kingdom Njobs Italia Njobs France Njobs Espana Njobs Poland
Njobs Austria Njobs Denmark Njobs Belgium Njobs Czech Republic Njobs Mexico Njobs India Njobs Colombia