Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Article from Issue 153/2013

Author(s): Fernando Doglio

Full-text search engines like Solr, Xapian, and Sphinx make the daily data chaos on your hard disk searchable – and they even cooperate with relational databases.

Creating a list of 10 websites that discuss the latest Ubuntu release is simple: just use Google or another one of the popular web search engines. But if you host an information-packed website yourself and want to offer your own search function for it, you need a full-text search tool. Full-text search engines have other benefits for the user and developer. If you are building a custom application or DVD, for instance, you might want to include a full-text search tool to put important information at the user's fingertips. Full-text search delves the depths of random or systematically arranged data for one or more search terms. You will want the search results sorted by relevance, and you will want the results in a split second.

Luckily, admins and developers need not reinvent the wheel: Solr, Xapian, and Sphinx are open source projects that index and analyze data. But how do you define data? You can roughly distinguish two states in which the search engines find information: structured and unstructured.

Structured data has a fixed, predefined structure that allows it to be easily recognized, categorized, and processed with the help of applications. The most common form of structured data is a relational database, with data organized in rows and columns that, in turn, are connected in the form of tables. In contrast to this, unstructured data lacks a data model. Such data sets are often so ambiguous that a program cannot simply process them because the data, facts, and figures are totally mixed. Unstructured data is the domain of search engines that can at least arrange the chaotic data semantically.

Although both structured and unstructured data are both important, structured data offers several challenges for the search engine. For instance, the search engine must interact with a database engine to gain access to the data (see the box titled "No Connector?"). Also, websites and other custom web service infrastructures increasingly rely on back-end database systems for storing and organizing critical information.

No Connector?

Clearly, searching a database is more complicated with no database connector. In this case, the users themselves must ensure that the search engine treats each database entry as an individual document and indexes it as if it were unstructured data. Additionally, you need to create a new document for any subsequent entries in the database.

On top of this, you have to program the search manually to interact with the database. Although it is possible to work around the constraints, you have to consider a few factors – for example, the size of the database and whether it resides in a distributed environment.

In this article, I describe the Solr, Xapian, and Sphinx full-text search tools, paying particular attention to how to handle data stored in relational databases – with the spotlight on SQL variants.

Indexing and Search

Search engines in the wild exist in the form of libraries, which can be integrated into existing code or run as standalone services. These standalone services provide APIs for developers that let them tap into the search engine with their own programs.

Basically, search engines follow two aims: Indexing prepares the data so that it is easily accessible, and the search process compiles the information in a meaningful way. If you look up the term index in the dictionary, you will see that indexing means making a list of things. Making a list is the first step in the process.

Solr, Sphinx, and Xapian rely on indexers to sift through the full inventory of data and apply different algorithms to create a searchable form (see the box titled "Stop Words and Stem Formation"). Grossly simplified, the search engine isolates key terms from the mass of data and stores them in the form of trees, binary trees, or hash tables in index files that also remember the relative position of the terms in the original documents.

Stop Words and Stem Formation

Every language includes words that do not add any significance to the body of text in which they appear; the indexer usually discards these words. In English, these stop words include the definite article (the), indefinite articles (a, an), and conjunctions (and, or).

Stem formation is a method that reverts declined or conjugated words to their stems. The associated algorithm assigns, for example, both "went" and "gone" to the stem "go."

When it comes to indexing information stored in a database, the index file generated by the search engine usually contains at least two things: the fields that are searched by the user, and the ID of the entry in the database, which makes it possible to reconstruct the record acquired by the search engine if necessary.

If multiple tables in a database are linked, the search engines must first understand how the individual tables are constructed, how they are related, and what information the full-text search needs when indexing in the database. When it finally comes to the actual data search, the search engine looks in the index file for the search terms and returns the IDs of the matches and the data sets created for them.

Some full-text search engines, whether they be servers or libraries, come with native support for interacting with popular database services, which allows automated indexing.

Indexing alone is useless if no search function can interpret the generated indices in a meaningful way. If you specify one or more keywords for a search, a search function investigates the indices created earlier and returns relevant results [1]. Because indexing removes redundant and meaningless words, the search proceeds much more quickly than if a computer were to search the original documents.

Reorganizing the data allows the user to track down an article, titled Many ways to find a cat by entering, for example, I'm looking for my kitty.

Apache Solr

Solr [2] is an open source, full-text search server by the Apache Software Foundation (the current version is 4.3.0) based on the Lucene search library [3], which also belongs to the Apache Software Foundation. Solr thus automatically assimilates new features when the Lucene project updates its software. Companies and institutions that use Solr include Netflix and SourceForge, but also NASA and the White House. As early as May 2011, Solr and Lucene decided to merge.

Even though Lucene provides the search results, Solr comes with some interesting extra features. These features include not only match highlighting, which visually highlights the sites in the search results, but also near-real-time indexing, which ensures that recently indexed data is almost immediately available for searching. It also indexes and parses rich documents (e.g., DOC, HTML, and PDF files) using Apache Tika [4], integrates a geodetic data feature, and has an extensible plugin architecture.

Solr's API lets you communicate with the server in almost any programming language because it uses REST-like standards (see the "REST" box) and stores information in XML, JSON, and CSV files.

REST

The term Representational State Transfer (REST) was first introduced in a doctoral dissertation by Roy Fielding. Among other things, the REST concept specifies that all components of dynamically generated web pages should be accessible via URIs.

Because it is written in Java and already looks back on a few years of existence, Solr copes well with structured data [5]. The server puts users in a position to index almost any database if they do two things: install the correct JDBC driver and configure some XML files. The first step becomes complicated if you do not use a standard database engine. However, if you rely on the widespread MySQL database [6], you can simply install the libmysql-java package on Ubuntu 13.04 and make a few manual adjustments.

Solr itself can be installed using solr-common; in addition to this, you need solr-tomcat as a matching server. A data import handler then takes care of tapping into the database. To use the import handler in Ubuntu, the admin first needs to define it in the /etc/solr/conf/solrconfig.xml file (Listing 1, line 1). At the same time, this file expects a link to the configuration file (data-config.xml, line 3), which you can store, for example, in /home/username.

Listing 1

solrconfig.xml

This configuration file (Listing 2) needs to explain how Solr can access the data in the database (line 5), which XML fields and columns it should read, and whether it should change, delete, or add fields (lines 9 to 12). If necessary, different databases can be tapped by using the name field [7]. In this example, the Solr schema file (/etc/solr/conf/schema.xml) additionally needs the id, name, and desc (line 11), which the database user can set up after reading the schema [8], before modifying the data-config.xml file.

Listing 2

data-config.xml

The next step is to ensure that the appropriate JDBC driver ends up in the /usr/share/solr/web/WEB-INF/lib directory in the form of a .jar file. In the case of MySQL, you can discover this by entering

sudo dpkg -L mysqldata-java

then linking the driver. In the last step, a command imports the data into Solr; a simple HTTP call to the appropriate port specification is all you need:

http://localhost:8080/solr/dataimport?command=full-import

The command can optionally include a =delta-import, if you simply need to update an existing database.

Xapian

Unlike Solr, Xapian [9] is not an independent server that runs separately from the applications that use its services. Instead, it is a C++ library that can be integrated into applications as a search function. The project includes bindings for other programming languages, including Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua, and Erlang.

Xapian users include, for example, the weekly newspaper "Die Zeit" and the Debian project, which uses the search function for the mailing list. The packages for Ubuntu 13.04 are xapian-omega, xapian-tools, and libxapian22.

Xapian stores the data collected during indexing in a standard database that is used in the search step. Xapian also features a weighted probability search: Important words are assigned more weight than unimportant ones, so that more relevant documents are likely to move to the top of the results.

The user can search for exact terms (Figure 1), or for a specific number of words that either appear sorted or random in the search text. Xapian also corrects search keys entered by the user on the basis of its own search index and supports data indexing during a search by immediately integrating new documents into the search results.

Figure 1: The Xapian search page, running under Ubuntu on an Apache server, investigating some previously indexed test documents.

Xapian provides an interesting answer to the problem of searching databases. By default, it does not include support for database indexing. However, the Xapian project also includes a server engine named Omega [10] that uses the Xapian libraries as a basis. On Ubuntu, Omega resides in the xapian-omega package and installs an Apache server. A guide [11] explains how to search some test data.

The link between Xapian and Omega is similar to that between Solr and Lucene. Omega allows users to index databases, including all SQL databases and other database systems supported by the Perl DBI module. To be more specific, the Xapian/Omega team searches MySQL, PostgreSQL, SQLite, Oracle, DB2, MS SQL, LDAP, and ODBC.

For convenience, I'll assume you have a customer table with the fields id (primary key), customerName (varchar 255), and address (text), which belongs to the MySQL database myDB, and that you want to index this table. For Xapian, you need to convert the database format to text. To do so, export the database username and associated password at the command line:

export DBUSER=<DB-username>;
export DBPASSWORD='<DB-Password>';

In the next step, run a script named dbi2omega; Ubuntu 13.04 puts this script in /usr/share/doc/xapian-omega/examples/dbi2omega. The script transfers the contents of the table to a text file:

dbi2omega myDB customer >> customer.txt

In the next step, the Xapian user must create a script (Listing 3) that tells Xapian which fields to include – in this example, the script is named fields.script. In this way, the search software understands that the id field is a unique field, that it needs to index the customerName field, which has a five times greater relevance than the address field, which also ends up in the index [12].

Listing 3

fields.script

On Ubuntu, you then run scriptindex, which resides in the path /usr/bin/, to generate the example database:

scriptindex --stemmer=german <ExampleDB> fields.script customer.txt

The --stemmer option helps with stem formation; Xapian's algorithm assumes English as the default language. For another language (e.g., German), you would optionally specify german, as shown in the example.

1 2 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Canonical Releases Ubuntu 24.04

Gnome , Linux , open source , Ubuntu

After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.

Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Indexing and Search

Apache Solr

Xapian

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Canonical Releases Ubuntu 24.04

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Indexing and Search

Apache Solr

Xapian

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters