Full-text search with Solr, Xapian, and Sphinx

Tracker Dogs

Author(s):

Full-text search engines like Solr, Xapian, and Sphinx make the daily data chaos on your hard disk searchable – and they even cooperate with relational databases.

Creating a list of 10 websites that discuss the latest Ubuntu release is simple: just use Google or another one of the popular web search engines. But if you host an information-packed website yourself and want to offer your own search function for it, you need a full-text search tool. Full-text search engines have other benefits for the user and developer. If you are building a custom application or DVD, for instance, you might want to include a full-text search tool to put important information at the user's fingertips. Full-text search delves the depths of random or systematically arranged data for one or more search terms. You will want the search results sorted by relevance, and you will want the results in a split second.

Luckily, admins and developers need not reinvent the wheel: Solr, Xapian, and Sphinx are open source projects that index and analyze data. But how do you define data? You can roughly distinguish two states in which the search engines find information: structured and unstructured.

Structured data has a fixed, predefined structure that allows it to be easily recognized, categorized, and processed with the help of applications. The most common form of structured data is a relational database, with data organized in rows and columns that, in turn, are connected in the form of tables. In contrast to this, unstructured data lacks a data model. Such data sets are often so ambiguous that a program cannot simply process them because the data, facts, and figures are totally mixed. Unstructured data is the domain of search engines that can at least arrange the chaotic data semantically.

Although both structured and unstructured data are both important, structured data offers several challenges for the search engine. For instance, the search engine must interact with a database engine to gain access to the data (see the box titled "No Connector?"). Also, websites and other custom web service infrastructures increasingly rely on back-end database systems for storing and organizing critical information.

No Connector?

Clearly, searching a database is more complicated with no database connector. In this case, the users themselves must ensure that the search engine treats each database entry as an individual document and indexes it as if it were unstructured data. Additionally, you need to create a new document for any subsequent entries in the database.

On top of this, you have to program the search manually to interact with the database. Although it is possible to work around the constraints, you have to consider a few factors – for example, the size of the database and whether it resides in a distributed environment.

In this article, I describe the Solr, Xapian, and Sphinx full-text search tools, paying particular attention to how to handle data stored in relational databases – with the spotlight on SQL variants.

Indexing and Search

Search engines in the wild exist in the form of libraries, which can be integrated into existing code or run as standalone services. These standalone services provide APIs for developers that let them tap into the search engine with their own programs.

Basically, search engines follow two aims: Indexing prepares the data so that it is easily accessible, and the search process compiles the information in a meaningful way. If you look up the term index in the dictionary, you will see that indexing means making a list of things. Making a list is the first step in the process.

Solr, Sphinx, and Xapian rely on indexers to sift through the full inventory of data and apply different algorithms to create a searchable form (see the box titled "Stop Words and Stem Formation"). Grossly simplified, the search engine isolates key terms from the mass of data and stores them in the form of trees, binary trees, or hash tables in index files that also remember the relative position of the terms in the original documents.

Stop Words and Stem Formation

Every language includes words that do not add any significance to the body of text in which they appear; the indexer usually discards these words. In English, these stop words include the definite article (the), indefinite articles (a, an), and conjunctions (and, or).

Stem formation is a method that reverts declined or conjugated words to their stems. The associated algorithm assigns, for example, both "went" and "gone" to the stem "go."

When it comes to indexing information stored in a database, the index file generated by the search engine usually contains at least two things: the fields that are searched by the user, and the ID of the entry in the database, which makes it possible to reconstruct the record acquired by the search engine if necessary.

If multiple tables in a database are linked, the search engines must first understand how the individual tables are constructed, how they are related, and what information the full-text search needs when indexing in the database. When it finally comes to the actual data search, the search engine looks in the index file for the search terms and returns the IDs of the matches and the data sets created for them.

Some full-text search engines, whether they be servers or libraries, come with native support for interacting with popular database services, which allows automated indexing.

Indexing alone is useless if no search function can interpret the generated indices in a meaningful way. If you specify one or more keywords for a search, a search function investigates the indices created earlier and returns relevant results [1]. Because indexing removes redundant and meaningless words, the search proceeds much more quickly than if a computer were to search the original documents.

Reorganizing the data allows the user to track down an article, titled Many ways to find a cat by entering, for example, I'm looking for my kitty.

Apache Solr

Solr [2] is an open source, full-text search server by the Apache Software Foundation (the current version is 4.3.0) based on the Lucene search library [3], which also belongs to the Apache Software Foundation. Solr thus automatically assimilates new features when the Lucene project updates its software. Companies and institutions that use Solr include Netflix and SourceForge, but also NASA and the White House. As early as May 2011, Solr and Lucene decided to merge.

Even though Lucene provides the search results, Solr comes with some interesting extra features. These features include not only match highlighting, which visually highlights the sites in the search results, but also near-real-time indexing, which ensures that recently indexed data is almost immediately available for searching. It also indexes and parses rich documents (e.g., DOC, HTML, and PDF files) using Apache Tika [4], integrates a geodetic data feature, and has an extensible plugin architecture.

Solr's API lets you communicate with the server in almost any programming language because it uses REST-like standards (see the "REST" box) and stores information in XML, JSON, and CSV files.

REST

The term Representational State Transfer (REST) was first introduced in a doctoral dissertation by Roy Fielding. Among other things, the REST concept specifies that all components of dynamically generated web pages should be accessible via URIs.

Because it is written in Java and already looks back on a few years of existence, Solr copes well with structured data [5]. The server puts users in a position to index almost any database if they do two things: install the correct JDBC driver and configure some XML files. The first step becomes complicated if you do not use a standard database engine. However, if you rely on the widespread MySQL database [6], you can simply install the libmysql-java package on Ubuntu 13.04 and make a few manual adjustments.

Solr itself can be installed using solr-common; in addition to this, you need solr-tomcat as a matching server. A data import handler then takes care of tapping into the database. To use the import handler in Ubuntu, the admin first needs to define it in the /etc/solr/conf/solrconfig.xml file (Listing 1, line 1). At the same time, this file expects a link to the configuration file (data-config.xml, line 3), which you can store, for example, in /home/username.

Listing 1

solrconfig.xml

 

This configuration file (Listing 2) needs to explain how Solr can access the data in the database (line 5), which XML fields and columns it should read, and whether it should change, delete, or add fields (lines 9 to 12). If necessary, different databases can be tapped by using the name field [7]. In this example, the Solr schema file (/etc/solr/conf/schema.xml) additionally needs the id, name, and desc (line 11), which the database user can set up after reading the schema [8], before modifying the data-config.xml file.

Listing 2

data-config.xml

 

The next step is to ensure that the appropriate JDBC driver ends up in the /usr/share/solr/web/WEB-INF/lib directory in the form of a .jar file. In the case of MySQL, you can discover this by entering

sudo dpkg -L mysqldata-java

then linking the driver. In the last step, a command imports the data into Solr; a simple HTTP call to the appropriate port specification is all you need:

http://localhost:8080/solr/dataimport?command=full-import

The command can optionally include a =delta-import, if you simply need to update an existing database.

Xapian

Unlike Solr, Xapian [9] is not an independent server that runs separately from the applications that use its services. Instead, it is a C++ library that can be integrated into applications as a search function. The project includes bindings for other programming languages, including Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua, and Erlang.

Xapian users include, for example, the weekly newspaper "Die Zeit" and the Debian project, which uses the search function for the mailing list. The packages for Ubuntu 13.04 are xapian-omega, xapian-tools, and libxapian22.

Xapian stores the data collected during indexing in a standard database that is used in the search step. Xapian also features a weighted probability search: Important words are assigned more weight than unimportant ones, so that more relevant documents are likely to move to the top of the results.

The user can search for exact terms (Figure 1), or for a specific number of words that either appear sorted or random in the search text. Xapian also corrects search keys entered by the user on the basis of its own search index and supports data indexing during a search by immediately integrating new documents into the search results.

Figure 1: The Xapian search page, running under Ubuntu on an Apache server, investigating some previously indexed test documents.

Xapian provides an interesting answer to the problem of searching databases. By default, it does not include support for database indexing. However, the Xapian project also includes a server engine named Omega [10] that uses the Xapian libraries as a basis. On Ubuntu, Omega resides in the xapian-omega package and installs an Apache server. A guide [11] explains how to search some test data.

The link between Xapian and Omega is similar to that between Solr and Lucene. Omega allows users to index databases, including all SQL databases and other database systems supported by the Perl DBI module. To be more specific, the Xapian/Omega team searches MySQL, PostgreSQL, SQLite, Oracle, DB2, MS SQL, LDAP, and ODBC.

For convenience, I'll assume you have a customer table with the fields id (primary key), customerName (varchar 255), and address (text), which belongs to the MySQL database myDB, and that you want to index this table. For Xapian, you need to convert the database format to text. To do so, export the database username and associated password at the command line:

export DBUSER=<DB-username>;
export DBPASSWORD='<DB-Password>';

In the next step, run a script named dbi2omega; Ubuntu 13.04 puts this script in /usr/share/doc/xapian-omega/examples/dbi2omega. The script transfers the contents of the table to a text file:

dbi2omega myDB customer >> customer.txt

In the next step, the Xapian user must create a script (Listing 3) that tells Xapian which fields to include – in this example, the script is named fields.script. In this way, the search software understands that the id field is a unique field, that it needs to index the customerName field, which has a five times greater relevance than the address field, which also ends up in the index  [12].

Listing 3

fields.script

 

On Ubuntu, you then run scriptindex, which resides in the path /usr/bin/, to generate the example database:

scriptindex --stemmer=german <ExampleDB> fields.script customer.txt

The --stemmer option helps with stem formation; Xapian's algorithm assumes English as the default language. For another language (e.g., German), you would optionally specify german, as shown in the example.

Sphinx

Like Solr, Sphinx is [13] a full-text search server, currently at version 2.0.8. Unlike Solr, Sphinx does not include a REST API for language diagnostics. However, developers have three options for working with this open source search solution: First, you can easily communicate with the Sphinx API. The search engine developers already offer a number of clients for Sphinx, which exist for the languages PHP, Java, and Python; unofficial Sphinx clients for Perl, Ruby, and C++ are also available.

Second, a dockable storage engine for MySQL called Sphinx SE allows the MySQL server to send large amounts of data to the searchd search daemon. Third, an SQL-style scripting language, Sphinx QL, can unearth information using the same approaches as in a MySQL database. Sphinx QL saves the detour of having to use the Sphinx API and talk directly to MySQL, which is why data retrieval is faster.

One of Sphinx's features is its ability to index collections of full text in real time. Sphinx natively integrates SQL databases, can group and sort search results, and can insert arithmetic functions such as MIN, MAX, SUM, and AVG when searching. Additionally, the search is very fast; Sphinx indexes 10 to 15MB of data per second with only one CPU core.

Sphinx's better-known customers include craigslist (with 300 million searches per day), Tumblr, and phpBB. Anyone wanting to use Sphinx as the search server in combination with Ubuntu needs to install the sphinxsearch package.

When it comes to indexing data, Sphinx considers everything to be a document with the same attributes and fields. It handles SQL data in a similar way, considering the rows to be documents and the columns to be fields or attributes. A data source driver converts the source data from different types of databases to documents. By default, Sphinx includes drivers for MySQL, PostgreSQL, MS SQL, and ODBC. Sphinx also includes a generic driver named xmlpipe2 for XML files.

To index a MySQL table with the fields id, title, and description [14], Sphinx users still need to configure many settings up front. For instance, you need to clarify where the information comes from, where the database server is running, and how Sphinx can access it.

Then the question arises as to where to store the index; the indexer must also be set up so that it searches the database and forwards the information. The searchd search daemon finally configures Sphinx.

Listings 4 and 5 show sections from the Sphinx configuration file (sphinx.conf). Fortunately, you can reference a template on Ubuntu:

Listing 4

Database Connection in sphinx.conf

 

Listing 5

Database Index Configuration

 

sudo cp /etc/sphinxsearch/sphinx.conf.sample /etc/sphinxsearch/sphinx.conf

Although the template file contains hundreds of entries, it is commented extensively and can thus be adapted to your needs (Figure 2).

Figure 2: This detailed configuration file helps connect Sphinx to databases.

The id field next to the sql_query entry must point to a positive integer that appears exactly once in the database. An auto-incremented integer key is the ideal choice.

The admin still needs to ensure that Sphinx is running as a service. In the /etc/default/sphinxsearch file, set the START=no parameter to START=yes and then start the service:

sudo service sphinxsearch start

After these adjustments, you just need a command to start indexing on Ubuntu. This relies on the indexer that Ubuntu automatically installs in /usr/bin/ when installing the Sphinx package.

indexer my_sphinx_index -c /etc/sphinxsearch/sphinx.conf

The final index is stored as my_sphinx_index in the /var/lib/sphinxsearch directory. For further details on the database connection and the possible options, check out the Sphinx documentation [15].

Conclusions

Of the three open source projects introduced in this article, Sphinx left the best impression when it came to indexing databases; it was apparently developed with this task in mind, at least in terms of usability and configuration.

However, users of all three candidates need to know in advance exactly which fields they want to index in the database and to specify these fields explicitly. It is not enough to install a database dump tool upstream of the search engine and hope that the right results come out at the end. Things become really complicated when search engines do not include SQL support – this means manual work for the admin: laboriously extracting the data from the tables with scripts.

Solr is otherwise recommended when it comes to NoSQL indexing (a database type not discussed here). Thanks to its REST-like API, Solr offers language-independent compatibility.

Infos

  1. More on accuracy and hit rates in the search context: https://en.wikipedia.org/wiki/Negative_predictive_value
  2. Solr search engine: http://lucene.apache.org/solr/
  3. Solr-Lucene substructure: http://lucene.apache.org
  4. Apache Tika Content Analysis Tool: http://tika.apache.org
  5. Solr with database connection: http://wiki.apache.org/solr/DIHQuickStart
  6. MySQL project: http://www.mysql.com/
  7. MySQL import into Solr: http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml
  8. Configuring Solr's schema.xml : http://wiki.apache.org/solr/SchemaXml
  9. Xapian search library: http://xapian.org
  10. Omega Project: http://xapian.org/docs/omega/
  11. Omega in action: http://trac.xapian.org/wiki/OmegaExample
  12. Xapian scriptindex: http://xapian.org/docs/omega/scriptindex.html
  13. Sphinx search engine: http://sphinxsearch.com
  14. Configuring MySQL for Sphinx: http://astellar.com/2011/12/replacing-mysql-full-text-search-with-sphinx/
  15. Detailed Sphinx documentation: http://sphinxsearch.com/wiki/doku.php?id=sphinx_docs#sphinxconf_options_reference