Full-text search with Solr, Xapian, and Sphinx

Sphinx

Like Solr, Sphinx is [13] a full-text search server, currently at version 2.0.8. Unlike Solr, Sphinx does not include a REST API for language diagnostics. However, developers have three options for working with this open source search solution: First, you can easily communicate with the Sphinx API. The search engine developers already offer a number of clients for Sphinx, which exist for the languages PHP, Java, and Python; unofficial Sphinx clients for Perl, Ruby, and C++ are also available.

Second, a dockable storage engine for MySQL called Sphinx SE allows the MySQL server to send large amounts of data to the searchd search daemon. Third, an SQL-style scripting language, Sphinx QL, can unearth information using the same approaches as in a MySQL database. Sphinx QL saves the detour of having to use the Sphinx API and talk directly to MySQL, which is why data retrieval is faster.

One of Sphinx's features is its ability to index collections of full text in real time. Sphinx natively integrates SQL databases, can group and sort search results, and can insert arithmetic functions such as MIN, MAX, SUM, and AVG when searching. Additionally, the search is very fast; Sphinx indexes 10 to 15MB of data per second with only one CPU core.

Sphinx's better-known customers include craigslist (with 300 million searches per day), Tumblr, and phpBB. Anyone wanting to use Sphinx as the search server in combination with Ubuntu needs to install the sphinxsearch package.

When it comes to indexing data, Sphinx considers everything to be a document with the same attributes and fields. It handles SQL data in a similar way, considering the rows to be documents and the columns to be fields or attributes. A data source driver converts the source data from different types of databases to documents. By default, Sphinx includes drivers for MySQL, PostgreSQL, MS SQL, and ODBC. Sphinx also includes a generic driver named xmlpipe2 for XML files.

To index a MySQL table with the fields id, title, and description [14], Sphinx users still need to configure many settings up front. For instance, you need to clarify where the information comes from, where the database server is running, and how Sphinx can access it.

Then the question arises as to where to store the index; the indexer must also be set up so that it searches the database and forwards the information. The searchd search daemon finally configures Sphinx.

Listings 4 and 5 show sections from the Sphinx configuration file (sphinx.conf). Fortunately, you can reference a template on Ubuntu:

Listing 4

Database Connection in sphinx.conf

Listing 5

Database Index Configuration

sudo cp /etc/sphinxsearch/sphinx.conf.sample /etc/sphinxsearch/sphinx.conf

Although the template file contains hundreds of entries, it is commented extensively and can thus be adapted to your needs (Figure 2).

Figure 2: This detailed configuration file helps connect Sphinx to databases.

The id field next to the sql_query entry must point to a positive integer that appears exactly once in the database. An auto-incremented integer key is the ideal choice.

The admin still needs to ensure that Sphinx is running as a service. In the /etc/default/sphinxsearch file, set the START=no parameter to START=yes and then start the service:

sudo service sphinxsearch start

After these adjustments, you just need a command to start indexing on Ubuntu. This relies on the indexer that Ubuntu automatically installs in /usr/bin/ when installing the Sphinx package.

indexer my_sphinx_index -c /etc/sphinxsearch/sphinx.conf

The final index is stored as my_sphinx_index in the /var/lib/sphinxsearch directory. For further details on the database connection and the possible options, check out the Sphinx documentation [15].

Conclusions

Of the three open source projects introduced in this article, Sphinx left the best impression when it came to indexing databases; it was apparently developed with this task in mind, at least in terms of usability and configuration.

However, users of all three candidates need to know in advance exactly which fields they want to index in the database and to specify these fields explicitly. It is not enough to install a database dump tool upstream of the search engine and hope that the right results come out at the end. Things become really complicated when search engines do not include SQL support – this means manual work for the admin: laboriously extracting the data from the tables with scripts.

Solr is otherwise recommended when it comes to NoSQL indexing (a database type not discussed here). Thanks to its REST-like API, Solr offers language-independent compatibility.

Infos

More on accuracy and hit rates in the search context: https://en.wikipedia.org/wiki/Negative_predictive_value
Solr search engine: http://lucene.apache.org/solr/
Solr-Lucene substructure: http://lucene.apache.org
Apache Tika Content Analysis Tool: http://tika.apache.org
Solr with database connection: http://wiki.apache.org/solr/DIHQuickStart
MySQL project: http://www.mysql.com/
MySQL import into Solr: http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml
Configuring Solr's schema.xml : http://wiki.apache.org/solr/SchemaXml
Xapian search library: http://xapian.org
Omega Project: http://xapian.org/docs/omega/
Omega in action: http://trac.xapian.org/wiki/OmegaExample
Xapian scriptindex: http://xapian.org/docs/omega/scriptindex.html
Sphinx search engine: http://sphinxsearch.com
Configuring MySQL for Sphinx: http://astellar.com/2011/12/replacing-mysql-full-text-search-with-sphinx/
Detailed Sphinx documentation: http://sphinxsearch.com/wiki/doku.php?id=sphinx_docs#sphinxconf_options_reference

« Previous 1 2

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
New Pentesting Distribution to Compete with Kali Linux

Linux , open source , Tools , Ubuntu

SnoopGod is now available for your testing needs

Full-text search with Solr, Xapian, and Sphinx

Sphinx

Conclusions

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

New Pentesting Distribution to Compete with Kali Linux

Full-text search with Solr, Xapian, and Sphinx

Sphinx

Conclusions

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters