Full-text search with Solr, Xapian, and Sphinx
Sphinx
Like Solr, Sphinx is [13] a full-text search server, currently at version 2.0.8. Unlike Solr, Sphinx does not include a REST API for language diagnostics. However, developers have three options for working with this open source search solution: First, you can easily communicate with the Sphinx API. The search engine developers already offer a number of clients for Sphinx, which exist for the languages PHP, Java, and Python; unofficial Sphinx clients for Perl, Ruby, and C++ are also available.
Second, a dockable storage engine for MySQL called Sphinx SE allows the MySQL server to send large amounts of data to the searchd
search daemon. Third, an SQL-style scripting language, Sphinx QL, can unearth information using the same approaches as in a MySQL database. Sphinx QL saves the detour of having to use the Sphinx API and talk directly to MySQL, which is why data retrieval is faster.
One of Sphinx's features is its ability to index collections of full text in real time. Sphinx natively integrates SQL databases, can group and sort search results, and can insert arithmetic functions such as MIN
, MAX
, SUM
, and AVG
when searching. Additionally, the search is very fast; Sphinx indexes 10 to 15MB of data per second with only one CPU core.
Sphinx's better-known customers include craigslist (with 300 million searches per day), Tumblr, and phpBB. Anyone wanting to use Sphinx as the search server in combination with Ubuntu needs to install the sphinxsearch
package.
When it comes to indexing data, Sphinx considers everything to be a document with the same attributes and fields. It handles SQL data in a similar way, considering the rows to be documents and the columns to be fields or attributes. A data source driver converts the source data from different types of databases to documents. By default, Sphinx includes drivers for MySQL, PostgreSQL, MS SQL, and ODBC. Sphinx also includes a generic driver named xmlpipe2
for XML files.
To index a MySQL table with the fields id
, title
, and description
[14], Sphinx users still need to configure many settings up front. For instance, you need to clarify where the information comes from, where the database server is running, and how Sphinx can access it.
Then the question arises as to where to store the index; the indexer must also be set up so that it searches the database and forwards the information. The searchd
search daemon finally configures Sphinx.
Listings 4 and 5 show sections from the Sphinx configuration file (sphinx.conf
). Fortunately, you can reference a template on Ubuntu:
Listing 4
Database Connection in sphinx.conf
Listing 5
Database Index Configuration
sudo cp /etc/sphinxsearch/sphinx.conf.sample /etc/sphinxsearch/sphinx.conf
Although the template file contains hundreds of entries, it is commented extensively and can thus be adapted to your needs (Figure 2).
The id
field next to the sql_query
entry must point to a positive integer that appears exactly once in the database. An auto-incremented integer key is the ideal choice.
The admin still needs to ensure that Sphinx is running as a service. In the /etc/default/sphinxsearch
file, set the START=no
parameter to START=yes
and then start the service:
sudo service sphinxsearch start
After these adjustments, you just need a command to start indexing on Ubuntu. This relies on the indexer that Ubuntu automatically installs in /usr/bin/
when installing the Sphinx package.
indexer my_sphinx_index -c /etc/sphinxsearch/sphinx.conf
The final index is stored as my_sphinx_index
in the /var/lib/sphinxsearch
directory. For further details on the database connection and the possible options, check out the Sphinx documentation [15].
Conclusions
Of the three open source projects introduced in this article, Sphinx left the best impression when it came to indexing databases; it was apparently developed with this task in mind, at least in terms of usability and configuration.
However, users of all three candidates need to know in advance exactly which fields they want to index in the database and to specify these fields explicitly. It is not enough to install a database dump tool upstream of the search engine and hope that the right results come out at the end. Things become really complicated when search engines do not include SQL support – this means manual work for the admin: laboriously extracting the data from the tables with scripts.
Solr is otherwise recommended when it comes to NoSQL indexing (a database type not discussed here). Thanks to its REST-like API, Solr offers language-independent compatibility.
Infos
- More on accuracy and hit rates in the search context: https://en.wikipedia.org/wiki/Negative_predictive_value
- Solr search engine: http://lucene.apache.org/solr/
- Solr-Lucene substructure: http://lucene.apache.org
- Apache Tika Content Analysis Tool: http://tika.apache.org
- Solr with database connection: http://wiki.apache.org/solr/DIHQuickStart
- MySQL project: http://www.mysql.com/
- MySQL import into Solr: http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml
- Configuring Solr's
schema.xml
: http://wiki.apache.org/solr/SchemaXml - Xapian search library: http://xapian.org
- Omega Project: http://xapian.org/docs/omega/
- Omega in action: http://trac.xapian.org/wiki/OmegaExample
- Xapian scriptindex: http://xapian.org/docs/omega/scriptindex.html
- Sphinx search engine: http://sphinxsearch.com
- Configuring MySQL for Sphinx: http://astellar.com/2011/12/replacing-mysql-full-text-search-with-sphinx/
- Detailed Sphinx documentation: http://sphinxsearch.com/wiki/doku.php?id=sphinx_docs#sphinxconf_options_reference
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.