An indexing search engine with Nutch and Solr
Go Find It

© Lead Image © Dmitry Naumov, 123RF.com
Build you own search engine using Apache's Nutch web crawler and Solr search platform.
CMS, wikis, text files … modern companies store important data in many different places, and that data must be accessible down to the tiniest detail through a single search. Commercial software vendors such as Google [1] offer tools that will index the data and store the index on an external server. But many organizations prefer to keep control of the search capabilities – for security and privacy reasons, but also to add flexibility and promote innovation and customization.
A handy constellation of open source tools from the Apache project will help you build your own search index for the assorted documents and data on your network: Nutch, Solr, Apache, and Lucene.
Nutch [2] is a powerful web crawler, and Apache Solr [3] is a search engine based on Apache Lucene [4]. You can combine Nutch with Solr to create a complete search engine – a miniature Google, if you like.
The Nutch crawler uses HTTP and FTP to discover information. If you want Nutch to inspect your local files, you need to store the files on an HTTP or FTP server and point to the directories you want Nutch to crawl. Nutch fetches data that is then searched and indexed by Solr. Solr depends on the Apache Lucene search libraries and is written in Java, and it requires a Java Servlet container server. The Jetty Java Servlet container tool is installed by default, but many users prefer a more robust solution such as Apache Tomcat. (See "A Note of Caution" box for more info.)
A Note of Caution
The crawler indexes data accessible to the daemons associated with the process. Depending on your security system, the search results could be more than you would want non-privileged users to see, so you might need to adjust your configuration to rule out access to highly secure files and directories.
This workshop shows how to build your own search engine using on an Ubuntu 14.04.2 LTS system.
Installing the Components
On Canonical's Enterprise Linux, Solr is available from the package sources; you only need to install Nutch manually (Listing 1, lines 1-4). Then back up Solr's default XML schema and replace it with the file supplied by Nutch (Lines 6 and 7).
Listing 1
Installing Solr and Nutch
apt-get install solr-tomcat wget http://www.eu.apache.org/dist/nutch/1.9/apache-nutch-1.9-bin.tar.gz tar vfx apache-nutch-1.9-bin.tar.gz mv apache-nutch-1.9 /opt/nutch mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig cp /opt/nutch/conf/schema.xml /etc/solr/conf/schema.xml
By default, the server does not save the content of pages or documents it finds. When it re-indexes, it transfers all the contents again. If you want to enable caching, you can do so in the /etc/solr/conf/schema.xml
configuration file by changing the stored="false"
entry in the following line:
<field name="content" type="text" stored="false" indexed="true"/>
line to "true"
; then restart Tomcat by typing service tomcat6 restart
.
Configuring the Nutch Crawler
Although you can control the crawler's default behavior with the /opt/nutch/conf/nutch-default.xml
file, it makes more sense to customize the /opt/nutch/conf/nutch-site.xml
file with site-specific details.
The example in Listing 2 shows how you can configure the name of the HTTP agent. This name will appear in the web server's logfiles.
Listing 2
nutch-site.xml
01 <?xml version="1.0"?> 02 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 03 <!-- Put site-specific property overrides in this file. --> 04 <configuration> 05 <property> 06 <name>http.agent.name</name> 07 <value>Company Search Agent</value> 08 </property> 09 </configuration>
The nutch-default.xml
file contains various settings that control the crawler's behavior. In nutch-site.xml
, you need to do this:
<property> <name>file.content.ignored</name> <value>false</value> </property>
Additionally, you need Nutch to remove any documents that the users have deleted in the meantime from the search engine's database:
<property> <name>db.update.purge.404</name> <value>true</value> </property>
On a local network, with few servers and clients compared with the Internet, the five-second default setting between two requests to the same server leads to an unnecessarily large number of inactive threads, which slows down the search engine. The fetcher.server.delay
is useful for ensuring that the search engine will not overload a server with requests:
<property> <name>fetcher.server.delay</name> <value>0.0</value> </property>
It makes sense to disable this value and only re-enable it if problems occur.
Large Documents
On the Internet, it is sometimes useful to index large documents, but you need to be careful not to let the crawler get hung up on a gigantic tome with no useful information. Nutch lets you define the content.limit
class parameters that define the maximum size of the content that crawler processes (Listing 3). You can also define the length of the document title, say, to achieve a more informative view in the search results – the value is in characters not in bytes:
<property> <name>indexer.max.title.length</name> <value>150</value> </property>
Another useful variable, fetcher.threads.fetch
, defines the number of concurrent threads reading content. The http.timeout
reduces the time the thread needs to wait for a request to time out.
Listing 3
File Lengths
01 <property> 02 <name>file.content.limit</name> 03 <value>131072</value> 04 </property> 05 <property> 06 <name>http.content.limit</name> 07 <value>131072</value> 08 </property> 09 <property> 10 <name>ftp.content.limit</name> 11 <value>131072</value> 12 </property>
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs
News
-
OpenMandriva Lx 23.03 Rolling Release is Now Available
OpenMandriva "ROME" is the latest point update for the rolling release Linux distribution and offers the latest updates for a number of important applications and tools.
-
CarbonOS: A New Linux Distro with a Focus on User Experience
CarbonOS is a brand new, built-from-scratch Linux distribution that uses the Gnome desktop and has a special feature that makes it appealing to all types of users.
-
Kubuntu Focus Announces XE Gen 2 Linux Laptop
Another Kubuntu-based laptop has arrived to be your next ultra-portable powerhouse with a Linux heart.
-
MNT Seeks Financial Backing for New Seven-Inch Linux Laptop
MNT Pocket Reform is a tiny laptop that is modular, upgradable, recyclable, reusable, and ships with Debian Linux.
-
Ubuntu Flatpak Remix Adds Flatpak Support Preinstalled
If you're looking for a version of Ubuntu that includes Flatpak support out of the box, there's one clear option.
-
Gnome 44 Release Candidate Now Available
The Gnome 44 release candidate has officially arrived and adds a few changes into the mix.
-
Flathub Vying to Become the Standard Linux App Store
If the Flathub team has any say in the matter, their product will become the default tool for installing Linux apps in 2023.
-
Debian 12 to Ship with KDE Plasma 5.27
The Debian development team has shifted to the latest version of KDE for their testing branch.
-
Planet Computers Launches ARM-based Linux Desktop PCs
The firm that originally released a line of mobile keyboards has taken a different direction and has developed a new line of out-of-the-box mini Linux desktop computers.
-
Ubuntu No Longer Shipping with Flatpak
In a move that probably won’t come as a shock to many, Ubuntu and all of its official spins will no longer ship with Flatpak installed.