An indexing search engine with Nutch and Solr
Indexed
If you want to prevent a search engine server from accessing the Internet, you can define a very short HTTP timeout using a firewall. The search engine will find external URLs in the documents, but its attempts to resolve the URLs will fail if you set a sufficiently short timeout. The crawler will therefore find external URLs, but not reach them and thus not add them to the database.
For a cleaner approach, you can use the regex-urlfilter.txt
file in Nutch's conf
directory. The regex-urlfilter.txt
file lets you define exceptions. (Nutch already has some default rules that prevent the crawler from reading unnecessary files such as CSS files or images.)
The following command
-^(http|https)://www.wikipedia.com
stops Nutch from following links to http://www.wikipedia.com. It makes even more sense to define a whitelist and thus only permit individual servers:
+^(http|https)://intranet.company.local
Or you can specify multiple addresses using regular expressions:
+^http://([a-z0-9\-A-Z]*\.)*.company.local/([a-z0-9\-A-Z]*\/)*
The important thing is to correct the last line, which defines the general policy:
# accept anything else +.
and replace it with:
# deny anything else -.
You need to deny unspecified traffic in order for the list to serve as a whitelist.
On Your Marks
Everything is set up; the search can start – all you need to do is tell the crawler the starting point of its journey. You define the subdirectories and the file that contain the URLs in /opt/nutch/
:
mkdir /opt/nutch/urls echo "http://intranetserver.company.local" > /opt/nutch/urls/seed.txt
Any number of URLs are permissible.
Then start the crawler:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/jre/ /opt/bin/crawl /opt/nutch/urls/ /opt/nutch/IntranetCrawler/ http://localhost:8080/solr/ 10
The first parameter in the crawl command specifies the directory containing the seed.txt
file.
Nutch runs fetcher processes that load and parse the discovered content. The /opt/nutch/IntranetCrawler
option specifies the directory in which you want Nutch to create this content. The address follows, including the Solr server port to which Nutch saves the results.
The number 10
at the end states the number of crawler runs. Depending on the pages it finds and the search depth, it can take some time for the command to complete. For initial tests, you might prefer a value of 1
or 2
.
When the fetcher downloads and parses the results, it typically finds more links to more content. These links end up in its link database. On the next run, the crawler reads these URLs too and hands them over to the fetcher processes. This to-do list with links for the crawler grows very quickly at the start, because the crawler can only process a certain amount of content during each run.
Nutch breaks down the discovered links into segments, which it processes one by one. A segment only contains a certain number of links; for this reason, it can happen that the crawler creates new segments immediately after the first one that it is processing. The new links found in this process end up in new segments.
The fact that the script has stopped running does not mean the crawler has found all the content; some content may have been saved for fetcher processes later on.
After the first run, however, the content should all be available for displaying in the Solr web interface: http://<Searchserver>:8080/solr/admin/ (see also Figure 1).
How often the crawler runs – every night, once a month, or on the weekend – is a question of the data volume and your need for up-to-date results. The important thing is that Nutch only finds data to which a link points – from a website or an indexed document. Non-linked documents virtually don't exist, except in the FTP or HTTP directory listings.
Querying with jQuery
Admins typically integrate the Solr search directly into an existing intranet portal. Solr provides an API for this purpose that manages database access and returns the search results. DIY queries with jQuery are a useful solution. Listing 4 shows the HTML code for a simple website with jQuery scripts (see Figure 2).
Listing 4
jQuery Query
01 <html> 02 <head> 03 <title>Example Search</title> 04 </head> 05 <body> 06 <h3>Simple Search Engine</h3> 07 Search: <input id="query" /> 08 <button id="search">Search</button> (Example: "content:foobar"; "url:bar"; "title:foo") 09 <hr/> 10 <div id="results"> 11 </div> 12 </body> 13 <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script> 14 <script> 15 function on_data(data) { 16 $('#results').empty(); 17 var docs = data.response.docs; 18 $.each(docs, function(i, item) { 19 if (item.content.length > 400) 20 contentpart = item.content.substring(0,400); 21 else 22 contentpart = item.content 23 $('#results').prepend($( 24 '<strong>' + item.title + '</strong><br/>' + 25 '<a href="'+ item.url +'" target="_blank">'+ item.url +'</a>' + 26 '<br/><div style="font-size:80%;">' + contentpart +'</div><hr/>')); 27 }); 28 var total = 'Found page: ' + docs.length + '<hr/>'; 29 $('#results').prepend('<div>' + total + '</div>'); 30 } 31 32 function on_search() { 33 var query = $('#query').val(); 34 if (query.length == 0) { 35 return; 36 } 37 } 38 var solrServer = 'http://SEARCHSERVER:8080/solr'; 39 var url = solrServer + '/select/?q='+encodeURIComponent(query) + '&version=2.2&start=0&rows=50&indent=on&wt= json&callback=?&json.wrf=on_data'; 40 $.getJSON(url); 41 function on_ready() { 42 $('#search').click(on_search); 43 /* Hook enter to search */ 44 $('body').keypress(function(e) { 45 if (e.keyCode == '13') { 46 on_search(); 47 } 48 }); 49 } 50 $(document).ready(on_ready); 51 </script> 52 </html>
![](/var/linux_magazin/storage/images/issues/2016/186/do-it-yourself-search-engine/figure-2/669384-1-eng-US/Figure-2_large.png)
Figure 3 shows the server's response in the XML code transferred – a simple example with primitive search requests against the Solr back end.
In the search query, the user has many options. You can rank the page title higher than the content. The statement content:(linux) title:(linux)^1.5
gives a match in the title one and a half times more weight than a match that is found in the document body. On the other hand, you can search for pages that contain the word "Linux" without the word "Debian." In this case, you might still want to give the title preferential treatment:
content:(linux -debian) title:(linux -debian)^1.5
Logical ANDs are easily achieved with a simple plus sign; OR-ing is the default; thus, content:(+linux +debian)
searches for Linux
and Debian
.
Without the plus sign, the Solr-Nutch team would show you any documents containing Linux or Debian. Quotes let you search for complete terms (content:("Linux Live USB Stick")
.
These simple forms of user interaction give you a good idea of the potential this option offers for jQuery and web programmers. What you will need for a professional-looking search engine front end is forms and point-and-click queries, as well as input validation and the ability to highlight the search key in the results.
Do-It-Yourself Search Engine
Infos
- Google Search: https://www.google.com/work/search/
- Nutch: https://nutch.apache.org
- Solr: https://lucene.apache.org/solr/
- Lucene: https://lucene.apache.org
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
![Learn More](https://www.linux-magazine.com/var/linux_magazin/storage/images/media/linux-magazine-eng-us/images/misc/learn-more/834592-1-eng-US/Learn-More_medium.png)
News
-
NVIDIA Released Driver for Upcoming NVIDIA 560 GPU for Linux
Not only has NVIDIA released the driver for its upcoming CPU series, it's the first release that defaults to using open-source GPU kernel modules.
-
OpenMandriva Lx 24.07 Released
If you’re into rolling release Linux distributions, OpenMandriva ROME has a new snapshot with a new kernel.
-
Kernel 6.10 Available for General Usage
Linus Torvalds has released the 6.10 kernel and it includes significant performance increases for Intel Core hybrid systems and more.
-
TUXEDO Computers Releases InfinityBook Pro 14 Gen9 Laptop
Sporting either AMD or Intel CPUs, the TUXEDO InfinityBook Pro 14 is an extremely compact, lightweight, sturdy powerhouse.
-
Google Extends Support for Linux Kernels Used for Android
Because the LTS Linux kernel releases are so important to Android, Google has decided to extend the support period beyond that offered by the kernel development team.
-
Linux Mint 22 Stable Delayed
If you're anxious about getting your hands on the stable release of Linux Mint 22, it looks as if you're going to have to wait a bit longer.
-
Nitrux 3.5.1 Available for Install
The latest version of the immutable, systemd-free distribution includes an updated kernel and NVIDIA driver.
-
Debian 12.6 Released with Plenty of Bug Fixes and Updates
The sixth update to Debian "Bookworm" is all about security mitigations and making adjustments for some "serious problems."
-
Canonical Offers 12-Year LTS for Open Source Docker Images
Canonical is expanding its LTS offering to reach beyond the DEB packages with a new distro-less Docker image.
-
Plasma Desktop 6.1 Released with Several Enhancements
If you're a fan of Plasma Desktop, you should be excited about this new point release.