Smart research using Elasticsearch

Fine-Tuning

The min_term_freq parameter specifies a threshold for the selection of a word in the reference document with the more_like_this function. If min_term_freq is set to the default value 2, a word must occur there at least twice to make its way into the list of words with which other documents are compared later. The second parameter max_query_terms is the maximum number of words from the list in the original document that the algorithm selects to use later in the query.

For anyone wanting to find out about other methods for fine-tuning the search engine, I would recommend the O'Reilly book on the topic [1]. It explains how to deal with Elasticsearch using examples, provides tips for scaling in clusters, and takes a look behind the scenes, where the Apache Lucene search engine is at work.

Infos

  1. Gormley, Clinton and Zachary Tong, Elasticsearch: The Definitive Guide: O'Reilly, 2015.
  2. Elasticsearch: https://www.elastic.co
  3. "Perl: Elasticsearch" by Mike Schilli, Linux Magazine, issue 162, pg. 66, 2014: http://www.linux-magazine.com/Issues/2014/162/Perl-Elasticsearch/(language)/eng-US
  4. Tf-idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
  5. More Like This Query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
  6. Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/magazine/182/

The Author

Mike Schilli works as a software engineer in the San Francisco Bay Area. He can be contacted at mailto:mschilli@perlmeister.com. Mike's homepage can be found at http://perlmeister.com.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Perl: Elasticsearch

    The Elasticsearch full-text search engine quickly finds expressions even in huge text collections. With a few tricks, you can even locate photos that have been shot in the vicinity of a reference image.

  • ELK Stack

    A powerful search engine, a tool for processing and normalizing protocols, and another for visualizing the results – Elasticsearch, Logstash, and Kibana form the ELK stack, which helps admins manage logfiles on high-volume systems.

  • Logstash

    When something goes wrong on a system, the logfile is the first place to look for troubleshooting clues. Logstash, a log server with built-in analysis tools, consolidates logs from many servers and even makes the data searchable.

  • Index Search with Lucene

    Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.

  • Tutorials – Recoll

    Even in the age of cloud computing, personal computers often hold thousands of files: text files, spreadsheets, word processing docs, configuration files, and HTML files, as well as email and other message formats. If it takes too long to find the file you need, chase it down with the Recoll local search engine.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News