Indexing and searching text with Lucene
Smart Search
Even state-of-the-art computers need to use clever methods to process ever-increasing amounts of document data. The open source Lucene framework uses inverted indexing for fast searches of document collections.
Nowadays, almost any commercially available hard drive can store more text than a whole library. In the digital world, a traditional system such as a card catalog or a knowledgeable librarian is no longer adequate to help find the right shelf. Even software equivalents such as find or zgrep are not always fast enough to track a particular piece of information amongst giga- or terabytes of data.
The science that deals with this type of search problem is called information retrieval. Computer scientists have developed sophisticated methods for tracking down files that users don’t even know exist. The free Java library Lucene implements some of these methods. Doug Cutting published an early version of Lucene in 1999. Two years later, the project, which carries the middle name of Cutting’s wife, came under the auspices of the Apache Foundation when it joined the Apache Jakarta Project. Lucene has been available in Version 4.0 since October 2012. The index file structures are backward compatible, so the transition from 3.6 to 4.0 does not cause any problems. Over the years, Lucene has become one of the most widely used solutions for indexing and searching text. (See the box titled “Lucene In All Its Facets.”)
Read full article as PDF »
Price $2.95
Direct Download
Read full article as PDF »
Price $2.95
Tag Cloud
News
-
Google and NASA Partner in Quantum Computing Project
Vendor D-Wave scores big with a sale to NASA's Quantum Intelligence Lab.
-
Mageia Project Announces Mageia 3 Linux
Many package updates and Steam integration highlight the latest from the Mandriva-based community Linux.
-
FSF Outs the World Wide Web Consortium over DRM Proposal
Richard Stallman calls for the W3C to remain independent of vendor interests.
-
Debian 7.0 Debuts
The new release supports nine architectures, 73 human languages, and zero non-Free components.
-
Alpha Version of Fedora 19 Released
Fedora developers release the first alpha version of Fedora 19, known as Schrödinger’s Cat, for general testing. The final release is expected in July 2013.
-
ack 2.0 Released
ack is a grep-like, command-line tool that has been optimized for programmers to search large trees of source code.
-
SUSE Studio 1.3 Released
New features in SUSE Studio 1.3 include enhanced cloud integration, VM platform support, and lifecycle management.
-
Xen To Become Linux Foundation Collaborative Project
The Linux Foundation recently announced that the Xen Project is becoming a Linux Foundation Collaborative Project.
-
RunRev Releases Open Source Version of LiveCode
Open source version of LiveCode is now available for developing apps, games, and utilities for all major platforms.
-
OpenDaylight Project Formed
OpenDaylight is an open source software-defined networking project committed to furthering adoption of SDN and accelerating innovation in a vendor-neutral and open environment.
