Hyper Estraier

Searching the Scene

Hendrik Weimer

2006-04-03

Normal version

One possibility to realize the sheer genius behind Google's search engine is trying to index and search a reasonable amount of data by yourself. In terms of speed and accessibility every search application has to compete with the king of the hill. This goes for desktop and web search solutions, and Hyper Estraier plays in both markets.

Hyper Estraier has many similarities with the famous giant. Both use a full inverted index to process the search queries. The syntax is very similar as well, however Hyper Estraier offers a few more features such as support for queries with regular expressions. Unfortunately, it is not possible to restrict the allowed query types. And you probably don't always want to have a regexp implementation lying open to attack for the whole world. Some fine-tuning via an ACL mechanism would be really helpful here.

The most time-consuming work is to create the inverted index from scratch. Of course, the large the amount of data, the longer it takes. To give a rough estimate, indexing a mail spool of 20 MB takes about twenty minutes on an average PC. There are no real limits on how large the index may get, but Hyper Estraier probably won't scale as well as Google's secret algorithms. To prevent swapping itself to death, the maximum size of the cache during indexing can be adjusted.

Besides indexing HTML documents and plain text files, Hyper Estraier natively supports e-mails stored as MIME messages as well. If you require additional formats you can call an external filter program in order to index anything that can be somehow converted to a text or HTML document. If some meta information about the document such as author or title shall be preserved, a special document format can be used to feed this information to the index database.

After performing changes the index needs to be updated. This is done with a special command, which runs much faster than building the index from scratch. Removing deleted documents from the index has to be done separately as well. To improve performance, the index should be optimized from time to time. All these tasks can be automatized by cronjobs.

Unless you are happy with a command-line tool, Hyper Estraier requires a web server to perform searches, there is no stand-alone GUI application. A CGI script helps to setup a user interface. However, the server with the CGI script doesn't have to run on the same machine as the actual data. In contrast to other projects like ht://Dig or Swish-e the script can be easily customized with templates. Other ways to access the index is via a library with bindings for several languages or a separate apache module.

Searching is extremely fast, even a search through an index of more than 10,000 documents takes much less than a second. If that is still insufficient, Hyper Estraier offers a peer-to-peer architecture that allows distributing the index database onto multiple servers.

Some other nice features such as full Unicode compliance and support for multiple languages round off the picture. Currently, Hyper Estraier is the best search solution that is suitable both for providing a search frontend for a website or indexing a large number of documents on a single computer.

Got a question on Hyper Estraier? Post it as a comment!

Hyper Estraier
Version:1.0.6
Homepage:http://hyperestraier.sourceforge.net/
License:LGPL
Distributions: [?]□ Debian stable■ Debian unstable
□ Fedora□ Mandriva
□ Suse■ Ubuntu
Pros:
Rating:

85

  • Customizable with templates
  • Google-like query syntax
  • Regexp queries
Cons:
  • Web server required
  • Poor query ACLs

Copyright 2006–2008 OS Reviews. This document is available under the terms of the GNU Free Documentation License. See the licensing terms for further details.