Apache Solr and Nutch

I have not post for quite a long time since my last post. I’ve been busy doing the new project I am now into. As a sweet fruit of labor, I would like to share some knowledge I have gained in this new venture.

I was assigned to the global IT projects together with two more colleagues. A lot of new technologies that we need to cope up with, especially dealing with the critical and gem departments of the company - marketing and sales, every detail is scrutinized and case studied thoroughly. Quality and reliability of the system is the top most importance of the development. A portion of everything done is the server configuration and optimization; this is to aid the system for the search management and site hits. The standard tools used by the company are Solr and Nutch working together pretty well, thus we end up studying these technologies.

A brief description:

Apache Solr - is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat. [Wikipedia.org]

NUTCH - is open source web-search software. It was built on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats. [Lucene.apache.org]

It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher (”robot” or “web crawler”) has been written from scratch solely for this project. [Wikipedia.org]

In general, Solr and Nutch harmoniously work together as it is both in-house software of Lucene project. Solr generally is responsible for the search and return capability of the system. If the user first searched an item, Solr technology is the one responsible for caching the search and the returned results. It has great API for JSON, XML as the format of the returned results. Other than caching, Solr have also these following features :

• Uses the Lucene library for full-text search
• Faceted navigation
• JSON, XML, PHP, Ruby, Python and custom Java binary output formats over HTTP
• HTML administration interface
• Replication to other Solr servers
• Extensible through plugins
• Distributed Search
• Caching

On the other hand, Nutch acts as the crawler, a mimic of the “goolge crawler” and other search engine crawler (im totaly a big fun of google. *grin). After hitting the search button, nutch would look for the search item into the entire web application and indexed its location and return its details to Solr if it is found.

To incorporate these in your web application, download the latest release of Solr and Nutch, and install it to your server and follow the configuration guidelines for each. Viola! Your system would have its own search engine with no sweat! *wink

Me on Information Tech

Search This Blog

Apache Solr and Nutch

Labels

Comments

Post a Comment

Popular posts from this blog

Cross-Site Scripting - (HACK) a way out

Search Engine Optimization (SEO)

Reviewing the BASICS