Skip to main content

Apache Solr and Nutch

I have not post for quite a long time since my last post. I’ve been busy doing the new project I am now into. As a sweet fruit of labor, I would like to share some knowledge I have gained in this new venture.


I was assigned to the global IT projects together with two more colleagues. A lot of new technologies that we need to cope up with, especially dealing with the critical and gem departments of the company - marketing and sales, every detail is scrutinized and case studied thoroughly. Quality and reliability of the system is the top most importance of the development. A portion of everything done is the server configuration and optimization; this is to aid the system for the search management and site hits. The standard tools used by the company are Solr and Nutch working together pretty well, thus we end up studying these technologies.


A brief description:


Apache Solr - is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat. [Wikipedia.org]


NUTCH - is open source web-search software. It was built on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats. [Lucene.apache.org]


It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher (”robot” or “web crawler”) has been written from scratch solely for this project. [Wikipedia.org]


In general, Solr and Nutch harmoniously work together as it is both in-house software of Lucene project. Solr generally is responsible for the search and return capability of the system. If the user first searched an item, Solr technology is the one responsible for caching the search and the returned results. It has great API for JSON, XML as the format of the returned results. Other than caching, Solr have also these following features :

• Uses the Lucene library for full-text search
• Faceted navigation
• JSON, XML, PHP, Ruby, Python and custom Java binary output formats over HTTP
• HTML administration interface
• Replication to other Solr servers
• Extensible through plugins
• Distributed Search
• Caching

On the other hand, Nutch acts as the crawler, a mimic of the “goolge crawler” and other search engine crawler (im totaly a big fun of google. *grin). After hitting the search button, nutch would look for the search item into the entire web application and indexed its location and return its details to Solr if it is found.


To incorporate these in your web application, download the latest release of Solr and Nutch, and install it to your server and follow the configuration guidelines for each. Viola! Your system would have its own search engine with no sweat! *wink

Comments

Popular posts from this blog

Creating Bottom-up Web Service (WSDL)

This post will primarily show you how to create a simple Web Service application through Apache Axis in Eclipse , and will not dwell on explaining the background or functionality of a Web Service. Yet, it’s a de facto to at least give a little definition. WSDL or the Web Services Definition Language is just another specification to describe network XML-based services. It supports message-oriented and procedural approach XML technologies. (for further reading click here ) 1. Preparing the web application a. Create a new web application and name it as “SimpleWebService”. b. Download and add “axis.jar” ( download here ) to the application libraries. c. Edit and add this following configurations to the web.xml file. AxisServlet org.apache.axis.transport.http.AxisServlet AdminServlet org.apache.axis.transport.http.AdminServlet 100 AxisServlet /servlet/AxisServlet AxisServlet *.jws AxisServlet /services/* *Note: spa...

How to get rid of VB Script Just-In-Time Debugger Error

Lately i have been pestered with a lame error every time my Windows starts up. The “VB Script Just-In-Time Debugger Error” shows up and it would terminate the explorer.exe process upon clicking OK. Somehow something went wrong in the system but the error does not specifically says what it is. I don’t have a clue how to resolve it. We all know explorer.exe is critical for all windows to work, thus leaving me no choice but to run it manually. For normal users who do not know how to run the explorer.exe manually, they will be paralyzed. They won’t find their way to work it except to ask for help, which sometimes can be so annoying specially when you’re up to finish a deadline. Luckily, i was so persistent enough to search for a solution. Though no one gave the exact process of eliminating this error, I come up to finally solve it through my compilation of readings and i’ll share it with you. Here's how to get rid of this error: 1. Open Regedi...

Cross-Site Scripting - (HACK) a way out

I came across to this discovery when i was affronted with the problem of trying to communicate TWO sites on different domain/server and some parameters need to be passed. As you may have “Googled” it, you can’t do outright javaScript function call from your site to a partner site since it resides on different domains. It’s a violation of the Cross-Site Scripting W3C standards for it is highly probable for site security breach. To make the picture clearer, here’s the scenario. Take for an example your site caters a hotel reservation and you have a partner seller that also maintains a site for marketing. If you want to maximize you potential sales, you’ll opt to let your partner embed/include your reservation site somewhere in their site. See the diagram now? Partner site e.g resides in www.marketing.com and your site in www.hotelreservation.com , without directly accessing your site, a customer must be able to get a hotel reservation on you part...