Skip to main content

Apache Solr and Nutch

I have not post for quite a long time since my last post. I’ve been busy doing the new project I am now into. As a sweet fruit of labor, I would like to share some knowledge I have gained in this new venture.


I was assigned to the global IT projects together with two more colleagues. A lot of new technologies that we need to cope up with, especially dealing with the critical and gem departments of the company - marketing and sales, every detail is scrutinized and case studied thoroughly. Quality and reliability of the system is the top most importance of the development. A portion of everything done is the server configuration and optimization; this is to aid the system for the search management and site hits. The standard tools used by the company are Solr and Nutch working together pretty well, thus we end up studying these technologies.


A brief description:


Apache Solr - is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat. [Wikipedia.org]


NUTCH - is open source web-search software. It was built on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats. [Lucene.apache.org]


It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher (”robot” or “web crawler”) has been written from scratch solely for this project. [Wikipedia.org]


In general, Solr and Nutch harmoniously work together as it is both in-house software of Lucene project. Solr generally is responsible for the search and return capability of the system. If the user first searched an item, Solr technology is the one responsible for caching the search and the returned results. It has great API for JSON, XML as the format of the returned results. Other than caching, Solr have also these following features :

• Uses the Lucene library for full-text search
• Faceted navigation
• JSON, XML, PHP, Ruby, Python and custom Java binary output formats over HTTP
• HTML administration interface
• Replication to other Solr servers
• Extensible through plugins
• Distributed Search
• Caching

On the other hand, Nutch acts as the crawler, a mimic of the “goolge crawler” and other search engine crawler (im totaly a big fun of google. *grin). After hitting the search button, nutch would look for the search item into the entire web application and indexed its location and return its details to Solr if it is found.


To incorporate these in your web application, download the latest release of Solr and Nutch, and install it to your server and follow the configuration guidelines for each. Viola! Your system would have its own search engine with no sweat! *wink

Comments

Popular posts from this blog

Creating Bottom-up Web Service (WSDL)

This post will primarily show you how to create a simple Web Service application through Apache Axis in Eclipse , and will not dwell on explaining the background or functionality of a Web Service. Yet, it’s a de facto to at least give a little definition. WSDL or the Web Services Definition Language is just another specification to describe network XML-based services. It supports message-oriented and procedural approach XML technologies. (for further reading click here ) 1. Preparing the web application a. Create a new web application and name it as “SimpleWebService”. b. Download and add “axis.jar” ( download here ) to the application libraries. c. Edit and add this following configurations to the web.xml file. AxisServlet org.apache.axis.transport.http.AxisServlet AdminServlet org.apache.axis.transport.http.AdminServlet 100 AxisServlet /servlet/AxisServlet AxisServlet *.jws AxisServlet /services/* *Note: spa...

How to get rid of VB Script Just-In-Time Debugger Error

Lately i have been pestered with a lame error every time my Windows starts up. The “VB Script Just-In-Time Debugger Error” shows up and it would terminate the explorer.exe process upon clicking OK. Somehow something went wrong in the system but the error does not specifically says what it is. I don’t have a clue how to resolve it. We all know explorer.exe is critical for all windows to work, thus leaving me no choice but to run it manually. For normal users who do not know how to run the explorer.exe manually, they will be paralyzed. They won’t find their way to work it except to ask for help, which sometimes can be so annoying specially when you’re up to finish a deadline. Luckily, i was so persistent enough to search for a solution. Though no one gave the exact process of eliminating this error, I come up to finally solve it through my compilation of readings and i’ll share it with you. Here's how to get rid of this error: 1. Open Regedi...

Search Engine Optimization (SEO)

Often we focus our site development mainly on its GUI (Graphical User Interface) and content, although this is very critical and needed, yet we tend to neglect adding some essentials that also need to be considered and included in order to optimize the site search ability. We want surfers to find and read our sites especially if we want to market something, thus it is important to make sure that our site link will be included in the list of search results on any search engines. The question is how will you do it? How will Google, Yahoo, msn etc, will find your site be indexed and included in their search results? There are a lot of things to learn and take into consideration if you want to optimize your site. Before proceeding, I’ll provide primary reference for you to get started before jumping into SEO. Read through this and decide if you need to reconstruct your site or not. Google Webmaster Site – This will acquaint you to search engines basic concepts. Webmaster Guidel...