Skip to main content

Apache Solr and Nutch

I have not post for quite a long time since my last post. I’ve been busy doing the new project I am now into. As a sweet fruit of labor, I would like to share some knowledge I have gained in this new venture.


I was assigned to the global IT projects together with two more colleagues. A lot of new technologies that we need to cope up with, especially dealing with the critical and gem departments of the company - marketing and sales, every detail is scrutinized and case studied thoroughly. Quality and reliability of the system is the top most importance of the development. A portion of everything done is the server configuration and optimization; this is to aid the system for the search management and site hits. The standard tools used by the company are Solr and Nutch working together pretty well, thus we end up studying these technologies.


A brief description:


Apache Solr - is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat. [Wikipedia.org]


NUTCH - is open source web-search software. It was built on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats. [Lucene.apache.org]


It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher (”robot” or “web crawler”) has been written from scratch solely for this project. [Wikipedia.org]


In general, Solr and Nutch harmoniously work together as it is both in-house software of Lucene project. Solr generally is responsible for the search and return capability of the system. If the user first searched an item, Solr technology is the one responsible for caching the search and the returned results. It has great API for JSON, XML as the format of the returned results. Other than caching, Solr have also these following features :

• Uses the Lucene library for full-text search
• Faceted navigation
• JSON, XML, PHP, Ruby, Python and custom Java binary output formats over HTTP
• HTML administration interface
• Replication to other Solr servers
• Extensible through plugins
• Distributed Search
• Caching

On the other hand, Nutch acts as the crawler, a mimic of the “goolge crawler” and other search engine crawler (im totaly a big fun of google. *grin). After hitting the search button, nutch would look for the search item into the entire web application and indexed its location and return its details to Solr if it is found.


To incorporate these in your web application, download the latest release of Solr and Nutch, and install it to your server and follow the configuration guidelines for each. Viola! Your system would have its own search engine with no sweat! *wink

Comments

Popular posts from this blog

Cross-Site Scripting - (HACK) a way out

I came across to this discovery when i was affronted with the problem of trying to communicate TWO sites on different domain/server and some parameters need to be passed. As you may have “Googled” it, you can’t do outright javaScript function call from your site to a partner site since it resides on different domains. It’s a violation of the Cross-Site Scripting W3C standards for it is highly probable for site security breach. To make the picture clearer, here’s the scenario. Take for an example your site caters a hotel reservation and you have a partner seller that also maintains a site for marketing. If you want to maximize you potential sales, you’ll opt to let your partner embed/include your reservation site somewhere in their site. See the diagram now? Partner site e.g resides in www.marketing.com and your site in www.hotelreservation.com , without directly accessing your site, a customer must be able to get a hotel reservation on you part...

Search Engine Optimization (SEO)

Often we focus our site development mainly on its GUI (Graphical User Interface) and content, although this is very critical and needed, yet we tend to neglect adding some essentials that also need to be considered and included in order to optimize the site search ability. We want surfers to find and read our sites especially if we want to market something, thus it is important to make sure that our site link will be included in the list of search results on any search engines. The question is how will you do it? How will Google, Yahoo, msn etc, will find your site be indexed and included in their search results? There are a lot of things to learn and take into consideration if you want to optimize your site. Before proceeding, I’ll provide primary reference for you to get started before jumping into SEO. Read through this and decide if you need to reconstruct your site or not. Google Webmaster Site – This will acquaint you to search engines basic concepts. Webmaster Guidel...

Reviewing the BASICS

Right knowledge in OOP makes your code and implementation more reliable. This can be achieved by correct coding practice and methodology. However, there will come a time that you’ll be affront with confusion of concepts and implementation of what OOP objects to use in right and efficient way. You’ll be stuck if you’re not well equipped. In this article, I’m hoping to aid newbies in this context, familiarize you with the basics of OOP that i think you need to know to get all throughout a smooth coding spree. Here are some common discussions found in development forums that you might find useful resource. Static vs Non-Static Static members : Act like global variables. There is only one instance of it in the entire program execution. They can be used with the class name directly and shared by all objects of a class. Example on referring a static method or data member. class A { static int firstInt; static int secondInt; static void i...