Skip to main content

Apache Solr and Nutch

I have not post for quite a long time since my last post. I’ve been busy doing the new project I am now into. As a sweet fruit of labor, I would like to share some knowledge I have gained in this new venture.


I was assigned to the global IT projects together with two more colleagues. A lot of new technologies that we need to cope up with, especially dealing with the critical and gem departments of the company - marketing and sales, every detail is scrutinized and case studied thoroughly. Quality and reliability of the system is the top most importance of the development. A portion of everything done is the server configuration and optimization; this is to aid the system for the search management and site hits. The standard tools used by the company are Solr and Nutch working together pretty well, thus we end up studying these technologies.


A brief description:


Apache Solr - is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Apache Tomcat. [Wikipedia.org]


NUTCH - is open source web-search software. It was built on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, and parsers for HTML and other document formats. [Lucene.apache.org]


It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
The fetcher (”robot” or “web crawler”) has been written from scratch solely for this project. [Wikipedia.org]


In general, Solr and Nutch harmoniously work together as it is both in-house software of Lucene project. Solr generally is responsible for the search and return capability of the system. If the user first searched an item, Solr technology is the one responsible for caching the search and the returned results. It has great API for JSON, XML as the format of the returned results. Other than caching, Solr have also these following features :

• Uses the Lucene library for full-text search
• Faceted navigation
• JSON, XML, PHP, Ruby, Python and custom Java binary output formats over HTTP
• HTML administration interface
• Replication to other Solr servers
• Extensible through plugins
• Distributed Search
• Caching

On the other hand, Nutch acts as the crawler, a mimic of the “goolge crawler” and other search engine crawler (im totaly a big fun of google. *grin). After hitting the search button, nutch would look for the search item into the entire web application and indexed its location and return its details to Solr if it is found.


To incorporate these in your web application, download the latest release of Solr and Nutch, and install it to your server and follow the configuration guidelines for each. Viola! Your system would have its own search engine with no sweat! *wink

Comments

Popular posts from this blog

Cross-Site Scripting - (HACK) a way out

I came across to this discovery when i was affronted with the problem of trying to communicate TWO sites on different domain/server and some parameters need to be passed. As you may have “Googled” it, you can’t do outright javaScript function call from your site to a partner site since it resides on different domains. It’s a violation of the Cross-Site Scripting W3C standards for it is highly probable for site security breach. To make the picture clearer, here’s the scenario. Take for an example your site caters a hotel reservation and you have a partner seller that also maintains a site for marketing. If you want to maximize you potential sales, you’ll opt to let your partner embed/include your reservation site somewhere in their site. See the diagram now? Partner site e.g resides in www.marketing.com and your site in www.hotelreservation.com , without directly accessing your site, a customer must be able to get a hotel reservation on you part...

Oracle Tips

Here are some helpful tips to remember when dealing with oracle. I. Use the “flashback technology” when you accidentally commit a mistake with your production data. (altering entire table contents, corrupted table data, or worst dropping unintended table). - First thing to do is to enable flashback on your database. ALTER DATABASE FLASHBACK ON; - Restoring database to its good state. FLASHBACK DATABASE TO RESTORE POINT bef_damage; - Restoring dropped table. FLASHBACK TABLE [TABLE_NAME] TO BEFORE DROP; - Restoring table to its good state. FLASHBACK TABLE [TABLE_NAME] TO TIMESTAMP TO_TIMESTAMP('[DATE_TIME]'); II. Manipulate date and time display Aside from to_date and to_timestamp functions , you could also alter the date and time display in your database through the use of this code below. ALTER SESSION SET NLS_DATE_FORMAT = '[DATE_FORMAT]'

Creating Bottom-up Web Service (WSDL)

This post will primarily show you how to create a simple Web Service application through Apache Axis in Eclipse , and will not dwell on explaining the background or functionality of a Web Service. Yet, it’s a de facto to at least give a little definition. WSDL or the Web Services Definition Language is just another specification to describe network XML-based services. It supports message-oriented and procedural approach XML technologies. (for further reading click here ) 1. Preparing the web application a. Create a new web application and name it as “SimpleWebService”. b. Download and add “axis.jar” ( download here ) to the application libraries. c. Edit and add this following configurations to the web.xml file. AxisServlet org.apache.axis.transport.http.AxisServlet AdminServlet org.apache.axis.transport.http.AdminServlet 100 AxisServlet /servlet/AxisServlet AxisServlet *.jws AxisServlet /services/* *Note: spa...