Nutch v Solr v Nutch+Solr - hadoop

A related Question on Stackoverflow exists but it was asked six and a half year ago. A lot has changed especially in Nutch since then. Basically I have two questions.
How do we compare Nutch to Solr?
In what circumstances do we need and why it is better to integrate both of these and use for crawling? How it would be different from using any of them in standalone mode (or with hadoop)?

At the current stage Nutch is only responsible for crawling the web, meaning visit a web page, extract the content, find more links and repeat the process (I'm skipping a lot of complicated stuff in between, but hopefuly you get the idea).
The last stage of the crawling process is to store the data in your backend (ES/Solr are the supported data storages on the 1.x branch). So in this step is where Solr comes to play, after Nutch have completed its work you need to store the data somewhere to be able to execute queries on top of it: This is Solr job.
Some time ago Nutch included the ability to write the inverted index (as explained in the question), but the decision (also some time ago) was to deprecate this in favor of using Solr/ES (or any other storage that you can write an indexer plugin for). Right now the indexing plugins are plugable and you can write a plugin for any data storage that you want.
Summary: Nutch is a crawler and Solr is the search engine where Nutch stores the data that is crawled.

Nutch and Solr are two different things. Nutch just crawls the web and parses the contents of the web pages while Solr is responsible for indexing i.e. storing the contents crawled by Nutch when Solr is Integrated with Nutch.
You need to integrate Solr with Nutch when you have to retrieve and store data while crawling the web. If you don't have to store or index anything, then you don't need Solr. Solr is useful when you want to store the data Nutch crawls and then perform a search on the data.

Related

Read documents with Elastic Search

I have a information retrieval assignment where I have to use elasticSearch to generate some indexing/ranking. I was able to download elasticSearch and it's now running on http://localhost:9200/ but how do I read every documents stored in my folder called 'data'?
Elasticsearch is just a search engine. In order to get your docs and files searchable, you need to load them, extract all relevant data and load into elasticsearch.
Apache Tika is a solution for extracting the data out of the files. Write a file system crawler using Tika. Then use the Rest API to index the data.
If you don't want to reinvent the wheel, have a look on the FSCrawler project. Here is a blogpost describing how to solve a task you are facing.
Good luck!

How to integrate AEM with ElasticSearch?

I have been through all the sites currently available to refer AEM & ElasticSearch, but could not find anything exact which is related to integration of these both.
Requirement : To create site search functionality for publish which will bring out all the results which are related to particular keyword. Currently we are using default AEM site search functionality, which very slow and thus we want to migrate it to ES. There are very less documents available on integration of these both, so we are troubling with it. Mainly we have to do this In Java.
That's because you are question is very vague. You have not specified what is it that you are trying to achieve. Do you want you the search results on the AEM publish side to be served by Elastic Search or do you want all your content(even in AEM author to be indexed?). There are multiple patterns hence it is not possible to provide a general answer. There are multiple ways you can integrate.
1) write custom replication agents in AEM to push content to ES.
2) create a workflow which can be triggered with launchers whenever node is added/modified. I would suggest you to refrain from this and consider option 1 instead as this will trigger too many workflow instances and will impact overall performance.
3) You can write crawlers to crawl your aem publish & index the content in ES.
4) you can write code which runs in ES(river in ES terminology) to fetch the content from AEM & index it.
Here is complete implementation of Apache Solr, Elasticsearch and Apache Lucene with AEM 6.5 - https://github.com/tadijam64/search-engines-comparison
There is detailed explanation of how every search engine works, and how it is integrated with AEM - step by step explained in six write-ups here
Its an old repo but may help you with the integration..
https://github.com/viveksachdeva/elasticsearch-cq
I know, this is an old question but I had the same problem and came up with a new implementation you can find on github:
https://github.com/deveth0/elasticsearch-aem
The usage is quite easy, you have to include several bundles and then configure, which Elasticsearch Instance to use.
Upon Page-Activation AEM triggers a Replication Agent that pushes the data to Elasticsearch.
For more detailed information, have a look at my blog

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

How to implement faster search in website using Apache Solr?

I want to use Apache Solr in my website in order to make the search faster.
I need java code to index data from mysql database so that I can perform faster search?
So can anybody please tell me how to implement this?
You can start with Looking at Solr DataImportHandler
This will enable you to index data from DB to Solr.
You would need to configure Solr to make it faster in performance though and it depends how much and what kind of data you have.
If you specifically want to use Java code to add data to the index, you should use the SolrJ client. For the specific case of adding data focus on the Adding Data to Solr section. However as #Jayendra pointed out, you can use other means than just Java, like the DataImportHandler, to load data into Solr. Also, please refer to the Integrating Solr page for a list of additional Solr Client/Language Bindings.

What is the best components stack for building distributed log aggregator (like Splunk)?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.
I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).
I see architecture as:
Log crawler (Ruby script).
Distributed log storage.
Distributed search engine.
Lightweight front end.
Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.
My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.
MongoDB was rejected because it doesn't work on Solaris.
CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.
So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.
Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?
I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.
For more, see my answer to a similar question on Quora.
With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

Resources