crawler + elasticsearch integration - elasticsearch

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.

Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.

You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

Related

Read documents with Elastic Search

I have a information retrieval assignment where I have to use elasticSearch to generate some indexing/ranking. I was able to download elasticSearch and it's now running on http://localhost:9200/ but how do I read every documents stored in my folder called 'data'?
Elasticsearch is just a search engine. In order to get your docs and files searchable, you need to load them, extract all relevant data and load into elasticsearch.
Apache Tika is a solution for extracting the data out of the files. Write a file system crawler using Tika. Then use the Rest API to index the data.
If you don't want to reinvent the wheel, have a look on the FSCrawler project. Here is a blogpost describing how to solve a task you are facing.
Good luck!

Ambari Hadoop/Spark and Elasticsearch SSL Integration

I have a Hadoop/Spark cluster setup via Ambari (​HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.
I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
So far, this seems reasonable, but I have some questions:
We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?
I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?
Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?
For the project setup part of the question you can take a look at
https://github.com/zouzias/elasticsearch-spark-example
which a project template integrating elasticsearch with spark.

Nutch v Solr v Nutch+Solr

A related Question on Stackoverflow exists but it was asked six and a half year ago. A lot has changed especially in Nutch since then. Basically I have two questions.
How do we compare Nutch to Solr?
In what circumstances do we need and why it is better to integrate both of these and use for crawling? How it would be different from using any of them in standalone mode (or with hadoop)?
At the current stage Nutch is only responsible for crawling the web, meaning visit a web page, extract the content, find more links and repeat the process (I'm skipping a lot of complicated stuff in between, but hopefuly you get the idea).
The last stage of the crawling process is to store the data in your backend (ES/Solr are the supported data storages on the 1.x branch). So in this step is where Solr comes to play, after Nutch have completed its work you need to store the data somewhere to be able to execute queries on top of it: This is Solr job.
Some time ago Nutch included the ability to write the inverted index (as explained in the question), but the decision (also some time ago) was to deprecate this in favor of using Solr/ES (or any other storage that you can write an indexer plugin for). Right now the indexing plugins are plugable and you can write a plugin for any data storage that you want.
Summary: Nutch is a crawler and Solr is the search engine where Nutch stores the data that is crawled.
Nutch and Solr are two different things. Nutch just crawls the web and parses the contents of the web pages while Solr is responsible for indexing i.e. storing the contents crawled by Nutch when Solr is Integrated with Nutch.
You need to integrate Solr with Nutch when you have to retrieve and store data while crawling the web. If you don't have to store or index anything, then you don't need Solr. Solr is useful when you want to store the data Nutch crawls and then perform a search on the data.

How to integrate AEM with ElasticSearch?

I have been through all the sites currently available to refer AEM & ElasticSearch, but could not find anything exact which is related to integration of these both.
Requirement : To create site search functionality for publish which will bring out all the results which are related to particular keyword. Currently we are using default AEM site search functionality, which very slow and thus we want to migrate it to ES. There are very less documents available on integration of these both, so we are troubling with it. Mainly we have to do this In Java.
That's because you are question is very vague. You have not specified what is it that you are trying to achieve. Do you want you the search results on the AEM publish side to be served by Elastic Search or do you want all your content(even in AEM author to be indexed?). There are multiple patterns hence it is not possible to provide a general answer. There are multiple ways you can integrate.
1) write custom replication agents in AEM to push content to ES.
2) create a workflow which can be triggered with launchers whenever node is added/modified. I would suggest you to refrain from this and consider option 1 instead as this will trigger too many workflow instances and will impact overall performance.
3) You can write crawlers to crawl your aem publish & index the content in ES.
4) you can write code which runs in ES(river in ES terminology) to fetch the content from AEM & index it.
Here is complete implementation of Apache Solr, Elasticsearch and Apache Lucene with AEM 6.5 - https://github.com/tadijam64/search-engines-comparison
There is detailed explanation of how every search engine works, and how it is integrated with AEM - step by step explained in six write-ups here
Its an old repo but may help you with the integration..
https://github.com/viveksachdeva/elasticsearch-cq
I know, this is an old question but I had the same problem and came up with a new implementation you can find on github:
https://github.com/deveth0/elasticsearch-aem
The usage is quite easy, you have to include several bundles and then configure, which Elasticsearch Instance to use.
Upon Page-Activation AEM triggers a Replication Agent that pushes the data to Elasticsearch.
For more detailed information, have a look at my blog

Elastic search with Google Big Query

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.
Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.
Any help with this regard would be appreciated.
Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.
Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.
Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()
Refer this how read data from BigQuery Dataflow
https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java
Elastic Search indexing
https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer
UPDATED 2019-06-24
Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.
From the documentation
The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.
I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.
You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.
As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch
More information here in a blog post by elastic.co
Further documentation and step by step process by google

Resources