How to integrate AEM with ElasticSearch? - elasticsearch

I have been through all the sites currently available to refer AEM & ElasticSearch, but could not find anything exact which is related to integration of these both.
Requirement : To create site search functionality for publish which will bring out all the results which are related to particular keyword. Currently we are using default AEM site search functionality, which very slow and thus we want to migrate it to ES. There are very less documents available on integration of these both, so we are troubling with it. Mainly we have to do this In Java.

That's because you are question is very vague. You have not specified what is it that you are trying to achieve. Do you want you the search results on the AEM publish side to be served by Elastic Search or do you want all your content(even in AEM author to be indexed?). There are multiple patterns hence it is not possible to provide a general answer. There are multiple ways you can integrate.
1) write custom replication agents in AEM to push content to ES.
2) create a workflow which can be triggered with launchers whenever node is added/modified. I would suggest you to refrain from this and consider option 1 instead as this will trigger too many workflow instances and will impact overall performance.
3) You can write crawlers to crawl your aem publish & index the content in ES.
4) you can write code which runs in ES(river in ES terminology) to fetch the content from AEM & index it.

Here is complete implementation of Apache Solr, Elasticsearch and Apache Lucene with AEM 6.5 - https://github.com/tadijam64/search-engines-comparison
There is detailed explanation of how every search engine works, and how it is integrated with AEM - step by step explained in six write-ups here

Its an old repo but may help you with the integration..
https://github.com/viveksachdeva/elasticsearch-cq

I know, this is an old question but I had the same problem and came up with a new implementation you can find on github:
https://github.com/deveth0/elasticsearch-aem
The usage is quite easy, you have to include several bundles and then configure, which Elasticsearch Instance to use.
Upon Page-Activation AEM triggers a Replication Agent that pushes the data to Elasticsearch.
For more detailed information, have a look at my blog

Related

Is there application client for ElasticSeach 6.4.3 (similar to DBvear)

I tried to see my node data from application client (like DBvear), but I didn't found information about that. someone found way to connect DBvear to this version or to see the data by similar application?
I believe what you are looking for is GUI for Elasticsearch.
Typically the industry calls the elasticsearch stack as ELK stack and I believe what you are looking for is the K part of it which is Kibana.
I'm not sure if you are asking for SQL feature but if you are thinking to make use of the SQL feature you can check the Elasticsearch SQL plugin.
Other widely used client application for elasticsearch is Grafana. There are others available too(I think Splunk, Graylog, Loggly) but I believe Kibana and Grafana are the best bet.
Hope this helps!
Actually no, I using elastic search as a Database in different deployments and I don't want to maintenance Kibana instance (i prefer to see all the data in tool like DBvear)

Ideas on making a Java Application with Nutch/Elastic Search and Kibana

I have an idea for an application make a search engine using tools Nutch, ES and Kibana. Nutch for crawling, ES for indexing and Kibana for the visualisation.
Currently, I have all the programs fine and I can successfully use them in terminal. My question is, is it possible to make a Java Application which incoporates Nutch, Es and Kibana all in one?
My idea for the application is that it will accept a URL for nutch to crawl, after crawling it will then accept terms to index. Finally, it will make a visualisation page with Kibana of the data.
Any pointers on how to do this?
Why do you want to have them as a single application? ES and Kibana are services and meant to run continuously. If you had StormCrawler (see comment above) that would be another continuous service. All you'd need to do is build a UI to send the URLs to a queue.

Ambari Hadoop/Spark and Elasticsearch SSL Integration

I have a Hadoop/Spark cluster setup via Ambari (​HDP -2.6.2.0). Now that I have my cluster running, I want to feed some data into it. We have an Elasticsearch cluster on premise (version 5.6). I want to setup the ES-Hadoop Connector (https://www.elastic.co/guide/en/elasticsearch/hadoop/current/doc-sections.html) that Elastic provides so I can dump some data from Elastic to HDFS.
I grabbed the ZIP file with the JARS and followed the directions on a blog post at CERN:
https://db-blog.web.cern.ch/blog/prasanth-kothuri/2016-05-integrating-hadoop-and-elasticsearch-%E2%80%93-part-2-%E2%80%93-writing-and-querying
So far, this seems reasonable, but I have some questions:
We have SSL/TLS setup on our Elasticsearch cluster, so when I perform a query, I obviously get an error using the example on the blog. What do I need to do on my Hadoop/Spark side and on the Elastic side to make this communication work?
I read that I need to add those JARS to the Spark classpath - is there a rule of thumb as to where i should put those on my cluster? I assume on of my Spark Client nodes, but I am not sure. Also, once i put them there, is there a way to add them to the classpath so that all of my nodes / client nodes have the same classpath? Maybe something in Ambari provides that?
Basically what I am looking for is to be able to preform a query to ES from Spark that triggers a job that tells ES to push "X" amount of data to my HDFS. Based on what I can read on the Elastic site, this is how I think it should work, but I am really confused by the documentation. It's lacking and has confused both me and my Elastic team. Can someone provide some clear directions or some clarity around what I need to do to set this up?
For the project setup part of the question you can take a look at
https://github.com/zouzias/elasticsearch-spark-example
which a project template integrating elasticsearch with spark.

Nutch v Solr v Nutch+Solr

A related Question on Stackoverflow exists but it was asked six and a half year ago. A lot has changed especially in Nutch since then. Basically I have two questions.
How do we compare Nutch to Solr?
In what circumstances do we need and why it is better to integrate both of these and use for crawling? How it would be different from using any of them in standalone mode (or with hadoop)?
At the current stage Nutch is only responsible for crawling the web, meaning visit a web page, extract the content, find more links and repeat the process (I'm skipping a lot of complicated stuff in between, but hopefuly you get the idea).
The last stage of the crawling process is to store the data in your backend (ES/Solr are the supported data storages on the 1.x branch). So in this step is where Solr comes to play, after Nutch have completed its work you need to store the data somewhere to be able to execute queries on top of it: This is Solr job.
Some time ago Nutch included the ability to write the inverted index (as explained in the question), but the decision (also some time ago) was to deprecate this in favor of using Solr/ES (or any other storage that you can write an indexer plugin for). Right now the indexing plugins are plugable and you can write a plugin for any data storage that you want.
Summary: Nutch is a crawler and Solr is the search engine where Nutch stores the data that is crawled.
Nutch and Solr are two different things. Nutch just crawls the web and parses the contents of the web pages while Solr is responsible for indexing i.e. storing the contents crawled by Nutch when Solr is Integrated with Nutch.
You need to integrate Solr with Nutch when you have to retrieve and store data while crawling the web. If you don't have to store or index anything, then you don't need Solr. Solr is useful when you want to store the data Nutch crawls and then perform a search on the data.

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

Resources