elasticsearch integration with mahout - hadoop

I would like to use Mahout to do some predictive analysis on data stored in elasticsearch to find similar documents or to recommend other records based on records that have been tagged with certain criteria.
I plan to create a Mahout cluster, however does elasticsearch have to sit within a Hadoop cluster to provide this functionality? Would I need to run es-hadoop? Or is there another way for Mahout to see the data in elasticsearch?
Would running es-hadooop have any impact on the speed compared to just elasticsearch?

Recently I found one project which is an ElasticSearch plugin which would be used to build recommendation engine on data indexed in elasticsearch. Take a loot at it.
https://github.com/hadashiA/elasticsearch-flavor

Mahout does not need to sit on the same machines as Elasticsearch but can. The new Mahout has legacy implementations of row and item similarity based on Hadoop MapReduce but these will eventually be deprecated in favor of the newer Spark implementations, which have been in the code since Mahout 0.10.0, it's now on 0.11.0
There is a full-blown recommender integration of Mahout's Spark code with Elasticsearch in PredictionIO's Universal Recommender. See docs for Mahout and PIO here:
http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
https://github.com/PredictionIO/template-scala-parallel-universal-recommendation
As to using Elasticsearch's es-hadooop, the Universal Recommender uses the Spark implementation of that and I'd say it is best to do so because it's optimized for distributed calculations. However there is no requirement to use it.

Related

logstah vs spark streaming and storm

I am working on building a distributed real time cluster system to supervise and analyze a network. I did several researches on internet and I came out with few technologies:
for real time processing : logstash, storm and apache streaming
for storage: elasticsearch
for analysis: Apache Spark over Hadoop (I will use ES-Hadoop to connect with Elasticsearch)
for data visualization: kibana, D3js, c3js
However, logstash is not often mentioned as spark streaming and storm. I found in internet the following architecture presented in the below picture:
I have two questions:
I don't understand why logstash is not often mentioned as a real-tim processing system like spark streaming and storm. What are the main reasons ? I hav been using it and it is very powerful..
Regarding the Analyze part, can I use the machine learning librairies in that configuration ?
Logstash is not cluster stream processing system. It is simply a JVM based process. The latest version supports on disk buffer but does not have the nearly the same delivery guaranties as Spark or Storm. Take a look at http://storm.apache.org/releases/1.0.3/Guaranteeing-message-processing.html
Yes but not sure why use Elastic for storing data first. Why not HDFS->SparkML->Elastic? The main thing to think here is managing models, training and testing.

How to connect elasticsearch to apache spark streaming or storm?

We are building a real-time big data tool with open source tools. Our main goal is to supervise and analyze a network by getting logs from a kafka server in real-time. We saw in tutorials that we have to divide our tool in two sections: Analytic and Supervision as shown below.
For the supervision section we chose the solution Elasticsearch and Logstash.
Regarding the section analytic, my team and I are comparing Apache Storm Streaming and Apache Storm in order to use it with Elasticsearch. Despite the fact that Apache Storm is a true real-time data processing tool and faster than Apache Spark Streaming, it does not provide machine learning libraries like with Apache Spark. That's why we are thinking to choose Apache Spark. The elastic website indicates that it exists a connector ES-Hadoop to connect a Elasticsearch database to a Hadoop ecosystem. We can see that in the below figure.
However, We are a little bit confused with this picture because there is only spark SQL and not all the spark frameworks (MLlib, Spark Streaming..). We did some assumptions and we came out with two final possible architectures. We only wanted to know if there are technically correct and if we are not in the wrong direction.
With Apache Spark streaming:
With Apache Storm:
Both your architectural diagrams are ok. Keep on mind that spark streaming will not work in this scenario. Es-hadoop provides you with easy access apis to get and put data from and into elastic. Its also provides the methods to get the data inro the spark framework (RDD) or data frames inthe case of spark sql. Once the data is in the framework, all ml libraries can be applied to the data for ml or analytics generation. Elastic is not capable of streaming data so spark streaming in the strict sense is not possible. So in the diagram, the arrow to hdfs optional and then to spark streaming can be removed and the arrow juat pointa to hdfs. My concern, however, would be running mllib algos on the data in realtime and expect realtime performance. Typical use case might be do modwl generation off line and use the model in realtime for analysis.

crawler + elasticsearch integration

I wasn't able to find out, how to crawl website and index data to elasticsearch. I managed to do that in the combination nutch+solr and as nutch should be able from the version 1.8 export data directly to elasticsearch (source), I tried to use nutch again. Nevertheless I didn't succeed. After trying to invoke
$ bin/nutch elasticindex
I get:
Error: Could not find or load main class elasticindex
I don't insist on using nutch. I just would need the simpliest way to crawl websites and index them to elasticsearch. The problem is, that I wasn't able to find any step-by-step tutorial and I'm quite new to these technologies.
So the question is - what would be the simpliest solution to integrate crawler to elasticsearch and if possible, I would be grateful for any step-by-step solution.
Did you have a look at the River Web plugin? https://github.com/codelibs/elasticsearch-river-web
It provides a good How To section, including creating the required indexes, scheduling (based on Quartz), authentication (basic and NTLM are supported), meta data extraction, ...
Might be worth having a look at the elasticsearch river plugins overview as well: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html#river
Since the River plugins have been deprecated, it may be worth having a look at ManifoldCF or Norconex Collectors.
You can evaluate indexing Common Crawl metadata into Elasticsearch using Hadoop:
When working with big volumes of data, Hadoop provides all the power to parallelize the data ingestion.
Here is an example that uses Cascading to index directly into Elasticsearch:
http://blogs.aws.amazon.com/bigdata/post/TxC0CXZ3RPPK7O/Indexing-Common-Crawl-Metadata-on-Amazon-EMR-Using-Cascading-and-Elasticsearch
The process involves the use of a Hadoop cluster (EMR on this example) running the Cascading application that indexes the JSON metadata directly into Elasticsearch.
Cascading source code is also available to understand how to handle the data ingestion in Elasticsearch.

How to run mahout from command line with KNN based Item Recommender?

I'm new to mahout and still trying to figure things out.
I'm trying to run a KNN based recommender using mahout 0.8 that runs in hadoop cluster (distributed recommender). I'm using mahout 0.8, so KNN is deprecated, but it is still usable (at least when I make it in java code)
I have several questions:
Is it true that there are basically two mahout implementations?
distributed (runs from command line)
non disributed (runs from jar file)
Assumming (1) is correct, Is mahout support running KNN based recommender from command-line? Can someone gives me a direction to do it?
Assumming (1) is wrong, how can I build a recommender in java (I'm using eclipse) that runs in hadoop cluster (distributed)?
Thanks!
KNN is being deprecated because it is being replaced with item-based and user-based cooccurrence recommenders and the ALS-WR recommender, which are better, more modern.
Yes, but not all code has a CLI interface. For the most part the CLI jobs in Mahout are Hadoop/distributed jobs that produce files in HDFS for output. These can be run from jar files with your own code wrapping them as you must with the local/non-distributed/non-Hadoop versions, which do not have a CLI. The in-memory recommenders require you to pass in a user ID to get recs, so you have to write code to do that. The Hadoop versions do have a CLI since they precalculate all recs for all users and put them in files. You'll probably insert them in your DB or serve them up some other way.
No, to my knowledge only user-based, item-based, and ALS-WR recommenders are supported from the command line. This runs the Hadoop/distributed version of the recommenders. This can work on a single machine, of course even using the local filesystem since Hadoop can be set up that way.
For the in-memory recommenders, just write your driver code and run them in eclipse, since Hadoop is not involved it works fine. If you want to use the Hadoop versions, setup Hadoop on your dev machine to run locally using the local filesystem and everything works fine in eclipse. Once you have things debugged move it to your Hadoop cluster. You can also debug remotely on the cluster but that is another question altogether.
The latest thing in Mahout recommenders is one that is trained in the background using Hadoop then the output is indexed by Solr. You then query Solr with items the user has expressed a preference for, no need to precalculate all recs for all users since they returned from a Solr query in near realtime. This is in Mahout 1.0-SNAPSHOT's mahout/examples/ or here https://github.com/pferrel/solr-recommender
BTW this code is being integrated with Mahout 1.0 and moved to run on Spark instead of Hadoop so even the training step will be much much faster.
Update:
I've clarified what can be run from the CLI above.

How does es-hadoop (ElasticSearch-Hadoop) do Hadoop?

How does es-hadoop enable Hadoop analytics if it is merely a Hadoop connector to HDFS?
I am assuming you are referring to this project. In which case, the ES Hadoop project has two sides. An ES plugin for HDFS which is used for creating index snapshots. But It also has various utilities that can be used within Mapreduce, Hive, Pig, Spack, ect for interacting with Elasticsearch.
For example, it is possible to bulk load ES documents from HBase using Mapreduce via the ESOutputFileFormat format. It is also possible to use Mapreduce to read from ES by a similar mechanism.
ES is an also analytical tool with a query capacity and visualization tool, kibana. So, with the integration, elasticsearch asserts that you can analyze the data at hdfs using elasticsearch.

Resources