what can I do with hadoop and elasticsearch together? - hadoop

I'm reading about hadoop with elasticsearch, so I'm confused about how it works.
I guess in this case elasticsearch is substitute for HDFS/Hbase, so I could write hadoop jobs to get and process data in elasticsearch.
Is this correct?
If yes, does this works for Hive and Pig too?

You can use elasticsearch in Hadoop like :
Input and output for MapReduce
Input (storage) for Hive and Pig
Write and read directly in ElasticSearch with Cascading

Related

Integration of Hadoop and Solr

According to my research, I can integrate hadoop and solr. I have downloaded and install both of them. But couldn't integrate them with each other. And also I couldn't find a proper tutorial for this purpose.
I use Ubuntu 14.04.02, Apache Hadoop 2.6.0 and Solr 5.2.1.
How can I integrate Hadoop and Solr on my machine?
Note: I installed hadoop as Single Node. Also I am very beginner about this concepts.
you can use Solr with hadoop in two ways
document based
using lily indexers With Hbase
so if you want to use a document that present in HDFS to be indexed by SOLR.
you need to follow following steps:
Step A.
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr instancedir --generate <path of collection>/collection_name
edit /collection_name/conf/schema.xml with your attributes that present in Data to be indexed
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr instancedir --create <collection_name> <path of collection>/collection_name
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr collection --create <collection_name> -s <num_of_solr_shard> -r <num_of_solr_replication>
you can any number to , but
* <= number of solr nodes in cluster
eg If you have 7 nodes , you can have 3,2 or 2,3 as per need.
so for your case it would be 1 & 1.
Step B.
once collection is been created , Data can be indexed by following command
curl http://solr-server:port/solr/<collection_name>/update/csv --data-binary #<path_of_data_file_in_linux> -H 'Content-type:text/plain; charset=utf-8'
If you want to index Hbase Data follow Step A. to create Solr Collection, & Then use Lily Indexer(key value indexer) to create indexer on hbase , after that that data can be seen on SOLR as XML or JSON.
I would recommend you to read about Cloudera Search (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html)
This is basically an open source project by cloudera integrating Hadoop and Solr.

Is it possible to put elasticsearch indexed data into hdfs?

Can elastic search indexed data be put into HDFS. Not sure about the fact so thought to get expert view on it.
Not certain exactly what you are looking for. If you want to backup/restore data into HDFS from Elasticsearch the answer is answer is the Hadoop HDFS snapshot/restore plugin for Elasticsearch:
https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs
This allows you to back up and restore data from ES into HDFS.
If on the other hand you want to run MapReduce jobs in Hadoop that access Elasticsearch data the answer is Elasticsearch for Apache Hadoop:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html

hadoop and elasticsearch integration

I would like to integrate apache Hadoop and elasticsearch for my project.Where the data will be stored into HDFS at the same time will be also be available on elastic search.
Any changes which I perform on the data in elastic search should also be reflected and stored in the HDFS. So basically I am looking for bidirectional flow between elastic search and hdfs. I have tried searching web for any useful documentation but didn't find anything helpful. Any help in this regard will be really helpful.
Thanks
See Elasticsearch for hadoop and Map Reduce You can index the data from HDFS into ES in real time for new, modified and deleted data. But from ES into HDFS may not possible, because HDFS is based on Write-once-read-many architecture HDFS Design.

Elasticsearch and Hive work together

I see that Hive and Elasticsearch are almost equivalent except that Elasticsearch supports near real time queries. Moreover, Elasticsearch can run independently to store and analyze data. So why people use both Hive and Elasticsearch on Hadoop ?
Hive and Elasticsearch are two really different tools.
Hive is a SQL to Hadoop Java translator to interact with virtually any datasource using SQL (including elasticsearch), using SerDe's. Hive can also store data using HDFS. Hive is really good at batch processing.
Elasticsearch is a distributed faceted search engine, it is very good to quickly retrieve data in millions of documents. It can also be used to make some simple calculations using facets.
Hive and ES are complementary, people use Hive to process data, and ES to deliver data / insights.

Lucene query from Hadoop PIG jobs

I have couple of thousands of customer names, alternative names, business names etc details indexed in Lucene indexes (indexes are not stored in HDFS).
I have massive amount (>100M) of person data in HDFS and I want to scan person data with Lucene indexes, I am currently using PIG for data processing from HDFS.
I am trying to find if it is possible to run PIG job which extracts data and in-parallel perform queries to Lucene indexes (may be by using custom written UDF), I am not able to think how Lucene local indexes are loaded and shared within PIG jobs (after Lucene query I need matched document IDs if match is found).
Is it possible using PIG ? or I need to write custom map-reduce jobs for this ? Or any other suggestions ?
Thanks.
You definitely need UDFs for that - elephant-bird's lucene loader is a good starting point.
Check it out at https://github.com/kevinweil/elephant-bird/tree/master/pig

Resources