Integration of Hadoop and Solr - hadoop

According to my research, I can integrate hadoop and solr. I have downloaded and install both of them. But couldn't integrate them with each other. And also I couldn't find a proper tutorial for this purpose.
I use Ubuntu 14.04.02, Apache Hadoop 2.6.0 and Solr 5.2.1.
How can I integrate Hadoop and Solr on my machine?
Note: I installed hadoop as Single Node. Also I am very beginner about this concepts.

you can use Solr with hadoop in two ways
document based
using lily indexers With Hbase
so if you want to use a document that present in HDFS to be indexed by SOLR.
you need to follow following steps:
Step A.
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr instancedir --generate <path of collection>/collection_name
edit /collection_name/conf/schema.xml with your attributes that present in Data to be indexed
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr instancedir --create <collection_name> <path of collection>/collection_name
solrctl --zk zookeeper_server:port/solr --solr solr-server:port/solr collection --create <collection_name> -s <num_of_solr_shard> -r <num_of_solr_replication>
you can any number to , but
* <= number of solr nodes in cluster
eg If you have 7 nodes , you can have 3,2 or 2,3 as per need.
so for your case it would be 1 & 1.
Step B.
once collection is been created , Data can be indexed by following command
curl http://solr-server:port/solr/<collection_name>/update/csv --data-binary #<path_of_data_file_in_linux> -H 'Content-type:text/plain; charset=utf-8'
If you want to index Hbase Data follow Step A. to create Solr Collection, & Then use Lily Indexer(key value indexer) to create indexer on hbase , after that that data can be seen on SOLR as XML or JSON.

I would recommend you to read about Cloudera Search (http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html)
This is basically an open source project by cloudera integrating Hadoop and Solr.

Related

Adding cluster to existing elastic search in elk

Currently I have existing
1. Elastic search
2. Logstash
3. Kibana
I have existing data on them.
Now i have setup ELK cluster with 3 Master nodes , 5 data nodes 3 client nodes.
But i am not sure how can i get existing data into them.
Is it possible that if i make the existing ES node as data node and then attach it to the cluster . Then will that data gets replicated to other data nodes as well? and then take that node offline
Option 1
How about just try with fewer nodes? It is not hard to test if it is supported if you setup one node, feed some data, and add one more and configure them as a cluster to see if data get synchronized.
Option 2
Another option is to use an elasticsearch migration tool like https://github.com/taskrabbit/elasticsearch-dump, basically, you could setup a clean cluster and migrate all your data in old node to this cluster.

Is it possible to put elasticsearch indexed data into hdfs?

Can elastic search indexed data be put into HDFS. Not sure about the fact so thought to get expert view on it.
Not certain exactly what you are looking for. If you want to backup/restore data into HDFS from Elasticsearch the answer is answer is the Hadoop HDFS snapshot/restore plugin for Elasticsearch:
https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs
This allows you to back up and restore data from ES into HDFS.
If on the other hand you want to run MapReduce jobs in Hadoop that access Elasticsearch data the answer is Elasticsearch for Apache Hadoop:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/index.html

hadoop and elasticsearch integration

I would like to integrate apache Hadoop and elasticsearch for my project.Where the data will be stored into HDFS at the same time will be also be available on elastic search.
Any changes which I perform on the data in elastic search should also be reflected and stored in the HDFS. So basically I am looking for bidirectional flow between elastic search and hdfs. I have tried searching web for any useful documentation but didn't find anything helpful. Any help in this regard will be really helpful.
Thanks
See Elasticsearch for hadoop and Map Reduce You can index the data from HDFS into ES in real time for new, modified and deleted data. But from ES into HDFS may not possible, because HDFS is based on Write-once-read-many architecture HDFS Design.

Analytics + Full text search - Big data

I need to implement a system which derives analytics/insights from data (Text-only) as well as can do complex search queries.
So I have shortlisted Solr(search) and Hadoop(Analytics). I am unable to decide which base should I use to start. Can we integrate HDFS cluster with Solr? I will be mainly dealing with aggregation queries and data will not update frequently.
I know this question is too broad and general. I just need a expert's opinion on this matter.
Look at Cloudera Search and this
Cloudera Search = SOLR + Hadoop
Using Cloudera Search, you can query the data in Hadoop or HBase using SOLR.

How to build distribute search base on hadoop and lucene

I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:
as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?
i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?
Thanks for advance
-As for your Question 1:
You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.
-For your question 2:
A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.
I can also suggest you to checkout the distributed search SolrCloud, or here
It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.

Resources