Elasticsearch and Hive work together - hadoop

I see that Hive and Elasticsearch are almost equivalent except that Elasticsearch supports near real time queries. Moreover, Elasticsearch can run independently to store and analyze data. So why people use both Hive and Elasticsearch on Hadoop ?

Hive and Elasticsearch are two really different tools.
Hive is a SQL to Hadoop Java translator to interact with virtually any datasource using SQL (including elasticsearch), using SerDe's. Hive can also store data using HDFS. Hive is really good at batch processing.
Elasticsearch is a distributed faceted search engine, it is very good to quickly retrieve data in millions of documents. It can also be used to make some simple calculations using facets.
Hive and ES are complementary, people use Hive to process data, and ES to deliver data / insights.

Related

Solr HBase search engine

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too

Indexing logs with es-hadoop

I am new to elasticsearch and want to index my website logs which are stored on HDFS for fast querying.
I have a well structured pipeline which runs a script every 20 minutes to ingest the data into HDFS.
I want to integrate elasticsearch with it, so that it also indexes these logs based on particular field(s) and thereby giving faster query results using spark SQL.
So, my question is, can I index my data based on particular field(s) only?
Also, my logs are saved in avro file format. Does es provides a way to directly index avro serialized data or do I need to convert it into some other format?
Thank you in advance.
I would suggest you to look at Elasticsearch, Logstash and Kibana stack that should be good enough to full fill your requirement. Putting it on HDFS and then using ES would be additional overhead.
Instead, you can use Logstash to pump data into ES, index on whatever fields you wish to query and build easy dashboards in less than 10 minutes of exercise. Take a look at this tutorial for better step-by-step guide.
http://hadooptutorials.co.in/tutorials/elasticsearch/log-analytics-using-elasticsearch-logstash-kibana.html

Analytics + Full text search - Big data

I need to implement a system which derives analytics/insights from data (Text-only) as well as can do complex search queries.
So I have shortlisted Solr(search) and Hadoop(Analytics). I am unable to decide which base should I use to start. Can we integrate HDFS cluster with Solr? I will be mainly dealing with aggregation queries and data will not update frequently.
I know this question is too broad and general. I just need a expert's opinion on this matter.
Look at Cloudera Search and this
Cloudera Search = SOLR + Hadoop
Using Cloudera Search, you can query the data in Hadoop or HBase using SOLR.

Lucene query from Hadoop PIG jobs

I have couple of thousands of customer names, alternative names, business names etc details indexed in Lucene indexes (indexes are not stored in HDFS).
I have massive amount (>100M) of person data in HDFS and I want to scan person data with Lucene indexes, I am currently using PIG for data processing from HDFS.
I am trying to find if it is possible to run PIG job which extracts data and in-parallel perform queries to Lucene indexes (may be by using custom written UDF), I am not able to think how Lucene local indexes are loaded and shared within PIG jobs (after Lucene query I need matched document IDs if match is found).
Is it possible using PIG ? or I need to write custom map-reduce jobs for this ? Or any other suggestions ?
Thanks.
You definitely need UDFs for that - elephant-bird's lucene loader is a good starting point.
Check it out at https://github.com/kevinweil/elephant-bird/tree/master/pig

LUCENE and Hadoop

i am using lucene for providing indexing and searching on text file.can i use HDFS for storing index file.
You interchange tasks: instead of thinking where to use Hadoop, first think what you need to implement your project. And if you see that you need Hadoop, it will become obvious where and how to use it.
One tip. Most probably you don't need neither Hadoop, nor even Lucene itself: Solr - search server created on top of Lucene - now has distributed setup, which is specifically designed for indexing and searching; Nutch may be used as front-end for Solr to crawl the web; and Tika may help you to parse all types of offline files.
Lucene comes into picture after all your data is ready in form of lucene documents ( lucene cache ).
Looks like you know Lucene already. The purpose of Hadoop is to reduce a big task into small chunks. I think first usage of Hadoop can be to gather data. Each hadoop node can keep collecting data; and create lucene documents

Resources