Analytics + Full text search - Big data - hadoop

I need to implement a system which derives analytics/insights from data (Text-only) as well as can do complex search queries.
So I have shortlisted Solr(search) and Hadoop(Analytics). I am unable to decide which base should I use to start. Can we integrate HDFS cluster with Solr? I will be mainly dealing with aggregation queries and data will not update frequently.
I know this question is too broad and general. I just need a expert's opinion on this matter.

Look at Cloudera Search and this
Cloudera Search = SOLR + Hadoop
Using Cloudera Search, you can query the data in Hadoop or HBase using SOLR.

Related

Solr HBase search engine

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too

hadoop and elasticsearch integration

I would like to integrate apache Hadoop and elasticsearch for my project.Where the data will be stored into HDFS at the same time will be also be available on elastic search.
Any changes which I perform on the data in elastic search should also be reflected and stored in the HDFS. So basically I am looking for bidirectional flow between elastic search and hdfs. I have tried searching web for any useful documentation but didn't find anything helpful. Any help in this regard will be really helpful.
Thanks
See Elasticsearch for hadoop and Map Reduce You can index the data from HDFS into ES in real time for new, modified and deleted data. But from ES into HDFS may not possible, because HDFS is based on Write-once-read-many architecture HDFS Design.

Elasticsearch and Hive work together

I see that Hive and Elasticsearch are almost equivalent except that Elasticsearch supports near real time queries. Moreover, Elasticsearch can run independently to store and analyze data. So why people use both Hive and Elasticsearch on Hadoop ?
Hive and Elasticsearch are two really different tools.
Hive is a SQL to Hadoop Java translator to interact with virtually any datasource using SQL (including elasticsearch), using SerDe's. Hive can also store data using HDFS. Hive is really good at batch processing.
Elasticsearch is a distributed faceted search engine, it is very good to quickly retrieve data in millions of documents. It can also be used to make some simple calculations using facets.
Hive and ES are complementary, people use Hive to process data, and ES to deliver data / insights.

How to integrate Hadoop, SOLR and Impala?

I am looking for example or guidance on how to use Hadoop, SOLR and Impala together. Actually I know how to use Impala and Hadoop, but also want to use the power of SOLR to make the queries run faster. I explored the web pretty extensively but could not find anything that would put me into action.
Solr and Impala complement each other in that Solr can help you understand your data's structure -- in other words, for initial discovery. From that point on, you can use that knowledge to write Impala queries that target one facet of the data or the other.

LUCENE and Hadoop

i am using lucene for providing indexing and searching on text file.can i use HDFS for storing index file.
You interchange tasks: instead of thinking where to use Hadoop, first think what you need to implement your project. And if you see that you need Hadoop, it will become obvious where and how to use it.
One tip. Most probably you don't need neither Hadoop, nor even Lucene itself: Solr - search server created on top of Lucene - now has distributed setup, which is specifically designed for indexing and searching; Nutch may be used as front-end for Solr to crawl the web; and Tika may help you to parse all types of offline files.
Lucene comes into picture after all your data is ready in form of lucene documents ( lucene cache ).
Looks like you know Lucene already. The purpose of Hadoop is to reduce a big task into small chunks. I think first usage of Hadoop can be to gather data. Each hadoop node can keep collecting data; and create lucene documents

Resources