LUCENE and Hadoop - hadoop

i am using lucene for providing indexing and searching on text file.can i use HDFS for storing index file.

You interchange tasks: instead of thinking where to use Hadoop, first think what you need to implement your project. And if you see that you need Hadoop, it will become obvious where and how to use it.
One tip. Most probably you don't need neither Hadoop, nor even Lucene itself: Solr - search server created on top of Lucene - now has distributed setup, which is specifically designed for indexing and searching; Nutch may be used as front-end for Solr to crawl the web; and Tika may help you to parse all types of offline files.

Lucene comes into picture after all your data is ready in form of lucene documents ( lucene cache ).
Looks like you know Lucene already. The purpose of Hadoop is to reduce a big task into small chunks. I think first usage of Hadoop can be to gather data. Each hadoop node can keep collecting data; and create lucene documents

Related

Solr HBase search engine

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

I am learning NoSQL and looking at different options for one of my client's requirements. I have gone through various resources before putting up this question (a person with little knowledge in NoSQL)
I need to store data at faster rate and read data.
Fully fail-safe and easily scalable.
Able to search through data for Analytics.
I ended up with a short list of: Cassandra and Elasticsearch
What I do understand is Cassandra is a perfect NoSQL storage solution for me, as I can write data and read data using indexes. Where it fails or it could fail is on Analytics. In the future, if I want to get data from from_date to to_date, or more ways to get data for analytics, if I don't design the Data model properly or keeping long term sight, which might be quite hard in ever changing world.
While Elastic Search is best at indexing (backed by Lucene), and can search the data randomly by throwing some random text. But does it work the same for even if I want to retrieve data from_date to to_date (I expect it might be). But the real question is, is it a Search Engine, or perfect NoSQL data storage like Cassandra? If yes, why do we still need Cassandra?
If both of these are in different world, please explain that! How do we combine them to get a more effective solution?
One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely.
We have asked that same question (of ourselves)..."Why don't we just get everything from ElastsicSearch?"
The answer is that ElasticSearch was designed to be a search engine, and not a persistent data store. Sometimes ElasticSearch loses writes. Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading. For that purpose, I have written jobs that are designed to keep ElasticSearch in-sync with our Cassandra cluster. There was also a fairly recent discussion on Quora about this topic, that yielded similar points.
That being said, ElasticSearch works great as a search engine. And Cassandra works great as a scalable, high-performance datastore. But querying data is different from searching for data. There are times that we need one or the other, and a combination of the two works well for our application. It may (or it may not) work well for yours.
As for analytics, I have had some success in using the Cassandra Spark connector, to serve more complex OLAP queries.
Edit 20200421
I've written a newer answer to a similar question:
ElasticSearch vs. ElasticSearch+Cassandra
Cassandra + Lucene is a great option. There are different initiatives for this issue, for example:
Stratio’s Cassandra Lucene Index - Derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality. (https://github.com/Stratio/cassandra-lucene-index)
Stratio Cassandra, it's a native integration with Apache Lucene, it is very interesting. (https://github.com/Stratio/stratio-cassandra) - THIS PROJECT HAS BEEN DISCONTINUED IN FAVOUR OF Stratio’s Cassandra Lucene Index
Tuplejump Calliope, it's like Stratio Cassandra, but it's less active. (https://github.com/tuplejump/stargate-core)
DSE Search by Datastax. It allows using Cassandra with Apache Solr, but it's a proprietary option.(http://www.datastax.com/what-we-offer/products-services/datastax-enterprise)
After working on this problem myself I have realized that NoSQL databases like casandra are good when you want to make sure you are preserving your data schema with reliable writing operation, and don't want to take advantage of indexing operations that elasticsearch offers. In case you want to preserve some indexes data then elasticsearch is good in case you are trusting your scheme and only going to do far more reads than writes.
My case was data analytics. So I preserved a lot of my Latices in elastic search since later I wanted to traverse through the data a lot to see what should be my next step. I would have used casandra if I wanted to have a lot of changes in the schema of the data in my analytic pilelines.
Also there are many nice representing tools like kibana that you can use to present your data with some good graphics. Maybe I am lazy but they are very good looking and they helped me.
Storing data in a combination of Cassandra and ElasticSearch gives you most functionality. It allows you to lookup key-value tables, and also allows you to search data in indexes.
The combination gives you a lot of flexibility, ideal for your application.
Elassandra is the combined solution of Cassandra + Elastic search , It uses Elastic search to index the data and Cassandra as the data store , i'm not sure about the performance but as per this article , its performance is good.
If your application needs search feature then , Elassandra is the best open source option. DSE search is available but its expensive.
We had developed an application where we used Elasticsearch and Cassandra.
Similar data was stored into Cassandra and indexed into Elasticsearch.
Our application's UI was having features like searches, aggregations, data export, etc.
The back-end microservices were continuously getting huge data (on Kafka topics) and storing it into Cassandra. Once the data is stored into Cassandra, the services would make sure the data is indexed into Elasticsearch.
Cassandra was acting as "Source of truth" for Elasticsearch. In the cases, where reindexing of the ES index was required, we queried Cassandra and reindexed the data into ES.
This solution helped us, as this was very easy to scale and the searches and aggregations were much faster.
Cassandra is great at retrieving data by ID. I don't know much about secondary index performance, but I doubt it's as fast as Elasticsearch. Certainly Elasticsearch wins when it comes to full text search functionality (text analysis, relevancy scoring, etc).
Cassandra wins on update performance, too. Elasticsearch supports updates, but an update is really a reindex + soft delete in an atomic operation.
Cassandra has a very nice replication model (if you need to be extra-fail-safe). Elasticsearch is OK, too, I'm not in the camp that says ES is particularly unreliable (it has issues sometimes, like all software).
Elasticsearch also has aggregations for real-time analytics. And because searches are so fast, analytics on a subset of data will be fast, too.
If your requirements are satisfied well enough by one of them (like here it seems like ES would work well), I would just use one. If you have requirements from both worlds, then you can either:
use one of them and work around the downsides. For example, you may be able to handle many updates with Elasticsearch, but with more shards and more hardware
use both and make sure they're in sync
As elasticsearch is built on Lucene index and if you want to store indexing in elasticsearch it performs best comparing to indexing in Cassandra itself for retrieving the data.
If your requirements are not related to real-time retrieval then you can use elasticsearch as NoSQL database also, there are thoughts that ElasticSearch loses writes & Schema changes are difficult, but if your volume of data is not too big. You can easily achive elasticsearch as a search engine with best indexing along with elasticsearch as aNoSQL database. There are several way that you can prevent it. I have worked on the schema changes in elasticsearch, if your data structure is consistent then it will create any issues.
Being a supporter of ElasticSearch or SOlr. I have worked on both the search engines and i experienced that both the search engines can be used fluently if you configure them correctly.
Only cons that i can think of it, if you are targetting real time result and can't comprosie milliseconds delay in your response. Then its better to take help of other NoSQL databases like cassandra or couchbase.
Cassandra with solr, work better than Cassandra with elasticSearch.

How to integrate Hadoop, SOLR and Impala?

I am looking for example or guidance on how to use Hadoop, SOLR and Impala together. Actually I know how to use Impala and Hadoop, but also want to use the power of SOLR to make the queries run faster. I explored the web pretty extensively but could not find anything that would put me into action.
Solr and Impala complement each other in that Solr can help you understand your data's structure -- in other words, for initial discovery. From that point on, you can use that knowledge to write Impala queries that target one facet of the data or the other.

How can I copy hadoop data to SOLR

I've a SOLR search which uses lucene index as a backend.
I also have some data in Hadoop I would like to use.
How do I copy this data into SOLR ??
Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR.
I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.
How do I copy? And it would be great if there is some incremental copy mechanism.
If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.
I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.
As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth
You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here
EDIT 1
Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.
You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.
Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.
Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.
* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

Analytics + Full text search - Big data

I need to implement a system which derives analytics/insights from data (Text-only) as well as can do complex search queries.
So I have shortlisted Solr(search) and Hadoop(Analytics). I am unable to decide which base should I use to start. Can we integrate HDFS cluster with Solr? I will be mainly dealing with aggregation queries and data will not update frequently.
I know this question is too broad and general. I just need a expert's opinion on this matter.
Look at Cloudera Search and this
Cloudera Search = SOLR + Hadoop
Using Cloudera Search, you can query the data in Hadoop or HBase using SOLR.

Resources