I just start digging into Elasticsearch and Hadoop. I am a bit lost about these two concepts. I found Elasticsearch is 'always' (probably biased by my limited knowledge) talked with Hadoop ecosystem (HDFS, Spark, HBase, Hive etc). At first, I thought Elasticsearch is part of Hadoop ecosystem, but it looks like I was wrong.
If I have a task of implementing a search engine, it seems enough to only have Elasticsearch for indexing and storing the data. Then will there be any reasons to leverage Hadoop in this task? If we use both HDFS and Elasticsearch to store the data, does this mean we would have the data physically stored duplicately in two formats (one for HDFS and one for Elasticsearch)?
Elasticsearch is a distribute, full-text search engine. It works on its own. If you want to use it as a search engine, you can use it standalone. There is no direct relation between Elasticsearch and hadoop. But you can use them together. If you are already using hadoop and want add searching capabilities to your data, you can index your data on elasticsearch and can query it from hadoop. There is a product for that purpose: ES-Hadoop
Elasticsearch's strength is search - if all you want to do is implement a search engine - you can stick with that. Where the power of something like Spark and/or Hadoop comes is is when you need to do large aggregations or calculations on records or returns on the order of ~100k or more. This is where Elasticsearch will have slowage (depending on your cluster sizing and specifications). For advanced analytcs, aggregations and machine learning tasks, I would leverage Spark (for its speed) and do those jobs there, feeding the output back to Elastic to visualize it with Kibana or some other utility.
Related
In a high-availability environment, how can these technologies replicate Lucene data? How could I do the replication of my Lucene directories, considering that today I do not use such technologies.
That question is probably too wide to answer anything useful, but in general you have two options:
Index the document to a master node, then replicate the index files that have changed to all other nodes. These are usually known as master/slave setups. The first versions of Solr used rsync to do this - that way Solr didn't have to know anything about replication itself. Later versions used HTTP to replicate the index files instead. If you already have a Lucene index that you want to make available on more nodes, this is the easiest solution that doesn't require fundamental changes to your project.
Distribute the document that's going to be added to the index to all known replicas of that index/shard. The indexing process happens on each node, and the document is distributed to the node before it has been added to the index. This is (simplified) what happens when Solr runs in cloud / cluster mode (and is what ES does as well IIRC). There's also transaction logs etc. involved here to make it more resilient to failure across nodes.
So either distribute the updates themselves or distribute the updated index.
We've been working with ElasticSearch 2.x for a quite while. Everything meets our requirements perfectly except for one weak point: The performance of writing/indexing to ElasticSearch cluster is not very good.
In our case, we have 8 nodes ES cluster, it's 100~ fields wide indices we are putting in ES. The indexing rate is around 50,000 per minute which is way too slow for our scenario. We've tried all tuning methods recommended by www.elastic.co. The fastest way we've found is that construct the json payload as files, they dump them into ES using bulk API. But still, the indexing pace is just too slow.
I've seen some ES-Hadoop connector, also elasticsearch has spark support where you can use saveToES() saves the RDD to ES. I suspect they all use ES bulk API underneath. Can anyone share some experience on them? What is the fastest way of writing indices in ElasticSearch?
No matter what third party tool you use outside ES, everything needs to use the ES ways of putting data in. Either Spark, Logstash, your own app all need to use bulk or index api in one way or another. There's no backdoor magic here.
I am working with Elasticsearch and kibanba for analyzing data.
Does anybody know is it possible to cluster data based on ...(whatever) in elasticsearch or kibana?
Clustering or Classification, or groupping.
for example like machine learning to give it some samples and then it can understand the trend for other data.
Thanks
Elasticsearch Hadoop contains an Elasticsearch Spark connector. You could run a Spark job on Elasticsearch data and use Spark MLLib for the machine learning stuff.
We in our organization trying to develop some competency around big data Hadoop and related eco system.
We are thinking doing a proof of concept in which our objective would be to store, index and search on large set of PDF files, email docs and word docs. First of all i would like to know that is this a big data use case?
If it is, then is it a hadoop use case? If it is then what all technologies we should go for?
We tried storing PDF in HDFS and successfully created lucene indexes via mapper jobs in parallel and store indexes at data node local temporary directory.
But we are not sure if we doing it right way or not, how to make it a proper big data Hadoop use case, and struggling around making a decision on tech stack whether Hadoop or a no SQL db or Elasticsearch or SOLR etc etc...
Our objective is to do a proof of concept around searching on large set of different format of docs and we wanted to use Hadoop if possible... Can anybody please help us to get right direction?
Thanks
If you are not going to do any analysis on the data on files stored in HDFS, Hadoop might not be your right choice. If you have unstructured or semi-structured data and want to crunch these into tables for future analysis, you can use HDFS with Hive/Pig to extract them. you might not need NoSQL unless you want high-availability or consistency which in your case I don't think so.
I started creating my new search application. In my earlier application I used Apache solr. Now I want to know which better in terms of performance and usability.
Personally I want to know the performance benchmark of Elastic search and solr. If there are other alternatives suggestions are most welcome.
Disclaimer: I work at elasticsearch.com
I would just say: give elasticsearch a try. I think that after some hours (minutes?), you will have somehow an opinion.
Start 2 or 3 or 4 nodes, and you will see how things are rebalanced nicely.
About performance, I'd say that elasticsearch will give you a constant query throughput even if you are doing massive index operations.
I have used both quite a bit, and much prefer ElasticSearch. The API is more flexible and accessible. It is easier to get started with. Replication happens automatically by default. In general all the defaults are easier to work with. Everything generally works out of the box (safe defaults) and you only need to tune what you find needs to work better.
I have not worked much with SOLR 4, only with 3.x. Once I switched I never looked back, but I hear that there are many improvements in 4 with regards to replication and clustering that make it a usable competitor.
With regards to performance, I think that generally they are comparable as they both rely on Lucene. That is why there is a lack of valid benchmarks that make this general comparison. That said, there are certainly use cases where one will perform better than the other.
If you look at the trends of utilization while there are many more people currently using SOLR, it is in decline. That decline is very correlated to the increase in users of Elasticsearch which is very much on the rise. As Dadoonet said, give ElasticSearch a try, it won't take long and you won't want to use SOLR again.
UPDATE
I just spent two weeks on a client site consulting on a SOLR Cloud installation. I am now much more familiar with the updates to SOLR, and say quite confidently, I still prefer ElasticSearch, but it seems SOLR has some momentum again.
ElasticSearch, is hands down more elastic. That is, having an elastic cluster where nodes come and go, or even where you just need to add nodes is much much easier in ElasticSearch than SOLR. Anyone who tells you it is easy in SOLR, has not done it in ElasticSearch. ElasticSearch will automatically join a cluster and assume an active role in that cluster, taking over serving available shards and replicas. Over the last week I decommissioned a 2 node cluster, replacing it with two new nodes. I simply added the 2 new nodes, and one at a time, marked the other two nodes as non-data nodes. Once the shard migration completed I decommissioned the nodes. I had set minimum_master_nodes = 2 ((2/2)+1), and had no issue with split brain.
During the same week, I had to add a node to a SOLR cluster. The process was poorly documented, especially considering the changes from 4.1 to 4.3 and the mishmash of existing documentation, much of which says you can't even do it based on old versions of SOLR. I finally found documentation which clarified. It requires manually adding a core to the collection and then adding replicas to existing shards within the cluster. Finally you manually decommission the redundant shards on some other node. At some point this node may become master for one of those shards but not immediately.
With SOLR If you do not have sufficient shards to distribute, you can just add replicas or you can go through a shard split to create two new shards. Again this is a poorly documented feature, but is functionality that does not exist in ElasticSearch. You must split and then remove the original shard, something none of the documentation clearly explains.
SolrCloud has a couple other advantages as well if integrating with Hadoop. If you are indexing data in HDFS or HBase, there are now both Map-Reduce, and real time methods of ingesting data into SOLR. This provides some real power to your Big Data platform and allows you to do full text search over data that is otherwise barely accessible.
While you can index Hadoop data into ElasticSearch, the implementation is not as clean as the SolrCloud/Cloudera Search implementations. Having the MapReduce directly build the shards is a far superior solution with significant performance benefits. Reducers talking directly to a cluster works, but it is not the same. I do not know if anything similar to the Lily connector for HBase exists for ElasticSearch, if not I may look into writing one. This allows indexing directly from the HBase replication logs.
So in summary there are certainly situations where either is beneficial. If you are looking for tight integration with Hadoop, SOLR, ClouderaSearch specifically, is a good option. If you are looking for ease in managing an Elastic cluster, Elasticsearch will be a much better option. For me, I'll continue with my hacky Hadoop integrations to make it work with Elasticsearch, until something better emerges.