I am working with Elasticsearch and kibanba for analyzing data.
Does anybody know is it possible to cluster data based on ...(whatever) in elasticsearch or kibana?
Clustering or Classification, or groupping.
for example like machine learning to give it some samples and then it can understand the trend for other data.
Thanks
Elasticsearch Hadoop contains an Elasticsearch Spark connector. You could run a Spark job on Elasticsearch data and use Spark MLLib for the machine learning stuff.
Related
Why elasticsearch ingest pipeline is not efficient for high volume of data?
What are the parameters that make ingesting inefficient? I mean, if some tool is made solely for the purpose to optimise searching, shouldn't the same tool should be shipped with a powerful ingestion pipeline?
How does logstash solves this issue?
I am working on Apache zeppelin for data visualization, I am able to visualize data from different sources such as HAWQ.
Now the requirement is to visualize data on a near real time basis for e.g. I need to visualize a table in HAWQ for every 15 seconds. I don't see such feature in a zeppelin. If anyone has done such kind of use cases on zeppelin please share some pointers
I just start digging into Elasticsearch and Hadoop. I am a bit lost about these two concepts. I found Elasticsearch is 'always' (probably biased by my limited knowledge) talked with Hadoop ecosystem (HDFS, Spark, HBase, Hive etc). At first, I thought Elasticsearch is part of Hadoop ecosystem, but it looks like I was wrong.
If I have a task of implementing a search engine, it seems enough to only have Elasticsearch for indexing and storing the data. Then will there be any reasons to leverage Hadoop in this task? If we use both HDFS and Elasticsearch to store the data, does this mean we would have the data physically stored duplicately in two formats (one for HDFS and one for Elasticsearch)?
Elasticsearch is a distribute, full-text search engine. It works on its own. If you want to use it as a search engine, you can use it standalone. There is no direct relation between Elasticsearch and hadoop. But you can use them together. If you are already using hadoop and want add searching capabilities to your data, you can index your data on elasticsearch and can query it from hadoop. There is a product for that purpose: ES-Hadoop
Elasticsearch's strength is search - if all you want to do is implement a search engine - you can stick with that. Where the power of something like Spark and/or Hadoop comes is is when you need to do large aggregations or calculations on records or returns on the order of ~100k or more. This is where Elasticsearch will have slowage (depending on your cluster sizing and specifications). For advanced analytcs, aggregations and machine learning tasks, I would leverage Spark (for its speed) and do those jobs there, feeding the output back to Elastic to visualize it with Kibana or some other utility.
From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
savings.
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
http://www.slideshare.net/hortonworks/hortonworks-elastic-searchfinal
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.
As we learnt hadoop is meant for batch processing of data. If we want to go for some trending based on the results produced by hadoop mapreduce jobs, what is the best way. How can we retrive mapreduce results for trending.
Is hbase can be used here. If so, is hbase is having all the capabilities of filtering and aggregate functions on the data stored in hbase?
Thanks
MRK
While there is now perfect solution in hadoop word for this problem, there are a few approaches to solve this kind of problems:
a) To produce some "on demand DataMart" using MR, load it into the RDBMS and run your queries in a real time. It can work if this data subset is much smaller then whole data set.
b) To use MPP database integrated with Hadoop. For example GreenPlum HD has MPP database pre-integrated with hadoop.
c) To use some more light-weight MR framework : Spark. It will have much less latency, but expect your data sets to be comparable with the RAM.
You probably want to look at Hive.