Ingesting data in elasticsearch from hdfs , cluster setup and usage - hadoop

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.

The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

Related

Cassandra vs HDFS to store analytics data

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
Thanks
Matteo
There's article about data locality, spark and Cassandra, so yes, it is possible:
https://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

Cloudera 5.4.4 Cluster - Getting aggregate usage metrics

I would like to collect aggregate usage metrics from a Cloudera 5.4.4 Hadoop cluster. Some of the metrics in my mind are as below:
Average CPU utilization of the cluster per day/ per week
Top n longest running jobs/queries on Hadoop
Top n users who use the cluster most (by utilization, by number of submitted jobs)
Cluster disk usage vs disk capacity
Cluster disk usage growth over time
Are there any APIs/resources/tools etc that I could use for starting with this? I don't think I am entirely sure of where to begin from. Any starting point would be greatly appreciated. Also, please do share your experience with cluster usage metrics, if you have had any.
Thanks in advance!
Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.
Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.
ref- http://hakunamapdata.com/ganglia-configuration-for-a-small-hadoop-cluster-and-some-troubleshooting/
I hope this link (here) may provide some details for 2 and 3.

Datastax hadoop nodes basics

I'm trying to set up some hadoop nodes along with some cassandra nodes in my datastax enterprise cluster. Two things are not clear to me at this point. One, how many hadoop nodes do I need? Is it the same number of cassandra nodes? Does the data still live on the cassandra nodes? Second--the tutorials mention that I should have vnodes disabled on the hadoop nodes. Can I still use vnodes on the cassandra nodes in that cluster? Thank you.
In Datastax Enterprise you run Hadoop on nodes that are also running Cassandra. The most common deployment is to make two datacenters (logical groupings of nodes.) One Datacenter is devoted to analytics and contains your machines which run Hadoop and C* at the same time, the other datacenter is C* only and servers the OLTP function of your cluster. The C* processes on the Analytics nodes are connected to the rest of your cluster (like any other C* node) and receives updates when mutations are written so it is eventually consistent with the rest of your database. The data lives both on these nodes and on the other nodes in your cluster. Again most folks end up having a replication pattern with NetworkTopologyStrategy which specifies several replicas in their C* only DC and a single replica in their Analytics DC but your usecase may differ. The number of nodes does not have to be equal in the two datacenters.
For your second question, yes you can have Vnodes enabled in the C* only datacenter. In addition if your batch jobs are of a signficantly large enough size you could also run vnodes in your analytics datacenterr with only a slight performance hit. Again this is completely based on your use case. If you want many faster shorter analytics jobs you do NOT want vnodes enabled in your Analytics datacenter.

Elasticsearch on Hadoop - Should ES nodes be Colocated with Hadoop DataNodes?

From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
savings.
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
http://www.slideshare.net/hortonworks/hortonworks-elastic-searchfinal
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

Hadoop will run a lot of jobs by reading data from Hbase and writing data to
Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase
cluster:
100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)
Separate the Database(Hbase), then we have two clusters:
60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)
which option is better? Why?
Thanks.
I would say option 2 is better.My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on. Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it. All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.
The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.
If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).
Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

Resources