Cloudera 5.4.4 Cluster - Getting aggregate usage metrics

I would like to collect aggregate usage metrics from a Cloudera 5.4.4 Hadoop cluster. Some of the metrics in my mind are as below:
Average CPU utilization of the cluster per day/ per week
Top n longest running jobs/queries on Hadoop
Top n users who use the cluster most (by utilization, by number of submitted jobs)
Cluster disk usage vs disk capacity
Cluster disk usage growth over time
Are there any APIs/resources/tools etc that I could use for starting with this? I don't think I am entirely sure of where to begin from. Any starting point would be greatly appreciated. Also, please do share your experience with cluster usage metrics, if you have had any.
Thanks in advance!

Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.
Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

I hope this link (here) may provide some details for 2 and 3.


Cassandra vs HDFS to store analytics data

we have an Apache Spark cluster that analyse data stored in HDFS (.parquet).
The solution is optimal in terms of performance but it's not disaster safe as we would like, indeed, HDFS architecture has a single point of failure (the namenode) even using two namenode (you just have 2 point of failure but it's not enough).
To improve our cluster fault tolerance we would like to move to another data store solution like Cassandra.
Questions are:
With Cassandra as datastore is Spark able to leverage on DataLocality as it do with HDFS?
How this change can affect the performance?
There's article about data locality, spark and Cassandra, so yes, it is possible:
I didn't done any performance checks with Spark on HDFS vs Cassandra, and i believe it will vary depending on different workflows, but since Netflix and Microsoft using Cassandra with Spark, i believe performance is acceptable in most cases, and probably is a trade-off between data ingestion speed, existence/nonexistence of ETL and speed of the analytical process.
About hadoop single point of failure - If you will run Cassandra with replication factor 3 and consistency level quorum, you will get same 2 nodes down that will make data unavailable :) , keep it in mind.
And maybe consider MapR hadoop distribution, they've tried to solve namenode problem.

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

Elasticsearch on Hadoop - Should ES nodes be Colocated with Hadoop DataNodes?

From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.

1 big Hadoop and Hbase cluster vs 1 Hadoop cluster + 1 Hbase cluster

Hadoop will run a lot of jobs by reading data from Hbase and writing data to
Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase
100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)
Separate the Database(Hbase), then we have two clusters:
60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)
which option is better? Why?
I would say option 2 is better.My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on. Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it. All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.
The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.
If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).
Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.
