Throughput for Kafka, Spark, Elasticsearch Stack on GCP/Dataproc - elasticsearch

I'm working on a research project where I installed a complete data analysis pipeline on Google Cloud Platform. We estimate unique visitors per URL in real-time using HyperLogLog on Spark. I used Dataproc to set up the Spark Cluster. One goal of this work is to measure the throughput of the architecture depending on the cluster size. The Spark cluster has three nodes (minimal configuration)
A data stream is simulated with own data generators written in Java where I used the kafka producer API. The architecture looks as follows:
Data generators -> Kafka -> Spark Streaming -> Elasticsearch.
The problem is: As I increase the number of produced events per second on my data generators and it goes beyond ~ 1000 events/s the input rate in my Spark job suddenly collapses and begin to vary a lot.
As you can see on the screenshot from the Spark Web UI, the processing times and scheduling delays keep constant short, while the input rate goes down.
Screenshot from Spark Web UI
I tested it with a complete simple Spark job which only does a simple mapping, to exclude causes like slow Elasticsearch writes or problems with the job itself. Kafka also seems to receive and send all the events correctly.
Furthermore I experimented with the Spark configuration parameters:
spark.streaming.kafka.maxRatePerPartition and spark.streaming.receiver.maxRate
with the same result.
Does anybody have some ideas what goes wrong here? It really seems to be up to the Spark Job or Dataproc... but I'm not sure. All CPU and memory utilizations seem to be okay.
EDIT: Currently I have two kafka partitions on that topic (placed on one machine). But I think Kafka should even with only one partition do more than 1500 Events/s. The problem also was with one partition at the beginning of my experiments. I use direct approach with no receivers, so Spark reads with two worker nodes concurretly from the topic.
EDIT 2: I found out what causes this bad throughput. I forgot to mention one component in my architecture. I use one central Flume agent to log all the events from my simulator instances via log4j via netcat. This flume agent is the cause of the performance problem! I changed the log4j configuration to use asynchronuous loggers (https://logging.apache.org/log4j/2.x/manual/async.html) via disruptor. I scaled the Flume agent up to more CPU cores and RAM and changed the channel to a file channel. But it still has a bad performance. No effect... any other ideas how to tune Flume performance?

Hard to say given the sparse amount of information. I would suspect a memory issue - at some point, the servers may even start swapping. So, check the JVM memory utilizations and swapping activity on all servers. Elasticsearch should be capable of handling ~15.000 records/second with little tweaking. Check the free and committed RAM on the servers.

As I mentioned before CPU and RAM utilizations are totally fine. I found out a "magic limit", it seems to be exactly 1500 events per second. As I exceed this limit the input rate immediately begins to wobble.
The misterious thing is that processing times and scheduling delays stay constant. So one can exclude backpressure effects, right?
The only thing I can guess is a technical limit with GCP/Dataproc... I didn't find any hints on the Google documentation.
Some other ideas?

Related

Spark Jobs on Yarn | Performance Tuning & Optimization

What is the best way to optimize the Spark Jobs deployed on Yarn based cluster ? .
Looking for changes based on configuration not code level. My Question is classically design level question, what approach should be used to optimized the Jobs that are either developed on Spark Streaming or Spark SQL.
There is myth that BigData is magic and your code will be work like a dream once deployed to a BigData cluster.
Every newbie have same belief :) There is also misconception that given configurations over web blogs will be working fine for every problem.
There is no shortcut for optimization or Tuning the Jobs over Hadoop without understating your cluster deeply.
But considering the below approach I'm certain that you'll be able to optimize your job within a couple of hours.
I prefer to apply the pure scientific approach to optimize the Jobs. Following steps can be followed specifically to start optimization of Jobs as baseline.
Understand the Block Size configured at cluster.
Check the maximum memory limit available for container/executor.
Under the VCores available for cluster
Optimize the rate of data specifically in case of Spark streaming real-time jobs. (This is most tricky park in Spark-streaming)
Consider the GC setting while optimization.
There is always room of optimization at code level, that need to be considered as well.
Control the block size optimally based on cluster configuration as per Step 1. based on data rate. Like in Spark it can be calculated batchinterval/blockinterval
Now the most important steps come here. The knowledge I'm sharing is more specific to real-time use cases like Spark streaming, SQL with Kafka.
First of all you need to know to know that at what number or messages/records your jobs work best. After it you can control the rate to that particular number and start configuration based experiments to optimize the jobs. Like I've done below and able to resolve performance issue with high throughput.
I have read some of parameters from Spark Configurations and check the impact on my jobs than i made the above grid and start the experiment with same job but with five difference configuration versions. Within three experiment I'm able to optimize my job. The green highlighted in above picture is magic formula for my jobs optimization.
Although the same parameters might be very helpful for similar use cases but obviously these parameter not covers everything.
Assuming that the application works i.e memory configuration is taken care of and we have at least one successful run of the application. I usually look for underutilisation of executors and try to minimise it. Here are the common questions worth asking to find opportunities for improving utilisation of cluster/executors:
How much of work is done in driver vs executor? Note that when the main spark application thread is in driver, executors are killing time.
Does you application have more tasks per stage than number of cores? If not, these cores will not be doing anything while in this stage.
Are your tasks uniform i.e not skewed. Since spark move computation from stage to stage (except for some stages that can be parallel), it is possible for most of your tasks to complete and yet the stage is still running because one of skewed task is still held up.
Shameless Plug (Author) Sparklens https://github.com/qubole/sparklens can answer these questions for you, automatically.
Some of things are not specific to the application itself. Say if your application has to shuffle lots of data, pick machines with better disks and network. Partition your data to avoid full data scans. Use columnar formats like parquet or ORC to avoid fetching data for columns you don't need all the time. The list is pretty long and some problems are known, but don't have good solutions yet.

Tools to test a BigData Pipeline end to end?

I have this pipeline: Webserver+rsyslog->Kafka->Logstash->ElasticSearch->Kibana
I have found these tools to help test my pipeline:
Generate webserver load by spinning up jmeter EC2 instances with jmeter-ec2
Generate load on Kafka and help graph throughput with Sangrenel
I am wondering if anyone had any other suggestions for testing components or end-to-end testing? Thanks.
Great question! I am looking for something similar but may settle on a simple home solution.
Set up Storm cluster with bolts writing data to Kafka. One thing to watch out for is the id/key so your messages are distributed across multiple partitions. The reason for Storm is to have distributed set of publishers. As alternative to Storm, you can have multiple producers with lets say KafkaAppender
Once you know your Kafka performance, connect Logstash to loaded topic and let it drain as fast as possible. You may find some useful information with KafkaManager or connecting to JMX (many tools for that)
Simplest way to monitor Elastic is Marvel
Performance of Kibana depends on amount of data your query returns but the smallest interval is still 5 sec.
In my experience, logstash performance will depend on data size and grok complexity. The performance of Elastic is mostly cluster size, shard/template configuration. The fastest component in your setup will always be Kafka (bounded by ack and Zookeeper settings)
Also, if you control data generation, you may compare time of record generated vs #timestamp of logstash and measure lagging.

How to profile map reduce jobs on HBase

I have a map reduce job which runs over a HBase table. It scans the Hbase table after applying some scan filters and does some processing.
The job is taking long time, definitely much more than expected and feels like the performance deterioration is exponential (i.e, the first 90% completes much faster than the rest and after about 98% (of the mappers complete), seems like getting stuck in eternity like the limbo in the movie inception.
From high level there should be no reason for this uneven performance since each row in the scan is expected to behave similarly and the downstream service should have similar SLAs every row of the HBase table.
How do I debug and profile this job? Are there any tools available out there which would help me meter the system and pinpoint the misbehaving component?
There are a couple of ways to monitor and debug jobs like this.
The first is to look at logs for the RegionServers, Datanodes, and TaskTrackers and try to find any error messages. The JobTracker will also contain a breakdown of performance per task, you can look to see if any tasks are failing or getting killed along with messages as to why. That's the easiest most straightforward place to start
In my experience, slow MapReduce jobs with HBase indicate uneven key distributions across your regions. For TableInputFormats, the default split is a mapper per region, if one of your regions contains an uneven number of rows you are accessing or if a particular RegionServer has several regions that are being read by several mappers, that could cause slowdowns on the machine because of disk contention or network io.
For debugging the RegionServers, you can take a look at JProfiler this is mentioned in the HBase Wiki as the profiler they use. I've never used it, but it does have a probe for HBase. Standard CPU load via uptime or top and IO wait from iostat metrics would also allow you to identify which machines are slowing things down.
If you don't want to run a profiling tool, you could monitor the RegionServer WebUI and look to see if you have a lot of RPC requests queued up or if they are taking a long time, this is available is an easily parseable JSON format. That would allow you to pinpoint slowdowns for particular regions that your job is processing.
Network IO could also be a contributing factor. If you are running a MapReduce cluster separate from the HBase cluster, then all of the data has to be shipped to the TaskTrackers, so that may be saturating your network. Standard network monitoring tools could be used.
Another problem could simply be with the Scanner itself, turning on cacheblocks generally hurts performance during MR jobs in my experience. This is because of a high level of cache churn as you are generally only reading rows once during MR jobs. Also, filters attached to Scanners are applied server side, so if you are doing complex filtering that may cause higher latency.

Streaming data access and latency in hadoop applications

I am very much new to hadoop and going through the book 'Hadoop the definitive guide'
What is meaning of Streaming data access in Hadoop and why we say latency is high in Hadoop applications. Can anyone please explain me ? Thanks in advance
Ok..Let me try.."Streaming data access" implies that instead of reading data as packets or chunks, data is read continuously with a constant bitrate, just as water from a tap. The application starts reading data from the start of a file and keeps on reading it in a sequential manner without random seeks.
Coming to the second part of your question, latency is said to be high in Hadoop applications as the initial few seconds are spent in the activities like job submission, resource distribution, split creation, mappper(s) creation etc.
HTH
For latency, I can say that the completion time is always more than 30 sec, even if you are working with KB's of data. I don't totally know why it is so long but this time is initializations, e.g creating job, determination that which part of data is going to be processed by which worker, and so on.
So, if you are going to be working on small amount of data that is less than GB's, then don't go for hadoop, just use your pc. Hadoop is only good for big data
It refers to the fact that HDFS operations are read-intensive as opposed to write-intensive. In a typical scenario source data which is what you would use for analysis is loaded into HDFS only when it is up-to-date and to ensure you have the latest data set.
During analysis, a copy of the original data (in almost its entire form) is made. Your MapReduce operation will then be invoked on the copied data.
As you can see it is different to the usual relationship between storage and processing. In normal operations (think your PC/Mac) you would ideally want the files to open quickly, which is low latency and maintain small file sizes to make that feasible.
Since HDFS inclines itself to working with petabytes (1000s of GBs) latency will be high but in contrast it is realistically possible to work with large data sets much more easily.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Resources