Tools to test a BigData Pipeline end to end?

Tools to test a BigData Pipeline end to end? - performance

I have this pipeline: Webserver+rsyslog->Kafka->Logstash->ElasticSearch->Kibana
I have found these tools to help test my pipeline:
Generate webserver load by spinning up jmeter EC2 instances with jmeter-ec2
Generate load on Kafka and help graph throughput with Sangrenel
I am wondering if anyone had any other suggestions for testing components or end-to-end testing? Thanks.

Great question! I am looking for something similar but may settle on a simple home solution.
Set up Storm cluster with bolts writing data to Kafka. One thing to watch out for is the id/key so your messages are distributed across multiple partitions. The reason for Storm is to have distributed set of publishers. As alternative to Storm, you can have multiple producers with lets say KafkaAppender
Once you know your Kafka performance, connect Logstash to loaded topic and let it drain as fast as possible. You may find some useful information with KafkaManager or connecting to JMX (many tools for that)
Simplest way to monitor Elastic is Marvel
Performance of Kibana depends on amount of data your query returns but the smallest interval is still 5 sec.
In my experience, logstash performance will depend on data size and grok complexity. The performance of Elastic is mostly cluster size, shard/template configuration. The fastest component in your setup will always be Kafka (bounded by ack and Zookeeper settings)
Also, if you control data generation, you may compare time of record generated vs #timestamp of logstash and measure lagging.

Related

Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.

Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations

Architecture of a real time streaming job

I am working on an streaming application using Spark Streaming, I want to index my data into elastic search.
My analysis:
I can directly push data from Spark to elastic search, but i feel in this case both the components will be tightly coupled.
If this is a spark core job, we can write that output to HDFS and use logstash to get the data from HDFS and push it to elastic search.
Solution according to me:
I can push the data from Spark Streaming to Kafka and from Kafka we can read that data using Logstash and push to ES.
Please suggest.

First of all, it is great that you have thought through the different approaches.
There are a few questions which you should ask before coming to a good design:
Timelines? Spark -> ES is a breeze and is recommended if you are starting on a PoC.
Operational bandwidth? introducing more components will increase operational concerns. From my personal experiences, making sure you spark streaming job is stable is itself a time-consuming job. You want to add Kafka as well, so you need to spend more time in trying to get the monitoring, other ops concerns right.
Scale? If it is going to take more scale, having a persistent message bus might be able to help absorb back-pressure and still scale pretty well.
If I had the time and dealing with large scale, Spark streaming -> Kafka -> ES looks to be the best bet. This way when your ES cluster is unstable, you still have the option of Kafka replay.
I am a little hazy on Kafka -> HDFS -> ES, as there could be performance implications on adding a batch layer in between the Source and Sink. Also honestly, I am not aware of how good logstash is with HDFS, so can't really comment.
Tight coupling is a oft-discussed subject. There are people who argue against it citing reusability concerns, but there are also people who argue for it, as sometimes it can create a simpler design and makes the whole system easier to reason about. Also talk about premature optimisations :) We have had successes with Spark -> ES directly at a moderate scale of data inflow. So don't discount the power of a simpler design just like that :)

Throughput for Kafka, Spark, Elasticsearch Stack on GCP/Dataproc

I'm working on a research project where I installed a complete data analysis pipeline on Google Cloud Platform. We estimate unique visitors per URL in real-time using HyperLogLog on Spark. I used Dataproc to set up the Spark Cluster. One goal of this work is to measure the throughput of the architecture depending on the cluster size. The Spark cluster has three nodes (minimal configuration)
A data stream is simulated with own data generators written in Java where I used the kafka producer API. The architecture looks as follows:
Data generators -> Kafka -> Spark Streaming -> Elasticsearch.
The problem is: As I increase the number of produced events per second on my data generators and it goes beyond ~ 1000 events/s the input rate in my Spark job suddenly collapses and begin to vary a lot.
As you can see on the screenshot from the Spark Web UI, the processing times and scheduling delays keep constant short, while the input rate goes down.
Screenshot from Spark Web UI
I tested it with a complete simple Spark job which only does a simple mapping, to exclude causes like slow Elasticsearch writes or problems with the job itself. Kafka also seems to receive and send all the events correctly.
Furthermore I experimented with the Spark configuration parameters:
spark.streaming.kafka.maxRatePerPartition and spark.streaming.receiver.maxRate
with the same result.
Does anybody have some ideas what goes wrong here? It really seems to be up to the Spark Job or Dataproc... but I'm not sure. All CPU and memory utilizations seem to be okay.
EDIT: Currently I have two kafka partitions on that topic (placed on one machine). But I think Kafka should even with only one partition do more than 1500 Events/s. The problem also was with one partition at the beginning of my experiments. I use direct approach with no receivers, so Spark reads with two worker nodes concurretly from the topic.
EDIT 2: I found out what causes this bad throughput. I forgot to mention one component in my architecture. I use one central Flume agent to log all the events from my simulator instances via log4j via netcat. This flume agent is the cause of the performance problem! I changed the log4j configuration to use asynchronuous loggers (https://logging.apache.org/log4j/2.x/manual/async.html) via disruptor. I scaled the Flume agent up to more CPU cores and RAM and changed the channel to a file channel. But it still has a bad performance. No effect... any other ideas how to tune Flume performance?

Hard to say given the sparse amount of information. I would suspect a memory issue - at some point, the servers may even start swapping. So, check the JVM memory utilizations and swapping activity on all servers. Elasticsearch should be capable of handling ~15.000 records/second with little tweaking. Check the free and committed RAM on the servers.

As I mentioned before CPU and RAM utilizations are totally fine. I found out a "magic limit", it seems to be exactly 1500 events per second. As I exceed this limit the input rate immediately begins to wobble.
The misterious thing is that processing times and scheduling delays stay constant. So one can exclude backpressure effects, right?
The only thing I can guess is a technical limit with GCP/Dataproc... I didn't find any hints on the Google documentation.
Some other ideas?

How do the Flowfiles get distributed across the cluster nodes?

For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?

In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe

Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.

Best technology stack for aggregation across various properties

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.

I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html

The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio