Nifi: Why need multiple Provenance Repositories? - apache-nifi

What is the main purpose for needing multiple Provenance repositroys for a Nifi cluster setup.
Documentation from Apache doesn't cover the need for multiple, but feel it would be needed for multiple clusters and high throughput.

Related

How can I shard request using Aeron Cluster

I'd like to understand the capabilities of Aeron Clusters with respect to sharing requests across different back-end cluster application instances. I am thinking of something similar to partitions in Kafka where distinct back-end consumer processes the workload in independent processes. There should be a partition key which defines how to find the partition, or it could be a consumer provided hash, etc.
I read this article but it was not much help https://aeroncookbook.com/aeron-cluster/on-sharding/
So far I have only been reading the documentation and the API documents.
I also read the aeoroncookbook site: https://aeroncookbook.com/aeron-cluster/on-sharding/
Could someone provide an example of this if it is possible? The cookbook does not really do much good here because it imposes a similar problem but with dependencies between the shards.
Aeron Cluster does not directly support sharding. Its primary goal is redundant copies of the same data across multiple nodes. Sharding would need to be something that layered on via your own application logic. An approach would be to run multiple clusters and utilize a key to partition data across the clusters, then within your client application run multiple cluster clients (one for each cluster) and select the approach client based on the data that you are interacting with.

Distributed Spark and HDFS Cluster with 6 to 7 Nodes hardware configuration

I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations

Tools to test a BigData Pipeline end to end?

I have this pipeline: Webserver+rsyslog->Kafka->Logstash->ElasticSearch->Kibana
I have found these tools to help test my pipeline:
Generate webserver load by spinning up jmeter EC2 instances with jmeter-ec2
Generate load on Kafka and help graph throughput with Sangrenel
I am wondering if anyone had any other suggestions for testing components or end-to-end testing? Thanks.
Great question! I am looking for something similar but may settle on a simple home solution.
Set up Storm cluster with bolts writing data to Kafka. One thing to watch out for is the id/key so your messages are distributed across multiple partitions. The reason for Storm is to have distributed set of publishers. As alternative to Storm, you can have multiple producers with lets say KafkaAppender
Once you know your Kafka performance, connect Logstash to loaded topic and let it drain as fast as possible. You may find some useful information with KafkaManager or connecting to JMX (many tools for that)
Simplest way to monitor Elastic is Marvel
Performance of Kibana depends on amount of data your query returns but the smallest interval is still 5 sec.
In my experience, logstash performance will depend on data size and grok complexity. The performance of Elastic is mostly cluster size, shard/template configuration. The fastest component in your setup will always be Kafka (bounded by ack and Zookeeper settings)
Also, if you control data generation, you may compare time of record generated vs #timestamp of logstash and measure lagging.

How do the Flowfiles get distributed across the cluster nodes?

For example, if I have a GetFile processor that I have designated to be isolated, how do the flow files coming from that processor get distributed across the cluster nodes?
Is there any additional work / processors that need to be added?
In Apache NiFi today the question of load balancing across the cluster has two main answers. First, you must consider how data gets to the cluster in the first place. Second, once it is in the cluster do you need to rebalance.
For getting data into the cluster it is important that you select protocols which are themselves scalable in nature. Protocols which offer queuing semantics are good for this whereas protocols which do not offer queuing semantics are problematic. As an example of one with queueing semantics think JMS queues or Kafka or some HTTP APIs. Those are great because one or more clients can pull from them in a queue fashion and thus spread the load. An example of a protocol which does not offer such behavior would bet GetFile or GetSFTP and so on. These are problematic because the client(s) have to share state about which data they see to pull. To address even these protocols we've moved to a model of 'ListSFTP' and 'FetchSFTP' where ListSFTP occurs on one node in the cluster (primary node) and then it uses Site-to-Site feature of NiFi to load balance to the rest of the cluster then each node gets its share of work and does FetchSFTP to actually pull the data. The same pattern is offered for HDFS now as well.
In describing that pattern I also mentioned Site-to-Site. This is how two nifi clusters can share data which is great for Inter-site and Instra-Site distribution needs. It also works well for spreading load within the same cluster. For this you simply send the data to the same cluster and NiFi takes care then of load balancing and fail-over and detection of new nodes and removed nodes.
So there are great options already. That said we can do more and in the future we plan to offer a way for you to on a connection indicate it should be auto-load-balanced and then it will behind the scenes do what I've described.
Thanks
Joe
Here is an updated answer, that works even simpler with newer versions of NiFi. I am running Apache NiFi 1.8.0 here.
The approach I found here is to use a processor on the primary node, that will emit flow files to be consumed via a load balanced connection.
For example, use one of the List* processors, in "Scheduling" set its "Execution" to run on the primary node.
This should feed into the next processor. Select the connection and set its "Load Balance Strategy".
You can read more about the feature in its design document.

Best technology stack for aggregation across various properties

We are working on developing a platform which models flow of entities across a graph. The system has to answer questions of the kind how many entities having these properties are sitting at a given node on the graph , what is the inflow on a node, outflow on a node etc. Flow data is fed to the system in a stream. We are thinking of breaking the flow data in time buckets(say 5 mins) and pre-compute various aggregates against different properties and storing the aggregates in DynamoDB to serve queries.
With regards to this we are evaluating the following options:
EMR: Put flow data in AWS -S3/DynamoDB run a Map Reduce/hive job
Putting recent data into AWS- RDS, computing the aggregates via sql
Akka: It is a framework to build distributed applications via Actors
and Message passing.
If anyone has worked on similar usecase or has used any of the above technologies, please let me know what approach would be best fit for our use case.
I have used EMR to process data in S3... works pretty well. And the best part is you can spin up hadoop clusters of various sizes that fit the work load.
you may want to look into Storm for stream processing
I am also collecting a list of big-data tools here: http://hadoopilluminated.com/hadoop_book/Bigdata_Ecosystem.html
The final solution employed AWS Redshift, the driving reason was the requirement of high speed data ingestion, which Redshift provides via the COPY command.
Hadoop is built to store the data efficiently, however it does not gurantees a sub-second sla for ingestion, neither does it provide an SLA for when the data will be available for MR jobs, this was the main reason we did not go with EMR or Hadoop in general.

Resources