Big Data ingestion - Flafka use cases - hadoop

I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.
A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).
I have done the implementation for the ingestion part in the following two ways:
(1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.
Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?

People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.
EDIT:
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures. -- https://kafka.apache.org/documentation#replication
Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).

Related

Kafka Connect- Modifying records before writing into sink

I have installed Kafka connect using confluent-4.0.0
Using hdfs connector I am able to save Avro records received from Kafka topic to hive.
I would like to know if there is any way to modify the records before writing into hdfs sink.
My requirement is to do small modifications to values of the record. For Example, performing arithmetic operations on integers or manipulation of strings etc.
Please suggest if there any way to achieve this
You have several options.
Single Message Transforms, which you can see in action here. Great for light-weight changes as messages pass through Connect. Configuration-file based, and extensible using the provided API if there's not an existing transform that does what you want.
See the discussion here on when SMT are suitable for a given requirement.
KSQL is a streaming SQL engine for Kafka. You can use it to modify your streams of data before sending them to HDFS. See this example here.
KSQL is built on the Kafka Stream's API, which is a Java library and gives you the power to transform your data as much as you'd like. Here's an example.
Take a look at Kafka connect transformers [1] & [2]. You can build a custom transformer library and use it in connector.
[1] http://kafka.apache.org/documentation.html#connect_transforms
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-66%3A+Single+Message+Transforms+for+Kafka+Connect

Clustered NIFI, Only one node is working

I'm using NIFI in a clustered mode with two nodes, and I have noticed that only one node that do all the work.
Any idea why is that ? and how can I make nifi2 do some of the processing of the dataflow ?
It depends how data is coming in to your cluster. It is up to you as the data flow designer to create an approach that allows the data to be partitioned across your cluster for processing.
See this post for an overview of strategies to do this:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

How do I add a custom monitoring feature in my Spark application?

I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.

Sticking stream data to specific working

We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming
When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.

XML data via API to Land in Hadoop

We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.
Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH
Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.
If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?
You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.

Resources