Sticking stream data to specific working - spark-streaming

We are trying to replace Apache Storm with Apache Spark streaming.
In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
We do this because each worker will cache customer details (from DB).
So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
I did see comparison Spark and Storm; and this being limitation on Spark.
I am hoping we have a solution to this in Spark Streaming

When using Kafka, one way to address this problem is to partition your data at the producer side. As you probably have seen, Kafka messages have a key, and you may use that key to partition the data among partitions.
Using the Kafka receiver, you create one receiver per partition. Upon start of the Streaming job, the receivers will be distributed over several executors.
This means that every executor (JVM) will be receiving data for only the partitions it's got assigned. This results on the same id going to the same executor for the lifetime of the receiver, and enables effective local caching as intended in the question.

Related

How do I get two topics that have the same partition key and the number of partitions land on the same consumer within a kafka streams application

I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.

How to send messages in order from Apache Spark to Kafka topic

I have a use case where event information about sensors is inserted continuously in MySQL. We need to send this information with some processing in a Kafka topic every 1 or 2 minutes.
I am using Spark to send this information to Kafka topic and for maintaining CDC in Phoenix table.I am using a Cron job to run spark job every 1 minute.
The issue I am currently facing is message ordering, I need to send these messages in ascending timestamp to end the system Kafka topic (which has 1 partition). But most of the message ordering is lost due to more than 1 spark DataFrame partition sends information concurrently to Kafka topic.
Currently as a workaround I am re-partitioning my DataFrame in 1, in order to maintain the messages ordering, but this is not a long term solution as I am losing spark distributed computing.
If you guys have any better solution design around this please suggest.
I am able to achieve message ordering as per ascending timestamp to some extend by reparations my data with the keys and by applying sorting within a partition.
val pairJdbcDF = jdbcTable.map(row => ((row.getInt(0), row.getString(4)), s"${row.getInt(0)},${row.getString(1)},${row.getLong(2)},${row. /*getDecimal*/ getString(3)},${row.getString(4)}"))
.toDF("Asset", "Message")
val repartitionedDF = pairJdbcDF.repartition(getPartitionCount, $"Asset")
.select($"Message")
.select(expr("(split(Message, ','))[0]").cast("Int").as("Col1"),
expr("(split(Message, ','))[1]").cast("String").as("TS"),
expr("(split(Message, ','))[2]").cast("Long").as("Col3"),
expr("(split(Message, ','))[3]").cast("String").as("Col4"),
expr("(split(Message, ','))[4]").cast("String").as("Value"))
.sortWithinPartitions($"TS", $"Value")

Big Data ingestion - Flafka use cases

I have seen that the Big Data community is very hot in using Flafka in many ways for data ingestion but I haven't really gotten why yet.
A simple example I have developed to better understand this is to ingest Twitter data and move them to multiple sinks(HDFS, Storm, HBase).
I have done the implementation for the ingestion part in the following two ways:
(1) Plain Kafka Java Producer with multiple consumers (2) Flume agent #1 (Twitter source+Kafka sink) | (potential) Flume agent #2(Kafka source+multiple sinks). I haven't really seen any difference in the complexity of developing any of these solutions(not a production system I can't comment on performance) - only what I found online is that a good use case for Flafka would be for data from multiple sources that need aggregating in one place before getting consumed in different places.
Can someone explain why would I use Flume+Kafka over plain Kafka or plain Flume?
People usually combine Flume and Kafka, because Flume has a great (and battle-tested) set of connectors (HDFS, Twitter, HBase, etc.) and Kafka brings resilience. Also, Kafka helps distributing Flume events between nodes.
EDIT:
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures. -- https://kafka.apache.org/documentation#replication
Thus, as soon as Flume gets the message to Kafka, you have a guarantee that your data won't be lost. NB: you can integrate Kafka with Flume at every stage of your ingestion (ie. Kafka can be used as a source, channel and sink, too).

How do I add a custom monitoring feature in my Spark application?

I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Resources