Background:
We use flink to consume data from kafka and write to HDFS in hour partitions. And We don't set the watermarks because we want all the record consumed.
so we get directories like
/data/app1/day=2019-01-01/hour=01/
/data/app1/day=2019-01-01/hour=02/
...
/data/app1/day=2019-01-01/hour=23/
Other jobs need our data as input, so the data are not intact if something some late data may even arrive several hours later, which leads to incorrect result for downstream jobs.
So my question is how to guarantee or detect the data integrity in flink?
Related
I've created a simple NiFi pipeline that reads a stream of data from a Kafka topic (using ConsumeKafka) and writes it to the HDFS (using PutHDFS). Currently, I'm seeing lots of small files being created on the HDFS. A new file is created about once a second, some with only one or two records.
I want fewer, larger files to be written to the HDFS.
I have the following settings in ConsumeKafka:
Message Demarcator = <new line>
Max Poll Records = 10000
Max Uncommitted Time = 20s
In the past I've used Flume instead of Nifi, and it has batchSize and batchDurationMillis, which allow me to tweak how big HDFS files are. It seems like ConsumeKafka in Nifi is missing a batchDurationMillis equivalent.
What's the solution in NiFi?
Using the Message Demarcator and Max Poll Records is the correct approach to get multiple messages per flow file. You may want to slow down the ConsumeKafka processor by adjusting the Run Schedule (on the scheduling tab) from 0 sec which means run as fast as possible, to something like 1 second or whatever makes sense for you to grab more data.
Even with the above, you would likely still want to stick a MergeContent processor before PutHDFS, and merge together flow files based on size so that you can wait til you have the appropriate amount of data before writing to HDFS.
How to use MergeContent will depend on the type of data you are merging... If you have Avro, there is a specific merge strategy for Avro. If you have JSON you can merge them one after another, or you can wrap them with a header, footer, and demarcator to make a valid JSON array.
I am developing a Spark application. The application takes data from Kafka queue and processes that data. After processing it stores data in Hbase table.
Now I want to monitor some of the performance attributed such as,
Total count of input and output records.(Not all records will be persisted to Hbase, some of the data may be filtered out in processing)
Average processing time per message
Average time taken to persist the messages.
I need to collect this information and send it to a different Kafka queue for monitoring.
Considering that the monitoring should not incur a significant delay in the processing.
Please suggest some ideas for this.
Thanks.
I am trying to use spark streaming to deal with some order stream, I have some previous computed features for maybe a buyer_id for order in the stream.
I need to get these features while the Spark Streaming is running.
Now, I stored the buyer_id features in a hive table and load it into and RDD and
val buyerfeatures = loadBuyerFeatures()
orderstream.transform(rdd => rdd.leftOuterJoin(buyerfeatures))
to get the pre-computed features.
another way to deal with this is maybe save the features in to a hbase table. and fire a get on every buyer_id.
which one is better ? or maybe I can solve this in another way.
From my short experience:
Loading the necessary data for the computation should be done BEFORE starting the streaming context:
If you are loading inside a DStream operation, this operation will be repeated at each Batch Inteverval time.
If you load each time from Hive, you should seriously consider overhead costs and possible problems during data transfer.
So, if your data is already computed and "small" enough, load it at the beginning of the program in a Broadcast variable or,even better, in a final variable. Either this, or create an RDD before the DStream and keep it as reference (which looks like what you are doing now), although remember to cache it (always if you have enough space).
If you actually do need to read it at streaming time (maybe you receive your query key from the stream), then try to do it once in a foreachPartition and save it in a local variable.
I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions
I am writing a spark streaming app with online streaming data compared to basic data which i broadcast into each computing node. However, since the basic data is updated daily, i need to update the broadcasted variable daily too. The basic data resides on hdfs.
Is there a way to do this? The update is not related to any online streaming results, just say at 12:00 am everyday. Moreover, if there is such a way, will the updating process block spark streaming computing jobs?
Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval
Create CacheLookup object that updates daily#12 am
Wrap that in Broadcast variable
Use CacheLookup as part of streaming logic