AWS Hive + Kinesis on EMR = Understanding check-pointing - hadoop

I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:
set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;
set kinesis.checkpoint.metastore.range.key.name=RangeKey;
set kinesis.checkpoint.logical.name=my_logical_name;
set kinesis.checkpoint.iteration.no=0;
I have the following questions:
Do I always have to start with iteration.no set to 0?
Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Given the DynamoDB checkpoint entry:
{"startSeqNo":"1234",
"endSeqNo":"5678",
"closed":false}
What's the meaning of the closed field?
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
I know that it's a lot of questions but I could not find these answers on the documentation.

Check out the Kinesis documentation and the Kinesis Storage Handler Readme which contains answers to many of your questions.
Do I always have to start with iteration.no set to 0?
Yes, unless you are doing some advanced logic which requires you to skip a known or already processed part of the stream
Does this always start from the beginning of the script (oldest
Kinesis record about to be evicted)?
Yes
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
This is handled by the hive script, since it is querying all data in the kinesis stream at each run
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
As Kinesis data is a 24-hour time window, the data has (possibly) changed since your last query, so you probably would want to query all records again in the Hive job
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Yes, you would expect the results to change as the stream changes
Given the DynamoDB checkpoint entry:
What's the meaning of the closed field?
Although this is an internal detail of the Kinesis Storage Handler, I believe this indicates whether the shard is a parent shard, which indicates whether is it open and accepting new data or closed and not accepting new data into the shard. If you have scaled your stream up or down, parent shards exist for 24 hours, and contain all data since you scaled, however no new data will be inserted into these shards.
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
New sequence numbers generally increase over time is the only guidance that Amazon provide on this.
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
This means the shard is open and still accepting new data (not a parent shard)

Related

Inconsistency streaming behavior with Logstash - ELK

I have an index with several flat fields and several nested fields. I am trying to stream info from SQL Server through the Logstash into a nested field by a specific Id.
When I stream the data for only one Id then it passes fully and successfully without any problem. On the other hand, when I try to stream the data for more than one id - the info that is inserted to the index is partial for some reason.
Note: The query is sorted by id.
Moreover, in different tries streaming the data, a different amount of information is obtained.
For example, suppose the full info contains 15 rows. In one try - only 2 rows is obtained, but in another try - 14 rows is obtained, seemingly completely arbitrarily.
Does anyone have any idea what can cause this strange behavior? I would be happy for any help.
Thanks!
This is because of the Logstash execution model where several workers can work in parallel and your events might be processed by different worker threads.
If you want to have a consistent loading behavior you need to execute your pipeline with a single worker (-w 1 on the command line)

Nifi record counts

I am getting files from remote server using Nifi: my files are as follow:
timestamp (ms), nodeID,value
12345,x,12.4
12346,x,12.7
12348,x,13.4
12356,x,13,6
12355,y,12.0
I am now just get and fetch and split lines and send them to Kafka, but before hand, I need to apply a checksum approach on my records and aggregate them based on time stamp, what I need to do to add an additional column to my content and count the records based on aggregated time stamps, for example aggregation based on each 10 milliseconds and nodeID..
timestamp (ms), nodeID,value, counts
12345,x,12.4,3
12346,x,12.7,3
12348,x,13.4,3
12356,x,13,6,1
12355,y,12.0,1
How to do above process in NiFi. I am totally new to Nifi but need to add above functinality to my Nifi process. I am currently using below nifi process
This may not answer your question directly, but you should consider refactoring your flow to use the "record" processors. It would greatly simplify things and would probably get you closer to being able to do the aggregation.
The idea is to not split up the records, and instead process them in place. Given your current flow, the 4 processors after FetchSFTP would like change to a single ConvertRecord processor that converts CSV to JSON. You would first need to defined a simple Avro schema for your data.
Once you have the record processing setup, you might be able to use PartitionRecord to partition the records by the node id, and then from there the missing piece would be how to count by the timestamps.
Some additional resources...
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
https://www.slideshare.net/BryanBende/apache-nifi-record-processing

Reading AWS Dynamodb Stream

I want to do an incremental DynamoDB backup on S3 using DynamoDB Streams. I have a lambda that reads the dynamodb stream and writes files into S3. In order to mark already read shards I have ExclusiveStartShardId logged into configuration file.
What I do is:
Describe the stream (using the logged ExclusiveStartShardId)
Get stream's shards
For all shards that are CLOSED (has EndingSequenceNumber) I do the following:
Get shard iterator for the certain shard (shardIteratorType: 'TRIM_HORIZON')
Iterate through shard and fetch records till NextShardIterator becomes null
The problem here is that I read only closed shards and in order to get new records I must wait (undetermined-amount-of-time) for it to be closed.
It seems that the last shard is usually in OPEN state (has NO EndingSequenceNumber). If I remove the check for EndingSequenceNumber from the pseudo code above I end up with infinite loop because when I hit the last shard NextShardIterator is always presented. I cannot also do a check if fetched items are 0 because there could be "gaps" in the shard.
In this tutorial numChanges is used in order to stop the infinite loop http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.LowLevel.Walkthrough.html#Streams.LowLevel.Walkthrough.Step5
What is the best approach in this situation?
I also found a similar question: Reading data from dynamodb streams. Unfortunately I could not find the answer for my question.
Why not attach the DynamoDB stream as an event source for your Lambda function? Then Lambda will take care of polling the stream and calling your function when necessary. See this for details.

Storm bolt doesn't guarantee to process the records in order they receive?

I had a storm topology that reads records from kafka, extracts timestamp present in the record, and does a lookup on hbase table, apply business logic, and then updates the hbase table with latest values in the current record!!
I have written a custom hbase bolt extending BaseRichBolt, where, the code, does a lookup on the hbase table and apply some business logic on the message that has been read from kafka, and then updates the hbase table with latest data!
The problem i am seeing is, some times, the bolt is receiving/processing the records in a jumbled order, due to which my application is thinking that a particular record is already processed, and ignoring the record!!! Application is not processing a serious amount of records due to this!!
For Example:
suppose there are two records that are read from kafka, one record belongs to 10th hour and second records belongs to 11th hour...
My custom HBase bolt, processing the 11th hour record first... then reading/processing the 10th hour record later!! Because, 11th hour record is processed first, application is assuming 10th record is already processed and ignoring the 10th hour record from processing!!
Can someone pls help me understand, why my custom hbase bolt is not processing the records in order it receive ?
should i have to mention any additional properties to ensure, the bolt processes the records in the order it receives ? what are possible alternatives i can try to fix this ?
FYI, i am using field grouping for hbase bolt, thru which i want to ensure, all the records of a particular user goes into same task!! Nevertheless to mention, thinking field grouping might causing the issue, reduces the no.of tasks for my custom hbase bolt to 1 task, still the same issue!!
Wondering why hbase bolt is not reading/processing records in the order it receives !!! Please someone help me with your thoughts!!
Thanks a lot.
Kafka doesn't provide order of messages in multiple partition.
So theres no orderring when you read messages. To avoid that, you need to create kafka topic with a single partition, but you will loose parallelism advantage.
Kafka guarantees ordering by partition not by topic. Partitioning really serves two purposes in Kafka:
It balances data and request load over brokers
It serves as a way to divvy up processing among consumer processes while allowing local state and preserving order within the partition.
For a given use case you may care about only #2. Please consider using Partitioner as part of you Producer using ProducerConfig.PARTITIONER_CLASS_CONFIG. The default Java Producer in .9 will try to level messages across all available partitions. https://github.com/apache/kafka/blob/6eacc0de303e4d29e083b89c1f53615c1dfa291e/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
You can create your own with something like this:
return hash(key)%num_partitions

Perform actions before end of the micro-batch in Spark Streaming

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to sum up zero values (as some micro-batches are empty).
e.g. I do collect some statistics data and want to send them to my server, but the object that collects the data only exists during a certain batch and is initialized from the scratch for the next batch. I would love to be able to call my "finish" method before the batch is done and the object is gone. Otherwise I loose the data that has not been sent to my server.
Maybe you can use StreamingListener:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener

Resources