RocksDB fileanme meaning used by Kafka Streams - apache-kafka-streams

Under /tmp/streams-my-application-id
I found the files that RocksDB uses. My intention was to check the file size by du -h.
When seeing the file name, I'm curious about the meaning of file name. What does the file names mean? I suppose it's related to the Kafka Streams tasks and the partitions.
Does the prefix 0 and 1 mean number of topics used, and the later is the partition used?
This KafkaStreams app simply joins two topics using KStream-KTable, and one topic is re-partition and reduce into KTable.
8,0K ./0_2
8,0K ./0_1
3,1M ./1_2/rocksdb/KSTREAM-REDUCE-STATE-STORE-0000000002
3,1M ./1_2/rocksdb
3,1M ./1_2
3,0M ./1_0/rocksdb/KSTREAM-REDUCE-STATE-STORE-0000000002
3,0M ./1_0/rocksdb
3,1M ./1_0
3,0M ./1_1/rocksdb/KSTREAM-REDUCE-STATE-STORE-0000000002
3,0M ./1_1/rocksdb
3,0M ./1_1
8,0K ./0_0

File names are derived using sub-topology and partition number.
Usually KStream application is splitted into number of sub-topologies (i.e. Sub-topology 0,1,2....etc). While using stateful transformation, state-store directories use that
reference in order to generate the directory and file name like given below:
<sub-topology-number>_<partition_number>
So first number represents sub-topology and second one represents partition number
8,0K ./0_2 //directory
8,0K ./0_1 // diretory
3,1M ./1_2/rocksdb/KSTREAM-REDUCE-STATE-STORE-0000000002
And KSTREAM-REDUCE-STATE-STORE-0000000002 format is
<Processor Node Type>-<Processor Node number>

Related

How do I get two topics that have the same partition key and the number of partitions land on the same consumer within a kafka streams application

I am trying to create a Kafka Streams service where
I am trying to initialize a cache in a processor, that will then be updated by consuming messages with a topic say "nodeStateChanged" for a partition key lets say locationId.
I need to check the node state when I consume another topic lets say "Report" again keyed by the same locationId. Effectively I am joining with the table created by nodeStateChanged.
How do I ensure that all the updates for nodeStateChanged fall on the same instance as the Report topic so that the lookup for a location is possible when a new report is recieved. Do 1 and 2 need to be created by the same topology or it okay to create two seperate topologies that share the same APPLICATION_ID_CONFIG.
You don't need to do anything. Kafka Streams will always co-partition topics. Ie, if you have a sub-topology that reads from multiple topics with N partitions each, you get N tasks and each task is processing corresponding partitions, ie, task 0 processes partitions zero of both input topics, task 1 processes partitions one of both input topics, etc.

Multiple data collector for a job without duplicating records in streamsets

I have a directory consist of multiple files, and that is shared across multiple data collectors. I have a job to process those files and put it in the destination. Because the records are huge, I want to run the job in multiple data collector. but when I tried I got the duplicate entries in my destination. Is there a way to achieve it without duplicating the records. Thanks
You can use kafka for it. For example:
Create one pipeline which reads file names and sends them to kafka topic via kafka producer.
Create pipeline with kafka consumer as an origin and set the consumer group property to it. This pipeline will read filenames and work with files.
Now you can run multiple pipelines with kafka consumer with the same consumer group. In this case kafka will balance messages within consumer group by itself and you will not be getting duplicates.
To be sure that you won't have duplicates also set 'acks' = 'all' property to kafka producer.
With this schema you can run as many collectors as your kafka topic partition count.
Hope it will help you.
Copying my answer from Ask StreamSets:
At present there is no way to automatically partition directory contents across multiple data collectors.
You could run similar pipelines on multiple data collectors and manually partition the data in the origin using different character ranges in the File Name Pattern configurations. For example, if you had two data collectors, and your file names were distributed across the alphabet, the first instance might process [a-m]* and the second [n-z]*.
One way to do this would be by setting File Name Pattern to a runtime parameter - for example ${FileNamePattern}. You would then set the value for the pattern in the pipeline's parameters tab, or when starting the pipeline via the CLI, API, UI or Control Hub.

Kafka Streams Custom processing

There is a requirement for me to process huge files, there could be multiple files that we may end up processing in parallel.
Each Row in a specific file would be processed for a rule specific to that file.
Once the processing is complete we would be generating an output file based on the processed records.
One option that i have thought of is each message pushed to the broker will have: the row data + rule to be applied + some co relation ID(would be like an identifier for that particular file)
I plan to use kafka streams and create a topology with a processor which will get the rule with message process it and sink it.
However (I am new to kafka streams hence may be wrong):
The order in which the messages will be processed will not be sequential as we are processing multiple files in Tandom(which is fine because there isn't a requirement for me to do so, moreover i want to keep it decoupled). But then how would i bring it to logical closure, i.e. in my processor how would i come to know that all the records of a file are processed.
Do i need to maintain the records(co relation ID, number of records etc.) in something like ignite.. i am unsure on that though..
i guess you can set a key and value record aside that could be sent to the topics at the end of the file which would signify the closure of the file.
Say the record has a unique key such as -1 which signifies that the eof

Filtering AVRO Data from 2 datasets

Use-case:
I had 2 dataset/fileset Machine (Parent) and Alerts (Child).
Their data is also stored in two avro files viz machine.avro and alert.avro.
Alert schema had machineId : column type int.
How can I filter data from machine if there is a dependency on alert too? (one-to-many).
e.g. get all machines where alert time is between 2 time-stamp.
Any e.g. with source will be great help...
Thanks in advance...
Got answer in another thread....
Mapping through two data sets with Hadoop
Posting comments from that thread...
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

HBase bulk load usage

I am trying to import some HDFS data to an already existing HBase table.
The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table.
The table is already filled up with a large volume of data, and it has 98 online regions.
The type of row keys it has, are under the form of(simplified version) :
2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH.
Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef.
The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. I emit Put objects from my mapper, which correspond to each line read from the HDFS files.
The existing data has keys which will all start with "XX181113".
The job is configured with :
HFileOutputFormat.configureIncrementalLoad(job, hTable)
Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing.
As a result, I see only 4 folder outputs, which have a very large size.
Are these files corresponding to 4 new regions which I can then import to the table? And if so, why only 4, while 98 reducers get created?
Reading HBase docs
In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
confused me even more as to why I get this behaviour.
Thanks!
The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). since you mention that all your new data start with the same prefix it is likely it only fit into a few regions.
You can pre split your table so that the new data would be divided between more regions

Resources