Have anyone faced the situation that Nifi ommits some records when using incremental refresh with maximum value? We foreseen the column which is used for maximum value some records are inserting not at the time when their col value states. Is there any way to investigate such situation?
A pagination system is done with total records and specify offset to start paginate.
How can I get total records and offset in kafka from console?
Kafka doesn't paginate. A topic is a sequential log of events.
However, your consumer group has an initial or stored offset, and on the next poll, will read up to max.poll.records for the next "page" after that offset.
If you want to count the number of records in a non compacted topic, you can use GetOffsetShell tool to query the first and last offset, then subtract the difference. For a compacted topic, there are gaps in those numbers and the only reasonable way to count records is to consume the entire topic
It is common to require ordering in same partition of given Kafka topic. That is, messages with same key should go to same partition. Now, if I want to add new partition in a running topic, how to make it and kept the consistency?
To my understanding, the default partitioning strategy is to mod on num-of-partition . When the num-of-partition changes (e.g. 4 to 5), some messages might fall into different partition from previous messages with same key.
I can image to have consistent hashing implemented to customize the partitioning behavior, but it might be to intrusive.
Or, just stop all producers until all messages are consumed up; then deploy new partition and restart all producers.
Any better ideas?
As you said, when you increase the number of partitions in a topic you will definitely loose the ordering of messages with the same key.
If you try to implement a customized partitioner to have a consistent assignment of a key to a partition, you wouldn't really use the new partition(s).
I would create a new topic with the desired amount of partitions and let the producer write into that new topic. As soon as the consumers of the old topic have processed all messages (i.e. consumer lag = 0) you could let the consumers read from the new topic.
We have real time data coming in to our system. We have online queries which we need to serve. In order to serve these online queries we need are doing some pre-processing of the data so that we can serve faster.
Now my query is how do I preprocess the online real time data. There should be a way for me to figure out if the data was already processed or not. In order to find this difference, I have the following approaches:
I can have a flag which says that data is processed or unprocessed, based on which i can further take a decision to process or not
I can have a column family where I can insert the data with a TTL, and a topic in a message bus like kafka which gives me the row identifier in cassandra so that I can process this row in cassandra
I can have a column family per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I can have a keyspace per day and a topic in a message bus like kafka which gives me the row identifier of the corresponding column family
I read some where that if, the number of deletions increases, then the number of tombstones increases and result in slow query times. Now I am confused with the approach I have to chose among the above four or is there a better way to solve this?
According to the datastax blog third option might be better fit.
Cassandra Anti-patterns
Please tell me how HBase partitions table across regionservers.
For example, let's say my row keys are integers from 0 to 10M and I have 10 regionservers.
Does this mean that first regionserver will store all rows with keys with values 0 - 10M, second 1M - 2M, third 2M-3M , ... tenth 9M - 10M ?
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
Or maybe this data would be spread differently?
Or maybe can I somehow create more regions than I have region servers, so (according to given example) server 1 would have keys 0 - 0,5M and 3M - 3,5M, this way my data would be spread more equally, is this possible?
update
I just found that there's option hbase.hregion.max.filesize, do you think this will solve my problem?
WRT partitionning, you can read Lars' blog post on HBase's architecture or Google's Bigtable paper which HBase "clones".
If your row key is only a timestamp, then yes the region with the biggest keys will always be hit with new requests (since a region is only served by a single region server).
Do you want to use timestamps in order to do short scans? If so, consider salting your keys (search google for how Mozilla did it with Sorocco).
Can your prefix the timestamp with any ID? For example, if you only request data for specific users, then prefix the ts with that user ID and it will give you a much better load distribution.
If not, then use UUIDs or anything else that will randomly distribute your keys.
About hbase.hregion.maxfilesize
Setting the maxfilesize on that table (which you can do with the shell), doesn't make it that each region is exactly X MB (where X is the value you set) big. So let's say your row keys are all timestamps, which means that each new row key is bigger than the previous one. This means that it will always be inserted in the region with the empty end key (the last one). At some point, one of the files will grow bigger than maxfilesize (through compactions), and that region will be split around the middle. The lower keys will be in their own region, the higher keys in another one. But since your new row key is always bigger than the previous, this means that you will only write to that new region (and so on).
tl;dr even though you have more than 1,000 regions, with this schema the region with the biggest row keys will always get the writes, which means that the hosting region server will become a bottleneck.
Option hbase.hregion.max.filesize which is by default 256MB sets max region size, after reaching this limit region is split. This means, that my data will be stored in multiple regions of 256MB and possibly one smaller.
So
I would like my row key to be timestamp, but I case most queries would apply to latest dates, all queries would be processed by only one regionserver, is it true?
This is not true, because latest data will be also split in regions of size 256MB and stored on different regionservers.