Designing composite rowkey for Hbase - hadoop

I am trying to create a hbase table with following structure.
**rowkey** |**CF1**
(customerid,txtimestamp)|customerid,amount
I want to query records using customerid for certain period range.
My rowkey uses customer id in reverse order and transaction timestamp.
Long customerid=Long.valueOf(new StringBuilder(customerid).reverse().toString());
byte[] rowKey = Bytes.add(Bytes.toBytes(customerid),Bytes.toBytes(txtimestamp.getTime()));
How do I desige row key so that it gets split into 4 region server?
Is there any efficient row key design method?

You don't need to reverse customer_id, it makes no sense
If you want to split all data across 4 regions, you can prefix all keys with values 0-3, for example:
int partition = customer_id % 4;
byte[] rowKey = Bytes.add(
Bytes.toBytes(String.valueOf(partition)),
Bytes.toBytes(String.valueOf(customer_id)),
Bytes.toBytes(txTimestamp.getTime())
);
In this case you need to create your table with split keys using this HBaseAdmin method
public void createTable(final HTableDescriptor desc, byte [][] splitKeys)
Split keys would be :
byte[][] splitKeys = new byte[3][];
splitKeys[0] = "1".getBytes();
splitKeys[1] = "2".getBytes();
splitKeys[2] = "3".getBytes();
so all keys starting with 0 go to first region, keys starting with 1 goes to second region and so on

Related

Spring batch Remote partitioning : Pushing Huge data in kafka during partition

I have implemented spring batch remote partitioning.Now I have to push partition 10 billion ids divided into partitions.The ids will be fetched from elastic and push into partition which in turn will be pushed into kafka
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> map = new HashMap<>(gridSize);
AtomicInteger partitionNumber = new AtomicInteger(1);
try {
for(int i=0;i<n;i++){
List<Integer> ids = //fetch id from elastic
map.put("partition" + partitionNumber.getAndIncrement(), context);
}
System.out.println("Partitions Created");
} catch (IOException e) {
e.printStackTrace();
}
return map;
}
I cannot fetch and push all ids in map at once otherwise,I will go out of memory.I want ids to be pushed in queue and then next ids are fetched.
Can this be done through spring batch?
If you want to use partitioning, you have to find a way to partition the input dataset with a given key. Without a partition key, you can't really use partitioning (with or without Spring Batch).
If your IDs are defined by a sequence that can be divided into partitions, you don't have to fetch 10 billion IDs, partition them and put each partition (ie all IDs of each partition) in the execution context of workers. What you can do is find the max ID, create ranges of IDs and assign them to distinct workers. For example:
Partition 1: 0 - 10000
Partition 2: 10001 - 20000
etc
If your IDs are not defined by a sequence and cannot be partitioned by range, then you need to find another key (or a composite key) that allows you to partition data based on another criteria. Otherwise, (remote) partitioning is not an option for you.

How to understand part and partition of ClickHouse?

I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?

how to change hbase table scan results order

I am trying to copy specific data from one hbase table to another which requires scanning the table for only rowkeys and parsing a specific value from there. It works fine but I noticed the results seem to be returned in ascending sort order & in this case alphabetically. Is there a way to specify a reverse order or perhaps by insert timestamp?
Scan scan = new Scan();
scan.setMaxResultSize(1000);
scan.setFilter(new FirstKeyOnlyFilter());
ResultScanner scanner = TestHbaseTable.getScanner(scan);
for(Result r : scanner){
System.out.println(Bytes.toString(r.getRow()));
String rowKey = Bytes.toString(r.getRow());
if(rowKey.startsWith("dm.") || rowKey.startsWith("bk.") || rowKey.startsWith("rt.")) {
continue;
} else if(rowKey.startsWith("yt")) {
List<String> ytresult = Arrays.asList(rowKey.split("\\s*.\\s*"));
.....
This table is huge so I would prefer to skip to the rows I actually need. Appreciate any help here.
Have you tried the .setReversed() property of the Scan? Keep in mind that in this case your start row would have to be the logical END of your rowKey range, and from there it would scan 'upwards'.

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Resources