Different Columns for Each Row in HBase? - hadoop

In my HBase table, each row may be have different columns than other rows. For example;
1-1040 cf:s1
1-1040 cf:s2
1-1043 cf:s2
2-1040 cf:s5
2-1045 cf:s99
3-1040 cf:s75
3-1042 cf:s135
As seen above, each row has different columns than other rows. So, when I run scan query like this;
scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}
I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?
Another option;
1-1040-s1 cf:value
1-1040-s2 cf:value
1-1043-s2 cf:value
2-1040-s5 cf:value
2-1045-s99 cf:value
3-1040-s75 cf:value
3-1042-s135 cf:value
In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;
scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}
When I want to get s2 values, which option is better in read performance?

HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.
For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:
One file:
1-1040 cf1:s1
2-1040 cf1:s5
2-1045 cf1:s99
3-1040 cf1:s75
3-1042 cf1:s135
Another file:
1-1040 cf2:s2
1-1043 cf2:s2
Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.
scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}
It's recommended to only have two or three column families per table, so you shouldn't implement this if you want to run this query for s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.
It depends on exactly which queries you'll be running, and how often you'll be running them. This is the fastest way for you to get values associated with s2, but might not be fastest for other queries.


Sorting after Repartitioning PySpark Dataframe

We have a giant file which we repartitioned according to one column, for example, say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a text file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an effect after running the repartition.
df = df.repartition(100, ['STATE_NAME'])\
.sortWithinPartitions('STATE_NAME', 'CUSTOMER_ID', 'ROW_ID')
I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:
The resulting DataFrame is hash partitioned.
Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.

Efficiently scanning on composite row key in hbase

I have my hbase table structured as follows:
Is there any way I can efficiently check if the first part of the row key exists in the hbase table? I do not want to retrieve the records,
I just want to check if a1, a2, a3 exist.
If you are doing this via a Scan, then you can operate only on row keys, without loading any columns by adding the following filters to your Scan:
However if you are doing this via a get, then I think you'd have to specify at least one column. If I remember correctly, an error will be thrown if you haven't added any columns to your get.

How can I skip HBase rows that are missing specific column family?

For example, a HBase table has columnFamilyA, columnFamilyB and columnFamilyC, for some rows, columnFamilyA does not have any column in it. I would like to scan the table and return only the rows that have at least one column in columnFamilyA.
What kind of filter should I use? I checked SingleColumnValueFilter, but it seems to only work with specific column other than columnFamily. I need all rows where columnFamiliyA contains at least one column. Not just data in columnFamiliyA, but the entire row.
If you need only data from columnFamiliyA you can use addFamily method on Get or Scan objects.
Or you can do scan of scan. First do scan for columFamilyA cols. Then get the rows of first scan.

Store nested entity in Hbase and read it as rows in hive

My requirement is to write a nested entity(Array of POJO objects) from Java to Hbase and to read them as individual records in Hive.
(i,e) while writing from Java, its just a single string(Array). But from hive, the array represents the table as a whole. So the hive should have the individual elements of the array as individual records in it.
Any help on this will be appreciated.
Perhaps you should take a look to Hive UDTF functions like explode, depending on what you store and what you need to retrieve they may work for you but be noticed they have some important limitations:
No other expressions are allowed in SELECT SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not
If standard UDTFs don't fit your case and you're in the mood, you can also do this:
Store each item of your array as a json string in a different column: i0, i1, i2 ... iN
Write your own UDTF function to process each row columns and emit 1 row per column.
IMHO, I'll just write one row per element of the array, appending to the rowkey the index of each array item, it will be faster when processing the data and you'll have a lot less headaches. You shouldn't worry about writing billions of rows if that's the case.

Hadoop map-reduce : Order of records while grouping

I have a record in each line of input and each record has around 10 fields. First, I group the records by three fields (field1, field2, field3) thus one mapper/reducer is responsible for one unique group (based on the three fields). Within each group, I sort the records based on another integer field timestamp and I tag each record in the group with the same tag aTag by adding another field.
Lets say that in mapper#1, I tag a sorted group as aTag and in mapper#2, I tag another group (a different group because I initially grouped the records based on the three fields) with the same tag aTag.
Now, if I group the records based on the tag field (i.e., grouping the groups in different mappers), I notice that the ordering within each group is no more preserved. I was expecting that since each mapper has a group with all records having the same tag, grouping by the tag name should just involve getting the relevant groups from other mappers and just concatenating them without re-ordering each individual group.
Is it because I am trying to store the records in gzip format and hence it tries to re-order the records for better compression? Also I would like to know how to preserve the order after grouping by the tag name.
It seems that you are trying to implement the sort step of MapReduce yourself in local memory, but then it completely ignores what you did and re-sorts the items in each group anyway. The proper way to fix this would be to specify a comparator on the keys, so that within each partition so that the merged input to the reducer is according to that comparison function. This means that
You don't have to do the sorting yourself
You don't run out of memory on one machine trying to sort a really large group.
It seems on your case that you'd want to add timestamp to the set of keys, tell it to partition on the first three keys, and tell it to sort on the timestamp.
For more information, see the following diagram, and Where is Sort used in MapReduce phase and why?
