Efficiently scanning on composite row key in hbase - hadoop

I have my hbase table structured as follows:
a1:b1
a1:b2
a2:b1
a3:b2
Is there any way I can efficiently check if the first part of the row key exists in the hbase table? I do not want to retrieve the records,
I just want to check if a1, a2, a3 exist.

If you are doing this via a Scan, then you can operate only on row keys, without loading any columns by adding the following filters to your Scan:
KeyOnlyFilter
FirstKeyOnlyFilter
However if you are doing this via a get, then I think you'd have to specify at least one column. If I remember correctly, an error will be thrown if you haven't added any columns to your get.

Related

Different Columns for Each Row in HBase?

In my HBase table, each row may be have different columns than other rows. For example;
ROW COLUMN
1-1040 cf:s1
1-1040 cf:s2
1-1043 cf:s2
2-1040 cf:s5
2-1045 cf:s99
3-1040 cf:s75
3-1042 cf:s135
As seen above, each row has different columns than other rows. So, when I run scan query like this;
scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}
I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?
Another option;
ROW COLUMN
1-1040-s1 cf:value
1-1040-s2 cf:value
1-1043-s2 cf:value
2-1040-s5 cf:value
2-1045-s99 cf:value
3-1040-s75 cf:value
3-1042-s135 cf:value
In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;
scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}
When I want to get s2 values, which option is better in read performance?
HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.
For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:
One file:
1-1040 cf1:s1
2-1040 cf1:s5
2-1045 cf1:s99
3-1040 cf1:s75
3-1042 cf1:s135
Another file:
1-1040 cf2:s2
1-1043 cf2:s2
Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.
scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}
Considerations:
It's recommended to only have two or three column families per table, so you shouldn't implement this if you want to run this query for s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.
It depends on exactly which queries you'll be running, and how often you'll be running them. This is the fastest way for you to get values associated with s2, but might not be fastest for other queries.

How to Generate/Create a Unique ID to Database Rows

So, i am using text file input step in Pentaho Data Integration to load rows into my database. I need to create a unique ID for each row so i can identify duplicates later on in my transformation. I tried to create an ID by concatinating 3 columns into one but some rows will always be the same due to how the file is generated. I do have "true" duplicates so its been hard getting them to be identified separately. Is there any other way of identifying each row so i can make it my Primary Key and avoid duplicates?
Thank you!
If your problem are not unique rows, so, identify them by using Memory Group By, use a grouping criteria and don't specify an adding function. After recognizing unique rows assign them a sequence and voila!.

Sorting Issue After Table Render in Laravel DataTables as a Service Implementation

I have implemented laravel dataTable as a service.
The initial two columns are actual id and names so, I am able to sort it asc/desc after the table renders.
But the next few columns renders after performing few calculations, i.e. these values are not fetched directly from any column rather it is processed.
I am unable to sort these columns where calculations were performed, and I get this error. And I know it is looking for that particular column for eg outstanding_amount which I don't have in the DB, rather it is a calculated amount from two or more columns that are in some other tables.
Any Suggestions on how to overcome this issue?
It looks like you're trying to sort by values that aren't columns, but calculated values.
So the main issue here is to give Eloquent/MySql the data it needs to provide the sorting.
// You might need to do some joins first
->addSelect(DB::raw('your_calc as outstanding_amount'))
->orderBy('outstanding_amount') // asc can be omitted as this is the default
// Anternative: you don't need the value sorted by
// Don't forget any joins you might need
->orderByRaw('your_calc_for_outstanding_amount ASC')
For SQL functions it'll work as follow
->addSelect(DB::raw('COUNT(products.id) as product_count'));
->orderByRaw(DB::raw('COUNT(products.id)'),'DESC');

How can I skip HBase rows that are missing specific column family?

For example, a HBase table has columnFamilyA, columnFamilyB and columnFamilyC, for some rows, columnFamilyA does not have any column in it. I would like to scan the table and return only the rows that have at least one column in columnFamilyA.
What kind of filter should I use? I checked SingleColumnValueFilter, but it seems to only work with specific column other than columnFamily. I need all rows where columnFamiliyA contains at least one column. Not just data in columnFamiliyA, but the entire row.
If you need only data from columnFamiliyA you can use addFamily method on Get or Scan objects.
Or you can do scan of scan. First do scan for columFamilyA cols. Then get the rows of first scan.

Altova Mapforce: Joining XML Input and conditional SQL Join using two tables

I'm trying to get the following done: Using Altova Mapforce, I use an XML file with schema as a source. I want to map it to exactly the same output, but only add data to one field.
The value of the field (it's Tax) is determined using a two table SQL join with a WHERE clause over both tables. The tables are joined using foreign keys, the relation is recognized by Mapforce.
The first field of the WHERE clause comes from the first table (header type table), the second and third field from the second tables (lines type tables).
However, I cannot seem to create the logical and correct equivalent of what I am describing here. I've tried it using complex AND constructions where it then inserts the one field I would need multiple times. I've tried WHERE clauses but they fail as they never supply both tables at the same time and there seems to be no way to use a pre-specified JOINing of two tables as a source. The WHERE clause then recognizes only the fields from the first table, not the second one.
Is there an example for this? Joining two (or more) tables, using WHERE to determine the exact row, then using a value from that row?
Best wishes.

Resources