How can I skip HBase rows that are missing specific column family? - filter

For example, a HBase table has columnFamilyA, columnFamilyB and columnFamilyC, for some rows, columnFamilyA does not have any column in it. I would like to scan the table and return only the rows that have at least one column in columnFamilyA.
What kind of filter should I use? I checked SingleColumnValueFilter, but it seems to only work with specific column other than columnFamily. I need all rows where columnFamiliyA contains at least one column. Not just data in columnFamiliyA, but the entire row.

If you need only data from columnFamiliyA you can use addFamily method on Get or Scan objects.

Or you can do scan of scan. First do scan for columFamilyA cols. Then get the rows of first scan.

Related

Different Columns for Each Row in HBase?

In my HBase table, each row may be have different columns than other rows. For example;
ROW COLUMN
1-1040 cf:s1
1-1040 cf:s2
1-1043 cf:s2
2-1040 cf:s5
2-1045 cf:s99
3-1040 cf:s75
3-1042 cf:s135
As seen above, each row has different columns than other rows. So, when I run scan query like this;
scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}
I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?
Another option;
ROW COLUMN
1-1040-s1 cf:value
1-1040-s2 cf:value
1-1043-s2 cf:value
2-1040-s5 cf:value
2-1045-s99 cf:value
3-1040-s75 cf:value
3-1042-s135 cf:value
In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;
scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}
When I want to get s2 values, which option is better in read performance?
HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.
For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:
One file:
1-1040 cf1:s1
2-1040 cf1:s5
2-1045 cf1:s99
3-1040 cf1:s75
3-1042 cf1:s135
Another file:
1-1040 cf2:s2
1-1043 cf2:s2
Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.
scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}
Considerations:
It's recommended to only have two or three column families per table, so you shouldn't implement this if you want to run this query for s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.
It depends on exactly which queries you'll be running, and how often you'll be running them. This is the fastest way for you to get values associated with s2, but might not be fastest for other queries.

How to Generate/Create a Unique ID to Database Rows

So, i am using text file input step in Pentaho Data Integration to load rows into my database. I need to create a unique ID for each row so i can identify duplicates later on in my transformation. I tried to create an ID by concatinating 3 columns into one but some rows will always be the same due to how the file is generated. I do have "true" duplicates so its been hard getting them to be identified separately. Is there any other way of identifying each row so i can make it my Primary Key and avoid duplicates?
Thank you!
If your problem are not unique rows, so, identify them by using Memory Group By, use a grouping criteria and don't specify an adding function. After recognizing unique rows assign them a sequence and voila!.

HBase : Confusion between COLUMN and FILTER (SingleColumnValueFilter)

I have installed hbase and I have access to command's shell.
I have a table with 2 familly column like this:
create 'arbres', 'emplacement', 'propriete'
This request works fine :
scan 'arbres',{FILTER=>"SingleColumnValueFilter('emplacement', 'lieu_adresse', =,'binary:VOIE INCONNUE')", COLUMNS=>['emplacement'], COLUMN=>15}
But this second one, list all rows, without filter
scan 'arbres',{FILTER=>"SingleColumnValueFilter('emplacement', 'lieu_adresse', =,'binary:VOIE INCONNUE')", COLUMNS=>['propriete'], COLUMN=>15}
I don't understand why and I don't find the reason in the documentation.
Please can you explain a little the reason.
regards
The second command has a filter on different column family and column that you are not accessing.
The push down requires the columns to be accessed, meaning you should have the column family and column mentioned in the COLUMNS=>[]
The reason one would have two different column families is to make access easier and light weight, since each column family will have its own file.

Adding Index To A Column Having Flag Values

I am a novice in tuning oracle queries thus need help.
If I have a sql query like:
select a.ID,a.name.....
from a,b,c
where a.id=b.id
and ....
and b.flag='Y';
then will adding index to the FLAG column of table b help to tune the query by any means? The FLAG column has only 2 values Y and N
With a standard btree index, the SQL engine can find the row or rows in the index for the specified value quickly due to its binary structure, then use the physical address (the rowid) stored in the index to access the desired row in a second hop. It's like looking in the index of a book to find the page number. So that is:
Go to index with the key value you want to look up.
The index tells you the physical address in the table.
Go straight to that physical address.
That is nice and quick for something like a unique customer ID. It's still OK for something nonunique, like a customer ID in a table of orders, although the database has to go through the index entries and for each one go to the indicated address. That can still be faster than slogging through the entire table from top to bottom.
But for a column with only two distinct values, you can see that it is going to be more work going through all of the index entries for 'Y' for example, and for each one going to the indicated location in the table, than it would be to just forget the index and scan the whole table in one shot.
That's unless the values are unevenly distributed. If there are a million Y rows and ten N rows then an index will help you find those N rows fast but be no use for Y.
Adding an index to a column with only 2 values normally isn't very useful, because Oracle might just as well do a full table scan.
From your query it looks like it would be more useful to have an index on id, because that would help with the join a.id=b.id.
If you really want to get into tuning then learn to use "explain plan", as that will give you some indication of how much work Oracle needs to do for a query. Add (or remove) an index, then rerun the explain plan.

HappyBase - Is there an equivalent of find_one or scan_one?

All the rows in a particular HBase table that I am making a UI for happen to have the same columns and will have so for the foreseeable future. I would like my html data visualizer application to simply query for a single random row to take note of the column names, and put this list of column names into a variable to refer to throughout the program.
I didn't see any equivalent to find_one or scan_one in the docs for HappyBase.
What is the best way to accomplish this?
This will fetch only the first row:
row = next(table.scan(limit=1))
Additionally you can specify a filter string to avoid retrieving the values, which is only worthwile if your values are large and you're performing this query often.
I use the 'limit' option.
Here is my HappyBase Python code:
pool1 = happybase.ConnectionPool(size=2, host='172.00.00.01')
with pool1.connection() as conn1:
hTable = conn1.table('HBastTableHere')
for rowKey, rowData in hTable.scan( limit=1):
print rowData
You can create a Scanner object, without specifying the start row (so that it start at the first row in the table), and limit the scan to one row. You will get then the first row only.
From the HBase shell command it should look like this:
scan 'table_name', {LIMIT => 1}
I don't know for HappyBase but I think you should be able to do that

Resources