How to retrieve information from specific rows from table in Hbase? - hadoop

I have a table in Hbase and the key of this table is "user_name" + "id" for example ("username123").
I want to retrieve all rows for specific user_name for example (if i have some rows with key "john1","john2"..., i want to retrieve all rows for john)
How can i do it ?

Use PrefixFilter. For Java API answer is here Hbase Java API: Retrieving all rows that match a Partial Row Key
In HBase shell PrefixFilter too:
scan 'tablename', {FILTER => "(PrefixFilter ('username'))"}

Related

HBase Need to export data from one cluster and import it to another with slight modification in row key

I am trying to export data from HBase table 'mytable' which rowkey starts with 'abc'.
scan 'mytable', {ROWPREFIXFILTER => 'abc'}
The above exported data need to be imported into the another cluster by changing the rowkey prefix from 'abc' to 'def'
Old Data:
hbase(main):002:0> scan 'mytable', {ROWPREFIXFILTER => 'abc'}
ROW COLUMN+CELL
abc-6535523 column=track:aid, timestamp=1339121507633, value=some stream/pojos
New Data: (In another cluster)
hbase(main):002:0> get 'mytable', 'def-6535523'
ROW COLUMN+CELL
def-6535523 column=track:aid, timestamp=1339121507633, value=some stream/pojos
Only part of the row key needs to be modified. Other data needs to be as same.
Tried to use bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
In the Export there is no provision to specify start row and end row.
But don't know how to import it with changed rowkey.
Also is there any inbuilt available in HBase/Hadoop to achie
Please help.

Is there any order of columns while creating Hive table that needs to be pairtitioned dynamically?

I am trying to load an RDBMS table into Hive. I need to partition the table dynamically based on a column data. I have the schema of the Greenplum table as below:
forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
drm_org:character varying(10)
ledger_id:bigint
currency_code:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
xx_last_update_log_id:integer
xx_data_hash_code:character varying(32)
xx_data_hash_id:bigint
xx_pk_id:bigint
When I checked for the schema of the same table on Hive (which is usually replicated on Hive), I did describe extended tablename and got the below schema:
forecast_id bigint
period_year bigint
period_num bigint
period_name string
drm_org string
ledger_id bigint
currency_code string
source_record_type string
xx_last_update_log_id int
xx_data_hash_code string
xx_data_hash_id bigint
xx_pk_id bigint
source_system_name String
so I asked my lead why is the column: source_system_name given at the end in Hive table and I got an answer: "The columns that are used to partition the hive table dynamically, comes at the end of the table"
Is it true that the columns on which the hive table is dynamically partitioned should come at the end of the schema ?
The order of the columns matter when you are dynamic partition in Hive. You can find more details here. From the documentation
In INSERT ... SELECT ... queries, the dynamic partition columns must
be specified last among the columns in the SELECT statement and in the
same order in which they appear in the PARTITION() clause.

hbase rowkey filter on multiple values

I have created an Hbase table rowkey with combination of multiple column values.
my data of rowkey in hbase look like below.
'123456~raja~ram~45000~mgr~20170116'
'123456~raghu~rajan~65000~mgr~20150106'
i am trying to filter condition just like in sql
as
select * from table
where deptid =123456 and name='rajan'
how can i do (or) and (and) conditions.
i am using below code to filter the condition
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan")))}
if i use same code with swapping values i am getting different results
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456")))}

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

Hive Hadoop : Need to LOAD data into a table based on conditions on the input file

I am new to Hadoop Hive and have just started to do basic querying in hive.
My intention is I have an input text file (which has large number of records per line). The format of the file is something like this:
1;23;0;;;;1;3;2;1;1;4;5;6;;;;
1;43;6;;;;1;3;2;1;1;4;5;5;;;;
1;53;7;;;;1;3;2;1;1;4;5;2;;;;
(Each integer before a ";" has a meaning which I am intending to put it in Hive table as column names - and each line contains about 400 fields)
So for inserting this I have created a table "test" - using the following query:
CREATE TABLE test (field1 INT, field2 INT, field3 INT, field4 INT, ... field390 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY "\073";
And I load my text file with the records using the LOAD query as below:
LOAD DATA LOCAL INPATH '/tmp/test.txt'
OVERWRITE INTO TABLE test;
For now all the fields are getting inserted into the table upto 50 fields accurately. Later I have mismatches.
What I have in my format of input is - at 50th field in the test.txt I have a INT number which decides the number of fields to take following the field.
Example:
50th field: 2 -> Hive has to take the next 2*10 field INT values and insert in the table.
50th field: 1 -> Hive has to take the next 1*10 field INT values and insert in the table. And the rest 10 fields can be set NULL.
(The maximum value of 50th field is 2 - so I have reserved 2*10 fields for this in the table)
After 50th+(2*10) fields , the data should be read normally in the sequence as it did before the 50th field.
Do we have a way in which we can have a condition on the input so that the data gets inserted accordingly in Hive.
A help may be appreciated. Need a solution which will not guide me to pre-process the test.txt and then supply to the table.
I have tried to answer it at http://www.knowbigdata.com/page/hive-hadoop-need-load-data-table-based-conditions-input-file#comment-85
Does it make sense?
You can use where clause in Hive.
First load data into Hive raw table or HDFS, then again create table and load data based on where clause.
Means:
SELECT * FROM table_reference
WHERE name like "%venu%"
GROUP BY City;
Resource: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Resources