hbase rowkey filter on multiple values - hadoop

I have created an Hbase table rowkey with combination of multiple column values.
my data of rowkey in hbase look like below.
'123456~raja~ram~45000~mgr~20170116'
'123456~raghu~rajan~65000~mgr~20150106'
i am trying to filter condition just like in sql
as
select * from table
where deptid =123456 and name='rajan'
how can i do (or) and (and) conditions.
i am using below code to filter the condition
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan")))}
if i use same code with swapping values i am getting different results
scan 'tablename', {FILTER => (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("rajan"))) && (org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),SubstringComparator.new("123456")))}

Related

how to make max function in hive query to ignore _HIVE_DEFAULT_PARTITION__

I have a view which uses max to show the latest partition (which is of format 2021-01, 2021-02, 2021-03, 2021-04). The hive table has _HIVE_DEFAULT_PARTITION__ too.
When we run the query in Impala, max on partitions gives the correct value of 2021-04 ignoring _HIVE_DEFAULT_PARTITION__ but the same do not work when we run the query in Hive as it returns _HIVE_DEFAULT_PARTITION__
Is there any way to make Hive query ignore the default partition if exists while returning max on that column?
You can filter it:
select max(partition_col) from your_table where partition_col != "__HIVE_DEFAULT_PARTITION__"
If you do not need data in __HIVE_DEFAULT_PARTITION__, you can drop it:
ALTER TABLE your_table DROP PARTITION (partition_col='__HIVE_DEFAULT_PARTITION__');
Transforming __HIVE_DEFAULT_PARTITION__ to NULL can be a solution if with max(partition_col) you want to aggregate something else and do not want to excluse __HIVE_DEFAULT_PARTITION__ partition:
select max(case when partition_col = "__HIVE_DEFAULT_PARTITION__" then NULL else partition_col end) as max_partition_col,
--aggregate something else including HIVE_DEFAULT_PARTITION
from your_table

updating a table using hive

Right now I run the following Hive query
CREATE TABLE dwo_analysis.exp_shown AS
SELECT
MIN(sc.date_time) as first_shown_time,
SUBSTR(sc.post_evar12,1,24) as guid,
sc.post_evar238 as experiment_name,
sc.post_evar239 as variant_name
FROM test
WHERE report_suite='adbemmarvelweb.prod'
AND date >= DATE_SUB(CURRENT_DATE,90) AND date < DATE_SUB(CURRENT_DATE, 2)
AND post_prop5 = 'experiment:standard:authenticated:shown'
AND post_evar238 NOT LIKE 'control%'
AND post_evar238 <> ''
AND post_evar239 <> ''
The table test is large. I would like to optimize this query by running it once, and every other time updating the table by getting the last 2 days of data and adding it to the table.
so basically run the above query once and every time run it again but with the condition
WHERE click_date >= DATE_SUB(CURRENT_DATE, 2) AND click_date < DATE_SUB(CURRENT_DATE)
How do I update the table using hive to populate the the rows as mentioned in the condition above?
First, your queries would be quicker if the Hive table were partitioned based on date. Your create table statement isn't inserting into any partitions, therefore I suspect your table is not partitioned. It would also be quicker if the source data were Parquet/ORC
In any case, you can overwrite the table for a date range like so
INSERT OVERWRITE TABLE dwo_analysis.exp_shown
SELECT * FROM test
WHERE click_date
BETWEEN DATE_SUB(CURRENT_DATE, 2) AND CURRENT_DATE;

How to retrieve information from specific rows from table in Hbase?

I have a table in Hbase and the key of this table is "user_name" + "id" for example ("username123").
I want to retrieve all rows for specific user_name for example (if i have some rows with key "john1","john2"..., i want to retrieve all rows for john)
How can i do it ?
Use PrefixFilter. For Java API answer is here Hbase Java API: Retrieving all rows that match a Partial Row Key
In HBase shell PrefixFilter too:
scan 'tablename', {FILTER => "(PrefixFilter ('username'))"}

How transfer a Table from HBase to Hive?

How can I tranfer a HBase table into Hive correctly?
What I tried before can you read in this question
How insert overwrite table in hive with diffrent where clauses?
( I made one table to import all data. The problem here is that data is still in rows and not in columns. So I made 3 tables for news, social and all with a specific where clause. After that I made 2 Joins on the tables which is giving me the result table. So I had 6 Tables at all which is not really performant!)
to sum my problem up : In HBase are column familys which are saved as rows like this.
count verpassen news 1
count verpassen social 0
count verpassen all 1
What I want to achieve in Hive is a datastructure like this:
name news social all
verpassen 1 0 1
How am I supposed to do this?
Below is the approach use can use.
use hbase storage handler to create the table in hive
example script
CREATE TABLE hbase_table_1(key string, value string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:val")
TBLPROPERTIES ("hbase.table.name" = "test");
I loaded the sample data you have given into hive external table.
select name,collect_set(concat_ws(',',type,val)) input from TESTTABLE
group by name ;
i am grouping the data by name.The resultant output for the above query will be
Now i wrote a custom mapper which takes the input as input parameter and emits the values.
from (select '["all,1","social,0","news,1"]' input from TESTTABLE group by name) d MAP d.input Using 'python test.py' as
all,social,news
alternatively you can use the output to insert into another table which has column names name,all,social,news
Hope this helps

how to pass parameter to oracle update statement from csv file and excluding null values from csv

I have a situation where I have following csv file(say file.csv) with following data:
AcctId,Name,OpenBal,closingbal
1,abc,1000,
2,,0,
3,xyz,,
4,,,
how can I loop through this file using unix shell and say for example for column $2 (Name) , I want to get all occurances of Name column accept null values and pass it to for example following oracle query with single quotes '','' format?
select * from account
where name in (collection of values from csv file column name
but excluding null values)
and openbal in
and same thing for column 3 (collection of values from csv file column Openbal
but excluding null values)
and same thing for column 4 (collection of values from csv file column
closingbal but excluding null values)
In short what I want is pass the csv column values as input parameter to oracle sql query and update query too ? but again I dont want to include null values in it. If a column is entirely null for all rows I want to exclude it too?
Not sure why you'd want to loop through this file in a unix shell script: perhaps because you can't think of any better approach? Anyway, I'm going to skip that and offer a pure Oracle solution.
We can expose data in CSV files to the database using external tables. These are like regular tables except their data comes from files in OS directories on the database server (rather than the database's storage). Find out more.
Given this approach it is easy to write the query you want. I suggest using sub-query factoring to select from the external table once.
with cte as ( select name, openbal, closingbal
from your_external_tab )
select *
from account a
where a.name in ( select cte.name from cte )
and a.openbal in ( select cte.openbal from cte )
and a.closingbal in ( select cte.closingbal from cte )
The behaviour of the IN clause is to exclude NULL from consideration.
Incidentally, that will return a different (larger) result set from this:
select a.*
from account a
, your_external_table e
where a.name = e.name
and a.openbal= e.openbal
and a.closingbal = e.closingbal

Resources