update query in hive/hbase - hadoop

I have already created a table in hbase using hive:
hive> CREATE TABLE hbase_table_emp(id int, name string, role string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cf1:role")
TBLPROPERTIES ("hbase.table.name" = "emp");
and created another table to load data on it :
hive> create table testemp(id int, name string, role string) row format delimited fields terminated by '\t';
hive> load data local inpath '/home/user/sample.txt' into table testemp;
and finally insert data into the hbase table:
hive> insert overwrite table hbase_table_emp select * from testemp;
hive> select * from hbase_table_emp;
OK
123 Ram TeamLead
456 Silva Member
789 Krishna Member
time taken: 0.160 seconds, Fetched: 3 row(s)
the table looks like this in hbase:
hbase(main):002:0> scan 'emp'
ROW COLUMN+CELL
123 column=cf1:name, timestamp=1422540225254, value=Ram
123 column=cf1:role, timestamp=1422540225254, value=TeamLead
456 column=cf1:name, timestamp=1422540225254, value=Silva
456 column=cf1:role, timestamp=1422540225254, value=Member
789 column=cf1:name, timestamp=1422540225254, value=Krishna
789 column=cf1:role, timestamp=1422540225254, value=Member
3 row(s) in 2.1230 seconds
Now I'm trying to update a value in this table
for exemple, I want to change the "role" of "Ram" from "Teamlead" to "Member",
which query should I use?

Assuming that you are trying to overwrite the previous value, from the hbase shell you can run the following:
put 'emp', 123, 'cf1:role', Member', 1422540225254
It's important to use the same timestamp as the previous entry if your goal is to overwrite.

You can do it with HIVE since v0.14+:
INSERT INTO TABLE hbase_table_emp VALUES (123, null, "Member");
You've got to provide the key you want to update and null values on fields you don't want to update... yeah, it's odd, like having to compile and run a MapReduce job for a single update, but it's also odd to pretend HIVE+HBase to work like a regular RDBMS and provide full ACID support in the process :)
For updating data I'd stick to the APIs provided by HBase (Stargate, Thrift, Native or even the hbase shell) and use HIVE just for mass imports and data analysis.

Related

Unable to map the HBase row key in HIve table effectively

I have a HBase table where the rowkey looks like this.
08:516485815:2013 1
06:260070837:2014 1
00:338289200:2014 1
I create a Hive link table using the below query.
create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
When I query the table I get the below result
select * from hb;
08:516485815 1
06:260070837 1
00:338289200 1
This is very strange to me. Why the serde is not able to map the whole content of the HBase key? The hive table is missing everything after the second ':'
Has anybody faced a similar kind of issue?
I tried by recreating your scenario on Hbase 1.1.2 and Hive 1.2.1000,it works as expected and i am able to get the whole rowkey from hive.
hbase> create 'hbaseTable','e'
hbase> put 'hbaseTable','08:516485815:2013','e:-1','1'
hbase> scan 'hbaseTable'
ROW COLUMN+CELL
08:516485815:2013 column=e:-1, timestamp=1519675029451, value=1
1 row(s) in 0.0160 seconds
As i'm having 08:516485815:2013 as rowkey and i have created hive table
hive> create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
hive> select * from hb;
+--------------------+-----------+--+
| hb.key | hb.value |
+--------------------+-----------+--+
| 08:516485815:2013 | 1 |
+--------------------+-----------+--+
Can you once make sure your hbase table rowkey having the data after second :.

Insert into bucketed table produces empty table

I`m trying to do insert into bucketed table. When I run the query everything looks fine and I see in reports some amount of wrote bytes. No any errors in Hive logs also.
But when I look into table I have nothing :(
CREATE TABLE test(
test_date string,
test_id string,
test_title string,)
CLUSTERED BY (
text_date)
INTO 100 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION
'hdfs://myserver/data/hive/databases/test.db/test'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transactional' = 'true')
INSERT INTO test.test
SELECT 'test_date', 'test_id', 'test_title' from test2.green
Result
Ended Job = job_148140234567_254152
Loading data to table test.test
Table test.teststats: [numFiles=100, numRows=1601822, totalSize=9277056, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 100 Cumulative CPU: 423.34 sec
HDFS Read: 148450105
HDFS Write: 9282219
SUCCESS
hive> select * from test.test limit 2;
OK
Time taken: 0.124 seconds
hive>
Is this query really working? You have extra comma after in line
test_title string,)
also coulmn text_date isnt in your you column definition. May be you meant test_date?
CLUSTERED BY (text_date)

Alter hive table add or drop column

I have orc table in hive I want to drop column from this table
ALTER TABLE table_name drop col_name;
but I am getting the following exception
Error occurred executing hive query: OK FAILED: ParseException line 1:35 mismatched input 'user_id1' expecting PARTITION near 'drop' in drop partition statement
Can any one help me or provide any idea to do this? Note, I am using hive 0.14
You cannot drop column directly from a table using command ALTER TABLE table_name drop col_name;
The only way to drop column is using replace command. Lets say, I have a table emp with id, name and dept column. I want to drop id column of table emp. So provide all those columns which you want to be the part of table in replace columns clause. Below command will drop id column from emp table.
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
There is also a "dumb" way of achieving the end goal, is to create a new table without the column(s) not wanted. Using Hive's regex matching will make this rather easy.
Here is what I would do:
-- make a copy of the old table
ALTER TABLE table RENAME TO table_to_dump;
-- make the new table without the columns to be deleted
CREATE TABLE table AS
SELECT `(col_to_remove_1|col_to_remove_2)?+.+`
FROM table_to_dump;
-- dump the table
DROP TABLE table_to_dump;
If the table in question is not too big, this should work just well.
suppose you have an external table viz. organization.employee as: (not including TBLPROPERTIES)
hive> show create table organization.employee;
OK
CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string,
`updated_by` string,
`updated_date` timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
You want to remove updated_by, updated_date columns from the table. Follow these steps:
create a temp table replica of organization.employee as:
hive> create table organization.employee_temp as select * from organization.employee;
drop the main table organization.employee.
hive> drop table organization.employee;
remove the underlying data from HDFS (need to come out of hive shell)
[nameet#ip-80-108-1-111 myfile]$ hadoop fs -rm hdfs://getnamenode/apps/hive/warehouse/organization.db/employee/*
create the table with removed columns as required:
hive> CREATE EXTERNAL TABLE `organization.employee`(
`employee_id` bigint,
`employee_name` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://getnamenode/apps/hive/warehouse/organization.db/employee'
insert the original records back into original table.
hive> insert into organization.employee
select employee_id, employee_name from organization.employee_temp;
finally drop the temp table created
hive> drop table organization.employee_temp;
ALTER TABLE emp REPLACE COLUMNS( name string, dept string);
Above statement can only change the schema of a table, not data.
A solution of this problem to copy data in a new table.
Insert <New Table> Select <selective columns> from <Old Table>
ALTER TABLE is not yet supported for non-native tables; i.e. what you get with CREATE TABLE when a STORED BY clause is specified.
check this https://cwiki.apache.org/confluence/display/Hive/StorageHandlers
After a lot of mistakes, in addition to above explanations, I would add simpler answers.
Case 1: Add new column named new_column
ALTER TABLE schema.table_name
ADD new_column INT COMMENT 'new number column');
Case 2: Rename a column new_column to no_of_days
ALTER TABLE schema.table_name
CHANGE new_column no_of_days INT;
Note that in renaming, both columns should be of same datatype like above as INT
For external table its simple and easy.
Just drop the table schema then edit create table schema , at last again create table with new schema.
example table: aparup_test.tbl_schema_change and will drop column id
steps:-
------------- show create table to fetch schema ------------------
spark.sql("""
show create table aparup_test.tbl_schema_change
""").show(100,False)
o/p:
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP, id BIGINT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- drop table --------------------------------
spark.sql("""
drop table aparup_test.tbl_schema_change
""").show(100,False)
------------- edit create table schema by dropping column "id"------------------
CREATE EXTERNAL TABLE aparup_test.tbl_schema_change(name STRING, time_details TIMESTAMP)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'gs://aparup_test/tbl_schema_change'
TBLPROPERTIES (
'parquet.compress' = 'snappy'
)
""")
------------- sync up table schema with parquet files ------------------
spark.sql("""
msck repair table aparup_test.tbl_schema_change
""").show(100,False)
==================== DONE =====================================
Even below query is working for me.
Alter table tbl_name drop col_name

hive external partitioned table

First i created hive external table partitioned by code and date
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/old_work/XYZ';
and then i execute insert overwrite on this table taking data from other table
INSERT OVERWRITE TABLE XYZ PARTITION (CODE,DATE)
SELECT
*
FROM TEMP_XYZ;
and after that i count the number of records in hive
select count(*) from XYZ;
it shows me 1000 records are there
and then i rename or move the location '/old_work/XYZ' to '/new_work/XYZ'
and then i again drop the XYZ table and created again pointing location to new directory
means '/new_work/XYZ'
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ
(
ID STRING,
SAL BIGINT,
NAME STRING,
)
PARTITIONED BY (CODE INT,DATE STRING)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS
INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"
LOCATION '/new_work/XYZ';
But then when i execute select count(*) from XYZ table in hive , it shows 0 records ,
i think i missed something , please help me on this????
You need not drop the table and re create it the second time:
As soon as you move or rename a external hdfs location of the table just do this :
msck repair table <table_name>
In your case the error was because, The hive metastore wasnt updated with the new path .

Log file into Hive

I have a log file "sample.log" which looks like below:
41 Texas 2000
42 Louisiana4 3211
43 Texas 5000
22 Iowa 4998p
In the log file first column is id, second state name and third amount. If you see State name it has Louisiana4 and sales total it has 4998p. How can I cleanse it so I can insert it into Hive (using Python or other way?). Could you please show the steps?
I want to insert into Hive table tblSample:
Table schema is:
CREATE TABLE tblSample(
id int,
state string,
sales int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/cloudera/Staging'
;
To load data into Hive table I could do:
load data local inpath '/home/cloudera/sample.log' into table tblSample;
Thank you!
You could load data as is into a hive table and then use UDFs to cleanse data and load into another table. This would be far more efficient than Python as it will be running as a mapr reduce.
I would rather store the data as it is and do the cleansing while fetching the data. It would be much simpler. No external code required. For example :
hive> CREATE TABLE tblSample(
> id string,
> state string,
> sales string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> LOCATION '/user/cloudera/Staging';
hive> select regexp_replace(state, "[0-9]", ""), regexp_replace(sales, "[a-z]", "") from tblSample;
HTH

Resources