How to retrieve data from a specific bucket in hive - hadoop

I created a table in hive
create table HiveMB
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
stored as orc TBLPROPERTIES ('transactional'='true') ;
where my file format is like
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
and the data went into 3 buckets for department.
When I examined the warehouse , there are 3 buckets
Found 3 items
-rwxr-xr-x 3 aibladmin hadoop 252330 2014-11-28 14:46 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00000
-rwxr-xr-x 3 aibladmin hadoop 100421 2014-11-28 14:45 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00001
-rwxr-xr-x 3 aibladmin hadoop 313047 2014-11-28 14:46 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00002
How will I be able to retrieve 1 such bucket.
When I did a -cat, It is not in human readable format.
showing something like
`J�lj�(��rwNj��[��Y���gR�� \�B�Q_Js)�6 �st�A�6�ixt� R �
ޜ�KT� e����IL Iԋ� ł2�2���I�Y��FC8 /2�g� ����� > ������q�D � b�` `�`���89$ $$ ����I��y|#޿
%\���� �&�ɢ`a~ � S �$�l�:y���K $�$����X�X��)Ě���U*��
6. �� �cJnf� KHjr�ć����� ��(p` ��˻_1s �5ps1: 1:I4L\��u
How can I able to see the data stored into each bucket?
And my file is in csv format not ORC so as a workaround I did this
But I am not able to view data in buckets. That is not in human readable format.

i am uploading orc screen shot which was produce from this hive queries:
create table stackOverFlow
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
row format delimited
fields terminated by ',';
load data local inpath '/home/ravi/stack_file.txt'
overwrite into table stackOverFlow;
and
create table stackOverFlow6
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
row format delimited
fields terminated by ','
stored as orc tblproperties ("orc.compress"="ZLIB");
insert overwrite table stackOverFlow6 select * from stackOverFlow;
generated ORC result file for above hive queries:

create table HiveMB1
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
row format delimited
fields terminated by ',';
load data local inpath '/home/user17/Data/hive.txt'
overwrite into table HiveMB1;
create table HiveMB2
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
row format delimited
fields terminated by ',';
insert overwrite table HiveMB2 select * from HiveMB1 ;
user17#BG17:~$ hadoop dfs -ls /user/hive/warehouse/hivemb2
Found 3 items
-rw-r--r-- 1 user17 supergroup 22 2014-12-01 15:52 /user/hive/warehouse/hivemb2/000000_0
-rw-r--r-- 1 user17 supergroup 44 2014-12-01 15:53 /user/hive/warehouse/hivemb2/000001_0
-rw-r--r-- 1 user17 supergroup 43 2014-12-01 15:53 /user/hive/warehouse/hivemb2/000002_0
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000000_0
2,Gokul,Admin,50000,B
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000001_0
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000002_0
1,Anne,Admin,50000,A
3,Janet,Sales,60000,A

your table:
> create table HiveMB
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
stored as orc TBLPROPERTIES ('transactional'='true') ;
you are chosen table as a ORC format, which means it compresses actual the data and stores the compressed data.

You can see the orc format for a bucket by the command :
hive --orcfiledump [path-to-the-bucket]

Related

Unable to map the HBase row key in HIve table effectively

I have a HBase table where the rowkey looks like this.
08:516485815:2013 1
06:260070837:2014 1
00:338289200:2014 1
I create a Hive link table using the below query.
create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
When I query the table I get the below result
select * from hb;
08:516485815 1
06:260070837 1
00:338289200 1
This is very strange to me. Why the serde is not able to map the whole content of the HBase key? The hive table is missing everything after the second ':'
Has anybody faced a similar kind of issue?
I tried by recreating your scenario on Hbase 1.1.2 and Hive 1.2.1000,it works as expected and i am able to get the whole rowkey from hive.
hbase> create 'hbaseTable','e'
hbase> put 'hbaseTable','08:516485815:2013','e:-1','1'
hbase> scan 'hbaseTable'
ROW COLUMN+CELL
08:516485815:2013 column=e:-1, timestamp=1519675029451, value=1
1 row(s) in 0.0160 seconds
As i'm having 08:516485815:2013 as rowkey and i have created hive table
hive> create external table hb
(key string,value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping"=":key,e:-1")
tblproperties("hbase.table.name"="hbaseTable");
hive> select * from hb;
+--------------------+-----------+--+
| hb.key | hb.value |
+--------------------+-----------+--+
| 08:516485815:2013 | 1 |
+--------------------+-----------+--+
Can you once make sure your hbase table rowkey having the data after second :.

Copy partitioned ORC data files to another external partitioned ORC table

Issue : incorrect row count after copying partitioned folders with ORC
files into another external partitioned ORC table
I have this employee table in dev schema . This table is a external partitioned ORC table.
CREATE EXTERNAL TABLE dev.employee(
empid string,
empname string,
update_gmt_ts timestamp)
PARTITIONED BY (
partition_upd_gmt_ts string)
stored as orc
location '/dev/employee';
I have orc data files inside these partitioned folder.
hdfs dfs -ls /dev/employee
drwxr-xr-x - user1 group1 0 2017-02-08 10:25 /dev/employee/partition_upd_gmt_ts=201609
drwxr-xr-x - user1 group1 0 2017-02-08 10:24 /dev/employee/partition_upd_gmt_ts=201610
When I execute this query
select count(*) from dev.employee where 1=1;
1000 -- correct rowcount
I have another table replica of employee table in prod schema.This is also a external partitioned ORC table.I want to push the same data into that table also.
CREATE EXTERNAL TABLE prod.employee(
empid string,
empname string,
update_gmt_ts timestamp)
PARTITIONED BY (
partition_upd_gmt_ts string)
stored as orc
location '/prod/employee';
so I did a hdfs copy
hdfs dfs -cp /dev/employee/* /prod/employee/
Data got copied .
hdfs dfs -ls /prod/employee
drwxr-xr-x - user1 group1 0 2017-02-08 10:25 /prod/employee/partition_upd_gmt_ts=201609
drwxr-xr-x - user1 group1 0 2017-02-08 10:24 /prod/employee/partition_upd_gmt_ts=201610
But when I executed the count query , I got zero rows.
Could you please help me why I am not getting the same 1000 as row
count.
select count(*) from prod.employee where 1=1;
0 -- wrong rowcount

Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0

I created external table as follows...
hive -e "create external table temp_db.temp_table (a char(10), b int) PARTITIONED BY (PART_DATE VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/work/temp_db/temp_table'"
And I use saveAsTextFile() with scala in IntelliJ IDEA as follows...
itemsRdd.map(_.makeTsv).saveAsTextFile("hdfs://work/temp_db/temp_table/2016/07/19")
So the file(fields terminated by '\t') was in the /work/temp_db/temp_table/2016/07/19.
hadoop fs -ls /work/temp_db/temp_table/2016/07/19/part-00000 <- data file..
But, I checked with hiveql, there are no datas as follows.
hive -e "select count(*) from temp_db.temp_table" -> 0.
hive -e "select * from temp_db.temp_table limit 5" -> 0 rows fetched.
Help me what to do. Thanks.
you are saving at wrong location from spark. Partition dir name follows part_col_name=part_value.
In Spark: save file at directory part_date=2016%2F07%2F19 under temp_table dir
itemsRdd.map(_.makeTsv)
.saveAsTextFile("hdfs://work/temp_db/temp_table/part_date=2016%2F07%2F19")
add partitions: You will need to add partition that should update hive table's metadata (partition dir we have created from spark as hive expected key=value format)
alter table temp_table add partition (PART_DATE='2016/07/19');
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/temp_table/part*|awk '{print $NF}'
/user/hive/warehouse/temp_table/part_date=2016%2F07%2F19/part-00000
/user/hive/warehouse/temp_table/part_date=2016-07-19/part-00000
query partitioned data:
hive> alter table temp_table add partition (PART_DATE='2016/07/19');
OK
Time taken: 0.16 seconds
hive> select * from temp_table where PART_DATE='2016/07/19';
OK
test1 123 2016/07/19
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select * from temp_table;
OK
test1 123 2016/07/19
test1 123 2016-07-19
Time taken: 0.199 seconds, Fetched: 2 row(s)
For Everyday process: you can run saprk job like this - just add partitions right after saveAsTextFile(), aslo note the s in alter statement. it is need to pass variable in hive sql from spark:
val format = new java.text.SimpleDateFormat("yyyy/MM/dd")
vat date = format.format(new java.util.Date())
itemsRDD.saveAsTextFile("/user/hive/warehouse/temp_table/part=$date")
val hive = new HiveContext(sc)
hive.sql(s"alter table temp_table add partition (PART_DATE='$date')")
NOTE: Add partition after saving the file or else spark will throw directory already exist exception as hive creates dir (if not exist) when adding partition.

Store PIG output as Ctrl a delimited output for import into hive?

How do I store PIG output as Ctrl-a delimited output for storage into hive?
To get the expected result you can follow below mentioned process
Store your relation using below command
STORE <Relation> INTO '<file_path>' USING PigStorage('\u0001');
Expose hive table referring to generated file
hive>CREATE EXTERNAL TABLE TEMP(
c1 INT,
c2 INT,
c3 INT,
c4 INT
.....
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '<file_path>';
If output file present in linux local directory then create table
hive>CREATE TABLE TEMP(
c1 INT,
c2 INT,
c3 INT,
c4 INT
.....
)
ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
and load the data into table
hive> load data local inpath '<file_path>' into table temp;
Can you try like this?
STORE <OutpuRelation> INTO '<Outputfile>' USING PigStorage('\u0001');
Example:
input.txt
1,2,3,4
5,6,7,8
9,10,11,12
PigScript:
A = LOAD 'input.txt' USING PigStorage(',');
STORE A INTO 'out' USING PigStorage('\u0001');
Output:
1^A2^A3^A4
5^A6^A7^A8
9^A10^A11^A12
UPDATE:
The above pig script output is stored into file name 'part-m-00000' and i am trying to load this file into hive. Everything works fine and i didn't see any issue.
hive> create table test_hive(f1 INT,f2 INT,f3 INT,f4 INT);
OK
Time taken: 0.154 seconds
hive> load data local inpath 'part-m-00000' overwrite into table test_hive;
OK
Time taken: 0.216 seconds
hive> select *from test_hive;
OK
1 2 3 4
5 6 7 8
9 10 11 12
Time taken: 0.076 seconds
hive>

Log file into Hive

I have a log file "sample.log" which looks like below:
41 Texas 2000
42 Louisiana4 3211
43 Texas 5000
22 Iowa 4998p
In the log file first column is id, second state name and third amount. If you see State name it has Louisiana4 and sales total it has 4998p. How can I cleanse it so I can insert it into Hive (using Python or other way?). Could you please show the steps?
I want to insert into Hive table tblSample:
Table schema is:
CREATE TABLE tblSample(
id int,
state string,
sales int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/cloudera/Staging'
;
To load data into Hive table I could do:
load data local inpath '/home/cloudera/sample.log' into table tblSample;
Thank you!
You could load data as is into a hive table and then use UDFs to cleanse data and load into another table. This would be far more efficient than Python as it will be running as a mapr reduce.
I would rather store the data as it is and do the cleansing while fetching the data. It would be much simpler. No external code required. For example :
hive> CREATE TABLE tblSample(
> id string,
> state string,
> sales string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> LOCATION '/user/cloudera/Staging';
hive> select regexp_replace(state, "[0-9]", ""), regexp_replace(sales, "[a-z]", "") from tblSample;
HTH

Resources