Confirming compression is enabled for specific Hive tables

Confirming compression is enabled for specific Hive tables - hadoop

I need to benchmark a series of tables, some compressed and some not. I compress by setting:
hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
...and using INSERT OVERWRITE to populate a table. Is there a way to confirm via the command line (similar to DESCRIBE EXTENDED) that output compression is enabled for a particular table?

When you exec describe formatted orc_with_compress_setting_table seeing something like this:
Compressed: No and feeling uncomfortable.
Here is the answer:
The Compressed field is not a reliable indicator of whether the table contains compressed data. It typically always shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata.
from: https://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_describe.html

Following will show the path to the table files in hdfs
desc formatted <tablename>
hive> desc formatted cobtest;
OK
col_name data_type comment
col_name data_type comment
batch_sk int None
geo_cd string None
env_cd string None
Detailed Table Information
Database: default
Owner: steve
CreateTime: Thu Nov 21 23:36:37 PST 2013
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:9000/user/hive/warehouse/cobtest
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 1
numPartitions 0
numRows 0
rawDataSize 0
totalSize 473
transient_lastDdlTime 1385105797
Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.203 seconds, Fetched: 34 row(s)
You can then do dfs -lsr and you will notice by the file extensions if they are compressed.
hive> dfs -lsr hdfs://localhost:9000/user/hive/warehouse/cobtest ;
-rw-r--r-- 1 steve supergroup 473 2013-11-21 23:36 /user/hive/warehouse/cobtest/UDFPafCobIndTest.input**.tsv**
UPDATE SNAPPY is the CODEC for compression. You still need to add the following:
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Related

HIVE - SQL insert record

In HIVE SQL , i ran same query in two different cloudera version. Cloudera VM 5.10 is not causing any issue. But another version cloudera -CDH-5.1.0-1.cdh5.1.0.p0.53 is throwing error.
hive> select * from t;
OK
Time taken: 1.803 seconds
hive> insert into table t values (1);
NoViableAltException(26#[])
at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:713)
at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:35992)
at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:33510)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:33389)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:33169)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1284)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:983)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:434)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:352)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:995)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1038)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:921)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:790)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:623)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:20 cannot recognize input near 'values' '(' '1' in select clause
hive>
Any idea? which version have to choose for my studies. Please advise.

In older versions of CDH like CDH-5.1 insert record is not supported but in new versions of CDH it is supported feature.
So instead of trying insert into values,try with Load data statement
If your file is in Local then
hive> LOAD DATA LOCAL INPATH '<local-path-tofile>' INTO TABLE t;
If your file is in HDFS then
hive> LOAD DATA INPATH 'hdfs_file_or_directory_path' INTO TABLE t;
For more details refer to:-
https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations

What you are trying is right, make sure you have one column in your table. if you multiple columns you need have the values for all the fields. checkout the below image
The version I tried in :
[cloudera#quickstart ~]$ hadoop version
Hadoop 2.6.0-cdh5.5.0
Subversion http://github.com/cloudera/hadoop -r fd21232cef7b8c1f536965897ce20f50b83ee7b2
Compiled by jenkins on 2015-11-09T20:37Z
Compiled with protoc 2.5.0
From source with checksum 98e07176d1787150a6a9c087627562c
This command was run using /usr/jars/hadoop-common-2.6.0-cdh5.5.0.jar

#roh
hive> describe formatted t;
OK
# col_name data_type comment
id int None
# Detailed Table Information
Database: default
Owner: learnhadoop
CreateTime: Fri Feb 23 21:25:15 IST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://hadoopmasters:8020/user/hive/warehouse/t
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1519401315\
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForm at
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.111 seconds, Fetched: 26 row(s)
hive>

Copy partitioned ORC data files to another external partitioned ORC table

Issue : incorrect row count after copying partitioned folders with ORC
files into another external partitioned ORC table
I have this employee table in dev schema . This table is a external partitioned ORC table.
CREATE EXTERNAL TABLE dev.employee(
empid string,
empname string,
update_gmt_ts timestamp)
PARTITIONED BY (
partition_upd_gmt_ts string)
stored as orc
location '/dev/employee';
I have orc data files inside these partitioned folder.
hdfs dfs -ls /dev/employee
drwxr-xr-x - user1 group1 0 2017-02-08 10:25 /dev/employee/partition_upd_gmt_ts=201609
drwxr-xr-x - user1 group1 0 2017-02-08 10:24 /dev/employee/partition_upd_gmt_ts=201610
When I execute this query
select count(*) from dev.employee where 1=1;
1000 -- correct rowcount
I have another table replica of employee table in prod schema.This is also a external partitioned ORC table.I want to push the same data into that table also.
CREATE EXTERNAL TABLE prod.employee(
empid string,
empname string,
update_gmt_ts timestamp)
PARTITIONED BY (
partition_upd_gmt_ts string)
stored as orc
location '/prod/employee';
so I did a hdfs copy
hdfs dfs -cp /dev/employee/* /prod/employee/
Data got copied .
hdfs dfs -ls /prod/employee
drwxr-xr-x - user1 group1 0 2017-02-08 10:25 /prod/employee/partition_upd_gmt_ts=201609
drwxr-xr-x - user1 group1 0 2017-02-08 10:24 /prod/employee/partition_upd_gmt_ts=201610
But when I executed the count query , I got zero rows.
Could you please help me why I am not getting the same 1000 as row
count.
select count(*) from prod.employee where 1=1;
0 -- wrong rowcount

Insert into bucketed table produces empty table

I`m trying to do insert into bucketed table. When I run the query everything looks fine and I see in reports some amount of wrote bytes. No any errors in Hive logs also.
But when I look into table I have nothing :(
CREATE TABLE test(
test_date string,
test_id string,
test_title string,)
CLUSTERED BY (
text_date)
INTO 100 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION
'hdfs://myserver/data/hive/databases/test.db/test'
TBLPROPERTIES (
'skip.header.line.count'='1',
'transactional' = 'true')
INSERT INTO test.test
SELECT 'test_date', 'test_id', 'test_title' from test2.green
Result
Ended Job = job_148140234567_254152
Loading data to table test.test
Table test.teststats: [numFiles=100, numRows=1601822, totalSize=9277056, rawDataSize=0]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 6 Reduce: 100 Cumulative CPU: 423.34 sec
HDFS Read: 148450105
HDFS Write: 9282219
SUCCESS
hive> select * from test.test limit 2;
OK
Time taken: 0.124 seconds
hive>

Is this query really working? You have extra comma after in line
test_title string,)
also coulmn text_date isnt in your you column definition. May be you meant test_date?
CLUSTERED BY (text_date)

Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0

I created external table as follows...
hive -e "create external table temp_db.temp_table (a char(10), b int) PARTITIONED BY (PART_DATE VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/work/temp_db/temp_table'"
And I use saveAsTextFile() with scala in IntelliJ IDEA as follows...
itemsRdd.map(_.makeTsv).saveAsTextFile("hdfs://work/temp_db/temp_table/2016/07/19")
So the file(fields terminated by '\t') was in the /work/temp_db/temp_table/2016/07/19.
hadoop fs -ls /work/temp_db/temp_table/2016/07/19/part-00000 <- data file..
But, I checked with hiveql, there are no datas as follows.
hive -e "select count(*) from temp_db.temp_table" -> 0.
hive -e "select * from temp_db.temp_table limit 5" -> 0 rows fetched.
Help me what to do. Thanks.

you are saving at wrong location from spark. Partition dir name follows part_col_name=part_value.
In Spark: save file at directory part_date=2016%2F07%2F19 under temp_table dir
itemsRdd.map(_.makeTsv)
.saveAsTextFile("hdfs://work/temp_db/temp_table/part_date=2016%2F07%2F19")
add partitions: You will need to add partition that should update hive table's metadata (partition dir we have created from spark as hive expected key=value format)
alter table temp_table add partition (PART_DATE='2016/07/19');
[cloudera#quickstart ~]$ hadoop fs -ls /user/hive/warehouse/temp_table/part*|awk '{print $NF}'
/user/hive/warehouse/temp_table/part_date=2016%2F07%2F19/part-00000
/user/hive/warehouse/temp_table/part_date=2016-07-19/part-00000
query partitioned data:
hive> alter table temp_table add partition (PART_DATE='2016/07/19');
OK
Time taken: 0.16 seconds
hive> select * from temp_table where PART_DATE='2016/07/19';
OK
test1 123 2016/07/19
Time taken: 0.219 seconds, Fetched: 1 row(s)
hive> select * from temp_table;
OK
test1 123 2016/07/19
test1 123 2016-07-19
Time taken: 0.199 seconds, Fetched: 2 row(s)
For Everyday process: you can run saprk job like this - just add partitions right after saveAsTextFile(), aslo note the s in alter statement. it is need to pass variable in hive sql from spark:
val format = new java.text.SimpleDateFormat("yyyy/MM/dd")
vat date = format.format(new java.util.Date())
itemsRDD.saveAsTextFile("/user/hive/warehouse/temp_table/part=$date")
val hive = new HiveContext(sc)
hive.sql(s"alter table temp_table add partition (PART_DATE='$date')")
NOTE: Add partition after saving the file or else spark will throw directory already exist exception as hive creates dir (if not exist) when adding partition.

How to retrieve data from a specific bucket in hive

I created a table in hive
create table HiveMB
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
stored as orc TBLPROPERTIES ('transactional'='true') ;
where my file format is like
1,Anne,Admin,50000,A
2,Gokul,Admin,50000,B
3,Janet,Sales,60000,A
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
and the data went into 3 buckets for department.
When I examined the warehouse , there are 3 buckets
Found 3 items
-rwxr-xr-x 3 aibladmin hadoop 252330 2014-11-28 14:46 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00000
-rwxr-xr-x 3 aibladmin hadoop 100421 2014-11-28 14:45 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00001
-rwxr-xr-x 3 aibladmin hadoop 313047 2014-11-28 14:46 /user/hive/warehouse/hivemb/delta_0000012_0000012/bucket_00002
How will I be able to retrieve 1 such bucket.
When I did a -cat, It is not in human readable format.
showing something like
`J�ǉ�(��rwNj��[��Y���gR�� \�B�Q_Js)�6 �st�A�6�ixt� R �
ޜ�KT� e����IL Iԋ� ł2�2���I�Y��FC8 /2�g� ����� > ������q�D � b�` `�`���89$ $$ ����I��y|#޿
%\���� �&�ɢ`a~ � S �$�l�:y���K $�$����X�X��)Ě���U*��
6. �� �cJnf� KHjr�ć����� ��(p` ��˻_1s �5ps1: 1:I4L\��u
How can I able to see the data stored into each bucket?
And my file is in csv format not ORC so as a workaround I did this
But I am not able to view data in buckets. That is not in human readable format.

i am uploading orc screen shot which was produce from this hive queries:
create table stackOverFlow
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
row format delimited
fields terminated by ',';
load data local inpath '/home/ravi/stack_file.txt'
overwrite into table stackOverFlow;
and
create table stackOverFlow6
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
row format delimited
fields terminated by ','
stored as orc tblproperties ("orc.compress"="ZLIB");
insert overwrite table stackOverFlow6 select * from stackOverFlow;
generated ORC result file for above hive queries:

create table HiveMB1
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
row format delimited
fields terminated by ',';
load data local inpath '/home/user17/Data/hive.txt'
overwrite into table HiveMB1;
create table HiveMB2
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
row format delimited
fields terminated by ',';
insert overwrite table HiveMB2 select * from HiveMB1 ;
user17#BG17:~$ hadoop dfs -ls /user/hive/warehouse/hivemb2
Found 3 items
-rw-r--r-- 1 user17 supergroup 22 2014-12-01 15:52 /user/hive/warehouse/hivemb2/000000_0
-rw-r--r-- 1 user17 supergroup 44 2014-12-01 15:53 /user/hive/warehouse/hivemb2/000001_0
-rw-r--r-- 1 user17 supergroup 43 2014-12-01 15:53 /user/hive/warehouse/hivemb2/000002_0
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000000_0
2,Gokul,Admin,50000,B
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000001_0
4,Hari,Admin,50000,C
5,Sanker,Admin,50000,C
user17#BG17:~$ hadoop dfs -cat /user/hive/warehouse/hivemb2/000002_0
1,Anne,Admin,50000,A
3,Janet,Sales,60000,A

your table:
> create table HiveMB
(EmployeeID Int,FirstName String,Designation String,Salary Int,Department String)
clustered by (Department) into 3 buckets
stored as orc TBLPROPERTIES ('transactional'='true') ;
you are chosen table as a ORC format, which means it compresses actual the data and stores the compressed data.

You can see the orc format for a bucket by the command :
hive --orcfiledump [path-to-the-bucket]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Confirming compression is enabled for specific Hive tables - hadoop

Related

HIVE - SQL insert record

Copy partitioned ORC data files to another external partitioned ORC table

Insert into bucketed table produces empty table

Using spark/scala, I use saveAsTextFile() to HDFS, but hiveql("select count(*) from...) return 0

How to retrieve data from a specific bucket in hive

Categories

Resources