I am creating one table skeleton using the table properties as
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
(as the files are in parquet format) and setting few of the parameters before creating the table as :
set hive.exec.dynamic.partition.mode=nonstrict;
set parquet.enable.dictionary=false;
set hive.plan.serialization.format=javaXML;
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
set avro.output.codec=snappy;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
add jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1168.923/lib/sentry/lib/hive-metastore.jar;
Still the table is not getting compressed. Could you please let me know the reason for table not getting compressed.
Thanks in advance for you inputs.
The solution is using “TBLPROPERTIES ('parquet.compression'='SNAPPY')”(and the case matters) in the DDL instead of “TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')”.
You can also achieve the compression using the following property in the hive.
set parquet.compression=SNAPPY
Your Parquet table is probably compressed but you're not directly seeing that. In Parquet files, the compression is baked in into the format. Instead of the whole file being compressed, individual segments are compressed using the specified algorithm. Thus a compressed Parquet will look from the outside the same as a compressed one (normally they don't include any suffix like normal compressed files have (e.g. .gz) as you cannot decompress them using the usual tools).
Having the compression baked in into the format is one of the many advantages of the Parquet format. This makes the files (hadoop-)splittable independent of the compression algorithm as well as it enables fast access to specific segments of the file without the need to decompress the whole file. In the case of a query engine processing a query on top of Parquet files, this means that often it only needs to read the small but uncompressed header, sees which segments are relevant for the query and then only needs to decompress these relevant sections.
Set the below parameters and after that perform below steps-
SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
CREATE TABLE Test_Parquet (
customerID int, name string, ..etc
) STORED AS PARQUET Location ''
INSERT INTO Test_Parquet SELECT * FROM Test;
If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.
describe formatted tableName
Note - but you will always see the compression as NO because the compression data format is not stored in
metadata of the table , the best way is to do dfs -ls -r to the table location and see the file format for compression.
Note- Currently the default compression is - Snappy with Impala tables.
If your issue didn't resolved after these steps also please post the all steps which are you performing..?
I see this mistake being done several times, here is what needs to be done (This will only work for hive. NOT with SPARK) :
OLD PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
CORRECT PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY')
I have recently created some tables stored as Parquet file with Snappy compression and have used the following commands:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
Related
I want to compress a table in parquet compression in Impala. Is there any method to compress that table as there are 1000's of files in the HDFS to that particular table.
Parquet is an encoding, not a compression format. Snappy is a compression format commonly used with Parquet
It's not clear what your original file types are, but typically simple running an INSERT OVERWRITE INTO query will cause the files to get re-collected and "compacted" into a lesser quantity.
I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.
I am trying to insert data into dynamic partitioned table which is creating lots of small files, i have set hive properties as below but i still see small files in partitioned folder, the size per task nor the avgfile size seems to be working for me as the files in partitioned folder are above the size per task i gave.
Any help will be greatly appreciated
hive.merge.mapfiles=true;
hive merge mapredfiles = true
hive.merge.size.per.task=10000;
hive.merge.smallfiles.avgsize=100;
Your example shows you setting the average size to 100 bytes which would create a lot of small files and is most likely being ignored because the files are already larger than that. Try increasing this value to an average of 128MB(134217728) which should on average increase the size of the files being merged after your job is complete.
set hive.merge.smallfiles.avgsize = 134217728;
This can happen when you execute multiple inserts into a single Hive table. 1 single insert can result in one or more files under the HDFS location.
I have managed this situation by executing below command - this will compact the table and will merge all files in one (or bigger ones)
There's one restriction though, you can't have indexes in your hive tables to execute the merge command.
I have also tested from Spark SQL over ORC files - (1.5.2) and it works fine.
ALTER TABLE schema.table PARTITION (month = '01') CONCATENATE
Hope it helps
Working with Small files in hive is a common problem and it can also be resolved by using CombineHiveInputFormat for input format. Also use ORC files by deafault:
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
This will help to run hive job faster for given small files in hive.
From what I googled around and found are ways of creating an ORC table using Hive but I want a an ORC file on which I can run my custom map-reduce job.
Also please let me know that the file created by Hive under the warehouse directory for my ORC table is a table file of ORC and not an actutal ORC file I can use? like: /user/hive/warehouse/tbl_orc/000000_0
[Wrap-up of the discussion]
a Hive table is mapped on a HDFS directory (or a list of
directories, if the table is partitioned)
all files in that directory use the same SerDe (ORC, Parquet, AVRO,
Text, etc.) and have the same column set; all together, they contain all the data available for that table
each file in that directory is the result of a previous MapReduce job
-- either a Hive INSERT, a Pig dataset saved via HCatalog, a Spark dataset saved via HiveContext... or any custom job that happens to
drop a file there, hopefully compliant with the table SerDe and
schema (retrieved via MetastoreClient Java API, or via HCatalog API,
whatever)
note that a single job with 3 reducers will probably create 3 new
files (and maybe 1 empty file + 1 small file + 1 big file!); and a
job with 24 mappers and no reducer will create 24 files, unless some
kind of "merge small files" post-processing step is enabled
note also that most file names give absolutely no information about
the way the file is encoded intenally, they are just sequence numbers
(i.e. the 5th job to add 12 files will typically create files 000004_0 to
000004_11)
All in all, processing an ORC fileset with a Java MapReduce program should be very similar to processing a Text fileset. You just have to provide the correct SerDe and the correct field mapping -- I think that the encryption algorithm is explicit in the files so the Serde handles it auto-magically at read time. Just remember that ORC files are not splittable at record level, but at stripe level (a stripe is a bunch of record stored in columnar format w/ tokenization and optional compression).
Of course, that will not give you access to ORC advanced features such a vectorization or stripe pruning (somewhat similar to "smart scan" in Oracle Exadata).
I need to enable Sequence File with Block Compression data. Below is the table which will be stored as SequenceFile.
create table lip_data_quality
( buyer_id bigint,
total_chkout bigint,
total_errpds bigint
)
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/b_apdpds/lip-data-quality'
;
And in the above table, I am getting data in Compressed Form like this by enabling these commands-
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
So my question is that's all I need to enable BLOCK Compression with Sequence File? Or is there anything else I need to do? I was following this article Hadoop
Any suggestion will be appreciated.
Update:-
I am loading the data in the above table like this by putting everything in a .hql file and running that hql file from the shell command prompt. And changing the partition date everytime while running the below hql file.
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
insert overwrite table lip_data_quality partition (dt='20120712')
SELECT query here which will give the output for the above table.
That should be fine then. You can also verify it by looking at the files on HDFS. There should be a directory in HDFS named /user/hive/warehouse/lip_data_quality/dt=20120712 after your load. If you run
hadoop fs -cat
on one of the files in that folder you should be able to see the header of the file which will give you basic info on the file.
Set the below properties before submitting job.
setProperty(job, "mapred.output.compress", "true");
setProperty(job,"mapred.output.compression.type", "BLOCK");
setProperty(job,"mapred.output.compression.codec","org.apache.hadoop.io.compress.DefaultCodec");
Using DefaultCodec, one can use org.apache.hadoop.io.compress.LzoCodec;