I want to compress a table in parquet compression in Impala. Is there any method to compress that table as there are 1000's of files in the HDFS to that particular table.
Parquet is an encoding, not a compression format. Snappy is a compression format commonly used with Parquet
It's not clear what your original file types are, but typically simple running an INSERT OVERWRITE INTO query will cause the files to get re-collected and "compacted" into a lesser quantity.
Related
I am creating one table skeleton using the table properties as
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
(as the files are in parquet format) and setting few of the parameters before creating the table as :
set hive.exec.dynamic.partition.mode=nonstrict;
set parquet.enable.dictionary=false;
set hive.plan.serialization.format=javaXML;
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
set avro.output.codec=snappy;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
add jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1168.923/lib/sentry/lib/hive-metastore.jar;
Still the table is not getting compressed. Could you please let me know the reason for table not getting compressed.
Thanks in advance for you inputs.
The solution is using “TBLPROPERTIES ('parquet.compression'='SNAPPY')”(and the case matters) in the DDL instead of “TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')”.
You can also achieve the compression using the following property in the hive.
set parquet.compression=SNAPPY
Your Parquet table is probably compressed but you're not directly seeing that. In Parquet files, the compression is baked in into the format. Instead of the whole file being compressed, individual segments are compressed using the specified algorithm. Thus a compressed Parquet will look from the outside the same as a compressed one (normally they don't include any suffix like normal compressed files have (e.g. .gz) as you cannot decompress them using the usual tools).
Having the compression baked in into the format is one of the many advantages of the Parquet format. This makes the files (hadoop-)splittable independent of the compression algorithm as well as it enables fast access to specific segments of the file without the need to decompress the whole file. In the case of a query engine processing a query on top of Parquet files, this means that often it only needs to read the small but uncompressed header, sees which segments are relevant for the query and then only needs to decompress these relevant sections.
Set the below parameters and after that perform below steps-
SET parquet.compression=SNAPPY;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
CREATE TABLE Test_Parquet (
customerID int, name string, ..etc
) STORED AS PARQUET Location ''
INSERT INTO Test_Parquet SELECT * FROM Test;
If not how do i identify a parquet table with snappy compression and parquet table without snappy compression?.
describe formatted tableName
Note - but you will always see the compression as NO because the compression data format is not stored in
metadata of the table , the best way is to do dfs -ls -r to the table location and see the file format for compression.
Note- Currently the default compression is - Snappy with Impala tables.
If your issue didn't resolved after these steps also please post the all steps which are you performing..?
I see this mistake being done several times, here is what needs to be done (This will only work for hive. NOT with SPARK) :
OLD PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESSION'='SNAPPY')
CORRECT PROPERTY :
TBLPROPERTIES('PARQUET.COMPRESS'='SNAPPY')
I have recently created some tables stored as Parquet file with Snappy compression and have used the following commands:
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.intermediate.compression.type=BLOCK;
I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files i.e. each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case.
Can someone suggest me a way to combine these multiple ORC files (belonging to each partition) into a single big ORC file.
I've tried creating a new Non Partitioned ORC table from the Partitioned table.. It does reduce the number of files but not to a single file.
PS: Creating a table out of another one is a completely a map task and hence setting the number of reducers to 1 using the property 'set mapred.reduce.tasks=1;' doesn't help.
Thanks
You can use the CONCATENATE command to combine the small orc files. This can be done at table as well as partition level:
The syntax as per the orc documentation:
users can request an efficient merge of small ORC files together by
issuing a CONCATENATE command on their table or partition. The files
will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.
I have two scripts which parse data from raw logs and write it into ORC tables in HIVE. One script creates more columns and another less. Both tables partitioned by date field.
As the result I have ORC tables with different sizes of files.
Table with larger number of columns consists of many small files (~4MB per file inside each partition) and tables with less columns consists of few large files (~250 MB per file inside each partition).
I suppose it happens because of stripe.size setting in ORC. But I don't know how to check size of stripe for existing table. Commands like "show create" and "describe" don't reveal any custom settings, it means that stripe size for tables should be equal to 256 MB.
I'm looking for any advice to check stripe.size for existing ORC table.
Or explanation how file size inside ORC tables depends on data in that tables.
P.s.It matters later when I'm reading from that tables with Map Reduce and there are small number of reducers for tables with big files.
Try the Hive ORC File Dump Utility: ORC File Dump Utility.
From what I googled around and found are ways of creating an ORC table using Hive but I want a an ORC file on which I can run my custom map-reduce job.
Also please let me know that the file created by Hive under the warehouse directory for my ORC table is a table file of ORC and not an actutal ORC file I can use? like: /user/hive/warehouse/tbl_orc/000000_0
[Wrap-up of the discussion]
a Hive table is mapped on a HDFS directory (or a list of
directories, if the table is partitioned)
all files in that directory use the same SerDe (ORC, Parquet, AVRO,
Text, etc.) and have the same column set; all together, they contain all the data available for that table
each file in that directory is the result of a previous MapReduce job
-- either a Hive INSERT, a Pig dataset saved via HCatalog, a Spark dataset saved via HiveContext... or any custom job that happens to
drop a file there, hopefully compliant with the table SerDe and
schema (retrieved via MetastoreClient Java API, or via HCatalog API,
whatever)
note that a single job with 3 reducers will probably create 3 new
files (and maybe 1 empty file + 1 small file + 1 big file!); and a
job with 24 mappers and no reducer will create 24 files, unless some
kind of "merge small files" post-processing step is enabled
note also that most file names give absolutely no information about
the way the file is encoded intenally, they are just sequence numbers
(i.e. the 5th job to add 12 files will typically create files 000004_0 to
000004_11)
All in all, processing an ORC fileset with a Java MapReduce program should be very similar to processing a Text fileset. You just have to provide the correct SerDe and the correct field mapping -- I think that the encryption algorithm is explicit in the files so the Serde handles it auto-magically at read time. Just remember that ORC files are not splittable at record level, but at stripe level (a stripe is a bunch of record stored in columnar format w/ tokenization and optional compression).
Of course, that will not give you access to ORC advanced features such a vectorization or stripe pruning (somewhat similar to "smart scan" in Oracle Exadata).