Insert into Parquet file generates 512 MB files. How to generate 1 GB file? - hadoop

I am testing Parquet file format and inserting data into Parquet file using Impala external table.
Following is the parameter set that may affect the Parquet file size:
NUM_NODES: 1
PARQUET_COMPRESSION_CODEC: none
PARQUET_FILE_SIZE: 1073741824
I am using following insert statement to write into Parquet file.
INSERT INTO TABLE parquet_test.parquetTable
PARTITION (pkey=X)
SELECT col1, col2, col3 FROM map_impala_poc.textTable where col1%100=X;
I want to generate file size of approximately 1 GB and partitioned data accordingly so that each partition has little less than 1 GB of data in Parquet format. But, this insert operation doesn't generate single file of more than 512 MB. It writes 512 MB of data into one file and then creates another file and writes rest of the data to another file. What can be done to write all the data into single file?

try setting parquet size in the same session you are executing the query
set PARQUET_FILE_SIZE=1g;
INSERT INTO TABLE parquet_test.parquetTable ...

Related

Hive insert overwrite directory split records into equal file sizes

I am using a hive external table to dump data as json. My dump files look fine. However after my dump, the files written by hive are of varied sizes ranging from around 400MB to 7GB. I want to have files of a fixed max size (say 1GB). But I am unable to do so. Please Help!
My Query:
INSERT OVERWRITE DIRECTORY '/myhdfs/location'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe'
select * from MY_EXTERNAL_TABLE;
Hive Version: Hive 1.1.0-cdh5.14.2
Hadoop Version: Hadoop 2.6.0-cdh5.14.2
Set bytes per reducer limit and add distribute by(this will trigger reducer step), use some evenly distributed column or column list:
set hive.exec.reducers.bytes.per.reducer=1000000000;
INSERT OVERWRITE DIRECTORY '/myhdfs/location'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe'
select * from MY_EXTERNAL_TABLE distribute by <column or col list here>;
Check also this answer: https://stackoverflow.com/a/55375261/2700344

Hive queries on external S3 table very slow

We have our dataset in s3 (parquet files) in the below format, data divided as multiple parquet files based on the row number
data1_1000000.parquet
data1000001_2000000.parquet
data2000001_3000000.parquet
...
Created hive table on top of it using,
CREATE EXTERNAL TABLE parquet_hive (
foo string
) STORED AS PARQUET
LOCATION 's3://myBucket/myParquet/';
Totally there 22000 parquet files and the size of the folder is nearly 300GB. When i run the count query on this table in Hive, it is taking 6 hours to return the result which is nearly 7 billion records. How can we make it faster? Can i create partition or index on the table or this is the time it usually take when pulling data from s3. Can anyone advice, what is wrong here.
Thanks.

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

Hive. Check stripe size for existing ORC storage

I have two scripts which parse data from raw logs and write it into ORC tables in HIVE. One script creates more columns and another less. Both tables partitioned by date field.
As the result I have ORC tables with different sizes of files.
Table with larger number of columns consists of many small files (~4MB per file inside each partition) and tables with less columns consists of few large files (~250 MB per file inside each partition).
I suppose it happens because of stripe.size setting in ORC. But I don't know how to check size of stripe for existing table. Commands like "show create" and "describe" don't reveal any custom settings, it means that stripe size for tables should be equal to 256 MB.
I'm looking for any advice to check stripe.size for existing ORC table.
Or explanation how file size inside ORC tables depends on data in that tables.
P.s.It matters later when I'm reading from that tables with Map Reduce and there are small number of reducers for tables with big files.
Try the Hive ORC File Dump Utility: ORC File Dump Utility.

Hive gzip file decompression

I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?
CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;
select count(*) from test_gz; result 1,123,456
select count(*) from test; result 7,720,109
I was able to resolve this issue. Somehow the gzip files were not fully getting decompressed in map/reduce jobs (hive or custom java map/reduce). Mapreduce job would only read about ~450 MB of the gzip file and write out the data out to HDFS without fully reading the 3.5GZ file. Strange, no errors at all!
Since the files were compressed on another server, I decompressed them manually and re-compressed them on the hadoop client server. After that, I uploaded the newly compressed 3.5GZ file to HDFS, and then hive was able to fully count all the records reading the whole file.
Marcin

Resources