Hive gzip file decompression - hadoop

I have loaded bunch of .gz file into HDFS and when I create a raw table on top of them I am seeing strange behavior when counting number of rows. Comparing the result of the count(*) from the gz table versus the uncompressed table results in ~85% difference. The table that has the file gz compressed has less records. Has anyone seen this?
CREATE EXTERNAL TABLE IF NOT EXISTS test_gz(
col1 string, col2 string, col3 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n'
LOCATION '/data/raw/test_gz'
;
select count(*) from test_gz; result 1,123,456
select count(*) from test; result 7,720,109

I was able to resolve this issue. Somehow the gzip files were not fully getting decompressed in map/reduce jobs (hive or custom java map/reduce). Mapreduce job would only read about ~450 MB of the gzip file and write out the data out to HDFS without fully reading the 3.5GZ file. Strange, no errors at all!
Since the files were compressed on another server, I decompressed them manually and re-compressed them on the hadoop client server. After that, I uploaded the newly compressed 3.5GZ file to HDFS, and then hive was able to fully count all the records reading the whole file.
Marcin

Related

Hive external table location in google cloud storage is ignoring subdirectories

I have a bunch of large csv.gz files in google cloud storage that we got from an external source. We need to bring this in BigQuery so we can start querying but BigQuery cannot directly ingest CSV GZIPPED files larger than 4GB. So, I decided to convert these files into Parquet format and then load in BigQuery.
Let's take example of the websites.csv.gz file, which is under path gs://<BUCKET-NAME>/websites/websites.csv.gz.
Now, for this I wrote a Hive script as below -
CREATE EXTERNAL TABLE websites (
col1 string,
col2 string,
col3 string,
col4 string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'gs://<BUCKET-NAME>/websites/'
TBLPROPERTIES ('skip.header.line.count'='1');
msck repair TABLE websites;
CREATE EXTERNAL TABLE par_websites (
col1 string,
col2 string,
col3 string,
col4 string
) STORED AS PARQUET LOCATION 'gs://<BUCKET-NAME>/websites/par_websites/';
INSERT OVERWRITE TABLE par_websites
SELECT *
FROM websites;
This works well and creates a new folder par_websites as in specified location gs://<BUCKET-NAME>/websites/par_websites/ which has the one parquet file inside it.
But when the website.csv.gz file is in a subfolder e.g. gs://<BUCKET-NAME>/data/websites/ and I update the script to have read and write locations as gs://<BUCKET-NAME>/data/websites/ and gs://<BUCKET-NAME>/data/websites/par_websites, it does not work at all. Hive does not seem to read from gs://<BUCKET-NAME>/data/websites/websites.csv.gz and instead of creating par_websites folder inside gs://<BUCKET-NAME>/data/websites, it creates a new folder gs://<BUCKET-NAME>/websites/par_websites with no parquet file inside.
Why is that and how can I make Hive read and write from subfolders?
Hive was caching my previous table names, so when I was updating it was still showing the older version and not updating.
Once I changed the name and it processed again, all worked well.

Convert data from gzip to sequenceFile format using Hive on spark

I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.

EMR Hadoop processing whole S3 file

I have a bunch of small (1KB to 1MB) text files stored in Amazon S3 that I would like to process using Amazon EMR's Hadoop.
Each record given to the mapper needs to contain the entire contents of a text file as well as some way to determine the filename, so I cannot use the default TextInputFormat.
What is the best way to accomplish this? Is there anything else I can do (like copying files from S3 to hdfs) to increase performance?
I had the same issue. Please refer following questions.
s3fs on Amazon EMR: Will it scale for approx 100million small files?
Write 100 million files to s3
Too many open files in EMR
If you don't have any large files, but have a lot of files, it's sufficient to use s3cmd get --recursive s3://<url> . command. After retrieved files into EMR instance, you could create tables with Hive. For example, you can load whole files with LOAD DATA statement with partition.
sample
This is a sample code
#!/bin/bash
s3cmd get --recursive s3://your.s3.name .
# create table with partitions
hive -e "SET mapred.input.dir.recursive=true; DROP TABLE IF EXISTS import_s3_data;"
hive -e "CREATE TABLE import_s3_data( rawdata string )
PARTITIONED BY (tier1 string, tier2, string, tier3 string);"
LOAD_SQL=""
# collect files as array
FILES=(`find . -name \*.txt -print`)
for FILE in ${FILES[#]}
do
DIR_INFO=(`echo ${FILE##./} | tr -s '/' ' '`)
T1=${DIR_INFO[0]}
T2=${DIR_INFO[1]}
T3=${DIR_INFO[2]}
LOAD_SQL="${LOAD_SQL} LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE
import_s3_data PARTITION (tier1 = '${T1}', tier2 = '${T2}', tier3 = '${T3}');"
done
hive -e "${LOAD_SQL}"
another option
I think there are some another options to retrieve small S3 data
S3DistCp ... it will merge small file as large one in order to deal with Hadoop
Hive - External Tables ... it will create an external table referring s3 storages. However, it has almost same performance compared with the case to use s3cmd get. It might more effective in such a case, there are many large raw or gziped files on S3.
The best approach according to me would be to create an external table on the CSV files and load it into another table stored again in S3 bucket in parquet format. You will not have to write any script in that case, just few SQL queries.
CREATE EXTERNAL TABLE databasename.CSV_EXT_Module(
recordType BIGINT,
servedIMSI BIGINT,
ggsnAddress STRING,
chargingID BIGINT,
...
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://module/input/csv_files/'
TBLPROPERTIES ("skip.header.line.count"="1");
The above table will only be an external table mapped to the csv file.
Create another table on top of it if you want the query to run faster:
CREATE TABLE databasename.RAW_Module as
SELECT
recordType,
servedIMSI,
ggsnAddress,
chargingID,
...
regexp_extract(INPUT__FILE__NAME,'(.*)/(.*)',2) as filename from
databasename.CSV_EXT_Module
STORED AS PARQUET
LOCATION 's3://module/raw/parquet_files/';
Change the regexp_extract to have the required input file name.

how to preprocess the data and load into hive

I completed my hadoop course now I want to work on Hadoop. I want to know the workflow from data ingestion to visualize the data.
I am aware of how eco system components work and I have built hadoop cluster with 8 datanodes and 1 namenode:
1 namenode --Resourcemanager,Namenode,secondarynamenode,hive
8 datanodes--datanode,Nodemanager
I want to know the following things:
I got data .tar structured files and first 4 lines have got description.how to process this type of data im little bit confused.
1.a Can I directly process the data as these are tar files.if its yes how to remove the data in the first four lines should I need to untar and remove the first 4 lines
1.b and I want to process this data using hive.
Please suggest me how to do that.
Thanks in advance.
Can I directly process the data as these are tar files.
Yes, see the below solution.
if yes, how to remove the data in the first four lines
Starting Hive v0.13.0, There is a table property, tblproperties ("skip.header.line.count"="1") while creating a table to tell Hive the number of rows to ignore. To ignore first four lines - tblproperties ("skip.header.line.count"="4")
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE
tblproperties("skip.header.line.count"="4");
LOAD DATA LOCAL INPATH '/tmp/test.tar' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
To view the data:
select * from raw_sequence
Reference: Compressed Data Storage
Follow the below steps to achieve your goal:
Copy the data(ie.tar file) to the client system where hadoop is installed.
Untar the file and manually remove the description and save it in local.
Create the metadata(i.e table) in hive based on the description.
Eg: If the description contains emp_id,emp_no,etc.,then create table in hive using this information and also make note of field separator used in the data file and use the corresponding field separator in create table query. Assumed that file contains two columns which is separated by comma then below is the syntax to create the table in hive.
Create table tablename (emp_id int, emp_no int)
Row Format Delimited
Fields Terminated by ','
Since, data is in structured format, you can load the data into hive table using the below command.
LOAD DATA LOCAL INPATH '/LOCALFILEPATH' INTO TABLE TABLENAME.
Now, local data will be moved to hdfs and loaded into hive table.
Finally, you can query the hive table using SELECT * FROM TABLENAME;

Is it possible to import data into Hive table without copying the data

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.
Can I avoid having all my text data stored twice?
EDIT: I load it via the following command
LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Then, I can find the exact same file in:
/user/hive/warehouse/sandbox.db/test/day=20130220
I assumed it was copied.
use an external table:
CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/logs/';
if you want to use partitioning with an external table, you will be responsible for managing the partition directories.
the location specified must be an hdfs directory..
If you drop an external table hive WILL NOT delete the source data.
If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.
I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.
LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')
Notice the LOCAL
You can use alter table partition statement to avoid data duplication.
create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';
ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
Hive (atleast when running in true cluster mode) can not refer to external files in local file system. Hive can automatically import the files during table creation or load operation. The reason behind this can be that Hive runs MapReduce jobs internally to extract the data. MapReduce reads from the HDFS as well as writes back to HDFS and even runs in distributed mode. So if the file is stored in local file system, it can not be used by the distributed infrastructure.

Resources