unexpected elements in imported Compressed data into Hive - hadoop

I am trying to load a compressed TXT file into Hive. The operation ends without any error, however in the constructed table there are some unexpected characters in the beginning. Why this happend?
More info about compressed data storage in Hive: https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
# cat test.txt
tab1 tab2 tab3
tab4 tab5 tab6
tab7 tab8 tab9
# tar -cvzf test.gz test.txt
test.txt
# cat hiveQuery.hql
CREATE TABLE raw (col1 STRING,col2 STRING,col3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
;
LOAD DATA LOCAL INPATH '/test.gz' INTO TABLE raw;
# hive -f hiveQuery.hql
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.0.0-169/0/hive-log4j.properties
OK
Time taken: 6.936 seconds
Loading data to table default.raw
Table default.raw stats: [numFiles=1, totalSize=145]
OK
# hive -e "select * from raw"
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.0.0-169/0/hive-log4j.properties
OK
test.txt 0000644 0000000 0000000 00000000055 13120243734 011273 0 ustar root root tab1 tab2 tab3
tab4 tab5 tab6
tab7 tab8 tab9
NULL NULL

tar format contains additional header information.
Compress your file using gzip and see that it works fine.

Related

Getting NULL values after loading data into Hive tables from an online dataset

I am trying to load a data from an online dataset into my hive table using hue interface but I am getting NULL values.
Here's my dataset:
https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Here's my code:
CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
Here's how I loaded the data:
LOAD DATA LOCAL INPATH '/home/hadoop/aisles.csv' INTO TABLE aisles;
My Workaround, but no go:
FIELDS TERMINATED BY ','
FIELDS TERMINATED BY '\t'
FIELDS TERMINATED BY ''
FIELDS TERMINATED BY ' '
Also tried removing LINES TERMINATED BY '\n'
This is how I downloaded the data:
[hadoop#ip-172-31-76-58 ~]$ wget -O aisles.csv "https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv"
--2020-10-14 23:50:06-- https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=aisles.csv
Resolving www.kaggle.com (www.kaggle.com)... 35.244.233.98
Connecting to www.kaggle.com (www.kaggle.com)|35.244.233.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘aisles.csv’
I checked the location of the table I created this is what it says;
hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles
I tried browsing the directory and see where the file was saved:
[hadoop#ip-172-31-76-58 ~]$ hdfs dfs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - arjiesaenz hadoop 0 2020-10-15 00:57 /user/hive/warehouse/aisles
So, I tried to change my load script like this;
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
But I got an error:
Error while compiling statement: FAILED: SemanticException line 6:61 Invalid path ''/user/hive/warehouse/aisles.csv'': No files matching path hdfs://ip-172-31-76-58.ec2.internal:8020/user/hive/warehouse/aisles.csv
Hopefully someone can help me pinpoint the problem with my code.
Thanks.
I tried the same on my hadoop cluster. The code worked without any issues.
Here's my execution snippet:
hive> CREATE TABLE IF NOT EXISTS AISLES (aisles_id INT, aisles STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> STORED AS TEXTFILE
> tblproperties("skip.header.line.count"="1");
OK
Time taken: 0.034 seconds
hive> load data inpath '/user/hirwuser1448/aisles.csv' into table AISLES;
Loading data to table revisit.aisles
Table revisit.aisles stats: [numFiles=1, totalSize=2603]
OK
Time taken: 0.183 seconds
hive> select * from AISLES limit 10;
OK
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
Time taken: 0.038 seconds, Fetched: 10 row(s)
I think you need to cross check if your dataset aisles.csv is at the hdfs location and not stored at local directory.
The problem is with your load cmd.
LOAD DATA INPATH '/user/hive/warehouse/aisles.csv' INTO TABLE aisles;
I see you tried browsing the dir to see the saved file. Do you see aisles.csv under that dir? If the file's there, then you're giving wrong path in your load cmd else file isn't there at all.
I found a workaround by downloading the dataset and uploaded it into the Amazon S3 bucket and used the S3 path in the LOAD command.

Presto Query HIVE Table Exception: Failed to list directory

I'm new to Presto. I have two machine for presto 0.160, one is coordinator, the other is worker. I want to query table in hive. Now I can "show tables", "desc tablename", but when I want to "select * from tablename", exception occured: "Query 20170728_123013_00011_q4s3a failed: Failed to list directory: hdfs://cdh-test/user/hive/warehouse/employee_hive"
presto> desc hive.default.employee_hive;
Column | Type | Comment
-------------+---------+---------
eid | integer |
name | varchar |
salary | varchar |
destination | varchar |
(4 rows)
Query 20170728_123001_00010_q4s3a, FINISHED, 2 nodes
Splits: 2 total, 2 done (100.00%)
0:00 [4 rows, 268B] [40 rows/s, 2.68KB/s]
presto> select * from hive.default.employee_hive;
Query 20170728_123013_00011_q4s3a, FAILED, 1 node
Splits: 1 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20170728_123013_00011_q4s3a failed: Failed to list directory: hdfs://cdh-test/user/hive/warehouse/employee_hive
Here is my configuration for hive catalog:
connector.name=hive-cdh4
hive.metastore.uri=thrift://***:9083
hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml
where am I wrong?
The path that the table is stored on needs to exist on HDFS for Presto to open it successfully. From the path it appears your table is an "internal" hive table, meaning hive should have created the path itself. Since it hasn't, you could create it yourself using a command similar to hdfs dfs -mkdir hdfs://cdh-test/user/hive/warehouse/employee_hive, although the exact command depends on your HDFS set up.
you can't access the hadoop directory directory. I hope you have created the table as textfile and it stores internal directory of respective user.
you just create table as external table and you can able to access via presto
Create External Table tablename (columnames datatypes) row format delimited fields terminated by '\t' stored as textfile;
load data inpath 'Your_hadoop_directory' into table tablename;
else you just create a internal table and load it to external ORC table and access via presto
Create Table tablename (columnames datatypes) row format delimited fields terminated by '\t' stored as textfile;
load data inpath 'Your_hadoop_directory' into table tablename;
Create external Table tablename (columnames datatypes) STORED AS ORC;
insert into orc_tablename select * from internal_tablename
I solved above issue by creating ORC table.

multi file insert from hive table not working?

Hi i have 200 gb of data in one of my hive table backed on HBase.
I have to create 142 different files out of that table currently trying for 3 files only .
I want to run all query to run parallel at the same time .
I was trying multi file insert from hive table but getting parse exception .
This is my query that i was trying .
FROM hbase_table_FinancialLineItem
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FinancialLineItem/Japan.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FinancialLineItem WHERE FilePartition='Japan'
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FinancialLineItem/SelfSourcedPrivate.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FinancialLineItem WHERE FilePartition='SelfSourcedPrivate'
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FinancialLineItem/ThirdPartyPrivate.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FinancialLineItem WHERE FilePartition='ThirdPartyPrivate';
And after running this i was getting below error.
FAILED: ParseException line 7:9 missing EOF at 'from' near '*'
I think it can be solved when you add this FROM hbase_table_FinancialLineItem; at the end of each insert overwrite.

Create temporary file to load in Hive table using stdout redirection

I would like to create a script to load a tsv file into HIVE.
However since the .tsv file contains an header
so I first have to create a temporary file without it.
In my script.hql I have the following:
DROP TABLE metadata IF EXISTS ;
CREATE TABLE metadata (
id INT,
value STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
! tail -n +2 metadata.tsv > tmp_metadata.tsv ;
LOAD DATA LOCAL 'tmp_metadata.tsv' INTO metadata ;
The problem is that hive complains about the > that should make the redirection to the new fails and therefore the scripts fails.
How can I fix this?
1. Create a new shell script named script.sh and add this in your shell script:
#!/bin/sh
tail -n +2 metadata.tsv > tmp_metadata.tsv
hive -v -f ./script.hql
2. Instead of your script.hql:
DROP TABLE metadata IF EXISTS ;
CREATE TABLE metadata (id INT,value STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ;
! tail -n +2 metadata.tsv > tmp_metadata.tsv ;
LOAD DATA LOCAL 'tmp_metadata.tsv' INTO metadata ;
Add this to your script.hql:
DROP TABLE IF EXISTS metadata;
CREATE TABLE metadata (
id INT,
value STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
LOAD DATA LOCAL INPATH 'tmp_metadata.tsv' INTO TABLE metadata ;
DROP TABLE metadata IF EXISTS ; is not correct. Change it to DROP TABLE IF EXISTS metadata;
and
LOAD DATA LOCAL 'tmp_metadata.tsv' INTO metadata ; is not correct. Change it to LOAD DATA LOCAL INPATH 'tmp_metadata.tsv' INTO TABLE metadata ;
3. Now, change permission for your shell script and execute it:
sudo chmod 777 script.sh
./script.sh

add date time from flat file name cloudera

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:
projectcode, pagename, pageviews, bytes
the files are named as such:
pagecounts-20090430-230000.gz
date time
when loading the data from HDFS to Impala, I do it as such:
CREATE EXTERNAL TABLE wikiPgvws
(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';
one thing I missed is the date and time of each of the file. The dir:
/user/hdfs
contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?
I think the thing you are missing is the concept of partitions. If you define the table as partitioned, the data may be divided to different partitions based on the timestamp(in the name) of the file. I'm able to work around it in hive, I hope you to do the needful(if any) for impala as there query syntax is the same.
For me, this problem is not possible to solve only using hive. So I mixed up bash with hive scripting and it works fine for me. This is how I wrapped it up :
Create table wikiPgvws with partition
Create table wikiTmp with same fields as wikiPgvws except for partitions
For each file
i. Load data into wikiTmp
ii. grep timeStamp from fileName
iii. Use sed to replace placeholders in a predefined hql script file to load the data to the actual table. Then run it.
Drop table wikiTmp & remove tmp.hql
The script is as follows :
#!/bin/bash
hive -e "CREATE EXTERNAL TABLE wikiPgvws(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
PARTITIONED BY(dts STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE";
hive -e "CREATE TABLE wikiTmp(
project_code varchar(100),
page_name varchar(1000),
page_views int,
page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE"
for fileName in $(hadoop fs -ls /user/hdfs/bounty/pagecounts-*.txt | grep -Po '(?<=\s)(/user.*$)')
do
echo "currentFile :$fileName"
dst=$(echo $filename | grep -oE '[0-9]{8}-[0-9]{6}')
echo "currentStamp $dst"
sed "s!sourceFile!'$fileName'!" t.hql > tmp.hql
sed -i "s!targetPartition!$dst!" tmp.hql
hive -f tmp.hql
done
hive -e "DROP TABLE wikiTmp"
rm -f tmp.hql
The hql script consists of just two lines :
LOAD DATA INPATH sourceFile OVERWRITE INTO TABLE wikiTmp;
INSERT OVERWRITE TABLE wikiPgvws PARTITION (dts = 'targetPartition') SELECT w.* FROM wikiTmp w;
Epilogue :
Check, whether options equivalent to hive -e & hive -f are available in impala. Without them, this script is of no use to you. Again the grep commands to fetch the fileName & timeStamp need to be modified according to your table location and stamp pattern. It's just one a way to show how the job can be done, but couldn't able to find another one.
Enhencement
If everything works well, consider merging the first two DDLs into another script to make it look cleaner. Although, I'm not sure that hql script arguments can be used to define partition values, still you can have a look to replace sed.

Resources