Unable to load data from multiple level directories into Hive table - hadoop

I created a table the following way
CREATE TABLE `default.tmptbl` (id int, name string) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'escapeChar'='\\','quoteChar'='\"','separatorChar'=',');
And I have data in HDFS that have been structured in the following way:
/app/tmptbl/
/DIR1
/file1.csv
/file2.csv
/DIR2
/file3.csv
/file4.csv
I tried to load the data using the following command:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
LOAD DATA INPATH '/app/tmptbl/' INTO TABLE `default.tmptbl`;
However I get the following error:
FAILED: SemanticException Line 1:17 Invalid path ''/app/tmptbl/'': source contains directory: /app/tmptbl/dir1
I don't know why even after setting mapred.input.dir.recursive=true; hive.mapred.supports.subdirectories=true didn't make it load data recursively from sub-directories. Am I missing anything?

Related

Error Copying data from HDFS to External Table In Hive

i am trying to insert data from hdfs to external table in hive. but getting below error.
Error :
Usage: java FsShell [-put <localsrc> ... <dst>]
Command failed with exit code = 255
Command
hive> !hadoop fs -put /myfolder/logs/pv_ext/2013/08/11/log/data/Sacramentorealestatetransactions.csv
> ;
Edited:
file location : /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv
table location : hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
i am in hive
executing command
!hadoop fs -put /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
getting error :
put: File /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv does not exist.
Command failed with exit code = 255
Please share your suggestion.
Thanks
Here are two methods to load data into the external Hive table.
Method 1:
a) Get the location of the HDFS folder for the Hive external table.
hive> desc formatted mytable;
b) Note the value for the Location property in output. Say, it is hdfs:///hive-data/mydata
c) Then, put the file from local disk to HDFS
$ hadoop fs -put /location/of/data/file.csv hdfs:///hive-data/mydata
Method 2:
a) Load data via this Hive command
hive > LOAD DATA LOCAL INPATH '/location/of/data/file.csv' INTO TABLE mytable;
One more method. Change Hive table location:
alter table table_name set location='hdfs://your_data/folder';
This method may help you to better.
Need to create a table in HIVE.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS mytable(myid INT, a1 STRING, a2 STRING....)
row format delimited fields terminated by '\t' stored as textfile LOCATION
hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data;
Load data from HDFS to hive table.
hive> LOAD DATA INPATH /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv INTO TABLE mytable;
NOTE: If you load data from HDFS to HIVE (INPATH) the data will be moved from HDFS
location to HIVE. So, the data won't available on HDFS location for next time.
Check if the data loaded successfully.
hive> SELECT * FROM mytable;

how to load load multiple files into table in hive?

There is a directory which contains multiple files yet to be analyzed, for example, file1, file2, file3.
I want to
load data inpath 'path/to/*' overwrite into table demo
instead of
load data inpath 'path/to/file1' overwrite into table demo
load data inpath 'path/to/file2' overwrite into table demo
load data inpath 'path/to/file3' overwrite into table demo.
However, it just doesn't work. Are there any easier ways to implement this?
1.
load data inpath is an HDFS metadata operation.
The only thing it does is moving files from their current location to the table location.
And again, "moving" (unlike "copying") is a metadata operation and not data operation.
2.
If the OVERWRITE keyword is used then the contents of the target table
(or partition) will be deleted and replaced by the files referred to
by filepath; otherwise the files referred by filepath will be added to
the table.
Language Manual DML-Loading files into tables
3.
load data inpath 'path/to/file1' into table demo;
load data inpath 'path/to/file2' into table demo;
load data inpath 'path/to/file3' into table demo;
or
load data inpath 'path/to/file?' into table demo;
or
dfs -mv path/to/file? ...{path to demo}.../demo
or (from bash)
hdfs dfs -mv path/to/file? ...{path to demo}.../demo
Generating a hive table with the path as the LOCATION parameter will automatically read all the files in said location.
for example:
CREATE [EXTERNAL] TABLE db.tbl(
column1 string,
column2 int ...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY (delimiter)
LINES TERMINATED BY '\n'
LOCATION '/path/to/' <- DO NOT POINT TO A SPECIFIC FILE, POINT TO THE DIRECTORY
Hive will will automatically parse all data within the folder and will "force feed" it to the table statement you created.
as long as all files in that path are in the same format you are good to go.
1) Directory contains three files
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall1.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:54 /hallfolder/hall2.csv
2) Enable this command
SET mapred.input.dir.recursive=true;
3) hive>
load data inpath '/hallfolder/*' into table alltable;
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be:
a relative path, such as project/data1 an absolute path, such as /user/hive/project/data1 a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.

Compress Json data in hive external table, at the time querying throwing exception?

I have created external tables by following below steps
Hive > ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Hive > set hive.exec.compress.output=true;
Hive > set mapred.output.compress=true;
Hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
Hive> set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Hive > CREATE EXTERNAL TABLE Json (id BIGINT,created_at STRING,source STRING,favorited BOOLEAN) ROW FORMAT SERDE "com.cloudera.hive.serde.JSONSerDe"
LOCATION /user/cloudera/ jsonGZ ";
I have compressed my Json file by executing below command
“ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.5.0.jar -Dmap.output.compress=true -Dmap.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec -input /user/cloudera/json/ -output /user/cloudera/jsonGZ “
Then when I am running “ select * from json; “ I am getting the below error:
“OK Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.map.JsonMappingException: Can not deserialize instance of java.util.LinkedHashMap out of VALUE_NUMBER_INT token at “
And also I have created one more table using “org.apache.hive.hcatalog.data.JsonSerD”
Hive > ADD JAR /usr/lib/hive-hactalog/share/ hactalog/ hive-hactalog-core.jar;
Hive > CREATE EXTERNAL TABLE Json 1(id BIGINT,created_at STRING,source STRING,favorited BOOLEAN) ROW FORMAT SERDE "com.cloudera.hive.serde.JSONSerDe"
LOCATION /user/cloudera/ jsonGZ ";
Then when I am running “select * from json1;“,I am getting the below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected" after using "org.apache.hive.hcatalog.core(hive-hcatalog-core-0.13.0.jar)"
Am I missing something? How can I resolve this errors.
Just gzip your files and put them as is (*.gz) into the table location.
gzip filename

Creating hive table: no files matching path file... but the file exist in the path

Im trying to create a hive orc table using a file stored in hdfs.
I have a table "partsupp.tbl" file where each line have the below format:
1|25002|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
I create a hive table like this:
create table if not exists partsupp (PS_PARTKEY BIGINT,
PS_SUPPKEY BIGINT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
;
Now Im trying to load the data in the .tbl file in the table like this:
LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp
But Im getting this issue:
No files matching path file:/tables/partsupp/partsupp.tbl
But the files exists in the hdfs...
LOCAL signifies that file is present on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
So in this case, use following query:
LOAD DATA INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Resources