how to load load multiple files into table in hive? - hadoop

There is a directory which contains multiple files yet to be analyzed, for example, file1, file2, file3.
I want to
load data inpath 'path/to/*' overwrite into table demo
instead of
load data inpath 'path/to/file1' overwrite into table demo
load data inpath 'path/to/file2' overwrite into table demo
load data inpath 'path/to/file3' overwrite into table demo.
However, it just doesn't work. Are there any easier ways to implement this?

1.
load data inpath is an HDFS metadata operation.
The only thing it does is moving files from their current location to the table location.
And again, "moving" (unlike "copying") is a metadata operation and not data operation.
2.
If the OVERWRITE keyword is used then the contents of the target table
(or partition) will be deleted and replaced by the files referred to
by filepath; otherwise the files referred by filepath will be added to
the table.
Language Manual DML-Loading files into tables
3.
load data inpath 'path/to/file1' into table demo;
load data inpath 'path/to/file2' into table demo;
load data inpath 'path/to/file3' into table demo;
or
load data inpath 'path/to/file?' into table demo;
or
dfs -mv path/to/file? ...{path to demo}.../demo
or (from bash)
hdfs dfs -mv path/to/file? ...{path to demo}.../demo

Generating a hive table with the path as the LOCATION parameter will automatically read all the files in said location.
for example:
CREATE [EXTERNAL] TABLE db.tbl(
column1 string,
column2 int ...)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY (delimiter)
LINES TERMINATED BY '\n'
LOCATION '/path/to/' <- DO NOT POINT TO A SPECIFIC FILE, POINT TO THE DIRECTORY
Hive will will automatically parse all data within the folder and will "force feed" it to the table statement you created.
as long as all files in that path are in the same format you are good to go.

1) Directory contains three files
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:53 /hallfolder/hall1.csv
-rw-r--r-- 1 hadoop supergroup 125 2017-05-15 17:54 /hallfolder/hall2.csv
2) Enable this command
SET mapred.input.dir.recursive=true;
3) hive>
load data inpath '/hallfolder/*' into table alltable;

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
filepath can be:
a relative path, such as project/data1 an absolute path, such as /user/hive/project/data1 a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.

Related

Two separate tables on Hadoop from from two files in HDFS directory

I am trying to build two Hadoop tables from one HDFS directory.
So I'd like table file1 from file 1.tsv and another table file2 from file 2.tsv. But both are inside one HDFS directory /tmp/ip.
# create hdfs directory
hadoop fs -mkdir /tmp/ip
# put my two tsv files
hadoop fs -put /tmp/data/1.tsv tmp/ip/
hadoop fs -put /tmp/data/2.tsv tmp/ip/
Now in Hive's CLI
--in Hive CLI to build table
CREATE EXTERNAL TABLE IF NOT EXISTS file1
(id STRING,Code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
-- failed solution because there's two files
LOCATION 'tmp/ip';
-- failed solution but don't understand why
LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1
Regarding failed solution:
-- failed solution but don't understand why LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1 failed.
This is failing due to keyword LOCAL. Hive is looking for file on local file system. You can try by not using this.
LOAD DATA INPATH 'tmp/ip/1.tsv' INTO table file1`

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS)

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS) is there a way we can keep the file in the source directory and load the data into hive table as well.
I used the below query;
LOAD DATA INPATH 'source_file_path' insert INTO TABLE TABLENAME;
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Use hadoop fs cp or hdfs dfs cp commands to copy (not move) files:
hadoop fs -cp [source_file_path] [table_location_path]
or
hdfs dfs cp [source_file_path] [table_location_path]
Use decribe formatted tablename command to check table location path.

Error Copying data from HDFS to External Table In Hive

i am trying to insert data from hdfs to external table in hive. but getting below error.
Error :
Usage: java FsShell [-put <localsrc> ... <dst>]
Command failed with exit code = 255
Command
hive> !hadoop fs -put /myfolder/logs/pv_ext/2013/08/11/log/data/Sacramentorealestatetransactions.csv
> ;
Edited:
file location : /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv
table location : hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
i am in hive
executing command
!hadoop fs -put /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
getting error :
put: File /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv does not exist.
Command failed with exit code = 255
Please share your suggestion.
Thanks
Here are two methods to load data into the external Hive table.
Method 1:
a) Get the location of the HDFS folder for the Hive external table.
hive> desc formatted mytable;
b) Note the value for the Location property in output. Say, it is hdfs:///hive-data/mydata
c) Then, put the file from local disk to HDFS
$ hadoop fs -put /location/of/data/file.csv hdfs:///hive-data/mydata
Method 2:
a) Load data via this Hive command
hive > LOAD DATA LOCAL INPATH '/location/of/data/file.csv' INTO TABLE mytable;
One more method. Change Hive table location:
alter table table_name set location='hdfs://your_data/folder';
This method may help you to better.
Need to create a table in HIVE.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS mytable(myid INT, a1 STRING, a2 STRING....)
row format delimited fields terminated by '\t' stored as textfile LOCATION
hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data;
Load data from HDFS to hive table.
hive> LOAD DATA INPATH /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv INTO TABLE mytable;
NOTE: If you load data from HDFS to HIVE (INPATH) the data will be moved from HDFS
location to HIVE. So, the data won't available on HDFS location for next time.
Check if the data loaded successfully.
hive> SELECT * FROM mytable;

Creating hive table: no files matching path file... but the file exist in the path

Im trying to create a hive orc table using a file stored in hdfs.
I have a table "partsupp.tbl" file where each line have the below format:
1|25002|8076|993.49|ven ideas. quickly even packages print. pending multipliers must have to are fluff|
I create a hive table like this:
create table if not exists partsupp (PS_PARTKEY BIGINT,
PS_SUPPKEY BIGINT,
PS_AVAILQTY INT,
PS_SUPPLYCOST DOUBLE,
PS_COMMENT STRING)
STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY")
;
Now Im trying to load the data in the .tbl file in the table like this:
LOAD DATA LOCAL INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp
But Im getting this issue:
No files matching path file:/tables/partsupp/partsupp.tbl
But the files exists in the hdfs...
LOCAL signifies that file is present on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
So in this case, use following query:
LOAD DATA INPATH '/tables/partsupp/partsupp.tbl' INTO TABLE partsupp

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Resources