Error Copying data from HDFS to External Table In Hive - hadoop

i am trying to insert data from hdfs to external table in hive. but getting below error.
Error :
Usage: java FsShell [-put <localsrc> ... <dst>]
Command failed with exit code = 255
Command
hive> !hadoop fs -put /myfolder/logs/pv_ext/2013/08/11/log/data/Sacramentorealestatetransactions.csv
> ;
Edited:
file location : /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv
table location : hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
i am in hive
executing command
!hadoop fs -put /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
getting error :
put: File /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv does not exist.
Command failed with exit code = 255
Please share your suggestion.
Thanks

Here are two methods to load data into the external Hive table.
Method 1:
a) Get the location of the HDFS folder for the Hive external table.
hive> desc formatted mytable;
b) Note the value for the Location property in output. Say, it is hdfs:///hive-data/mydata
c) Then, put the file from local disk to HDFS
$ hadoop fs -put /location/of/data/file.csv hdfs:///hive-data/mydata
Method 2:
a) Load data via this Hive command
hive > LOAD DATA LOCAL INPATH '/location/of/data/file.csv' INTO TABLE mytable;

One more method. Change Hive table location:
alter table table_name set location='hdfs://your_data/folder';

This method may help you to better.
Need to create a table in HIVE.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS mytable(myid INT, a1 STRING, a2 STRING....)
row format delimited fields terminated by '\t' stored as textfile LOCATION
hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data;
Load data from HDFS to hive table.
hive> LOAD DATA INPATH /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv INTO TABLE mytable;
NOTE: If you load data from HDFS to HIVE (INPATH) the data will be moved from HDFS
location to HIVE. So, the data won't available on HDFS location for next time.
Check if the data loaded successfully.
hive> SELECT * FROM mytable;

Related

Two separate tables on Hadoop from from two files in HDFS directory

I am trying to build two Hadoop tables from one HDFS directory.
So I'd like table file1 from file 1.tsv and another table file2 from file 2.tsv. But both are inside one HDFS directory /tmp/ip.
# create hdfs directory
hadoop fs -mkdir /tmp/ip
# put my two tsv files
hadoop fs -put /tmp/data/1.tsv tmp/ip/
hadoop fs -put /tmp/data/2.tsv tmp/ip/
Now in Hive's CLI
--in Hive CLI to build table
CREATE EXTERNAL TABLE IF NOT EXISTS file1
(id STRING,Code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
-- failed solution because there's two files
LOCATION 'tmp/ip';
-- failed solution but don't understand why
LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1
Regarding failed solution:
-- failed solution but don't understand why LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1 failed.
This is failing due to keyword LOCAL. Hive is looking for file on local file system. You can try by not using this.
LOAD DATA INPATH 'tmp/ip/1.tsv' INTO table file1`

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS)

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS) is there a way we can keep the file in the source directory and load the data into hive table as well.
I used the below query;
LOAD DATA INPATH 'source_file_path' insert INTO TABLE TABLENAME;
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Use hadoop fs cp or hdfs dfs cp commands to copy (not move) files:
hadoop fs -cp [source_file_path] [table_location_path]
or
hdfs dfs cp [source_file_path] [table_location_path]
Use decribe formatted tablename command to check table location path.

how can I read table hive in a .orc file?

I have a .orc file, is there a way to convert it to a .csv file? or is there another way to read the tables within this file?
Hive has native ORC support, so you can read it directly via Hive.
Illustration:
(Say, the file is named myfile.orc)
Upload file to HDFS
hadoop fs -mkdir hdfs:///my_table_orc_file
hadoop fs -put myfile.orc hdfs:///my_table_orc_file
Create a Hive table on it
(Update column definitions to match the data)
CREATE EXTERNAL TABLE `my_table_orc`(
`col1` string,
`col2` string)
STORED AS ORC
LOCATION
'hdfs:///my_table_orc_file';
Query it
select * from my_table_orc;
You can read content of ORC file using following command
hive --orcfiledump -d <path_of_orc_file_in_hdfs>
It will return the content as json.

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Apache hive MSCK REPAIR TABLE new partition not added

I am new for Apache Hive. While working on external table partition, if I add new partition directly to HDFS, the new partition is not added after running MSCK REPAIR table. Below are the codes I tried,
-- creating external table
hive> create external table factory(name string, empid int, age int) partitioned by(region string)
> row format delimited fields terminated by ',';
--Detailed Table Information
Location: hdfs://localhost.localdomain:8020/user/hive/warehouse/factory
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
transient_lastDdlTime 1438579844
-- creating directory in HDFS to load data for table factory
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Table data
cat factory1.txt
emp1,500,40
emp2,501,45
emp3,502,50
cat factory2.txt
EMP10,200,25
EMP11,201,27
EMP12,202,30
-- copying from local to HDFS
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory1.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory2.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
-- Altering table to update in the metastore
hive> alter table factory add partition(region='southregion') location '/user/hive/testing/testing1/factory2';
hive> alter table factory add partition(region='northregion') location '/user/hive/testing/testing1/factory1';
hive> select * from factory;
OK
emp1 500 40 northregion
emp2 501 45 northregion
emp3 502 50 northregion
EMP10 200 25 southregion
EMP11 201 27 southregion
EMP12 202 30 southregion
Now I created new file factory3.txt to add as new partition for the table factory
cat factory3.txt
user1,100,25
user2,101,27
user3,102,30
-- creating the path and copying table data
[cloudera#localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
[cloudera#localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory3'
now I executed the below query to update the metastore for the new partition added
MSCK REPAIR TABLE factory;
Now the table is not giving the new partition content of factory3 file. Can I know where I am doing mistake while adding partition for table factory?
whereas, if I run the alter command then it is showing the new partition data.
hive> alter table factory add partition(region='eastregion') location '/user/hive/testing/testing1/factory3';
Can I know why the MSCK REPAIR TABLE command is not working?
For the MSCK to work, naming convention /partition_name=partition_value/ should be used. For example in the root directory of table;
# hadoop fs -ls /user/hive/root_of_table/*
/user/hive/root_of_table/day=20200101/data1.parq
/user/hive/root_of_table/day=20200101/data2.parq
/user/hive/root_of_table/day=20200102/data3.parq
/user/hive/root_of_table/day=20200102/data4.parq
When you run msck repair table <tablename> partitions of day; 20200101 and 20200102 will be added automatically.
You have to put data in directory named 'region=eastregio' in table location directory:
$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'

Resources