Hive: Multiple files in one partition - hadoop

Hive: Can I add partition with few locations?
For example, will the following query work?
alter table data
add partition (year = 2013, month = 11, day = 18)
LOCATION '/path1/a.avro,/path2/b.avro..';

Yes, you can. If the partition already exists in Hive (HDFS directory), then you don't need to run any hive alter commands. Just use hadoop -fs put ..
For example you have a hive partition table test (partitioned by dt):
/user/hive/warehouse/test/dt=20131216
with files:
/user/hive/warehouse/test/dt=20131216/1.avro
/user/hive/warehouse/test/dt=20131216/2.avro
Now if you have a new avro file: 3.avro then just run the hadoop fs -put command and hive will be able to see the new file automatically.

Related

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS)

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS) is there a way we can keep the file in the source directory and load the data into hive table as well.
I used the below query;
LOAD DATA INPATH 'source_file_path' insert INTO TABLE TABLENAME;
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Use hadoop fs cp or hdfs dfs cp commands to copy (not move) files:
hadoop fs -cp [source_file_path] [table_location_path]
or
hdfs dfs cp [source_file_path] [table_location_path]
Use decribe formatted tablename command to check table location path.

Error Copying data from HDFS to External Table In Hive

i am trying to insert data from hdfs to external table in hive. but getting below error.
Error :
Usage: java FsShell [-put <localsrc> ... <dst>]
Command failed with exit code = 255
Command
hive> !hadoop fs -put /myfolder/logs/pv_ext/2013/08/11/log/data/Sacramentorealestatetransactions.csv
> ;
Edited:
file location : /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv
table location : hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
i am in hive
executing command
!hadoop fs -put /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
getting error :
put: File /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv does not exist.
Command failed with exit code = 255
Please share your suggestion.
Thanks
Here are two methods to load data into the external Hive table.
Method 1:
a) Get the location of the HDFS folder for the Hive external table.
hive> desc formatted mytable;
b) Note the value for the Location property in output. Say, it is hdfs:///hive-data/mydata
c) Then, put the file from local disk to HDFS
$ hadoop fs -put /location/of/data/file.csv hdfs:///hive-data/mydata
Method 2:
a) Load data via this Hive command
hive > LOAD DATA LOCAL INPATH '/location/of/data/file.csv' INTO TABLE mytable;
One more method. Change Hive table location:
alter table table_name set location='hdfs://your_data/folder';
This method may help you to better.
Need to create a table in HIVE.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS mytable(myid INT, a1 STRING, a2 STRING....)
row format delimited fields terminated by '\t' stored as textfile LOCATION
hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data;
Load data from HDFS to hive table.
hive> LOAD DATA INPATH /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv INTO TABLE mytable;
NOTE: If you load data from HDFS to HIVE (INPATH) the data will be moved from HDFS
location to HIVE. So, the data won't available on HDFS location for next time.
Check if the data loaded successfully.
hive> SELECT * FROM mytable;

Timestamp partitioning in Hive

I am trying to create timestamp based partition in hive. But hive is creating data based partition. Below is my code. Could someone please help?
cat test1.sh
dat=`date +'%Y%m%d %H:%m:%S'`
hive -f load.hql -hiveconf file_load_timestamp=$dat;
cat load.hql
INSERT OVERWRITE table perm.test partition(file_load_timestamp='${hiveconf:dat}')
SELECT a,b FROM work.temp;
dt=20180102/ = HDFS path is getting created like this.
dt=20180102 103455/ = Expecting HDFS path to be created like this.
When I tried with %Y%m%d_%H:%m:%S' format its working as expected. But I need space between date and timestamp.
To create a folder name in HDFS with space in between, it is required to escape the space with \
hadoop fs -mkdir test\ 123
create a folder in hdfs with name test 123.
Similarly, hive maintains the partitions in folders created with the partition value. Thats why providing the date format %Y%m%d\ %H%m%S will help to create folder with spaces.
Below is tested and working:
INSERT OVERWRITE table person_details1 partition(datelocal='20180102\ 200128') select * from person_details;
datelocal is String
Edited:Executed the code, Below is working one:
hduser#Amit:~$ cat test1.sh
#!/bin/sh
dat=`date +'%Y%m%d\ %H%m%S'`
hive -f load.hql -hiveconf datelocal="$dat";
hduser#Amit:~$ cat load.hql
INSERT OVERWRITE table amit.person_details1 partition(datelocal='${hiveconf:datelocal}') select * from amit.person_details;

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

How to get the hive table output or text file in hdfs on which hive table created to .CSV format.

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.
You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

Resources