How to get the hive table output or text file in hdfs on which hive table created to .CSV format. - hadoop

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.

You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

Related

Two separate tables on Hadoop from from two files in HDFS directory

I am trying to build two Hadoop tables from one HDFS directory.
So I'd like table file1 from file 1.tsv and another table file2 from file 2.tsv. But both are inside one HDFS directory /tmp/ip.
# create hdfs directory
hadoop fs -mkdir /tmp/ip
# put my two tsv files
hadoop fs -put /tmp/data/1.tsv tmp/ip/
hadoop fs -put /tmp/data/2.tsv tmp/ip/
Now in Hive's CLI
--in Hive CLI to build table
CREATE EXTERNAL TABLE IF NOT EXISTS file1
(id STRING,Code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
-- failed solution because there's two files
LOCATION 'tmp/ip';
-- failed solution but don't understand why
LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1
Regarding failed solution:
-- failed solution but don't understand why LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1 failed.
This is failing due to keyword LOCAL. Hive is looking for file on local file system. You can try by not using this.
LOAD DATA INPATH 'tmp/ip/1.tsv' INTO table file1`

how can I read table hive in a .orc file?

I have a .orc file, is there a way to convert it to a .csv file? or is there another way to read the tables within this file?
Hive has native ORC support, so you can read it directly via Hive.
Illustration:
(Say, the file is named myfile.orc)
Upload file to HDFS
hadoop fs -mkdir hdfs:///my_table_orc_file
hadoop fs -put myfile.orc hdfs:///my_table_orc_file
Create a Hive table on it
(Update column definitions to match the data)
CREATE EXTERNAL TABLE `my_table_orc`(
`col1` string,
`col2` string)
STORED AS ORC
LOCATION
'hdfs:///my_table_orc_file';
Query it
select * from my_table_orc;
You can read content of ORC file using following command
hive --orcfiledump -d <path_of_orc_file_in_hdfs>
It will return the content as json.

Timestamp partitioning in Hive

I am trying to create timestamp based partition in hive. But hive is creating data based partition. Below is my code. Could someone please help?
cat test1.sh
dat=`date +'%Y%m%d %H:%m:%S'`
hive -f load.hql -hiveconf file_load_timestamp=$dat;
cat load.hql
INSERT OVERWRITE table perm.test partition(file_load_timestamp='${hiveconf:dat}')
SELECT a,b FROM work.temp;
dt=20180102/ = HDFS path is getting created like this.
dt=20180102 103455/ = Expecting HDFS path to be created like this.
When I tried with %Y%m%d_%H:%m:%S' format its working as expected. But I need space between date and timestamp.
To create a folder name in HDFS with space in between, it is required to escape the space with \
hadoop fs -mkdir test\ 123
create a folder in hdfs with name test 123.
Similarly, hive maintains the partitions in folders created with the partition value. Thats why providing the date format %Y%m%d\ %H%m%S will help to create folder with spaces.
Below is tested and working:
INSERT OVERWRITE table person_details1 partition(datelocal='20180102\ 200128') select * from person_details;
datelocal is String
Edited:Executed the code, Below is working one:
hduser#Amit:~$ cat test1.sh
#!/bin/sh
dat=`date +'%Y%m%d\ %H%m%S'`
hive -f load.hql -hiveconf datelocal="$dat";
hduser#Amit:~$ cat load.hql
INSERT OVERWRITE table amit.person_details1 partition(datelocal='${hiveconf:datelocal}') select * from amit.person_details;

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.
Are there any tools or way I can achieve this?
Thank you
To delete data older then a certain time frame, you have a few options.
First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.
Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.
A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.
A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

Resources