how can I read table hive in a .orc file? - hadoop

I have a .orc file, is there a way to convert it to a .csv file? or is there another way to read the tables within this file?

Hive has native ORC support, so you can read it directly via Hive.
Illustration:
(Say, the file is named myfile.orc)
Upload file to HDFS
hadoop fs -mkdir hdfs:///my_table_orc_file
hadoop fs -put myfile.orc hdfs:///my_table_orc_file
Create a Hive table on it
(Update column definitions to match the data)
CREATE EXTERNAL TABLE `my_table_orc`(
`col1` string,
`col2` string)
STORED AS ORC
LOCATION
'hdfs:///my_table_orc_file';
Query it
select * from my_table_orc;

You can read content of ORC file using following command
hive --orcfiledump -d <path_of_orc_file_in_hdfs>
It will return the content as json.

Related

Two separate tables on Hadoop from from two files in HDFS directory

I am trying to build two Hadoop tables from one HDFS directory.
So I'd like table file1 from file 1.tsv and another table file2 from file 2.tsv. But both are inside one HDFS directory /tmp/ip.
# create hdfs directory
hadoop fs -mkdir /tmp/ip
# put my two tsv files
hadoop fs -put /tmp/data/1.tsv tmp/ip/
hadoop fs -put /tmp/data/2.tsv tmp/ip/
Now in Hive's CLI
--in Hive CLI to build table
CREATE EXTERNAL TABLE IF NOT EXISTS file1
(id STRING,Code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
-- failed solution because there's two files
LOCATION 'tmp/ip';
-- failed solution but don't understand why
LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1
Regarding failed solution:
-- failed solution but don't understand why LOAD DATA LOCAL INPATH 'tmp/ip/1.tsv' INTO TABLE file1 failed.
This is failing due to keyword LOCAL. Hive is looking for file on local file system. You can try by not using this.
LOAD DATA INPATH 'tmp/ip/1.tsv' INTO table file1`

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS)

When we Load data into Hive table from HDFS, it deletes the file from source directory(HDFS) is there a way we can keep the file in the source directory and load the data into hive table as well.
I used the below query;
LOAD DATA INPATH 'source_file_path' insert INTO TABLE TABLENAME;
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Use hadoop fs cp or hdfs dfs cp commands to copy (not move) files:
hadoop fs -cp [source_file_path] [table_location_path]
or
hdfs dfs cp [source_file_path] [table_location_path]
Use decribe formatted tablename command to check table location path.

Error Copying data from HDFS to External Table In Hive

i am trying to insert data from hdfs to external table in hive. but getting below error.
Error :
Usage: java FsShell [-put <localsrc> ... <dst>]
Command failed with exit code = 255
Command
hive> !hadoop fs -put /myfolder/logs/pv_ext/2013/08/11/log/data/Sacramentorealestatetransactions.csv
> ;
Edited:
file location : /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv
table location : hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
i am in hive
executing command
!hadoop fs -put /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data
getting error :
put: File /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv does not exist.
Command failed with exit code = 255
Please share your suggestion.
Thanks
Here are two methods to load data into the external Hive table.
Method 1:
a) Get the location of the HDFS folder for the Hive external table.
hive> desc formatted mytable;
b) Note the value for the Location property in output. Say, it is hdfs:///hive-data/mydata
c) Then, put the file from local disk to HDFS
$ hadoop fs -put /location/of/data/file.csv hdfs:///hive-data/mydata
Method 2:
a) Load data via this Hive command
hive > LOAD DATA LOCAL INPATH '/location/of/data/file.csv' INTO TABLE mytable;
One more method. Change Hive table location:
alter table table_name set location='hdfs://your_data/folder';
This method may help you to better.
Need to create a table in HIVE.
hive> CREATE EXTERNAL TABLE IF NOT EXISTS mytable(myid INT, a1 STRING, a2 STRING....)
row format delimited fields terminated by '\t' stored as textfile LOCATION
hdfs://sandbox:8020/yapstone/logs/pv_ext/2013/08/11/log/data;
Load data from HDFS to hive table.
hive> LOAD DATA INPATH /yapstone/logs/pv_ext/somedatafor_7_11/Sacramentorealestatetransactions.csv INTO TABLE mytable;
NOTE: If you load data from HDFS to HIVE (INPATH) the data will be moved from HDFS
location to HIVE. So, the data won't available on HDFS location for next time.
Check if the data loaded successfully.
hive> SELECT * FROM mytable;

Hadoop backend with millions of records insertion

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records?
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error:
FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs
Can anyone suggest me some way to resolve it?
I suppose initially the data is in the Local file system.
So a simple workflow could be: load data from local to hadoop file system(HDFS), create a hive table over it and then load the data in hive table.
Step 1:
// put in HDFS
$~ hadoop fs -put /local_path/file_pattern* /path/to/your/HDFS_directory
// check files
$~ hadoop fs -ls /path/to/your/HDFS_directory
Step 2:
CREATE EXTERNAL TABLE if not exists mytable (
Year int,
name string
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as TEXTFILE;
// display table structure
describe mytable;
Step 3:
Load data local INPATH '/path/to/your/HDFS_directory'
OVERWRITE into TABLE mytable;
// simple hive statement to fetch top 10 records
SELECT * FROM mytable limit 10;
You should use LOAD DATA LOCAL INPATH <local-file-path> to load the files from local directory to Hive tables.
If you dont specify LOCAL , then load command will assume to lookup the given file path from HDFS location to load.
Please refer below link,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Loadingfilesintotables

How to get the hive table output or text file in hdfs on which hive table created to .CSV format.

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.
You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

Resources