Load several files into HIVE table - hadoop

Look I'm trying to analyze too many files into just one HIVE table. Key insights, I'm working with json files and the tables structure is :
CREATE EXTERNAL TABLE test1
(
STATIONS ARRAY<STRING>,
SCHEMESUSPENDED STRING,
TIMELOAD TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/andres/hive/bixihistorical/';
I need to load around 50 files with the same structure all of them. I have tried things like:
LOAD DATA INPATH '/user/andres/datasets/bixi2017/*.json'
OVERWRITE INTO TABLE test1;
LOAD DATA INPATH '/user/andres/datasets/bixi2017/*'
OVERWRITE INTO TABLE test1;
LOAD DATA INPATH '/user/andres/datasets/bixi2017/'
OVERWRITE INTO TABLE test1;
Any of those above have worked, any idea guys about how should I go thru?
thanks so much

Make sure folder contains only that files which needs to be loaded into Hive table.
CREATE EXTERNAL TABLE test1
(
STATIONS ARRAY<STRING>,
SCHEMESUSPENDED STRING,
TIMELOAD TIMESTAMP
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/andres/hive/bixihistorical/';
LOAD DATA INPATH '/user/andres/datasets/bixi2017/'
OVERWRITE INTO TABLE test1;

I'm so so .... Well, I just remember that you can create just an external table stored in the same folder all files with the same structure are located. So , in that way I will load whole records in one shoot.
> CREATE EXTERNAL TABLE bixi_his
> (
> STATIONS ARRAY<STRUCT<id: INT,s:STRING,n:string,st:string,b:string,su:string,m:string,lu:string,lc:string,bk:string,bl:string,la:float,lo:float,da:int,dx:int,ba:int,bx:int>>,
> SCHEMESUSPENDED STRING,
> TIMELOAD BIGINT
> )
> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> LOCATION '/user/ingenieroandresangel/datasets/bixi2017/';
thanks

Related

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?
Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.
There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Insert xml file on hdfs to Hive Parquet Table

tI have a gzipped 3GBs xml file that I want to map to Hive parquet table.
I'm using xml serde for parsing that file to temporary external table and than I'm using INSERT to insert this data to hive parquet table (I want this data to by placed on Hive table, not create interface to xml file on HDFS).
I came up with this script:
CREATE TEMPORARY EXTERNAL TABLE temp_table (someData1 INT, someData2 STRING, someData3 ARRAY<STRING>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.someData1" ="someXpath1/text()",
"column.xpath.someData2"="someXpath2/text()",
"column.xpath.someData3"="someXpath3/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 'hdfs://locationToGzippedXmlFile'
TBLPROPERTIES (
"xmlinput.start"="<MyItem>",
"xmlinput.end"="</MyItem>"
);
CREATE TABLE parquet_table
STORED AS Parquet
AS select * from temp_table
Main point of this is that I want to have optimized way to access the data. I don't want to parse xml every query instead parse whole file once and put the result into parquet table. And running the script above is taking unlimited amount of time additionally in log's i can see that only 1 mapper is used.
I don't really know if it's the correct approach (maybe it's possible to do that with partitions?)
BTW I'm using Hue with cloudera.

Hive partition folder changes after import to another table

I have a Hive table TEST with this configuration:
create external table if not exists TEST (
ID bigint,
ACTIVITY_ID string,
BATCH_NBR
)
PARTITIONED BY (year INT, month INT, day INT)
CLUSTERED BY (BATCH_NBR) into 20 buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/lake/hive/test';
And I have Hive files in this location which I can easily load into Hive table and it works.
/user/lake/hive/test/2013/01/01/part-r-00001
Now if I create another table STORE and insert some data from this TEST table, folder structures are getting changes for the Test table. I was expecting after loading the same data, location for the STORE table will have something like this:
/user/core/store/2014/07/03/batch123231.1313
But the above location changed to this:
/user/core/store/year=2013/month=01/day=01/
I'm using insert overwrite table STORE select * from TEST; query for loading STORE table from TEST.
How can I load that table and preserve the same folder structure in destination?
Internal table in Hive will follow their own/default folder structure in /apps/hive/warehouse folder and will not preserve folder structure if the data is loaded from an external Hive table. I was using internal table for "Store", so it was not working as expected.

Loading Data from a .txt file to Table Stored as ORC in Hive

I have a data file which is in .txt format. I am using the file to load data into Hive tables. When I load the file in a table like
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS TEXTFILE;
the data is loaded correctly using
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
and I can run a SELECT * FROM test_details_txt; on the table in Hive.
However If I try to load the data in a table that is
CREATE TABLE test_details_txt(
visit_id INT,
store_id SMALLINT) STORED AS ORC;
I receive the following error on trying to run a SELECT:
Failed with exception java.io.IOException:java.io.IOException: Malformed ORC file hdfs://master:6000/user/hive/warehouse/test.db/transaction_details/test_details.txt. Invalid postscript.
While loading the data using above LOAD statement I do not receive any error or exception.
Is there anything else that needs to be done while using the LOAD DATA IN PATH.. command to store data into an ORC table?
LOAD DATA just copies the files to hive datafiles. Hive does not do any transformation while loading data into tables.
So, in this case the input file /home/user/test_details.txt needs to be in ORC format if you are loading it into an ORC table.
A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA into it, and then copy data from this table to the ORC table.
Here is an example:
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
-- Load into Text table
LOAD DATA LOCAL INPATH '/home/user/test_details.txt' INTO TABLE test_details_txt;
-- Copy to ORC table
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;
Steps:
First create a table using stored as TEXTFILE  (i.e default or in
whichever format you want to create table)
Load data into text table.
Create table using stored as ORC as select * from text_table;
Select * from orc table.
Example:
CREATE TABLE text_table(line STRING);
LOAD DATA 'path_of_file' OVERWRITE INTO text_table;
CREATE TABLE orc_table STORED AS ORC AS SELECT * FROM text_table;
SELECT * FROM orc_table; /*(it can now be read)*/
Since Hive does not do any transformation to our input data, the format needs to be the same: either the file should be in ORC format, or we can load data from a text file to a text table in Hive.
ORC file is a binary file format, so you can not directly load text files into ORC tables.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
First you need to create one normal table as textFile, load your data into the textFile table and then you can use insert overwrite query to write your data into ORC file.
create table table_name1 (schema of the table) row format delimited by ',' | stored as TEXTFILE
create table table_name2 (schema of the table) row format delimited by ',' | stored as ORC
load data local inpath ‘path of your file’ into table table_name1;(loading data from a local system)
INSERT OVERWRITE TABLE table_name2 SELECT * FROM table_name1;
Now all your data will be stored in an ORC file.
The similar procedure is applied to all the binary file formats i.e., Sequence files, RC files and Parquet files in Hive.
You can refer to the below link for more details.
https://acadgild.com/blog/file-formats-in-apache-hive/
Steps to load data into ORC file format in hive
1.Create one normal table using textFile format
2.Load the data normally into this table
3.Create one table with the schema of the expected results of your normal hive table using stored as orcfile
4.Insert overwrite query to copy the data from textFile table to orcfile table
Refer the blog to learn the handson of how to load data into all file formats in hive
Load data into all file formats in hive

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Resources