Mapping HDFS directory with .tsv files to Hive

Mapping HDFS directory with .tsv files to Hive - hadoop

I have data into HFDS as a .tsv format. I need to load them into Hive table. I need some help.
Data into HDFS is like:
/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03
Note: Data is loaded into HDFS directory /ad_data/raw/reg_logs daily and hourly.
There are 3 .tsv files into this HDFS directory:
funel1.tsv
funel2.tsv
funel3.tsv
Each .tsv file has 3 columns separated by tab and has data like:
2344 -39 223
2344 -23 443
2394 -43 98
2377 -12 33
...
...
I want to create a Hive schema with 3 columns id int, region_code int and count int, exactly as in HDFS. If possible I want to remove that negative sign, in Hive table but not big deal.
I create a Hive table with schema: (please correct me if I am wrong)
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';
All I want to do is copy data from HDFS to Hive. I do not want to use "load data inpath '..' into table reg_logs" because I do not want to manually enter data everyday. I just want to point Hive table to HDFS directory so it will get data for each day automatically.
How can I achieve it? Please correct my hive table schema if needed and way to get data there.
==
2nd part:
I want to create another table reg_logs_org which would get populated from reg_logs. I need to put every thing on reg_logs_org from reg_logs beside hour column.
Schema I created is:
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';
Insert data into reg_logs_org from reg_logs:
insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code
error message:
FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'
==
Thank you,
Rio

You're very close. The last step is that you need to add the partition information to Hive's metastore. Hive stores the location of every partition individually, and it does not automatically find new partitions. There are two ways to add the partitions:
Every hour, do an add partition statement:
alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';
Every hour (or less frequently) do a table repair. This scans the root table location for any partitions it has not yet added.
msck repair table reg_logs;
The first approach is a bit more painful, but more efficient. The second approach is easy, but does a full scan of all partitions every time.
Edit: second half of question:
You just need to add some syntax for inserting into a table using dynamic partitions. In general, it is:
insert overwrite [table] partition([partition column])
select ...
Or in your case:
insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code

Related

Hive: Does hive support partitioning and bucketing while usiing external tables

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables,
hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.
Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
When we load data into a partitioned table, the partition values are specified explicitly:
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
At the filesystem level, partitions are simply nested sub directories of the table directory.
After loading a few more files into the logs table, the directory structure might look like this:
The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.
In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');
How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.
Hope fully in hive warehouse? Can someone support or clarify this?

Suppose partitioning on date as this is a common thing to do.
CREATE EXTERNAL TABLE mydatabase.mytable (
var1 double
, var2 INT
, date String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';
Then add all your partitions;
ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );
So on and so forth.
Finally you can add your data in the proper location. You will have an external partitioned file.

There is an easy way to do this.
Create your External Hive table first.
CREATE EXTERNAL TABLE database.table (
id integer,
name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';
Next you have to run a MSCK command (metastore consistency check)
msck repair table database.table
This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

hive first column to consider in partition table

Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?

In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.

I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.

creating partition in external table in hive

I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps:
1-created a source table
2-loaded data from local into source table
3- created another table with partitions - partition_table
4- inserted the data to this table from source table resulting in creation of all the partitions dynamically
My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table??
example:
Step 1:
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
Step 2:
alter table table1 add partition(age)
location 'path/to/already/existing/partition'
I am not sure how to proceed with partitioning in external tables. Can somebody please help by giving step by step description of the same?.
Thanks in advance!

Yes, you have to tell Hive explicitly what is your partition field.
Consider you have a following HDFS directory on which you want to create a external table.
/path/to/dataFile/
Let's say this directory already have data stored(partitioned) department wise as follows:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
Each of these directories have bunch of files where each file
contains actual comma separated data for fields say name,age,height.
e.g.
/path/to/dataFile/dept1/file1.txt
/path/to/dataFile/dept1/file2.txt
Now let's create external table on this:
Step 1. Create external table:
CREATE EXTERNAL TABLE testdb.table1(name string, age int, height int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
Step 2. Add partitions:
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
Done, run select query once to verify if data loaded successfully.

1. Set below property
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
2. Create External partitioned table
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
3. Insert data to partitioned table from source table.
Basically , the process is same. its just that you create external partitioned table and provide HDFS path to table under which it will create and store partition.
Hope this helps.

The proper way to do it.
Create the table and mention it is partitioned.
create external table1 ( name string, age int, height int)
partitioned by (age int)
stored as ****(your format)
location 'path/to/dataFile/in/HDFS';
Now you have to refresh the partitions in the hive metastore.
msck repair table table1
This will take care of loading all your partitions into the hive metastore.
You can use msck repair table at any point during your process to have the metastore updated.

Follow the below steps:
Create a temporary table/Source table
create table source_table(name string,age int,height int) row format delimited by ',';
Use your delimiter as in the file instead of ',';
Load data into the source table
load data local inpath 'path/to/dataFile/in/HDFS';
Create external table with partition
create external table external_dynamic_partitions(name string,height int)
partitioned by (age int)
location 'path/to/dataFile/in/HDFS';
Enable dynamic partition mode to nonstrict
set hive.exec.dynamic.partition.mode=nonstrict
Load data to external table with partitions from source file
insert into table external_dynamic partition(age)
select * from source_table;
That's it.
You can check the partitions information using
show partitions external_dynamic;
You can even check if it is an external table or not using
describe formatted external_dynamic;
External table is a type of table in Hive where the data is not moved to the hive warehouse. That means even if U delete the table, the data still persists and you will always get the latest data, which is not the case with Managed table.

Is it possible to have multiple hive tables represented within the same HDFS directory structure?

Is it possible to have multiple hive tables represented within the same HDFS directory structure? In other words, is there a way to have multiple hive tables pointing to same/overlapping HDFS paths?
Here is my situation:
I have a table named "mytable", located in hdfs:/tables/mytable
CREATE EXTERNAL TABLE mytable
(
id int,
...
[a whole bunch of columns]
...
PARTITIONED BY (logname STRING)
STORED AS [I-do-not-know-what-just-yet]
LOCATION 'hdfs:/tables/mytable';
So, HDFS will look like:
hdfs:/tables/mytable/logname=tarzan/....
hdfs:/tables/mytable/logname=jane/....
hdfs:/tables/mytable/logname=whoa/....
Is it possible to have a hive table, named "tarzan", located in hdfs:/tables/mytable/logname=tarzan ? Same with hive table "jane", located in hdfs:/tables/mytable/logname=jane, etc.
The tarzan, jane, whoa, etc sub-tables share some columns (timestamp, ip_address, country, user_id, and some others), but there will also be a lot of columns that they do not have in common.
Is there a way to store this data once in HDFS, and use it for multiple tables as I described above? Furthermore, is there a way to store the data in an efficient way, since many of the tables will have columns that are not in common? Would a file format like RCFILE or PARQUET work in this case?
Thanks so much for any hints or help anyone can provide,

Yes, we can have multiple hive tables with the same underlying HDFS directory.
Example:
Create table emp and load data file file3 into it.
create table emp (id int, name string, salary int)
row format delimited
fields terminated by ','
-- default location would be used
load data
local inpath '/home/parv/testfiles/file3'
into table emp;
Create another table mirror. When you will select data from mirror table, it would be as same as of emp table (contents of file3).
create table mirror (id int, name string, salary int)
row format delimited
fields terminated by ','
location 'hdfs:///user/hive/warehouse/parv.db/base';
Load data into mirror table. When you will select data either from mirror table or emp table, it would return same results (contents of file3 and file4).
load data
local inpath '/home/parv/testfiles/file4'
into table mirror;
Conclusion:
Same data files are shared among both tables emp and mirror.
But, strange, the HDFS filesystem only shows data directory for emp table and not for mirror table. However, both the tables are present in hive and so can be queried.

Answering my own question:
It IS possible to have multiple hive tables represented by the same HDFS directory structure, but for what I am looking to do:
A mytable table partitioned by logname (logname=tarzan, logname=jane, etc...)
A separate table for each logname: A "tarzan" table with only columns used by the tarzan table, and not any other logname, same for the "jane" table, etc
Only represent the data one time in HDFS
A better solution is to have the 1 mytable table, partitioned by logname, AND create views for each logname table, with only the subset of columns needed in each.

Yes, you could point multiple tables to the same location on HDFS. However, Hive doesn't support dynamic columns.
Is there a reason you can't just have 3 different tables? This would allow you do have different schemas (columns) for each.
--Brandon

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.

Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table

As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example

I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.

How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');

CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio