Hive based data warehousing task- add sequence number to records - hadoop

I have a use case where in i need to implement SQL based data warehousing activities using Hive.
The software would generate a bunch of csv files. When it transforms into SQL table, an unique id called session is assigned for each csv file and loaded into a SQL table. Let's say, I have 3 columns in csv files. I will have four columns in the SQL table wherein the first column represent the session. This means that, values stored in first csv file is written into the SQL table with the sessios id '1', and values from the second csv file is appended to the SQL table with the session id '2', and so on.
In Hive,
I stored these csv files in hdfs directory and want to create one hive table with the additional columns that represents the session id. I am not sure how I can do it. Any help or clue will be highly appreciated.

Try below approaches:
Using Random session id:
create external table on top of source dataset:
create external table staging (a string, b string, c string) location 'xyz';
Assign a unique id to each row:
insert into table destination as select reflect("java.util.UUID", "randomUUID") AS session_id, s.* from staging;
Using sequence number as session id:
create external table on top of source dataset:
create external table staging (a string, b string, c string) location 'xyz';
first time data load:
CREATE TABLE IF NOT EXISTS max_session_id (session_id int);
Append a sequence id to each record:
insert into table destination
select cast(coalesce(t.session_id,0) + row_number() over () as INT) as session_id, t1.*
from max_session_id t join destination t1 on 1=1;
Maintain max session id in separate table:
DROP TABLE IF EXISTS tmp_max_session_id;
CREATE TABLE tmp_max_session_id AS SELECT COALESCE(MAX(session_id), 0) AS session_id FROM destination;
INSERT OVERWRITE TABLE max_session_id SELECT * FROM tmp_max_session_id;
if you want to tag a same session id per file then add each file as a partition, you may store reflect("java.util.UUID", "randomUUID") or max_session_id in separate table while adding partition use newly generated session_id as partition id.

Related

How to create table in Hive with specific column values from another table

I am new to Hive and have some problems. I try to find a answer here and other sites but with no luck... I also tried many different querys that come to my mind, also without success.
I have my source table and i want to create new table like this.
Were:
id would be number of distinct counties as auto increment numbers and primary key
counties as distinct names of counties (from source table)
You could follow this approach.
A CTAS(Create Table As Select)
with your example this CTAS could work
CREATE TABLE t_county
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE AS
WITH t AS(
SELECT DISTINCT county, ROW_NUMBER() OVER() AS id
FROM counties)
SELECT id, county
FROM t;
You cannot have primary key or foreign keys on Hive as you have primary key on RBDMSs like Oracle or MySql because Hive is schema on read instead of schema on write like Oracle so you cannot implement constraints of any kind on Hive.
I can not give you the exact answer because of it suppose to you must try to do it by yourself and then if you have a problem or a doubt come here and tell us. But, what i can tell you is that you can use the insertstatement to create a new table using data from another table, I.E:
create table CARS (name string);
insert table CARS select x, y from TABLE_2;
You can also use the overwrite statement if you desire to delete all the existing data that you have inside that table (CARS).
So, the operation will be
CREATE TABLE ==> INSERT OPERATION (OVERWRITE?) + QUERY OPERATION
Hive is not an RDBMS database, so there is no concept of primary key or foreign key.
But you can add auto increment column in Hive. Please try as:
Create table new_table as
select reflect("java.util.UUID", "randomUUID") id, countries from my_source_table;

Create partitioned table from non partitioned table

Suppose I have internal orc non partitioned table in Hive:
CREATE TABLE IF NOT EXISTS non_partitioned_table(
id STRING,
company STRING,
city STRING,
country STRING,
)
STORED AS ORC;
Is it possible somehow create parquet partitioned table this way via cte like statement?
create partitioned_table PARTITION ON (date STRING) like non_partitioned_table;
alter table partitioned_table SET FILEFORMAT PARQUET;
This create statement doesn't work.
So basically I need to add column and make table partitioned by this column. I know that I can create table through the simple create table statement, but I need to do it within CREATE TABLE LIKE and the altered somehow
Your table doesn't have a date column to begin with, so you're going to have to make a new one.
You might be able to ALTER TABLE non_partitioned_table ADD PARTITION, but haven't tried that myself. If you want to try it, I would suggest the partition location be outside of the existing HDFS directory.
Anyways, the CREATE-TABLE-LIKE DDL does not support PARTITIONED BY
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path];
You need to copy the DESCRIBE TABLE schema from the first, then alter it and add the PARTITIONED BY, and optionally specify STORED AS. (SET FILEFORMAT PARQUET doesn't change the data type in-place).
Then, if you want the data in the new table, you need to INSERT OVERWRITE TABLE

hive first column to consider in partition table

Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?
In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.
I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.

creating partition in external table in hive

I have successfully created and added Dynamic partitions in an Internal table in hive. i.e. by using following steps:
1-created a source table
2-loaded data from local into source table
3- created another table with partitions - partition_table
4- inserted the data to this table from source table resulting in creation of all the partitions dynamically
My question is, how to perform this in external table? I read so many articles on this, but i am confused , that do I have to specify path to the already existing partitions for creating partitions for external table??
example:
Step 1:
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
Step 2:
alter table table1 add partition(age)
location 'path/to/already/existing/partition'
I am not sure how to proceed with partitioning in external tables. Can somebody please help by giving step by step description of the same?.
Thanks in advance!
Yes, you have to tell Hive explicitly what is your partition field.
Consider you have a following HDFS directory on which you want to create a external table.
/path/to/dataFile/
Let's say this directory already have data stored(partitioned) department wise as follows:
/path/to/dataFile/dept1
/path/to/dataFile/dept2
/path/to/dataFile/dept3
Each of these directories have bunch of files where each file
contains actual comma separated data for fields say name,age,height.
e.g.
/path/to/dataFile/dept1/file1.txt
/path/to/dataFile/dept1/file2.txt
Now let's create external table on this:
Step 1. Create external table:
CREATE EXTERNAL TABLE testdb.table1(name string, age int, height int)
PARTITIONED BY (dept string)
ROW FORMAT DELIMITED
STORED AS TEXTFILE
LOCATION '/path/to/dataFile/';
Step 2. Add partitions:
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept1') LOCATION '/path/to/dataFile/dept1';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept2') LOCATION '/path/to/dataFile/dept2';
ALTER TABLE testdb.table1 ADD PARTITION (dept='dept3') LOCATION '/path/to/dataFile/dept3';
Done, run select query once to verify if data loaded successfully.
1. Set below property
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
2. Create External partitioned table
create external table1 ( name string, age int, height int)
location 'path/to/dataFile/in/HDFS';
3. Insert data to partitioned table from source table.
Basically , the process is same. its just that you create external partitioned table and provide HDFS path to table under which it will create and store partition.
Hope this helps.
The proper way to do it.
Create the table and mention it is partitioned.
create external table1 ( name string, age int, height int)
partitioned by (age int)
stored as ****(your format)
location 'path/to/dataFile/in/HDFS';
Now you have to refresh the partitions in the hive metastore.
msck repair table table1
This will take care of loading all your partitions into the hive metastore.
You can use msck repair table at any point during your process to have the metastore updated.
Follow the below steps:
Create a temporary table/Source table
create table source_table(name string,age int,height int) row format delimited by ',';
Use your delimiter as in the file instead of ',';
Load data into the source table
load data local inpath 'path/to/dataFile/in/HDFS';
Create external table with partition
create external table external_dynamic_partitions(name string,height int)
partitioned by (age int)
location 'path/to/dataFile/in/HDFS';
Enable dynamic partition mode to nonstrict
set hive.exec.dynamic.partition.mode=nonstrict
Load data to external table with partitions from source file
insert into table external_dynamic partition(age)
select * from source_table;
That's it.
You can check the partitions information using
show partitions external_dynamic;
You can even check if it is an external table or not using
describe formatted external_dynamic;
External table is a type of table in Hive where the data is not moved to the hive warehouse. That means even if U delete the table, the data still persists and you will always get the latest data, which is not the case with Managed table.

Mapping HDFS directory with .tsv files to Hive

I have data into HFDS as a .tsv format. I need to load them into Hive table. I need some help.
Data into HDFS is like:
/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03
Note: Data is loaded into HDFS directory /ad_data/raw/reg_logs daily and hourly.
There are 3 .tsv files into this HDFS directory:
funel1.tsv
funel2.tsv
funel3.tsv
Each .tsv file has 3 columns separated by tab and has data like:
2344 -39 223
2344 -23 443
2394 -43 98
2377 -12 33
...
...
I want to create a Hive schema with 3 columns id int, region_code int and count int, exactly as in HDFS. If possible I want to remove that negative sign, in Hive table but not big deal.
I create a Hive table with schema: (please correct me if I am wrong)
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';
All I want to do is copy data from HDFS to Hive. I do not want to use "load data inpath '..' into table reg_logs" because I do not want to manually enter data everyday. I just want to point Hive table to HDFS directory so it will get data for each day automatically.
How can I achieve it? Please correct my hive table schema if needed and way to get data there.
==
2nd part:
I want to create another table reg_logs_org which would get populated from reg_logs. I need to put every thing on reg_logs_org from reg_logs beside hour column.
Schema I created is:
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';
Insert data into reg_logs_org from reg_logs:
insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code
error message:
FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'
==
Thank you,
Rio
You're very close. The last step is that you need to add the partition information to Hive's metastore. Hive stores the location of every partition individually, and it does not automatically find new partitions. There are two ways to add the partitions:
Every hour, do an add partition statement:
alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';
Every hour (or less frequently) do a table repair. This scans the root table location for any partitions it has not yet added.
msck repair table reg_logs;
The first approach is a bit more painful, but more efficient. The second approach is easy, but does a full scan of all partitions every time.
Edit: second half of question:
You just need to add some syntax for inserting into a table using dynamic partitions. In general, it is:
insert overwrite [table] partition([partition column])
select ...
Or in your case:
insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code

Resources