Rules to be followed before creating a Hive partitioned table - hadoop

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:
CREATE EXTERNAL TABLE IF NOT EXISTS countData (
tableName String,
ssn String,
hiveCount String,
sapCount String,
countDifference String,
percentDifference String,
sap_UpdTms String,
hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';
To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are
While inserting the data from a query, partition column should be the last one.
PARTITIONED BY column shouldn't be an existing column in the table.
Could anyone let me know if there are any other rules for partitioning a Hive table ?
Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code.
Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?

What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.
If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)
Also, don't use STORED AS TEXT. Use Parquet or ORC instead.

Related

hive first column to consider in partition table

Creating partition table in hive, does it mandatory to choose always the last column for partition column.
If I choose 1st column as partition, I cant do filter data, is there any way to choose first column for partition?
In hive, if you want to partition a table, you have to define partition column first during table creation time. & while populating the data into table you need to specify as follow:
"INSERT INTO partitioned_table PARTITION(status) SELECT id , name, status from temp_tbl "
in this way using you can partition based on last column only. if you want to partition on the basis of first column. you have to write a Mapreduce job for that . that is the only option available.
I guess the problem you are facing is that you already have table "source" in your local system or hdfs and you want to upload it to partitioned table. And you want the first column in the source table to be partitioned in hive. As the source table does not have headers i guess we can not do anything here if we try to directly upload the file in the hive destination folder. The only alternate way i know is that create a non partitioned table in hive whose structure is exactly the same as the source file. then upload the source data to non partitioned table first, then copy the data from non partitioned table to partitioned table.
Suppose the source file is like this
create table source(eid int, ename int, esal int) partitioned by (dept string)
your non partioned table where you upload the data is like thiscreate table nopart(dept string, esal int,ename string, eid int)
then you use the dynamic partition by command insert overwrite table source partition(dept) select eid,ename,esal,dept from nopart;
the order of the parameters is the only point here.

How to query sorted/indexed columns in Impala

I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;

Mapping HDFS directory with .tsv files to Hive

I have data into HFDS as a .tsv format. I need to load them into Hive table. I need some help.
Data into HDFS is like:
/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03
Note: Data is loaded into HDFS directory /ad_data/raw/reg_logs daily and hourly.
There are 3 .tsv files into this HDFS directory:
funel1.tsv
funel2.tsv
funel3.tsv
Each .tsv file has 3 columns separated by tab and has data like:
2344 -39 223
2344 -23 443
2394 -43 98
2377 -12 33
...
...
I want to create a Hive schema with 3 columns id int, region_code int and count int, exactly as in HDFS. If possible I want to remove that negative sign, in Hive table but not big deal.
I create a Hive table with schema: (please correct me if I am wrong)
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';
All I want to do is copy data from HDFS to Hive. I do not want to use "load data inpath '..' into table reg_logs" because I do not want to manually enter data everyday. I just want to point Hive table to HDFS directory so it will get data for each day automatically.
How can I achieve it? Please correct my hive table schema if needed and way to get data there.
==
2nd part:
I want to create another table reg_logs_org which would get populated from reg_logs. I need to put every thing on reg_logs_org from reg_logs beside hour column.
Schema I created is:
CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';
Insert data into reg_logs_org from reg_logs:
insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code
error message:
FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'
==
Thank you,
Rio
You're very close. The last step is that you need to add the partition information to Hive's metastore. Hive stores the location of every partition individually, and it does not automatically find new partitions. There are two ways to add the partitions:
Every hour, do an add partition statement:
alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';
Every hour (or less frequently) do a table repair. This scans the root table location for any partitions it has not yet added.
msck repair table reg_logs;
The first approach is a bit more painful, but more efficient. The second approach is easy, but does a full scan of all partitions every time.
Edit: second half of question:
You just need to add some syntax for inserting into a table using dynamic partitions. In general, it is:
insert overwrite [table] partition([partition column])
select ...
Or in your case:
insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from
reg_logs
group by
utc_date, id, region_code

Hive: Create Table and Partition By

I have a table with loaded data as following:
create table xyzlogTable (dateC string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties( "input.regex" = "(\\S+)\\t(\\d+):(\\d+):(\\d+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s") stored as textfile;
load data local inpath '/home/hadoop/hive/xyxlogData/' into table xyxlogTable;
total row count is found to be more than 3 million. some queries work fine and some get into infinite loop.
after seeing that select, group by queries taking long time and sometimes not even returning results, decided to go for partitioning.
But both the following statements are failing:
create table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);
FAILED: Error in metadata: AlreadyExistsException(message:Table xyzlogTable already exists)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Alter table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);
FAILED: Parse Error: line 1:12 cannot recognize input 'xyzlogTable' in alter table statement
Any idea whats the problem!
This is precisely why I prefer using external tables in Hive. The table you created is not external (you used create table instead of create external table). With non-external tables, dropping the table, drops the metadata (name, column names, types, etc.) and the data of the table in HDFS. On the contrary, when an external table is dropped, only the metadata is removed, the data in HDFS sticks around.
You have a few options going forward:
If the cost of import is high and the data is already not partitioned. Keep this table around but create a new table say xyzlogTable_partitioned that will be a partitioned version of this table. You can use Dynamic Partitioning in Hive to populate this new table.
If the cost of import is high but the data is already partitioned; for example say you already have data in separate files for each partition in HDFS. Create a new partitioned table and have a bash script (or equivalent), move (or copy and later delete, if you are conservative) from the HDFS directory corresponding to the un-partitioned table to the directory corresponding to the appropriate partition of the new table.
If import is cheap: drop the entire table. Re-create a new partitioned table and re-import. Many times if the import process is not aware of the partitioning schema (in other words, if the import can't directly push data into appropriate partitions), it's a common use case to have an unpartitioned table (like the one you already have) as a staging table and then use a Hive query or dynamic partitioning to populate a new partitioned table which gets used in subsequent queries of the workflow.
You should first drop your table which was already created and then create the partitioned table. Or change your table name.

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Resources