Hadoop and hive optimisation - hadoop

I need help on the following scenario:
1) Memo table is the source table in hive.
It has 5493656359 records.Its desc is as follows:
load_ts timestamp
memo_ban bigint
memo_id bigint
sys_creation_date timestamp
sys_update_date timestamp
operator_id bigint
application_id varchar(6)
dl_service_code varchar(5)
dl_update_stamp bigint
memo_date timestamp
memo_type varchar(4)
memo_subscriber varchar(20)
memo_system_txt varchar(180)
memo_manual_txt varchar(2000)
memo_source varchar(1)
data_dt string
market_cd string
Partition information:
data_dt string
market_cd string
2)
This is the target table
CREATE TABLE IF NOT EXISTS memo_temprushi (
load_ts TIMESTAMP,
ban BIGINT,
attuid BIGINT,
application VARCHAR(6),
system_note INT,
user_note INT,
note_time INT,
date TIMESTAMP)
PARTITIONED BY (data_dt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORC
TBLPROPERTIES ("orc.compress"="SNAPPY");
3)
This is the initial load statement from source table Memo into
target table memo_temprushi. Loads all records till date 2015-12-14:
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT INTO TABLE memo_temprushi PARTITION (DATA_DT)
SELECT LOAD_TS,MEMO_BAN, OPERATOR_ID, APPLICATION_ID,
CASE WHEN LENGTH(MEMO_SYSTEM_TXT)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(MEMO_MANUAL_TXT)=0 THEN 0 ELSE 1 END,
HOUR(MEMO_DATE), MEMO_DATE, DATA_DT
FROM tlgmob_gold.MEMO WHERE LOAD_TS < DATE('2015-12-15');
4)
For incremental load I want to insert the rest of the records i.e. from date 2015-12-15 onward. I'm using following query:
INSERT INTO TABLE memo_temprushi PARTITION (DATA_DT)
SELECT MS.LOAD_TS,MS.MEMO_BAN, MS.OPERATOR_ID, MS.APPLICATION_ID,
CASE WHEN LENGTH(MS.MEMO_SYSTEM_TXT)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(MS.MEMO_MANUAL_TXT)=0 THEN 0 ELSE 1 END,
HOUR(MS.MEMO_DATE), MS.MEMO_DATE, MS.DATA_DT
FROM tlgmob_gold.MEMO MS JOIN (select max(load_ts) max_load_ts from memo_temprushi) mt
ON 1=1
WHERE
ms.load_ts > mt.max_load_ts;
It launches 2 jobs. Initially it gives warning regarding stage being a cross product.
The first job gets completely quite soon but second job remains stuck at reduce 33%.
The log shows : [EventFetcher for fetching Map Completion Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: EventFetcher is interrupted.. Returning
It shows that the number of reducers is 1.
Trying to increase the number of reducers through this command set mapreduce.job.reduces but it's not working.
Thanks

You can try this.
Run "select max(load_ts) max_load_ts from memo_temprushi"
Add the value in the where condition of the query and remove the join condition of the query.
If it works, then you can develop shell script in which first query will get max value and then run the second query with out join.
Here is the sample shell script.
max_date=$(hive -e "select max(order_date) from orders" 2>/dev/null)
hive -e "select order_date from orders where order_date >= date_sub('$max_date', 7);"

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

issue with hive partitioning and bucketing in CDH 5.10 quick VM

i am new to this area and got stuck in a simple issue.
I am loading data into a hive table (using insert command from another table tset1) which is partitioned by udate and day as bucket.
insert overwrite test1 partition(udate) select id,value,udate,day from tset1;
so now the issue is when I am loading data it is taking wrong value in partition column. Day is taken as partition because in my table this is last column so during data load it's taking day as udate.
how I can force my query to take the right value during data load?
hive (testdb)> create table test1_buk(id int, value string,day int) partitioned by(udate string) clustered by(day) into 5 buckets row format delimited fields terminated by ',' stored as textfile;
hive (testdb)> desc tset1;
OK
col_name data_type comment
id int
value string
udate string
day int
hive (testdb)> desc test1_buk;
OK
col_name data_type comment
id int
value string
day int
udate string
# Partition Information
# col_name data_type comment
udate string
hive (testdb)> select * from test1_buk limit 1;
OK
test1_buk.id test1_buk.value test1_buk.day test1_buk.udate
5 w 2000 10
please help.

How to automatically get the current date and time in a column using HIVE

Hey I have two columns in my HIVE table :
For example :-
c1 : name
c2 : age
Now while creating a table I want to declare two more columns which automatically give me the current date and time when the row is loaded.
eg: John 24 26/08/2015 11:15
How can this be done?
Hive currently does not support the feature to add a default value to any column definition while creating a table. Please refer to the link for complete hive create table syntax:
Hive Create Table specification
Alternative work around for this issue would be to temporarily load data into temporary table and use the insert overwrite table statement to add the current date and time into the main table.
Below example may help:
1. Create a temporary table
create table EmpInfoTmp(name string, age int);
2. Insert data using a file or existing table into the EmpInfoTmp table:
name|age Alan|28 Sue|32 Martha|26
3. Create a table which will contain your final data:
create table EmpInfo(name string, age tinyint, createDate string, createTime string);
4. Insert data from the temporary table and with that also add the columns with default value as current date and time:
insert overwrite table empinfo select name, age, FROM_UNIXTIME( UNIX_TIMESTAMP(), 'dd/MM/YYYY' ), FROM_UNIXTIME( UNIX_TIMESTAMP(), 'HH:mm' ) from empinfofromfile;
5. End result would be like this:
name|age|createdate|createtime Alan|28|26/08/2015|03:56 Martha|26|26/08/2015|03:56 Sue|32|26/08/2015|03:56
Please note that the creation date and time values will be entered accurately by adding the data to your final table as and when it comes into the temp table.
Note: You can't set more then 1 column as CURRENT_TIMESTAMP.
Here this way, You cant set CURRENT_TIMESTAMP in one column
SQL:
CREATE TABLE IF NOT EXISTS `hive` (
`id` int(11) NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`age` int(11) DEFAULT '0',
`datecreated` timestamp NULL DEFAULT CURRENT_TIMESTAMP
);
Hey i found a way to do it using shell script.
Heres how :
echo "$(date +"%Y-%m-%d-%T") $(wc -l /home/hive/landing/$line ) $dir " >> /home/hive/recon/fileinfo.txt
Here i get the date without spaces. In the end I upload the textfile to my hive table.

Excluding the partition field from select queries in Hive

Suppose I have a table definition as follows in Hive(the actual table has around 65 columns):
CREATE EXTERNAL TABLE S.TEST (
COL1 STRING,
COL2 STRING
)
PARTITIONED BY (extract_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\007'
LOCATION 'xxx';
Once the table is created, when I run hive -e "describe s.test", I see extract_date as being one of the columns on the table. Doing a select * from s.test also returns extract_date column values. Is it possible to exclude this virtual(?) column when running select queries in Hive.
Change this property
set hive.support.quoted.identifiers=none;
and run the query as
SELECT `(extract_date)?+.+` FROM <table_name>;
I tested it working fine.

Hive partition columns seem to prevent "select distinct"

I have created a table in Hive like this:
CREATE TABLE application_path
(userId STRING, sessId BIGINT, accesstime BIGINT, actionId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Running on this table the query:
SELECT DISTINCT userId FROM application_path;
gives the expected result:
user1#domain.com
user2#domain.com
user3#domain.com
...
Then I've changed the declaration to add a partition:
CREATE TABLE application_path
(sessId BIGINT, accesstime BIGINT, actionId STRING)
PARTITIONED BY(userId STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '#'
STORED AS TEXTFILE;
Now the query SELECT DISTINCT userId... runs for seconds as before, but eventually returns anything.
I've just noticed the syntax:
SHOW PARTITIONS application_path;
but I was wondering if that's the only way to get unique (distinct) values from a partitioning column. The output of SHOW PARTITION is not even an exact replacement of what you would get from SELECT DISTINCT, since the column name is prefixed to each row:
hive> show partitions application_path;
OK
userid=user1#domain.com
userid=user2#domain.com
userid=user3#domain.com
...
What's strange to me is that usedId can be used in GROUP BY with other columns, like in:
SELECT userId, sessId FROM application_path GROUP BY userId, sessId;
but does return anything in:
SELECT userId FROM application_path GROUP BY userId;
I experienced the same issue, it will be fixed in 0.10
https://issues.apache.org/jira/browse/HIVE-2955

Resources