failed to make query faster in hive using reducer and mappers - performance

so I've been working on transferring data from oracle into hive with parquet format.
The format looks like this:
create external table if not exists
partitioned by(datee string)
stored as parquet
tblproperties ('hive.exec.compress.output'='true', 'parquet.compression'='SNAPPY')
My question is how to tune this query (eats about 4 hours) into 5 minutes query?
I already try this query (put before my query) of these combinations:
set hive.vectorized.execution.enabled = true;
sethive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
But the results is going nowhere.
Any ideas would be appreciated!

Related

Retrieving data from oracle table using scala jdbc giving wrong results

I am using scala jdbc to check whether a partition exists for an oracle table. It is returning wrong results when an aggregate function like count(*) is used.
I have checked the DB connectivity and other queries are working fine. I have tried to extract the value of count(*) using an alias, But it failed. Also tried using getString. But it failed.
Class.forName(jdbcDriver)
var connection = DriverManager.getConnection(jdbcUrl,dbUser,pswd)
val statement = connection.createStatement()
try{
val sqlQuery = s""" SELECT COUNT(*) FROM USER_TAB_PARTITIONS WHERE
TABLE_NAME = \'$tableName\' AND PARTITION_NAME = \'$partitionName\' """
val resultSet1 = statement.executeQuery(sqlQuery)
while(resultSet1.next())
{
var cnt=resultSet1.getInt(1)
println("Count="+cnt)
if(cnt==0)
// Code to add partition and insert data
else
//code to insert data in existing partition
}
}catch(Exception e) { ... }
The value of cnt always prints as 0 even though the oracle partition already exists. Can you please let me know what is the error in the code? Is this giving wrong results because I am using scala jdbc to get the result of an aggregate function like count(*)? If yes, then what would be the correct code? I need to use scala jdbc to check whether the partition already exists in oracle and then insert data accordingly.
This is just a suggestion or might be the solution in your case.
Whenever you search the metadata tables of the oracle always use UPPER or LOWER on both side of equal sign.
Oracle converts every object name in to the upper case and store it in the metadata unless you have specifically provided the lower case object name in double quotes while creating it.
So take an following example:
-- 1
CREATE TABLE "My_table_name1" ... -- CASE SENSISTIVE
-- 2
CREATE TABLE My_table_name2 ... -- CASE INSENSITIVE
In first query, we used double quotes so it will be stored in the metadata of the oracle as case sensitive name.
In second query, We have not used double quotes so the table name will be converted into the upper case and stored in the metadata of the oracle.
So If you want to create a query against any metadata in the oracle which include both of the above cases then you can use UPPER or LOWER against the column name and value as following:
SELECT * FROM USER_TABLES WHERE UPPER(TABLE_NAME) = UPPER('<YOUR TABLE NAME>');
Hope, this will help you in solving the issue.
Cheers!!

how to constraint hive query file output to be in a single file always

I have created a hive table using below query, and inserting data to this table on daily basis using second query as mentioned below
create EXTERNAL table IF NOT EXISTS DB.efficacy
(
product string,
TP_Silent INT,
TP_Active INT,
server_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://hdfsadlproduction/user/DB/Report/efficacy';
Insert INTO DB.efficacy
select
product,
SUM(CASE WHEN verdict = 'TP_Silent' THEN 1 ELSE 0 END ),
SUM(CASE WHEN verdict = 'TP_Active' THEN 1 ELSE 0 END ) ,
current_date()
from
DB.efficacy_raw
group by
product
;
The issue is that everyday when my insert query executes it basically creates a new file in hadoop FS. I want every day query output to get appended in a same single file only, but Hadoop FS contains the files in the following manner.
000000_0, 000000_0_copy_1, 000000_0_copy_2
I have used below hive settings:-
SET hive.execution.engine=mr;
SET tez.queue.name=${queueName};
SET mapreduce.job.queuename=${queueName};
SET mapreduce.map.memory.mb = 8192;
SET mapreduce.reduce.memory.mb = 8192;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.parallel = true;
SET hive.exec.parallel.thread.number = 2;
SET mapreduce.input.fileinputformat.split.maxsize=2048000000;
SET mapreduce.input.fileinputformat.split.minsize=2048000000;
SET mapreduce.job.reduces = 20;
SET hadoop.security.credential.provider.path=jceks://hdfs/user/efficacy/s3-access/efficacy.jceks;
set hive.vectorized.execution.enabled=false;
set hive.enforce.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=false;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
**set hive.exec.reducers.max=1;**
I am beginner into hive and hadoop era so pl excuse. Any help will be greatly appreciated
Note:- I am using Hadoop 2.7.3.2.5.0.55-1
I didn't see any direct mechanism available or hive settings which will automatically merge all the small files at the end of the query. The concatenation of small files are currently not supported for files stored as text file.
As per the comment by "leftjoin" in my post, I have created the table in ORC format, and then used CONCATENATE hive query to merge all the small files into single big file.
I then used below hive query to export data from this single big ORC file into single text file, and could able to do my task with this exported text file.
hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;
Courtesy:- https://community.hortonworks.com/questions/144122/convert-orc-table-data-into-csv.html

Hive Merge Small ORC Files

My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks.
My Input and Output Both Are S3 and Environment using is EMR,
Hive Parameters which am setting,
set hive.msck.path.validation=ignore;
set hive.exec.reducers.bytes.per.reducer=256000000;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 8;
SET hive.exec.stagingdir=/tmp/hive/  ;
SET hive.exec.scratchdir=/tmp/hive/ ;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=104857600;
set mapred.reduce.tasks=10;
My Insert Statement:
insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;
Now the problem is , I have around 80 input files which are of 500MB size in total and after this insert statement, I was expecting 4 files in S3, but all these files are getting merged into a single file which is not desired output.
Can someone please let me know, what's going wrong ,
you are using 2 different concepts to control the output files:
partition: it set the directories
distribute by: set the files in each directory
if you just want to have 4 files in each directory, you can distribute by just a random number, for example:
insert into table dev.orc_convert_zzz_18 partition(event_type)
select * from dev.events_part_input_18
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);
but I would recommend distributing by some column in your data that you might query by. It can improve your query times.
can read more about it here

Passing the table header in Hive transform

I am creating a query in Hive to execute a R script. I am using transform function to pass the table. However when I receive the table in R it comes without the header. I know that I could create a variable and ask the user to insert the header manually but I do not wanna do it.
I wanna do something automatically, I am considering two options:
1) Figure out a way to pass the table with the header included when using transform function
2) Save the header in a variable and pass it in transform (I have already tried it in different ways but instead of passing the result of the query it is passing the query string - as seen below)
Here is what I have:
--Name of the origin table
set source_table = categ_table_small;
--Number of clusters
set k = "5";
--Distance to be used in the model
set distance = "euclidean";
--Folder where the results of the model will be saved
set dir_tar = "/output_r";
--Name of the model used in the naming of the files
set model_name ="testeclara_small";
--Samples: integer, number of samples to be drawn from the dataset.
set n_samples = "10";
--sampsize: integer, number of observations in each sample. This formula is suggested by the package. sampsize<-min(nrow(x), 40 + 2 * k)
set sampsize = "50";
--Creating a matrix which will store the sample number and the group of each sample according to the algorithm
CREATE TABLE IF NOT EXISTS medoids_result AS SELECT * FROM categ_table_small;
--In the normal situation you don't have the output label, it means you just have 'x' and do not have 'y', so you need to add one extra column to receive
--the group of each observation
--ALTER TABLE medoids_result ADD COLUMNS (medoid INT);
set result_matrix = medoids_result;
set headerMatrix = show columns in categ_table_small;
--Trainning query
SET mapreduce.job.name = K medoids Clara- ${hiveconf:source_table};
SET mapreduce.job.reduces=1;
INSERT OVERWRITE TABLE ${hiveconf:result_matrix}
SELECT TRANSFORM ($begin(cols="${hiveconf:source_table}" delimiter= "," excludes="y")$column$end)
USING '/usr/bin/Rscript_10gb /programs_r/du8_dev_1.R ${hiveconf:k}${hiveconf:distance}${hiveconf:dir_tar}${hiveconf:model_name}${hiveconf:n_samples}${hiveconf:sampsize}${hiveconf:headerMatrix}'
AS
(
$begin(table='${hiveconf:result_matrix}') $column$end
)
FROM
(SELECT *
FROM ${hiveconf:source_table}
DISTRIBUTE BY '1'
)t1;
You can add this line
hive -e 'set hive.cli.print.header=true;select * from tablename;'
Where tablename refers to your table name
If you want defaultly work for every table then you need to update the $HOME/.hiverc file with
hive> set hive.cli.print.header=true;
in the first line.

Hive partitioning not working with dynamic variable

If I run
set hivevar:a = 1;
select * from t1 where partition_variable=${a};
Hive only pulls in the records from the appropriate partition.
Alternately if I run
set hivevar:b = 6;
set hivevar:c = 5;
set hivevar:a = ${b}-${c};
select * from t1 where partition_variable=${a};
The condition on partition_variable is treated as a predicate rather than a partition, and hive goes through all records in the table.
This is obviously a contrived example, but in my particular use case it is necessary. Is there anyway to force hive to use this for partitioning?
Thanks in advance.
Is the partition variable the column on which partition occurs. It works with following.
create table newpart
(productOfMonth string)
partitioned by (month int);
hive> select * from newpart;
OK
Cantaloupes 10
Pumpkin 11
set hivevar:lastmonth = 11;
set hivevar:const = 1;
set hivevar:prevmonth = ${lastmonth}-${const};
hive> select * from newpart
> where month = ${prevmonth};
OK
Cantaloupes 10
I was never able to get partitioning to work properly with dynamically generated hive variables, but a simple workaround was to create a table containing the variables and join on them rather than using them in the where clause.

Resources