Hive Merge Small ORC Files - hadoop

My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks.
My Input and Output Both Are S3 and Environment using is EMR,
Hive Parameters which am setting,
set hive.msck.path.validation=ignore;
set hive.exec.reducers.bytes.per.reducer=256000000;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 8;
SET hive.exec.stagingdir=/tmp/hive/  ;
SET hive.exec.scratchdir=/tmp/hive/ ;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=104857600;
set mapred.reduce.tasks=10;
My Insert Statement:
insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;
Now the problem is , I have around 80 input files which are of 500MB size in total and after this insert statement, I was expecting 4 files in S3, but all these files are getting merged into a single file which is not desired output.
Can someone please let me know, what's going wrong ,

you are using 2 different concepts to control the output files:
partition: it set the directories
distribute by: set the files in each directory
if you just want to have 4 files in each directory, you can distribute by just a random number, for example:
insert into table dev.orc_convert_zzz_18 partition(event_type)
select * from dev.events_part_input_18
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);
but I would recommend distributing by some column in your data that you might query by. It can improve your query times.
can read more about it here

Related

failed to make query faster in hive using reducer and mappers

so I've been working on transferring data from oracle into hive with parquet format.
The format looks like this:
create external table if not exists
partitioned by(datee string)
stored as parquet
tblproperties ('hive.exec.compress.output'='true', 'parquet.compression'='SNAPPY')
My question is how to tune this query (eats about 4 hours) into 5 minutes query?
I already try this query (put before my query) of these combinations:
set hive.vectorized.execution.enabled = true;
sethive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.enabled=false;
set hive.vectorized.execution.reduce.enabled=false;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
But the results is going nowhere.
Any ideas would be appreciated!

how to constraint hive query file output to be in a single file always

I have created a hive table using below query, and inserting data to this table on daily basis using second query as mentioned below
create EXTERNAL table IF NOT EXISTS DB.efficacy
(
product string,
TP_Silent INT,
TP_Active INT,
server_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://hdfsadlproduction/user/DB/Report/efficacy';
Insert INTO DB.efficacy
select
product,
SUM(CASE WHEN verdict = 'TP_Silent' THEN 1 ELSE 0 END ),
SUM(CASE WHEN verdict = 'TP_Active' THEN 1 ELSE 0 END ) ,
current_date()
from
DB.efficacy_raw
group by
product
;
The issue is that everyday when my insert query executes it basically creates a new file in hadoop FS. I want every day query output to get appended in a same single file only, but Hadoop FS contains the files in the following manner.
000000_0, 000000_0_copy_1, 000000_0_copy_2
I have used below hive settings:-
SET hive.execution.engine=mr;
SET tez.queue.name=${queueName};
SET mapreduce.job.queuename=${queueName};
SET mapreduce.map.memory.mb = 8192;
SET mapreduce.reduce.memory.mb = 8192;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.parallel = true;
SET hive.exec.parallel.thread.number = 2;
SET mapreduce.input.fileinputformat.split.maxsize=2048000000;
SET mapreduce.input.fileinputformat.split.minsize=2048000000;
SET mapreduce.job.reduces = 20;
SET hadoop.security.credential.provider.path=jceks://hdfs/user/efficacy/s3-access/efficacy.jceks;
set hive.vectorized.execution.enabled=false;
set hive.enforce.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=false;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
**set hive.exec.reducers.max=1;**
I am beginner into hive and hadoop era so pl excuse. Any help will be greatly appreciated
Note:- I am using Hadoop 2.7.3.2.5.0.55-1
I didn't see any direct mechanism available or hive settings which will automatically merge all the small files at the end of the query. The concatenation of small files are currently not supported for files stored as text file.
As per the comment by "leftjoin" in my post, I have created the table in ORC format, and then used CONCATENATE hive query to merge all the small files into single big file.
I then used below hive query to export data from this single big ORC file into single text file, and could able to do my task with this exported text file.
hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;
Courtesy:- https://community.hortonworks.com/questions/144122/convert-orc-table-data-into-csv.html

Passing the table header in Hive transform

I am creating a query in Hive to execute a R script. I am using transform function to pass the table. However when I receive the table in R it comes without the header. I know that I could create a variable and ask the user to insert the header manually but I do not wanna do it.
I wanna do something automatically, I am considering two options:
1) Figure out a way to pass the table with the header included when using transform function
2) Save the header in a variable and pass it in transform (I have already tried it in different ways but instead of passing the result of the query it is passing the query string - as seen below)
Here is what I have:
--Name of the origin table
set source_table = categ_table_small;
--Number of clusters
set k = "5";
--Distance to be used in the model
set distance = "euclidean";
--Folder where the results of the model will be saved
set dir_tar = "/output_r";
--Name of the model used in the naming of the files
set model_name ="testeclara_small";
--Samples: integer, number of samples to be drawn from the dataset.
set n_samples = "10";
--sampsize: integer, number of observations in each sample. This formula is suggested by the package. sampsize<-min(nrow(x), 40 + 2 * k)
set sampsize = "50";
--Creating a matrix which will store the sample number and the group of each sample according to the algorithm
CREATE TABLE IF NOT EXISTS medoids_result AS SELECT * FROM categ_table_small;
--In the normal situation you don't have the output label, it means you just have 'x' and do not have 'y', so you need to add one extra column to receive
--the group of each observation
--ALTER TABLE medoids_result ADD COLUMNS (medoid INT);
set result_matrix = medoids_result;
set headerMatrix = show columns in categ_table_small;
--Trainning query
SET mapreduce.job.name = K medoids Clara- ${hiveconf:source_table};
SET mapreduce.job.reduces=1;
INSERT OVERWRITE TABLE ${hiveconf:result_matrix}
SELECT TRANSFORM ($begin(cols="${hiveconf:source_table}" delimiter= "," excludes="y")$column$end)
USING '/usr/bin/Rscript_10gb /programs_r/du8_dev_1.R ${hiveconf:k}${hiveconf:distance}${hiveconf:dir_tar}${hiveconf:model_name}${hiveconf:n_samples}${hiveconf:sampsize}${hiveconf:headerMatrix}'
AS
(
$begin(table='${hiveconf:result_matrix}') $column$end
)
FROM
(SELECT *
FROM ${hiveconf:source_table}
DISTRIBUTE BY '1'
)t1;
You can add this line
hive -e 'set hive.cli.print.header=true;select * from tablename;'
Where tablename refers to your table name
If you want defaultly work for every table then you need to update the $HOME/.hiverc file with
hive> set hive.cli.print.header=true;
in the first line.

Hive partitioning not working with dynamic variable

If I run
set hivevar:a = 1;
select * from t1 where partition_variable=${a};
Hive only pulls in the records from the appropriate partition.
Alternately if I run
set hivevar:b = 6;
set hivevar:c = 5;
set hivevar:a = ${b}-${c};
select * from t1 where partition_variable=${a};
The condition on partition_variable is treated as a predicate rather than a partition, and hive goes through all records in the table.
This is obviously a contrived example, but in my particular use case it is necessary. Is there anyway to force hive to use this for partitioning?
Thanks in advance.
Is the partition variable the column on which partition occurs. It works with following.
create table newpart
(productOfMonth string)
partitioned by (month int);
hive> select * from newpart;
OK
Cantaloupes 10
Pumpkin 11
set hivevar:lastmonth = 11;
set hivevar:const = 1;
set hivevar:prevmonth = ${lastmonth}-${const};
hive> select * from newpart
> where month = ${prevmonth};
OK
Cantaloupes 10
I was never able to get partitioning to work properly with dynamically generated hive variables, but a simple workaround was to create a table containing the variables and join on them rather than using them in the where clause.

SQLPlus - spooling to multiple files from PL/SQL blocks

I have a query that returns a lot of data into a CSV file. So much, in fact, that Excel can't open it - there are too many rows. Is there a way to control spool to spool to a new file everytime 65000 rows have been processed? Ideally, I'd like to have my output in files named in sequence, such as large_data_1.csv, large_data_2.csv, large_data_3.csv, etc...
I could use dbms_output in a PL/SQL block to control how many rows are output, but then how would I switch files, as spool does not seem to be accessible from PL/SQL blocks?
(Oracle 10g)
UPDATE:
I don't have access to the server, so writing files to the server would probably not work.
UPDATE 2:
Some of the fields contain free-form text, including linebreaks, so counting line breaks AFTER the file is written is not as easy as counting records WHILE the data is being returned...
Got a solution, don't know why I didn't think of this sooner...
The basic idea is that the master sqplplus script generates an intermediate script that will split the output to multiple files. Executing the intermediate script will execute multiple queries with different ranges imposed on rownum, and spool to a different file for each query.
set termout off
set serveroutput on
set echo off
set feedback off
variable v_rowCount number;
spool intermediate_file.sql
declare
i number := 0;
v_fileNum number := 1;
v_range_start number := 1;
v_range_end number := 1;
k_max_rows constant number := 65536;
begin
dbms_output.enable(10000);
select count(*)
into :v_err_count
from ...
/* You don't need to see the details of the query... */
while i <= :v_err_count loop
v_range_start := i+1;
if v_range_start <= :v_err_count then
i := i+k_max_rows;
v_range_end := i;
dbms_output.put_line('set colsep ,
set pagesize 0
set trimspool on
set headsep off
set feedback off
set echo off
set termout off
set linesize 4000
spool large_data_file_'||v_fileNum||'.csv
select data_string
from (select rownum rn, data_object
from
/* Details of query omitted */
)
where rn >= '||v_range_start||' and rn <= '||v_range_end||';
spool off');
v_fileNum := v_fileNum +1;
end if;
end loop;
end;
/
spool off
prompt executing intermediate file
#intermediate_file.sql;
set serveroutput off
Try this for a pure SQL*Plus solution...
set pagesize 0
set trimspool on
set headsep off
set feedback off
set echo off
set verify off
set timing off
set linesize 4000
DEFINE rows_per_file = 50
-- Create an sql file that will create the individual result files
SET DEFINE OFF
SPOOL c:\temp\generate_one.sql
PROMPT COLUMN which_dynamic NEW_VALUE dynamic_filename
PROMPT
PROMPT SELECT 'c:\temp\run_#'||TO_CHAR( &1, 'fm000' )||'_result.txt' which_dynamic FROM dual
PROMPT /
PROMPT SPOOL &dynamic_filename
PROMPT SELECT *
PROMPT FROM ( SELECT a.*, rownum rnum
PROMPT FROM ( SELECT object_id FROM all_objects ORDER BY object_id ) a
PROMPT WHERE rownum <= ( &2 * 50 ) )
PROMPT WHERE rnum >= ( ( &3 - 1 ) * 50 ) + 1
PROMPT /
PROMPT SPOOL OFF
SPOOL OFF
SET DEFINE &
-- Define variable to hold number of rows
-- returned by the query
COLUMN num_rows NEW_VALUE v_num_rows
-- Find out how many rows there are to be
SELECT COUNT(*) num_rows
FROM ( SELECT LEVEL num_files FROM dual CONNECT BY LEVEL <= 120 );
-- Create a master file with the correct number of sql files
SPOOL c:\temp\run_all.sql
SELECT '#c:\temp\generate_one.sql '||TO_CHAR( num_files )
||' '||TO_CHAR( num_files )
||' '||TO_CHAR( num_files ) file_name
FROM ( SELECT LEVEL num_files
FROM dual
CONNECT BY LEVEL <= CEIL( &v_num_rows / &rows_per_file ) )
/
SPOOL OFF
-- Now run them all
#c:\temp\run_all.sql
Use split on the resulting file.
utl_file is the package you are looking for. You can write a cursor and loop over the rows (writing them out) and when mod(num_rows_written,num_per_file) == 0 it's time to start a new file. It works fine within PL/SQL blocks.
Here's the reference for utl_file:
http://www.adp-gmbh.ch/ora/plsql/utl_file.html
NOTE:
I'm assuming here, that it's ok to write the files out to the server.
Have you looked at setting up an external data connection in Excel (assuming that the CSV files are only being produced for use in Excel)? You could define an Oracle view that limits the rows returned and also add some parameters in the query to allow the user to further limit the result set. (I've never understood what someone does with 64K rows in Excel anyway).
I feel that this is somewhat of a hack, but you could also use UTL_MAIL and generate attachments to email to your user(s). There's a 32K size limit to the attachments, so you'd have to keep track of the size in the cursor loop and start a new attachment on this basis.
While your question asks how to break the greate volume of data into chunks Excel can handle, I would ask if there is any part of the Excel operation that can be moved into SQL (PL/SQL?) that can reduce the volume of data. Ultimately it has to be reduced to be made meaningful to anyone. The database is a great engine to do that work on.
When you have reduced the data to more presentable volumes or even final results, dump it for Excel to make the final presentation.
This is not the answer you were looking for but I think it is always good to ask if you are using the right tool when it is getting difficult to get the job done.

Resources