I am creating a query in Hive to execute a R script. I am using transform function to pass the table. However when I receive the table in R it comes without the header. I know that I could create a variable and ask the user to insert the header manually but I do not wanna do it.
I wanna do something automatically, I am considering two options:
1) Figure out a way to pass the table with the header included when using transform function
2) Save the header in a variable and pass it in transform (I have already tried it in different ways but instead of passing the result of the query it is passing the query string - as seen below)
Here is what I have:
--Name of the origin table
set source_table = categ_table_small;
--Number of clusters
set k = "5";
--Distance to be used in the model
set distance = "euclidean";
--Folder where the results of the model will be saved
set dir_tar = "/output_r";
--Name of the model used in the naming of the files
set model_name ="testeclara_small";
--Samples: integer, number of samples to be drawn from the dataset.
set n_samples = "10";
--sampsize: integer, number of observations in each sample. This formula is suggested by the package. sampsize<-min(nrow(x), 40 + 2 * k)
set sampsize = "50";
--Creating a matrix which will store the sample number and the group of each sample according to the algorithm
CREATE TABLE IF NOT EXISTS medoids_result AS SELECT * FROM categ_table_small;
--In the normal situation you don't have the output label, it means you just have 'x' and do not have 'y', so you need to add one extra column to receive
--the group of each observation
--ALTER TABLE medoids_result ADD COLUMNS (medoid INT);
set result_matrix = medoids_result;
set headerMatrix = show columns in categ_table_small;
--Trainning query
SET mapreduce.job.name = K medoids Clara- ${hiveconf:source_table};
SET mapreduce.job.reduces=1;
INSERT OVERWRITE TABLE ${hiveconf:result_matrix}
SELECT TRANSFORM ($begin(cols="${hiveconf:source_table}" delimiter= "," excludes="y")$column$end)
USING '/usr/bin/Rscript_10gb /programs_r/du8_dev_1.R ${hiveconf:k}${hiveconf:distance}${hiveconf:dir_tar}${hiveconf:model_name}${hiveconf:n_samples}${hiveconf:sampsize}${hiveconf:headerMatrix}'
AS
(
$begin(table='${hiveconf:result_matrix}') $column$end
)
FROM
(SELECT *
FROM ${hiveconf:source_table}
DISTRIBUTE BY '1'
)t1;
You can add this line
hive -e 'set hive.cli.print.header=true;select * from tablename;'
Where tablename refers to your table name
If you want defaultly work for every table then you need to update the $HOME/.hiverc file with
hive> set hive.cli.print.header=true;
in the first line.
Related
I must go through the records of a table and display them in multiple textboxes
I am using the table with four different alias to have four workareas on the same table and have four record pointers.
USE Customers ALIAS customers1
USE customers AGAIN ALIAS customers2
USE customers AGAIN ALIAS customers3
USE customers AGAIN ALIAS customers4
Thisform.TxtNomCli.ControlSource = "customers.name"
Thisform.TxtIdent.ControlSource = "customers.identify"
Thisform.TxtAddress.ControlSource = "customers.address"
Thisform.TxtTele.ControlSource = "customers.phone"
Thisform.TxtNomCli2.ControlSource = "customers2.name"
Thisform.TxtIdent2.ControlSource = "customers2.identify"
Thisform.TxtDirec2.ControlSource = "customers2.address"
Thisform.TxtTele2.ControlSource = "customers2.phone"
Thisform.TxtNomCli3.ControlSource = "customers3.name"
Thisform.TxtIdent3.ControlSource = "customers3.identify"
Thisform.TxtDirec3.ControlSource = "customers3.address"
Thisform.TxtTele3.ControlSource = "customers3.phone"
Thisform.TxtNomCli4.ControlSource = "customers4.name"
Thisform.TxtIdent4.ControlSource = "customers4.identify"
Thisform.TxtDirec4.ControlSource = "customers4.address"
Thisform.TxtTele4.ControlSource = "customers4.phone"
how to go through the records of the table, that in customers is in the first record, customers2 in the second record, customers3 in the third record and customers4 in the fourth record of the table?
How do I make each row of the textbox show the corresponding row of the table?
I would SQL Select id + whatever other fields you need into four cursors:
select id, identifica, nombre, direccion, telefono from customers ;
into cursor customers1 nofilter readwrite
select id, identifica, nombre, direccion, telefono from customers;
into cursor customers2 nofilter readwrite
* repeat for 3 and 4
Then set your ControlSources() to the cursors, not the base table. If you need to update records you can use the id of the modified record in the cursor to update the correct record in the base table.
You could simply use SET RELATION to achieve what you want. However, in your current code you are not really using 4 aliases. You are reopening the same table with a different alias in the same workarea and you would end up with a single table with alias Customers4. To do that correctly, you need to add "IN 0" clause to your USE commands. ie:
USE customers ALIAS customers1
USE customers IN 0 AGAIN ALIAS customers2
USE customers IN 0 AGAIN ALIAS customers3
USE customers IN 0 AGAIN ALIAS customers4
SELECT customers1
SET RELATION TO ;
RECNO()+1 INTO Customers2, ;
RECNO()+2 INTO Customers3, ;
RECNO()+3 INTO Customers4 IN Customers1
With this setup, as you move the pointer in Customers1 it would move in all other 3 aliases accordingly (note that there is no order set).
Having said these, now you should think why you need to do this? Maybe having another control like a grid is the way to go? Or using an array might be a better way to control this? ie: With an array:
USE (_samples+'data\customer') ALIAS customers
LOCAL ARRAY laCustomers[4]
LOCAL ix
FOR ix=1 TO 4
GO m.ix
SCATTER NAME laCustomers[m.ix]
ENDFOR
? laCustomers[1].Cust_id, laCustomers[2].Cust_id, laCustomers[3].Cust_id, laCustomers[4].Cust_id
With this approach, you could set your controlsources to be laCustomers[1].Identify, laCustomers[1].name and so on. While saving back to data, you would go to related record and do a GATHER. That would be all.
First you need to think about what you really want to do.
I have created a hive table using below query, and inserting data to this table on daily basis using second query as mentioned below
create EXTERNAL table IF NOT EXISTS DB.efficacy
(
product string,
TP_Silent INT,
TP_Active INT,
server_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://hdfsadlproduction/user/DB/Report/efficacy';
Insert INTO DB.efficacy
select
product,
SUM(CASE WHEN verdict = 'TP_Silent' THEN 1 ELSE 0 END ),
SUM(CASE WHEN verdict = 'TP_Active' THEN 1 ELSE 0 END ) ,
current_date()
from
DB.efficacy_raw
group by
product
;
The issue is that everyday when my insert query executes it basically creates a new file in hadoop FS. I want every day query output to get appended in a same single file only, but Hadoop FS contains the files in the following manner.
000000_0, 000000_0_copy_1, 000000_0_copy_2
I have used below hive settings:-
SET hive.execution.engine=mr;
SET tez.queue.name=${queueName};
SET mapreduce.job.queuename=${queueName};
SET mapreduce.map.memory.mb = 8192;
SET mapreduce.reduce.memory.mb = 8192;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.parallel = true;
SET hive.exec.parallel.thread.number = 2;
SET mapreduce.input.fileinputformat.split.maxsize=2048000000;
SET mapreduce.input.fileinputformat.split.minsize=2048000000;
SET mapreduce.job.reduces = 20;
SET hadoop.security.credential.provider.path=jceks://hdfs/user/efficacy/s3-access/efficacy.jceks;
set hive.vectorized.execution.enabled=false;
set hive.enforce.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=false;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
**set hive.exec.reducers.max=1;**
I am beginner into hive and hadoop era so pl excuse. Any help will be greatly appreciated
Note:- I am using Hadoop 2.7.3.2.5.0.55-1
I didn't see any direct mechanism available or hive settings which will automatically merge all the small files at the end of the query. The concatenation of small files are currently not supported for files stored as text file.
As per the comment by "leftjoin" in my post, I have created the table in ORC format, and then used CONCATENATE hive query to merge all the small files into single big file.
I then used below hive query to export data from this single big ORC file into single text file, and could able to do my task with this exported text file.
hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;
Courtesy:- https://community.hortonworks.com/questions/144122/convert-orc-table-data-into-csv.html
My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks.
My Input and Output Both Are S3 and Environment using is EMR,
Hive Parameters which am setting,
set hive.msck.path.validation=ignore;
set hive.exec.reducers.bytes.per.reducer=256000000;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;
set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 8;
SET hive.exec.stagingdir=/tmp/hive/ ;
SET hive.exec.scratchdir=/tmp/hive/ ;
set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=104857600;
set mapred.reduce.tasks=10;
My Insert Statement:
insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;
Now the problem is , I have around 80 input files which are of 500MB size in total and after this insert statement, I was expecting 4 files in S3, but all these files are getting merged into a single file which is not desired output.
Can someone please let me know, what's going wrong ,
you are using 2 different concepts to control the output files:
partition: it set the directories
distribute by: set the files in each directory
if you just want to have 4 files in each directory, you can distribute by just a random number, for example:
insert into table dev.orc_convert_zzz_18 partition(event_type)
select * from dev.events_part_input_18
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);
but I would recommend distributing by some column in your data that you might query by. It can improve your query times.
can read more about it here
If I run
set hivevar:a = 1;
select * from t1 where partition_variable=${a};
Hive only pulls in the records from the appropriate partition.
Alternately if I run
set hivevar:b = 6;
set hivevar:c = 5;
set hivevar:a = ${b}-${c};
select * from t1 where partition_variable=${a};
The condition on partition_variable is treated as a predicate rather than a partition, and hive goes through all records in the table.
This is obviously a contrived example, but in my particular use case it is necessary. Is there anyway to force hive to use this for partitioning?
Thanks in advance.
Is the partition variable the column on which partition occurs. It works with following.
create table newpart
(productOfMonth string)
partitioned by (month int);
hive> select * from newpart;
OK
Cantaloupes 10
Pumpkin 11
set hivevar:lastmonth = 11;
set hivevar:const = 1;
set hivevar:prevmonth = ${lastmonth}-${const};
hive> select * from newpart
> where month = ${prevmonth};
OK
Cantaloupes 10
I was never able to get partitioning to work properly with dynamically generated hive variables, but a simple workaround was to create a table containing the variables and join on them rather than using them in the where clause.
I need to update value in table with (own value) - (value in other table Like) for an id
I have tried this
UPDATE FULLTABLE
SET FULLTABLE.Balance = FULLTABLE.Balance - AdvBalance.balance
WHERE FULLTABLE.id= AdvBalance.advid;
and this
update fulltable f set f.balance = ( f.balance -
select a.balance from advbalance a where a.advid=f.advertiserID)
first one is throwing error that invalid identifier. second one some other error.
I am using oracle db.
Please suggest a way to do this.
Thanks
You need a subselect to get the value from another table:
UPDATE FULLTABLE ft
SET ft.Balance = ft.Balance - (SELECT ab.balance
FROM AdvBalance ab
WHERE ft.id = ab.advid);
This will fail if it is possible that the subquery will return more than one row, but if this is the case, then you must decide how to find the right value to be subtracted.