Select rows except the one that contains min value in Spark using HiveContext - sparkr

I have a Spark Data Frame that contains Timestamp and Machine Ids. I wish to remove the lowest timestamp value from each group. I tried following code:
sqlC <- sparkRHive.init(sc)
ts_df2<- sql(sqlC,"SELECT ts,Machine FROM sdf2 EXCEPT SELECT MIN(ts),Machine FROM sdf2 GROUP BY Machine")
But the following error is coming:
16/04/06 06:47:52 ERROR RBackendHandler: sql on 35 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: missing EOF at 'SELECT' near 'EXCEPT'; line 1 pos 35
What is the problem? If HiveContext does not support EXCEPT keyword what will be synonymous way of doing the same in HiveContext?

The programming guide for Spark 1.6.1 shows supported and unsupported Hive features in Spark 1.6.1
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features
I don't see EXCEPT in either category. I saw elsewhere that Hive QL doesn't support EXCEPT, or at least did not at that time.
Hive QL Except clause
Perhaps try a table of the mins and then do a left outer join as in that answer?
SELECT ts, Machine FROM ts mins LEFT OUTER JOIN ts mins ON (ts.id=mins.id) WHERE mins.id IS NULL;
You can also use the sparkR built-in function except(), though I think you would need to create you mins DataFrame first
exceptDF <- except(df, df2)

Related

How to Identify total number of jobs required to execute hive query

Is there a way to identify, total number of jobs required to execute a query.
For Example in the below 2 queries, number of joins and subquery are same but one query would require 2 jobs where as other requires 3
select t1.item_dim_key hive, t2.item_dim_key as monet
from ext_dist_it_dim_key t1
left outer join (select distinct item_dim_key from PO_ITEM_DIM) t2 on t1.item_dim_key=t2.item_dim_key
where t2.item_dim_key is null;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = 20190208020329_258ee4c0-5819-4842-b479-d549c82a0779
**Total jobs = 3**
hive> select t1.item_dim_key hive, t2.item_dim_key as monet
from (select distinct item_dim_key from PO_ITEM_DIM) t1
left outer join ext_dist_it_dim_key t2 on t1.item_dim_key=t2.item_dim_key
where t2.item_dim_key is null;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = 20190208020624_9ea3dc20-ffc8-4461-9516-7a4770d1dd6b
**Total jobs = 2**
Is it possible to know how many jobs would it take for a query to execute?
What are the parameters required to calculate number of jobs.
Thanks
Use EXPLAIN, it shows the query execution plan. Only plan can help to answer this question for sure. Based on statistics or table(files) size, Optimizer can convert some joins to map-joins, etc.

Hive partition pruning on computed column

I have a few tables on Hive and my query is trying to retrieve the data for the past x days. Hive is pruning the partitions when I use a direct date, but is doing a full table scan when using a formula instead.
select *
from f_event
where date_key > 20160101;
scanned partitions..
s3://...key=20160102 [f]
s3://...key=20160103 [f]
s3://...key=20160104 [f]
If I use a formula, say, to get the past 4 weeks of data
Select count(*)
From f_event f
Where date_key > from_unixtime(unix_timestamp()-2*7*60*60*24, 'yyyyMMdd')
This is scanning all partitions in the table.
environment : Hadoop 2.6.0, EMR, Hive on S3, Hive 1.0.0
Hive doesn't trigger partition pruning when the filtering expression contains non-deterministic functions such as unix_timestamp().
A good reason for this was mentioned in the discussion:
Imagine a situation where you had:
WHERE partition_column = f(unix_timestamp()) AND ordinary_column =
f(unix_timestamp).
The right hand side of the predicate has to be evaluated at map-time,
whereas you're assuming that left hand side should be evaluated at
compile time, which means you have two different values of
unix_timestamp() floating around, which can only end badly.

How much data is considered "too large" for a Hive MAPJOIN job?

EDIT: added more file size details, and some other session information.
I have a seemingly straightforward Hive JOIN query that surprisingly requires several hours to run.
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key
WHERE a.keyPart BETWEEN b.startKeyPart AND B.endKeyPart;
I'm trying to determine if the execution time is normal for my dataset and AWS hardware selection, or if I am simply trying to JOIN too much data.
Table A: ~2.2 million rows, 12MB compressed, 81MB raw, 4 files.
Table B: ~245 thousand rows, 6.7MB compressed, 14MB raw, one file.
AWS: emr-4.3.0, running on about 5 m3.2xlarge EC2 instances.
Records from A always matches one or more records in B, so logically I see that at most 500 billion rows are generated before they are pruned with the WHERE clause.
4 mappers are allocated for the job, which completes in 6 hours. Is this normal for this type of query and configuration? If not, what should I do to improve it?
I've partitioned B on the JOIN key, which yields 5 partitions, but haven't noticed a significant improvement.
Also, the logs show that the Hive optimizer starts a local map join task, presumably to cache or stream the smaller table:
2016-02-07 02:14:13 Starting to launch local task to process map join; maximum memory = 932184064
2016-02-07 02:14:16 Dump the side-table for tag: 1 with group count: 5 into file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2016-02-07 02:14:17 Uploaded 1 File to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (12059634 bytes)
2016-02-07 02:14:17 End of local task; Time Taken: 3.71 sec.
What is causing this job to run slowly? The data set doesn't appear too large, and the "small-table" size is well under the "small-table" limit of 25MB that triggers the disabling of the MAPJOIN optimization.
A dump of the EXPLAIN output is copied on PasteBin for reference.
My session enables compression for output and intermediate storage. Could this be the culprit?
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.output.compress=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET io.seqfile.compression.type=BLOCK;
My solution to this problem is to express the JOIN predicate entirely within the JOIN ON clause, as this is the most efficient way to execute a JOIN in Hive. As for why the original query was slow, I believe that the mappers just need time when scanning the intermediate data set row by row, 100+ billion times.
Due to Hive only supporting equality expressions in the JOIN ON clause and rejecting function calls that use both table aliases as parameters, there is no way to rewrite the original query's BETWEEN clause as an algebraic expression. For example, the following expression is illegal.
-- Only handles exclusive BETWEEN
JOIN b ON a.key = b.key
AND sign(a.keyPart - b.startKeyPart) = 1.0 -- keyPart > startKeyPart
AND sign(a.keyPart - b.endKeyPart) = -1.0 -- keyPart < endKeyPart
I ultimately modified my source data to include every value between startKeyPart and endKeyPart in a Hive ARRAY<BIGINT> data type.
CREATE TABLE LookupTable
key BIGINT,
startKeyPart BIGINT,
endKeyPart BIGINT,
keyParts ARRAY<BIGINT>;
Alternatively, I could have generated this value inline within my queries using a custom Java method; the LongStream.rangeClosed() method is only available in Java 8, which is not part of Hive 1.0.0 in AWS emr-4.3.0.
Now that I have the entire key space in an array, I can transform the array to a table using LATERAL VIEW and explode(), rewriting the JOIN as follows.
WITH b AS
(
SELECT key, keyPart, value
FROM LookupTable
LATERAL VIEW explode(keyParts) keyPartsTable AS keyPart
)
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key AND a.keyPart = b.keyPart;
The end result is that the above query takes approximately 3 minutes to complete, when compared with the original 6 hours on the same hardware configuration.

Hive Query dumping issue

I am facing difficulties in getting the dump(text file delimited by ^) for a Query in hive for my project -sentimental analysis in stock market using twitter.
The query which should fetch me an output in hdfs or local file-system is given below:
hive> select t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));
The query produces the correct output but when I try dumping it in hdfs it gives me an error.
I ran through various other solutions given in stackoverflow for dumping but I was not able to find an appropriate solution which suits me.
Thanks for your help.
INSERT OVERWRITE DIRECTORY '/path/to/dir'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '^'
SELECT t.cmpname,t.datecol,t.tweet,st.diff FROM tweet t LEFT OUTER JOIN stock st
ON(t.datecol = st.datecol AND lower(t.cmpname) = lower(st.cmpname));

Hive macro not returning expected results

I am using Hive temporary macros to help with date algebra (finding the first day of the prior month in this case) and I am getting unexpected results.
create temporary macro month1st_sub(dt date)
cast(concat(
case
when month(dt) = 1 then cast(year(dt)-1 as string)
else cast(year(dt) as string)
end,
"-",
case
when month(dt) = 1 then "12"
else cast(month(dt)-1 as string)
end,
"-01"
) as date)
;
When I test this macro using a vars table that contains a single value for max_dt (8-15-2014) using the following:
select
max_dt,
month1st_sub(cast("2013-1-1" as date)),
month1st_sub(max_dt),
month1st_sub(cast("2013-1-1" as date)),
month1st_sub(cast("2013-4-1" as date)),
month1st_sub(cast("2013-5-1" as date)),
month1st_sub(cast("2013-6-1" as date))
from vars;
I receive the following output:
max_dt _c1 _c2 _c3 _c4 _c5 _c6
2013-08-01 2012-12-01 2013-07-01 2012-12-01 2013-03-01 2013-04-01 2013-07-01
The last returned value, 2013-07-01, should be 2013-05-01. This error is reproducible if I remove the 6-1 line then the 5-1 line will return 2013-07-01. The issue appears to always be with the last returned value of a set of macro invocations.
The setting I use use are as follows:
set hive.cli.print.header=true;
set mapreduce.input.fileinputformat.split.maxsize=10000000;
set hive.auto.convert.join = true;
set hive.exec.dynamic.partition.mode=nonstrict;
Question 1: Am I doing something wrong? If not is this an issue with Hive or likely to be some environment issue?
Question 2: is the temporary macro functionality in hive trustworthy enough to use or should I be writing java udfs to do this?
Old question, I know.
There are a number of significant bugs in the macro implementation, which should be mostly resolved by 2.1.0. From https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL.
As of Hive 0.12.0.
Bug fixes: Prior to Hive 1.3.0 and 2.0.0:
When a HiveQL macro was used more than once while processing the same row,
Hive returned the same result for all invocations even though the
arguments were different. (See HIVE-11432.)
Prior to Hive 1.3.0 and 2.0.0:
when multiple macros were used while processing the same row, an ORDER BY clause could give wrong results. (See HIVE-12277.)
Prior to Hive 2.1.0:
when multiple macros were used while processing the same
row, results of the later macros were overwritten by that of the
first. (See HIVE-13372.)
I would recommend updating hive to one of these two versions to resolve your problem.

Resources