Cross Product In HIVE - hadoop

While Running hive query on map reduce my job is stuck at a particular stage.I don't have any idea why it's running very slow.
I can't put the whole query but will post just part of it.I have already table called TICKET_V and TICKET_R. Now my query is following...
INSERT OVERWRITE TABLE TICKET_V
SELECT * FROM CUSTOMER AS A
LEFT OUTER JOIN TICKET_R AS B ON A.TICKET_NO= B.TICKET_NO
LEFT OUTER JOIN TICKET_X AS C ON A.COMPANY_ID = C.COMPANY_ID
WHERE SOME CONDITION
Here TICKET_R, CUSTOMER, TICKET_X Table have respectively 55M,20M,2M Rows.Everything runs smoothly and got a TICKET_V table with 0.8M rows.
Now I run another query following which depends on TICKET_V which is following...
INSERT OVERWRITE TABLE TICKET_R
SELECT * FROM CUSTOMER_R AS C
LEFT OUTER JOIN TICKET_R AS D ON C.TICKET_NO = D.TICKET_NO
WHERE SOME CONDITION
CUSTOMER_R has around 2M rows.
After running this query on hive console Firstly I got a warning as follows:
Warning: Map Join MAPJOIN[57][bigTable=?] in task 'stage-14: MAPRED' is a cross product
Warning: Shuffle Join JOIN[31][table=[table alias names]] in stage 'Stage-2: MAPRED' is a cross product
I don't understand why hive is doing a cross product on the second query while I had given a condition while everything runs well on the first query even data size is more.
If somebody gives me more light on the query then it would be helpful.I am very new to map reduce and Yes this problem is from my work.
Edits are welcome...!!
Thanks.

Related

ORA-02019:connection description for remote database not found - left join in a view

I have 3 tables:
table1: id, person_code
table2: id, address, person_code_foreing(same with that one from table 1), admission_date_1
table3: id, id_table2, admission_date_2, something
(the tables are fictive)
I'm trying to make a view who takes infos from this 3 tables using left join, i'm doing like this because in the first table i have some record who don't have the person_code in the others tables but I want also this info to be returned by the view:
CREATE OR REPLACE VIEW schema.my_view
SELECT t1.name, t2.adress, t3.something
from schema.table1#ambient1 t1
left join schema.table2#ambient1 t2
on t1.person_code = t2.person_code_foreing
left join schema.table3#ambient1 t3
on t3.id_table2 = t2.id
and t1.admission_date_1=t2.admission_date_2;
This view needs to be created in another ambient (ambient2).
I tried using a subquery, there I need also a left join to use, and this thing is very confusing because I don't get it, the subquery and the left join are the big no-no?! Or just de left-join?!
Has this happened to anyone?
How did you risolved it?
Thanks a lot.
ORA-2019 indicates that your database link (#ambient1) does not exist, or is not visible to the current user. You can confirm by checking the ALL_DB_LINKS view, which should list all links to which the user has access:
select owner, db_link from all_db_links;
Also keep in mind that Oracle will perform the joins in the database making the call, not the remote database, so you will almost certainly have to pull the entire contents of all three tables over the network to be written into TEMP for the join and then thrown away, every time you run a query. You will also lose the benefit of any indexes on the data and most likely wind up with full table scans on the temp tables within your local database.
I don't know if this is an option for you, but from a performance perspective and given that it isn't joining with anything in the local database, it would make much more sense to create the view in the remote database and just query that through the database link. That way all of the joins are performed efficiently where the data lives, only the result set is pushed over the network, and your client database SQL becomes much simpler.
I managed to make it work, but apparently ambient2 doesn't like my "left-join", and i used only a subquery and the operator (+), this is how it worked:
CREATE OR REPLACE VIEW schema.my_view
SELECT t1.name, all.adress, all.something
from schema.table1#ambient1 t1,(select * from
schema.table3#ambient1 t3, schema.table2#ambient1 t2
where t3.id_table2 = t2.id(+)
and (t1.admission_date_1=t2.admission_date_2 or t1.admission_date is null))
all
where t1.person_code = t2.person_code_foreing(+);
I tried to test if a query in ambient2 using a right-join works (with 2 tables created there) and it does. I thought there is a problem with that ambient..
For me, there is no sense why in my case this kind of join retrieves that error.
The versions are different?! I don't know, and I don't find any official documentation about that.
Maybe some of you guys have any clue..
There is a mistery for me :))
Thanks.

Ordering of multiple inner joins in a query to improve performance on the basis of where clause Oracle

I am working in oracle.
I have more than 2 tables to perform inner join. And according to me their ordering matters for query performance.
Below is the query :
select * from
FROM A a
INNER JOIN B b
ON a.b_ID=b.id
INNER JOIN C c
ON c.id=a.c_Id
INNER JOIN D d
ON a.d_ID=d.id
INNER JOIN E e
ON e.d_id =d.id
where e.name='abc' AND e.company_name='xyz';
In my case I don't require full table scan of tables A, B, C, D.
I want to apply predicate filter of name and company_name to be applied first and then apply inner join of tables A,B,C,D (in execution plan).
My question is : Is that possible?
Also , If I change the order of inner join on the basis of final where clause , can that improve performance (like below query)?
select * from
E e INNER JOIN D d
ON e.d_id =d.id
INNER JOIN A a
ON a.d_ID=d.id
INNER JOIN B b
ON a.b_ID=b.id
INNER JOIN C c
ON c.id=a.c_Id
where e.name='abc' AND e.company_name='xyz';
Even after applying this change, I found that on some DB environments, execution plan is same for two queries.
Is there any way in which I can order the steps of execution plan while query execution, like explicitly specifying the ordering of inner join?
Thanks
If you really want to go for it, you can use an optimizer hint: https://docs.oracle.com/cd/B12037_01/server.101/b10752/hintsref.htm#5555
But generally I would not recommend it - if the table statistics are up to date, the database should well be able to determine the best execution plan (especially for such an easy query).
Thanks #sers
I also found this solution of using /*+ORDERED */ earlier than the solution you provided.But I just wanted some proof of increased performace.
So I just executed
explain plan for sql_query;
select plan_table_output from table(dbms_xplan.display('plan_table',null,'typical'));
I cannot show the actual table output (so omitting table names) , but I will post other performance factors here :
I know that oracle can figure out the best plan , but forcing the execution plan helped me to improve performance.Also in my case this is the worst case scenario.

How much data is considered "too large" for a Hive MAPJOIN job?

EDIT: added more file size details, and some other session information.
I have a seemingly straightforward Hive JOIN query that surprisingly requires several hours to run.
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key
WHERE a.keyPart BETWEEN b.startKeyPart AND B.endKeyPart;
I'm trying to determine if the execution time is normal for my dataset and AWS hardware selection, or if I am simply trying to JOIN too much data.
Table A: ~2.2 million rows, 12MB compressed, 81MB raw, 4 files.
Table B: ~245 thousand rows, 6.7MB compressed, 14MB raw, one file.
AWS: emr-4.3.0, running on about 5 m3.2xlarge EC2 instances.
Records from A always matches one or more records in B, so logically I see that at most 500 billion rows are generated before they are pruned with the WHERE clause.
4 mappers are allocated for the job, which completes in 6 hours. Is this normal for this type of query and configuration? If not, what should I do to improve it?
I've partitioned B on the JOIN key, which yields 5 partitions, but haven't noticed a significant improvement.
Also, the logs show that the Hive optimizer starts a local map join task, presumably to cache or stream the smaller table:
2016-02-07 02:14:13 Starting to launch local task to process map join; maximum memory = 932184064
2016-02-07 02:14:16 Dump the side-table for tag: 1 with group count: 5 into file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2016-02-07 02:14:17 Uploaded 1 File to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (12059634 bytes)
2016-02-07 02:14:17 End of local task; Time Taken: 3.71 sec.
What is causing this job to run slowly? The data set doesn't appear too large, and the "small-table" size is well under the "small-table" limit of 25MB that triggers the disabling of the MAPJOIN optimization.
A dump of the EXPLAIN output is copied on PasteBin for reference.
My session enables compression for output and intermediate storage. Could this be the culprit?
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.output.compress=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET io.seqfile.compression.type=BLOCK;
My solution to this problem is to express the JOIN predicate entirely within the JOIN ON clause, as this is the most efficient way to execute a JOIN in Hive. As for why the original query was slow, I believe that the mappers just need time when scanning the intermediate data set row by row, 100+ billion times.
Due to Hive only supporting equality expressions in the JOIN ON clause and rejecting function calls that use both table aliases as parameters, there is no way to rewrite the original query's BETWEEN clause as an algebraic expression. For example, the following expression is illegal.
-- Only handles exclusive BETWEEN
JOIN b ON a.key = b.key
AND sign(a.keyPart - b.startKeyPart) = 1.0 -- keyPart > startKeyPart
AND sign(a.keyPart - b.endKeyPart) = -1.0 -- keyPart < endKeyPart
I ultimately modified my source data to include every value between startKeyPart and endKeyPart in a Hive ARRAY<BIGINT> data type.
CREATE TABLE LookupTable
key BIGINT,
startKeyPart BIGINT,
endKeyPart BIGINT,
keyParts ARRAY<BIGINT>;
Alternatively, I could have generated this value inline within my queries using a custom Java method; the LongStream.rangeClosed() method is only available in Java 8, which is not part of Hive 1.0.0 in AWS emr-4.3.0.
Now that I have the entire key space in an array, I can transform the array to a table using LATERAL VIEW and explode(), rewriting the JOIN as follows.
WITH b AS
(
SELECT key, keyPart, value
FROM LookupTable
LATERAL VIEW explode(keyParts) keyPartsTable AS keyPart
)
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key AND a.keyPart = b.keyPart;
The end result is that the above query takes approximately 3 minutes to complete, when compared with the original 6 hours on the same hardware configuration.

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

using multiple left outer joins pl/sql

So I have three tables I am trying to pull data from with the following query:
select tats.machine_interval_id as machine_interval_id,
tats.interval_type as interval_type,
tats.interval_category as interval_category,
ops.opstatemnemonic as operational_state,
nptc.categorytype as idle_category,
tats.interval_duration as interval_duration
from temp_agg_time_summary tats
left outer join operationalstates ops on ops.opstateid=tats.operationalstatenumeric
left outer join nptcategories nptc on nptc.categoryid=tats.categorytypenumeric
The problem I'm having is that whenever there is a value that is not null from the nptcategories table, it double the record which in turns throws off any calculations I have later in my packages. I believe the problem has to do with having more than one left outer join in the query. My question may see fairly simple, but I'm new to PL/SQL so bear with me.
What I want to know is how can I use multiple left outer joins in a query with out having this problem occur? What would be a better way to structure this query?
Update
Okay so I found the offending line of code it is below:
left outer join nptcategories nptc on nptc.categoryid=tats.categorytypenumeric
Also when using distinct, it removes all of the duplicate records, but will using this cause any problems I am unaware of? Should I focus more on figuring out why the join above does not work properly, or is the distinct good enough?
Okay, so after taking Wolf's suggestion, i went in and ran the following line of code
select categorytype, count(*)
from nptcategories
group by categorytype
having count(*) > 1;
After running this, i found that somehow there were duplicates of records in this table so, this was fixed by removing the duplicates and setting the table to have unique ids. This was done by using the following script on the DB:
alter table nptcategories add constraint nptcatidunq unique(categoryid)

Resources