Inner join two data sets using Apache Hadoop Pig - hadoop

I have two data sets (1M unique string) and (1B unique string); I want to know how many strings are common in both sets, and wondering what is the most efficient way to get the number using Apache Pig?

You can first join both the file like below:
A = LOAD '/joindata1.txt' AS (a1:int,a2:int,a3:int);
B = LOAD '/joindata2.txt' AS (b1:int,b2:int);
X = JOIN A BY a1, B BY b1;
Then you can count the number of rows :
grouped_records = GROUP X ALL;
count_records = FOREACH grouped_records GENERATE COUNT(A.a1);
Does it help you problem...

Your case doesn't fall under either replicate or merge or skewed join. So you have to do a default join, where in map phase it annotates each record's source, Join key would be used as the shuffle key so that the same join key goes to same reducer then the leftmost input is cached in memory in the reducer side and the other input is passed through to do a join. You could also improve your join by normal join optimizations like filter NULL's before joining and table which has the largest number of tuples per key could be kept as the last table in your query.

If your data is already sorted in both the data sets you can define merged join.
Mergede = join A by a1, B by b1 USING "merge";
Skewed Join: If the data is skewed and user need finer control over the allocation to reducers.
skewedh = join A by a1, B by b1 USING "skewed";

Related

Consecutive JOIN and aliases: order of execution

I am trying to use FULLTEXT search as a preliminary filter before fetching data from another table. Consecutive JOINs follow to further refine the query and to mix-and-match rows (in reality there are up to 6 JOINs of the main table).
The first "filter" returns the IDs of the rows that are useful, so after joining I have a subset to continue with. My issue is performance, however, and my lack of understanding of how the SQL query is executed in SQLite.
SELECT *
FROM mytbl AS t1
JOIN
(SELECT someid
FROM myftstbl
WHERE
myftstbl MATCH 'MATCHME') AS prior
ON
t1.someid = prior.someid
AND t1.othercol = 'somevalue'
JOIN mytbl AS t2
ON
t2.someid = prior.someid
/* Or is this faster? t2.someid = t1.someid */
My thought process for the query above is that first, we retrieve the matched IDs from the myftstbl table and use those to JOIN on the main table t1 to get a sub-selection. Then we again JOIN a duplicate of the main table as t2. The part that I am unsure of is which approach would be faster: using the IDs from the matches, or from t2?
In other words: when I refer to t1.someid inside the second JOIN, does that contain only the someids after the first JOIN (so only those at the intersection of prior and those for which t1.othercol = 'somevalue) OR does it contain all the original someids of the whole original table?
You can assume that all columns are indexed. In fact, when I use one or the other approach, I find with EXPLAIN QUERY PLAN that different indices are being used for each query. So there must be a difference between the two.
The query should be simplified to
SELECT *
FROM mytbl AS t1
JOIN myftstbl USING (someid) -- or ON t1.someid = myftstbl.someid
JOIN mytbl AS t2 USING (someid) -- or ON t1.someid = t2.someid
WHERE myftstbl.{???} MATCH 'MATCHME' -- replace {???} with correct column name
AND t1.othercol = 'somevalue'
PS. The query logic is not clear for me, so it is saved as-is.

Pig Latin JOIN error

I am loading two datasets A, B
A= LOAD [datapath]
B= LOAD [datapath]
I want to JOIN all fields of both A and B by id field.Both A and B have common field id and other fields. When I perform JOIN by id:
AB= JOIN A by id, B by id;
The resulted dataset AB includes two similar columns for the field id, However, it only must show only one column for the id field. What am I doing wrong here?
That's the expected behaviour, when joining two datasets, all columns are included (even those ones which you are joining by)
You can check it here
If you want to drop a column you can do it with the generate statement. But first you ned to know the position of the undesired column.
If that column is, for instance, in the 3th position
C = FOREACH AB GENERATE $1,$2, $4, $5...;
Edit from the comments
You can also use a generate statement without knowing position. Example:
C = FOREACH AB GENERATE A::id AS id, A::foo AS foo, B::bar AS bar;

How to effeciently select data from two tables?

I have two tables: A, B.
A has prisoner_id and prisoner_name columns.
B has all other info about prisoners included prisoner_name column.
First I select all of the data that I need from B:
WITH prisoner_datas AS
(SELECT prisoner_name, ... FROM B WHERE ...)
Then I want to know all of the id of my prisoner_datas. To do this I need to combine information by prisoner_name column, because it's common for both tables
I did the following
SELECT A.prisoner_id, prisoner_datas.prisoner_name, prisoner_datas. ...,
FROM A, prisoner_datas
WHERE A.prisoner_name = prisoner_datas.prisoner_name
But it works very slow. How can I improve performance?
Add an index on the prisoner_name join column in the B table. Then the following join should have some performance improvement:
SELECT
A.prisoner_id,
B.prisoner_name,
B.prisoner_datas.id -- and other columns if needed
FROM A
INNER JOIN B
ON A.prisoner_name = B.prisoner_name
Note here that I used an explicit join syntax here. It isn't required, and the query plan might not change, but it makes the query easier to read. I don't think the CTE will change much, but the lack of an index on the join column should be important here.

Can Mapside Join and Reduce side join have different O/P

The below code is present in PROD and runs daily, I am trying to optimize it.
I see that set hive.auto.convert.join=FALSE; is making it to do an Reduce side join which runs for 2.5 hours and produces an row count of 2324381 records.
If i set hive.auto.convert.join=TRUE; then it does an Map side join and runs only for 20 minutes and produces an row count of 5766529 records.
I need to know why the row counts differ and is this correct ? is it okay the row counts differ ? i was under the impression that the O/P or the query should remain the same irrespective of which join is happening.
The source data remains the same in both the case and every other condition is the same expect for the hive setting i am changing.
INSERT OVERWRITE TABLE krish
SELECT
s.svcrqst_id
s.svcrqst_lupdusr_id,
s.svcrqst_lstupd_dts as svcrqst_lupdt,
f.crsr_lupdt,
s.svcrqst_crt_dts,
s.svcrqst_asrqst_ind,
s.svcrtyp_cd,
s.svrstyp_cd,
s.asdplnsp_psuniq_id as psuniq_id,
s.svcrqst_rtnorig_in,
s.cmpltyp_cd,
s.catsrsn_cd,
s.apealvl_cd,
s.cnstnty_cd,
s.svcrqst_vwasof_dt,
f.crsr_master_claim_index,
t.svcrqct_cds,
r.sum_reason_cd,
r.sum_reason
from
table1 s
left outer join
(
select distinct
lpad(trim(i_srtp_sr_sbtyp_cd), 3, '0') as i_srtp_sr_sbtyp_cd,
lpad(trim(i_srtp_sr_typ_cd), 3, '0') as i_srtp_sr_typ_cd,
sum_reason_cd,
sum_reason
from table2
) r
on lpad(trim(s.svcrtyp_cd), 3, '0')=r.i_srtp_sr_typ_cd
and lpad(trim(s.svrstyp_cd), 3, '0')=r.i_srtp_sr_sbtyp_cd
left outer join table3 f
on trim(s.svcrqst_id)=trim(f.crsr_sr_id)
left outer join table4 t
on t.svcrqst_id=s.svcrqst_id
where
( year(s.svcrqst_lstupd_dts)=${hiveconf:YEAR} and month(s.svcrqst_lstupd_dts)=${hiveconf:MONTH} and day(s.svcrqst_lstupd_dts)=${hiveconf:DAY} )
or
( year(f.crsr_lupdt)=${hiveconf:YEAR} and month(f.crsr_lupdt)=${hiveconf:MONTH} and day(f.crsr_lupdt)=${hiveconf:DAY} )
;
After doing some more research with my data, I created all my source tables with same column partitioned and bucketed and reran my HQL.
This time the number of rows for both map side join and reduce side join came with same counts.
I think on the previous query since the data was not partitioned the map side and reduce side joins have different output.

Multiple table join in hive

I have migrated Teradata tables' data into hive .
Now I have to build summary tables on top of imported data. Summary table needs to be built from five source tables
If I go with joins I'll need to join five tables is it possible in hive ? or should I break the query in five parts?
what should be advisable approach for this problem?
Please suggest
Five way joins in hive are of course possible and also (naturally) likely slow to very slow.
You should consider co-partitioning the tables on
identical partition columns
identical number of partitions
Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint
Assuming a is large:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)
Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
:
All five tables are joined in a single map/reduce job and the values
for a particular value of the key for tables b, c,d, and e are
buffered in the memory in the reducers. Then for each row retrieved
from a, the join is computed with the buffered rows. If the
STREAMTABLE hint is omitted, Hive streams the rightmost table in the
join.
Another hint is the mapjoin that is useful to cache small tables in memory.
Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:
SELECT /*+ MAPJOIN(b,c,d,e) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
join d on (d.key = c.key) join e on (e.key = d.key)
Yes, you can join multiple tables in a single query. This allows many opportunities for Hive to make optimizations that couldn't be done if you broke it into separate queries.

Resources