Shuffle Join in Hive optimisation and understanding

Shuffle Join in Hive optimisation and understanding - hadoop

Warning: Shuffle Join JOIN[38][tables = [a, b]] in Stage 'Stage-2:MAPRED' is a cross product
What does shuffle join mean?
What does JOIN[38][tables = [a, b]] signify? Does index 38 mean something special? Can I use this statement to reach the part of my query which needs optimisation?
PS : I have multiple shuffle joins happening in my query.

Related

fast and concurrent algorithm of frequency calculation in elixir

I have two big lists that their item's lengths isn't constant. Each list include millions items.
And I want to count frequency of items of first list in second list!
For example:
a = [[c, d], [a, b, e]]
b = [[a, d, c], [e, a, b], [a, d], [c, d, a]]
# expected result of calculate_frequency(a, b) is %{[c, d] => 2, [a, b, e] => 1} Or [{[c, d], 2}, {[a, b, e], 1}]
Due to the large size of the lists, I would like this process to be done concurrently.
So I wrote this function:
def calculate_frequency(items, data_list) do
items
|> Task.async_stream(
fn item ->
frequency =
data_list
|> Enum.reduce(0, fn data_row, acc ->
if item -- data_row == [] do
acc + 1
else
acc
end
end)
{item, frequency}
end,
ordered: false
)
|> Enum.reduce([], fn {:ok, merged}, merged_list -> [merged | merged_list] end)
end
But this algorithm is slow. What should I do to make it fast?
PS: Please do not consider the type of inputs and outputs, the speed of execution is important.

Not sure if this fast enough and certainly it's not concurrent. It's O(m + n) where m is the size of items and n is the size of data_list. I can't find a faster concurrent way because combining the result of all the sub-processes also takes time.
data_list
|> Enum.reduce(%{}, fn(item, counts)->
Map.update(counts, item, 1, &(&1 + 1))
end)
|> Map.take(items)
FYI, doing things concurrently does not necessarily mean doing things in parallel. If you have only one CPU core, concurrency actually slows things down because one CPU core can only do one thing at a time.

Put one list into a MapSet.
Go through the second list and see whether or not each element is in the MapSet.
This is linear in the lengths of the lists, and both operations should be able to be parallelized.

I would start by normalizing the data you want to compare so a simple equality check can tell if two items are "equal" as you would define it. Based on your code, I would guess Enum.sort/1 would do the trick, though MapSet.new/1 or a function returning a map may compare faster if it matches your use case.
defp normalize(item) do
Enum.sort(item)
end
def calculate_frequency(items, data_list) do
data_list = Enum.map(data_list, &normalize/1)
items = Enum.map(items, &normalize/1)
end
If you're going to get most frequencies from data list, I would then calculate all frequencies for data list. Elixir 1.10 introduced Enum.frequencies/1 and Enum.frequencies_by/2, but you could do this with a reduce if desired.
def calculate_frequency(items, data_list) do
data_frequencies = Enum.frequencies_by(data_list, &normalize/1) # does map for you
Map.new(items, &Map.get(data_frequencies, normalize(&1), 0)) # if you want result as map
end
I haven't done any benchmarks on my code or yours. If you were looking to do more asynchronous stuff, you could replace your mapping with Task.async_stream/3, and you could replace your frequencies call with a combination of Stream.chunk_every/2, Task.async_stream/3 (with Enum.frequencies/1 being the function), and Map.merge/3.

Relational Algebra: Natural Join having the same result as Cartesian product

I am trying to understand what will be the result of performing a natural join
between two relations R and S, where they have no common attributes.
By following the below definition, I thought the answer might be an empty set:
Natural Join definition.
My line of thought was because the condition in the 'Select' symbol is not met, the projection of all of the attributes won't take place.
When I asked my lecturer about this, he said that the output will be the same as doing a cartezian product between R and S.
I can't seem to understand why, would appreciate any help )

Natural join combines a cross product and a selection into one
operation. It performs a selection forcing equality on those
attributes that appear in both relation schemes. Duplicates are
removed as in all relation operations.
There are two special cases:
• If the two relations have no attributes in common, then their
natural join is simply their cross product.
• If the two relations have more than one attribute in common,
then the natural join selects only the rows where all pairs of
matching attributes match.
Notation: r s
Let r and s be relation instances on schema R and S
respectively.
The result is a relation on schema R ∪ S which is
obtained by considering each pair of tuples tr from r and ts from s.
If tr and ts have the same value on each of the attributes in R ∩ S, a
tuple t is added to the result, where
– t has the same value as tr on r
– t has the same value as ts on s
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r s is defined as:
πr.A, r.B, r.C, r.D, s.E (σr.B = s.B r.D = s.D (r x s))

The definition of the natural join you linked is:
It can be broken as:
1.First take the cartezian product.
2.Then select only those row so that attributes of the same name have the same value
3.Now apply projection so that all attributes have distinct names.
If the two tables have no attributes with same name, we will jump to step 3 and therefore the result will indeed be cartezian product.

Is `[<var> in <distributed variable>]` equivalent to `forall`?

I noticed something in a snippet of code I was given:
var D: domain(2) dmapped Block(boundingBox=Space) = Space;
var A: [D] int;
[a in A] a = a.locale.id;
Is [a in A] equivalent to forall a in A a = a.locale.id?

For the most part, yes. In Chapel, [a in A] expr can be thought of as a shorthand for forall a in A do expr. However, there is a slight difference in that if A does not support parallel iteration, the forall form will generate a compile-time error whereas the [a in A] form will fall back to serial iteration.
With respect to the title of this question, note that this behavior is independent of whether or not A is distributed. For example, you could also write [i in 1..n] rather than forall i in 1..n do even though ranges like 1..n are never distributed in Chapel.
Array types in Chapel, like [D] real can similarly be read as "for all indices in D, allocate an element of type real."

SWI Prolog usage of agregation

I created a simple database on SWI Prolog. My task is to count how long each of departments will work depending on production plan. I am almost finished, but I don't know how to sum my results. As for now I am getting something like this
department amount
b 20
a 5
c 50
c 30
how I can transform it to this?
b 20
a 5
c 80
My code https://gist.github.com/senioroman4uk/d19fe00848889a84434b

The code provided won't interpret the count predicate on account of a bad format. You should rewrite it as count:- instead of count():-. As far as I know, all zero-ary predicates need to be defined like this.
Second, your count predicate does not collect the results in a list upon which you could operate. Here's how you can change it to collect all department-amount pairs in a list with findall:
count_sum(DepAmounts):-
findall((Department,Sum),
( productionPlan(FinishedProduct, Amount),
resultOf(FinishedProduct, Operation),
executedIn(Operation, Department, Time),
Sum is Amount * Time
),
DepAmounts
).
Then, over that list, you can use something like SWI-Prolog's aggregate:
?- count_sum(L), aggregate(sum(A),L,member((D,A),L),X).
Which will yield, by backtracing, departments in D and the sum of the amounts in X:
D = a,
X = 15 ;
D = b,
X = 20 ;
D = c,
X = 80.
BTW, if I were you I'd replace all double-quoted strings for department names and operations and etc. for atoms, for simplicity.

you should consider library(aggregate): for instance, calling data/2 the interesting DB subset, you get
?- aggregate(set(K-A),aggregate(sum(V),data(K,V),A),L).
L = [a-5, b-20, c-80]

Oracle Left outer join

SELECT
a,
last_note_user,
c,
d,
iso_src
FROM
X
CROSS JOIN Y
CROSS JOIN Z
LEFT OUTER JOIN W
ON W.last_note_user = Z.userid
AND W.user_ten = Y.iso_src
The above ANSI code fetch me 107 records,When I giving the same query without ANSI code it is fetching 875 records.The non ANSI query is below:
SELECT
a,
last_note_user,
c,
d,
iso_src
FROM
X,
Y,
Z,
W
WHERE
W.last_note_user = Z.userid(+)
AND W.user_ten = Y.iso_src(+)
why there is difference in the two query with ANSI and without ANSI standards??
By answering the above query please help me out!!!

Your old-style query has the (+) symbols on the wrong side of the predicate. It should be:
SELECT
a,
last_note_user,
c,
d,
iso_src
FROM
X,
Y,
Z,
W
WHERE
W.last_note_user (+) = Z.userid
AND W.user_ten (+) = Y.iso_src
But I wouldn't use the old-style syntax any more really.

OUTER JOINS with old ANSI-syntax are ambiguous, so who knows what the query optimizer understands with this.
If the first SQL is producing the right rows, forget about the ANSI version and move on.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Shuffle Join in Hive optimisation and understanding - hadoop

Related

fast and concurrent algorithm of frequency calculation in elixir

Relational Algebra: Natural Join having the same result as Cartesian product

Is `[<var> in <distributed variable>]` equivalent to `forall`?

SWI Prolog usage of agregation

Oracle Left outer join

Categories

Resources