How to achive Union All in pig? - hadoop

I have 3 data sets each having 415 GB of data and of different domain.
I need to union all of them using pig but all i can use it union clause which launches the reducers at the end of job to remove distinct values.
a = union a1, a2
data = union a, a3
Is there a way to skip the reducer part as the data is already distinct.

From the docs on UNION:
Use the UNION operator to merge the contents of two or more relations.
The UNION operator:
Does not preserve the order of tuples. Both the input and output
relations are interpreted as unordered bags of tuples.
Does not ensure
(as databases do) that all tuples adhere to the same schema or that
they have the same number of fields. In a typical scenario, however,
this should be the case; therefore, it is the user's responsibility to
either (1) ensure that the tuples in the input relations have the same
schema or (2) be able to process varying tuples in the output
relation.
Does not eliminate duplicate tuples.
Emphasis is mine. This indicates to me there wouldn't need to be a reducer step to complete the UNION since it doesn't need to remove duplicate rows. Are you sure that the reducer job is a result of the UNION? It could be the result of another operator.
BONUS: You can simplify your example to:
B = UNION a1, a2, a3 ;

Related

Algorithm to simplify an SQL query over many combinations of values

Suppose you have a set of data you want to query from a database table based on two or more columns and you have various combinations of values from those columns to query over.
For example, with these combination
A 1
A 2
A 3
A 4
B 2
B 3
B 4
C 2
C 3
C 4
C 5
the simplest automated way to write the query would be
...WHERE column1=A AND column2=1
OR column1=A AND column2=2
OR ...
This is accurate, but verbose.
Another approach would be to choose one list and group by that.
...WHERE column1=A AND column2 IN (1,2,3,4)
OR column1=B AND column2 IN (2,3,4)
OR column1=C AND column2 IN (2,3,4,5)
This can also be pretty verbose if the lists are large.
What I'm looking for is a good, programmatic/algorithmic way to simplify this to something more optimal, like
...WHERE column1 IN (A,B,C) AND column2 IN (2,3,4,5) AND NOT (column1=A AND column2=5)
OR column1=A AND column2=1
Realistically, I guess I'm looking for the shortest string representation, but rather than worry about the lengths of column names or values or the other code in between, I'd be more than satisfied with a way to minimize the number of times that values have to be printed, or something along those lines.
The given example is for two columns, but I would like to find an approach that can generalize to more columns.
It also doesn't have to be perfect. Pretty good is good enough for me.

Database index that support arbitrary sort order

Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.

Difference between natural join and simple join on common attribute in algebra

I have a confusion.
Suppose there two relation with common attribite A.
Now is
(R natural join S)=(R join S where join condition A=A)?
Natural join returns a common column A
Do simple join return two columns with same name AA or 1 common column A due to relational algebra which is defined in set theory ??
There's an example of a Natural Join here. As #Renzo says, there are many variants. And SQL is different again. So I'll keep to what wikipedia shows.
Most important: the join condition applies to all attributes in common between the two arguments. So you need to say "two relations with A being their only common attribute". The only common attribute is DeptName in that wikipedia example. There can be many common attributes, in general.
Yes, joining means forming tuples in the result by pairing tuples from the argument that have same values in the corresponding common attributes. So you have same value with same attribute name. It would be pointless repeating both attributes in the result, because you'd be repeating the values. The example shows there's a single attribute DeptName in the result.
Beware that different dialects of Relational Algebra use different symbols and notations. So the bare bowtie (⋈) for Natural Join can be suffixed with a boolean condition, making a theta-join (θ-join) or equi-join -- see that example. The boolean condition is between differently-named attributes, and might use any comparison operator. So both attribute names and their values appear in the result.
Set theory operations apply because each tuple is a set of name-value pairs. The result tuples are the union of a tuple from each argument -- providing that union is a valid tuple. That is, providing same-named n-v pairs have same value.

Alternative of ORDER BY in hive

By using ORDER BY in hive, It only uses single reducer. So ORDER BY is inefficient. Is there any alternative solution available for ORDER BY.
Regards,
Ratto
You will probably want to use the combination of DISTRIBUTE BY and SORT BY. DISTRIBUTE BY will ensure that all keys with a certain value will end up on the same data node. SORT BY will then sort the data on each node.
For Example:
SELECT a, b, c
FROM table
DISTRIBUTE by a
SORT BY a, b
ORDER BY will sort all of the data together, which is why it has to pass through one reducer.
SORT BY should do the trick. This will sort the data within each reducer, so the values for a given key will be in order, but the keys are not guaranteed to be in order. You can use any number of reducers for SORT BY.

union/intersection of 2 sets, where each set are defined by it's subsets

We know every set's definition from the union of other sets.
For example
A = B union {1,2}
B = C union D
C = {5,6}
D = {5,7}
E = {4}
then A = {1,2,5,6,7}
A union E = {1,2,4,5,6,7}
Are the any efficient algorithms to do that. Suppose the hierarchy of unions can be really deep, and the subsets can change pretty often(not that much).
I think there should be ways to minimize reduce the amount of unions one have to make.
So you have a unchanging hierarchy of unions of changing sets? And you are, like in your example, only interested in the value of one set?
Then flatten the hierarchy. That is, in your example you would once walk through the hierarchy to find the set of changing sets your set is the union of, and store this set.
To dispense with recomputing unions whenever a leaf set changes, you could track for each element in how many sets it is currently contained. This can be updated quickly if a leaf set changes, and those not required looking at any unchanged leaf sets. Then, those elements with frequency count > 0 are currently in the union.
Perhaps you're looking for some sort of disjoint set data structure?
Several questions about this case.
First question: For how long does this "script" / "program" have to run? In case it's not too long, it could well be a good option to simply store previous unions and checking cache before performing a union action. Memory isn't that expensive nowadays :).
Second question you should ask: are elements in a certain order before union? If they aren't, and a list is accessed more than once, it can be very useful to sort a list first (than you can make a decision when you're only halfway a list, for example). Mergesort is a powerful technique of the efficient join of two ordered lists.

Resources