Hadoop data cross calculation - hadoop

I have a huge file, which contains a list of numbers, for example A0, A1, ..., An. I want to calculate values for all combinations of Ai and Aj using a complex operation (op). So I want the values
op(A1, A2), op(A1, A3), ....., op(An, An-2), op(An, An-1)
or event more op(A1, A2, A3), op(A1, A2, A4), ....
My questions is the huge file contains all numbers are divided into segments on nodes. How can I get number Ai & Aj which are not on the same node?
thanks

Related

Find variations of numbers

I'm sorry I don't know what's the proper title for this because I don't know what topic this question falls into.
So for example there are 5 people. They want to stay in a hotel. This hotel only allows at most 2 lodgers per room and at least 1. That means there are a few possible variations for this.
1-1-1-1-1 (1 room for each person)
1-2-2 (1 person stays alone, the other 4 are divided into 2 room)
1-1-1-2 (... and so on)
What is the algorithm to find these variations?
This is a combinatorial question, and the abstract version is typically called balls & bins. A key question is whether the balls are distinguishable. Ditto for the bins.
In your example, the balls are people and the bins are rooms. If the rooms are distinguishable, you'll also need the total number available.
Let's say neither is distinguishable. Then the only question is how many pairs we have, with the options being 0, 1, or 2, so there are 3 solutions.
If people are distinguishable but not rooms (balls but not bins), then we care who are in the pairs. In this case 1-1-1-1-1 has a single solution, 1-1-1-2 has choose(5,2) = 10 solutions (all the ways we can choose who is in the lone pair), and 1-2-2 has choose(5,2) * choose(3,2) / 2 = 10 * 3 = 30 solutions (choose who is in the first pair, then the second, then divide by 2 to avoid double-counting where the order is reversed). Total solutions: 41.
If people and rooms are both distinguishable, then for each of the solutions above we care which room each person or pair goes in. This will depend on the total number of rooms available. If there are R rooms available, then a solution which uses r rooms where rooms aren't distinguishable will need to be multiplied by R!/(R-r)!.
E.g. 1-1-1-2 has 10 solutions where rooms are indistinguishable. If the hotel has 5 rooms then we multiple that by 5!/(5-4)! = 120 to get 1200 solutions.
If the people are a,b,c,d,e and there are 5 rooms numbered 1,2,3,4,5, then the solutions where b+d are paired up, a is in room 1, and c is in room 2 are:
a1, c2, e3, bd4
a1, c2, e3, bd5
a1, c2, e4, bd3
a1, c2, e4, bd5
a1, c2, e5, bd3
a1, c2, e5, bd4
You can consider above problem similar to coin change problem, where you have the sum and coins and you have to find the number of ways you can make the sum using those coins.
Here:
coins = {1,2}
sum = Number of people

Interview question: minimum number of swaps to make couples sit together

This is an interview question, and the problem description is as follows:
There are n couples sitting in a row with 2n seats. Find the minimum number of swaps to make everyone sit next on his/her partner. For example, 0 and 1 are couple, and 2 and 3 are couple. Originally they are sitting in a row in this order: [2, 0, 1, 3]. The minimum number of swaps is 1, for example swapping 2 with 1.
I know there is a greedy solution for this problem. You just need to scan the array from left to right. Every time you see an unmatched pair, you swap the first person of the pair to his/her correct position. For example, in the above example for pair [2, 0], you will directly swap 2 with 1. There is no need to try swapping 0 with 3.
But I don't really understand why this works. One of the proofs I saw was something like this:
Consider a simple example: 7 1 4 6 2 3 0 5. At first step we have two choices to match the first couple: swap 7 with 0, or swap 1 with 6. Then we get 0 1 4 6 2 3 7 5 or 7 6 4 1 2 3 0 5. Pay attention that the first couple doesn't count any more. For the later part it is composed of 4 X 2 3 Y 5 (X=6 Y=7 or X=1 Y=0). Since different couples are unrelated, we don't care X Y is 6 7 pair or 0 1 pair. They are equivalent! Thus it means our choice doesn't count.
I feel that this is very reasonable but not compelling enough. In my opinion we have to prove that X and Y are couple in all possible cases and don't know how. Can anyone give a hint? Thanks!
I've split the problem into 3 examples. A's are a pair and so are B's in all examples. Note that throughout the examples a match requires that elements are adjacent and the first element occupy an index that satisfies index%2 = 0. An array looking like this [X A1 A2 ...] does not satisfy this condition, however this does [X Y A1 A2 ...]. The examples also do not look to the left at all, because looking to the left of A2 below is the same as looking to the right of A1.
First example
There's an even number of elements between two unmatched pairs:
A1 B1 ..2k.. A2 B2 .. for any number k in {0, 1, 2, ..} meaning A1 B1 A2 B2 .. is just a another case.
Both can be matched in one swap:
A1 A2 ..2k.. B1 B2 .. or B2 B1 ..2k.. A2 A1 ..
Order is not important, so it doesn't matter which pair is first. Once the pairs are matched, there will be no more swapping involving either pair. Finding A2 based on A1 will result in the same amount of swaps as finding B2 based on B1.
Second example
There's an odd number of elements between two pairs (2k + the element C):
A1 B1 ..2k.. C A2 B2 D .. (A1 B1 ..2k.. C B2 A2 D .. is identical)
Both cannot be matched in one swap, but like before it doesn't matter which pair is first nor if the matched pair is in the beginning or in the middle part of the array, so all these possible swaps are equally valid, and none of them creates more swaps later on:
A1 A2 ..2k .. C B1 B2 D .. or B2 B1 ..2k.. C A2 A1 D .. Note that the last pair is not matched
C B1 ..2k.. A1 A2 B2 D .. or A1 D ..2k.. C A2 B2 B1 .. Here we're not matching the first pair.
The important thing about this is that in each case, only one pair is matched and none of the elements of that pair will need to be swapped again. The result of the remaining non-matched pair are either one of:
..2k.. C B1 B2 D ..
..2k.. C A2 A1 D ..
C B1 ..2k.. B2 D ..
A1 D ..2k.. C A2 ..
They are clearly equivalent in terms of swaps needed to match the remaining A's or B's.
Third example
This is logically identical to the second. Both B1/A2 and A2/B2 can have any number of elements between them. No matter how elements are swapped, only one pair can be matched. m1 and m2 are arbitrary number of elements. Note that elements X and Y are just the elements surrounding B2, and they're only used to illustrate the example:
A1 B1 ..m1.. A2 ..m2.. X B2 Y .. (A1 B1 ..m1.. B2 ..m2.. X A2 Y .. is identical)
Again both pairs cannot be matched in one swap, but it's not important which pair is matched, or where the matched pair position is:
A1 A2 ..m1.. B1 ..m2.. X B2 Y .. or B2 B1 ..m1.. A2 ..m2.. X A1 Y .. Note that the last pair is not matched
A1 X ..m1.. A2 ..m2-1.. B1 B2 Y .. or A1 Y ..m1.. A2 ..m2.. X B2 B1.. depending on position of B2. Here we're not matching the first pair.
Matching the pair around A2 is equivalent, but omitted.
As in the second example, one swap can also be matching a pair in the beginning or in the middle of the array, but either choice doesn't change that only one pair is matched. Nor does it change the remaining amount of unmatched pairs.
A little analysis
Keeping in mind that matched pairs drop out of the list of unmatched/problem pairs, the list of unmatched pairs are either one fewer or two fewer pairs for each swap. Since it's not important which pair drops out of the problem, it might as well be the first. In that case we can assume that pairs to the left of the cursor/current index are all matched. And that we only need to match the first pair, unless it's already matched by coincidence and the cursor is then rightfully moved.
It becomes even more clear if the above examples are looked at with the cursor being at the second unmatched pair, instead of the first. It still doesn't matter which pairs are swapped for the amount of total swaps needed. So there's no need to try to match pairs in the middle. The resulting amount of swaps are the same.
The only time two pairs can be matched with only one swap are those in the first example. There is no way to match two pairs in one swap in any other setup. Looking at the result of the swap in the second and third examples, it also becomes clear that none of the results have any advantage to any of the others and that each result becomes a new problem that can be described as one of the three cases (two cases really, because second and third are equivalent in terms of match-able pairs).
Optimal swapping
There is no way to modify the array to prepare it for more optimal swapping later on. Either a swap will match one or two pairs, or it will count as a swap with no matches:
Looking at this: A1 B1 ..2k.. C B2 ... A2 ...
Swap to prepare for optimal swap:
A1 B1 ..2k.. A2 B2 ... C ... no matches
A1 A2 ..2k.. B1 B2 ... C ... two in one
Greedy swap:
B2 B1 ..2k.. C A1 ... A2 ... one
B2 B1 ..2k.. A2 A1 ... C ... one
Un-matching pairs
Pairs already matched will not become unmatched because that would require that:
For A1 B1 ..2k.. C A2 B2 D ..
C is identical to A1 or
D is identical to B1
either of which is impossible.
Likewise with A1 B1 ..m1.. (Z) A2 (V) ..m2.. X B2 Y ..
Or it would require that matched pairs are shifted one (or any odd number of) index inside the array. That's also not possible, because we always ever swap, so the array elements aren't being shifted at all.
[Edited for clarity 4-Mar-2020.]
There is no point doing a swap which does not put (at least) one couple together. To do so would add 1 to the swap count and leave us with the same number of unpaired couples.
So, each time we do a swap, we put one couple together leaving at most n-1 couples. Repeating the process we end up with 1 pair, who must by then be a couple. So, the worst case must be n-1 swaps.
Clearly, we can ignore couples who are already together.
Clearly, where we have two pairs a:B b:A, one swap will create the two couples a:A b:B.
And if we have m pairs a:Q b:A c:B ... q:P -- where the m pairs are a "disjoint subset" (or cycle) of couples, m-1 swaps will put them into couples.
So: the minimum number of swaps is going to be n - s where s is the number of "disjoint subsets" (and s >= 1). [A subset may, of course, contain just one couple.]
Interestingly, there is nothing clever you can do to reduce the number of swaps. Provided every swap creates a couple you will do the minimum number.
If you wanted to arrange each couple in height order as well, things may or may not be more interesting.
FWIW: having shown that you cannot do better than n-1 swaps for each disjoint set of n couples, the trick then is to avoid the O(n^2) search for each swap. That can be done relatively straightforwardly by keeping a vector with one entry per person, giving where they are currently sat. Then in one scan you pick up each person and if you know where their partner is sat, swap down to make a pair, and update the location of the person swapped up.
I will swap every even positioned member,
if he/she doesn't sit besides his/her partner.
Even positioned means array indexed 1, 3, 5 and so on.
The couples are [even, odd] pair. For example [0, 1], [2, 3], [4, 5] and so on.
The loop will be like that:
for(i=1; i<n*2; i+=2) // when n = # of couples.
Now, we will check i-th and (i-1)-th index member. If they are not couple, then we will look for the partner of (i-1)-th member and once we have it, we should swap it with i-th member.
For an example, say at i=1, we got 6, now if (i-1)-th element is 7 then they form a couple (if (i-1)-th element is 5 then [5, 6] is not a couple.) and we don't need any swap, otherwise we should look for the partner of (i-1)-th element and will swap with i-th element. So, (i-1)-th and i-th will form a couple.
It ensure that we need to check only half of the total members, that means, n.
And, for any non-matched couple, we need a linear search from i-th position to the rest of the array. Which is O(2n), eventually O(n).
So, the overall technique complexity will be O(n^2).
In worst case, minimum swap will be n-1. (this is maximum as well).
Very straightforward. If you need help to code, let us know.

Problem of assigning some values to a set from multiple options algo

I have a problem statemment, where I have some sets and each set have some options, some specific option from options, needs to be assigned to that set.
Some options can be common in multiple sets, but none can be assigned to more than one set. Need an algo to achieve this. A rough example is
Set1 has options [100,101,102,103] - 2 needs to be selected,
Set2 has options [101,102,103,104] - 2 needs to be selected,
Set3 has options [99,100,101] - 2 needs to be selected,
so the possible solution is
Set1 gets 100,102
Set2 gets 103,104
Set3 gets 99,101
Can anyone suggests an approach on how can I get a generic solution to this problem.
This can be modelled as an instance of the bipartite graph matching problem.
Let A be the numbers which appear in the lists (combined into one set with no duplicates), and let B be the lists themselves, each repeated according to how many elements need to be selected from them. There is an edge from a number a ∈ A to a list b ∈ B whenever the number a is in the list b.
Therefore this problem can be approached using an algorithm which finds a "perfect matching", i.e. a matching which includes all vertices. Wikipedia lists several algorithms which can be used to find matchings as large as possible, including the Ford–Fulkerson algorithm and the Hopcroft–Karp algorithm.
Thanks #kaya3, I was not remembering the exact algo, and getting me remember that its a bipartite graph matching problem was really helpful.
But it wasn't giving me the exact solution when I needed n number of options for each So I followed the following approach, i.e.
A = [99,100,101,102,103,104]
B = [a1, a2, b1, b2, c1, c2]
# I repeated the instances because I need 2 instances for each and created my
# graph like this
# Now my graph will be like
99 => [c1, c2]
100 => [a1, a2, c1, c2]
101 => [a1, a2, b1, b2, c1, c2]
102 => [a1, a2, b1, b2]
103 => [a1, a2, b1, b2]
104 => [b1, b2]
Now it is giving correct solution everytime. I tried with multiple use cases. Repeating

Pre-processing data for existing compressor

I have an existing "compression" algorithm, which compresses an association into ranges. Something like
type Assoc = [(P,C)]
type RangeCompress :: Assoc -> [(P, [(C,C)]]
Read P as "product" and C as "code". The result is a list of products each associated with a list of code-ranges. To find the product associated with a given code, one traverses the compressed data until one finds a range, the given code falls into.
This mechanism works well if consecutive codes are likely to belong to the same product. However, if the codes of different product interleave, they no longer form compact ranges and I end up with lots of ranges, where the upper bound is equal to the lower bound and nil compression.
What I am looking for is a pre-compressor, which looks at the original association and determines a "good enough" transformation of codes, such that the association expressed in terms of transformed codes can be range-compressed into compact ranges. Something like
preCompress :: Assoc -> (C->C)
or more fine-grained (by Product)
preCompress ::Assoc -> P -> (C->C)
In that case a product lookup would have to first transform the code in question and then do the lookup as before. Therefore the transformation must be expressible by a handful of parameters, which would have to be attached to the compressed data, either once for the entire association or by product.
I checked some compression algorithms, but they all seem to focus on reconstructing the original data (which is not strictly needed here), while being totally free about the way they store the compressed data. In my case however, the compressed data must be ranges, only enriched by the parameters of the pre-compression.
Is this a known problem?
Does it have a solution?
Where to look next?
Please note:
I am not primarily interested in restoring the original data
I am primarily interested in the product lookup
The number of codes is apx 7,000,000
The number of products is apx 200
Assuming that for each code there is only one product, you can encode the data as a list (string) of products. Let's pretend we have the following list of products and codes randomly generated from 9 products (or no product) and 10 codes.
[(P6,C2),
(P1,C4),
(P2,C10),
(P3,C9),
(P3,C1),
(P4,C7),
(P6,C8),
(P5,C3),
(P1,C5)]
If we sort them by code we have
[(P3,C1),
(P6,C2),
(P5,C3),
(P1,C4),
(P1,C5),
-- C6 didn't have a product
(P4,C7),
(P6,C8),
(P3,C9),
(P2,C10)]
We can convert this into a string of products + nothing (N). The position in the string determines the code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 -}
[ P3, P6, P5, P1, P1, N , P4, P6, P3, P2 ]
A string of symbols in some alphabet (in this case products+nothing) puts us squarely in the realm of well-studied string compression problems.
If we run-length encode this list, we have an encoding similar to the encoding you originally presented. For each range of associations we store a single product+nothing and the run length. We only need a single small integer for the run length instead of two (possibly large) codes for the interval.
{- P3 , P6, P5, P1, P1, N , P4, P6, P3, P2 -}
[ (P3, 1), (P6, 1), (P5, 1), (P1, 2), (N, 1), (P4, 1), (P6, 1), (P3, 1), (P2, 1) ]
We can serialize this to a string of bytes and use any of the existing compression libraries on bytes to perform the actual compression. Some compression libraries, such as the frequently used zlib, are found in the codec category.
Sorted the other way
We'll take the same data from before and sort it by product instead of by code
[(P1,C4),
(P1,C5),
(P2,C10),
(P3,C1),
(P3,C9),
(P4,C7),
(P5,C3),
(P6,C2),
(P6,C8)]
We'd like to allocate new codes to each product so that the codes for a product are always consecutive. We'll just allocate these codes in order.
-- v-- New code --v v -- Old code
[(P1,C1), [(C1,C4)
(P1,C2), (C2,C5),
(P2,C3), (C3,C10),
(P3,C4), (C4,C1),
(P3,C5), (C5,C9),
(P4,C6), (C6,C7),
(P5,C7), (C7,C3),
(P6,C8), (C8,C2),
(P6,C9), (C9,C8)]
We now have two pieces of data to save. We have a (now) one-to-one mapping between products and code ranges and a new one-to-one mapping between new codes and old codes. If we follow the steps from the previous section, we can convert the mapping between new codes and old codes into a list of old codes. The position in the list determines the new code.
{- C1 C2 C3 C4 C5 C6 C7 C8 C9 -} -- new codes
[ C4, C5, C10, C1, C9, C7, C3, C2, C8 ] -- old codes
We can throw any of our existing compression algorithms at this string. Each symbol only occurs at most once in this list so traditional compression mechanisms will not compress this list any more. This isn't significantly different from grouping by the product and storing the list of codes as an array with the product; the new intermediate codes are just (larger) pointers to the start of the array and the length of the array.
[(P1,[C4,C5]),
(P2,[C10]),
(P3,[C1,C9]),
(P4,[C7]),
(P5,[C3]),
(P6,[C2,C8])]
A better representation for the list of old codes is probably the difference between old codes. If the codes already tend to be in consecutive ranges these consecutive ranges can be run-length encoded away.
{- C4, C5, C10, C1, C9, C7, C3, C2, C8 -} -- old codes
[ +4, +1, +5, -9, +8, -2, -4, -1, +6 ] -- old code difference
I might be tempted to store the difference in product and the difference in codes with the products. This should increase the opportunity for common sub-strings for the final compression algorithm to compress away.
[(+1,[+4,+1]),
(+1,[+5]),
(+1,[-9,+8]),
(+1,[-2]),
(+1,[-4]),
(+1,[-1,+6])]

A travelling salesman that has state

Let's say we have a directed graph. We want to visit every node exactly once by traveling on the edges of this graph. Every node is annotated with one or more tags; some nodes may share tags, and even have the exact same set of tags. As we go along our walk, we are collecting a list of every distinct tag we have encountered - our objective is to find the walk which postpones acquisition of new tags as much as possible.
To restate this as a traveler analogy, let's say that a carpet salesman is trying to decide which supplier he should acquire his carpets from. He makes a list of all the carpet factories in the city. He makes an appointment witht every factory, and collect samples of the kinds of carpet they make.
Let's say we have 3 factories, producing the following kinds of carpet:
F1: C1, C2, C3
F2: C1, C4
F3: C1, C4, C5
The salesman could take the following routes:
Start at F1, collect C1, C2, C3. Go to F2, collect C4 (since he already has C1). Go to F3, collect C5 (he already has C1 and C4).
Start at F1, collect C1, C2, C3. Go to F3, collect C4 and C5. Go to F2, collect nothing (since it turns out he already has all their carpets).
Start at F2, collect C1, C4. Go to F1, collect C2, C3. Go to F3 and collect C5.
Start at F2, collect C1, C4. Go to F3, collect C5. Go to F1 and collect C3.
Start at F3, collect C1, C4, C5. Go to F1, collect C2, C3. Go to F2, collect nothing.
Start at F3, collect C1, C4, C5. Go to F2, collect nothing. Go to F1, collect C2, C3.
Note how sometimes, the salesman visits a factory even though he knows he has already collected a sample for every kind of carpet they produce. The analogy breaks down here a bit, but let's say he must visit them because it would be rude to not show up for his appointment.
Now, the carpet samples are heavy, and our salesman is traveling on foot. Distance by itself isn't hugely important (assume every edge has cost 1), but he doesn't want to carry around a whole bunch of samples any more than he needs to. So, he needs to plan his trip such that he visits the factories which have a lot of rare carpets (and where he will have to pick up a lot of new samples) last.
For the example paths above, here are the numbers of samples carried at each leg of the journey (columns 2-4), and the sum (column 5).
1 0 3 4 7
2 0 3 5 8
3 0 2 4 6
4 0 2 3 5
5 0 3 5 8
6 0 3 3 6
We can see now that route 2 is very bad: first he had to carry 3 sample from F1 to F3, then he had to carry 5 samples from F3 to F2! Instead, he could have went with route 4 - he would carry first 2 samples from F2 to F3, and then 3 samples from F3 to F1.
Also, as shown in the last column, the sum of the samples carried through every edge is a good metric for how many samples he had to carry overall: The number of samples he is carrying cannot decrease, so visiting varied factories early on will necessarily inflate the sum, and a low sum is only possible by visiting similar factories with few carpets.
Is this a known problem? Is there an algorithm to solve it?
Note: I would recommend being careful about making assumptions based on my example problem. I came up with it on the spot, and deliberately kept it small for brevity. It is certain there are many edge cases that it fails to catch.
As the size of the Graph is small, we can consider using bit-mask and dynamic programming to solve this problem (Similar with how we solve the traveling salesman problem)
Assume that we have total 6 cities to visit. So the starting state is 0 and the ending is 111111b or 127 in decimal.
From each step, if the state is x, we can easily calculate the number of sampling the salesman is carrying, and the cost from state x to state y will be the number of newly added samples from x to y times the number of unvisited cities .
public int cal(int mask) {
if (/*Visit all city*/) {
return 0;
}
HashSet<Integer> sampleSet = new HashSet();//Store current samples
int left = 0;//Number of unvisited cities
for (int i = 0; i < numberOfCity; i++) {
if (((1 << i) & mask) != 0) {//If this city was visited
sampleSet.addAll(citySample[i]);
} else {
left++;
}
}
int cost;
for (int i = 0; i < numberOfCity; i++) {
if (((1 << i) & mask) == 0) {
int dif = number of new sample from city i;
cost = min(dif * left + cal(mask | (1 << i));
}
}
return cost;
}
In the case where there are edges between every pair of nodes, and each carpet is only available at one location, this looks tractable. If you pick up X carpets when there are Y steps to go, then the contribution from this to the final cost is XY. So you need to minimise SUM_i XiYi where Xi is the number of carpets picked up when you have Yi steps to go. You can do this by visiting the factories in increasing order of the number of carpets to be picked up at that factory. If you provide a schedule in which you pick up more carpets at A than B, and you visit A before B, I can improve it by swapping the times at which you visit A and B, so any schedule that does not follow this rule is not optimal.

Resources