Optimized algorithm to synchronize two arrays - algorithm

I am looking for an efficient algorithm to synchronize two arrays. Let's say a1 and a2 are two arrays given as input.
a1 - C , C++ , Java , C# , Perl
a2 - C++ , Python , Java , Cw , Haskel
Output 2 arrays:
Output A1: C , C++ , Java
Output A2: Cw , Haskell , Python
Output A1:
1) items common to both arrays
2) items only in A1 and not in A2
Output A2:
items only in a2
Thanks in advance.
Raj

Sort both arrays with an efficient sorting algorithm, complexity of O(n.log(n))
Build the output arrays initially empty
Compare the first element a1 of sorted A1 to the first element a2 of sorted A2
Equal means is in both arrays, put a1 into OutputA1
a1 < a2 means a1 is only in A1, a1 now necomes next element in sorted A1, put a1 into OutputA1
else a2 < a1 means a2 is only in A2, a2 now necomes next element in sorted A2, put a2 into OutputA2
Do this until you processed all elements in the sorted arrays, complexity of O(n).

Related

Data structure traversal

Lets say I have package A version 1 and package A version 2, Will call them A1 and A2 respectively.
If I have a pool of packages: A1, A2, B1, B2, C1, C2, D1, D2
A1 depends on B1, will represent as (A1, (B1)).
Plus A1 depends on any version of package C "C1 or C2 satisfy A1", will represent as (A1, (C1, C2))
combining A1 deps together, then A1 data-structure becomes: (A1, (B1), (C1, C2))
Also B1 depends on D1: (B1, (D1))
A1 structure becomes: (A1, ((B1, (D1))), (C1, C2))
similarly A2 structure is (A2, ((B2, (D2))), (C1, C2))
My question is: How can I select best candidate of package A, where I can select based on a condition (for example, the condition is the package does not conflict with current installed packages).
by combining A1 and A2: ((A1, ((B1, (D1))), (C1, C2)), (A2, ((B2, (D2))), (C1, C2)))
How can I traverse this data structure
So start with A1, if doesn't conflict check B1, if doesn't conflict check D1, if doesn't conflict check (C1, C2), and take one only either C1 or C2.
With this I end up selecting (A1, B1, D1, C1).
In case if A1 or any of its deps did not meet the condition, (for example if B1 conflicts with installed packages), then drop A1 entirely and move to check A2. then end up with (A2, B2, D2, C1).
What kind of traversal would that be?
I have been reading about in-order, pre-order, post-order traversal, and wondering if I need to do something similar here.
Assuming you are asking traversal on a more generic problem rather than working on this instance, I don't think there exists such a traversal.
Note that in-order is only applicable to BINARY trees. Any other kind of tree does not have in-order traversal. If your generic problem has B1, B2, B3, then apparently there wouldn't be a binary tree representation.
One property about traversal, is that the tree has all the information inclusively in the itself. When you traverse over a tree you never worry about "external information". In your case, your tree is not complete in information - you need to depend on external information to see if there is a conflict. e.g. B1 is installed - this information is never in the tree.
You can use adjacency list to represent the data:
Suppose the packages are A1, A2, B1, B2, C1, C2.
And A1 depends on B1 and C2, A2 depends on B1 and C1 and C2.
The above data can be represented as
[A1] -> [B1, C2]
[A2] -> [B1, C1, C2]
Use Topological Sorting to get the order of dependencies

Pig - How to use a nested for loop in pig to get the list of elements inside a tuple?

I have an intermediate pig structure like
(A, B, (n. no Cs))
example:
(a1,b1, (c11,c12))
(a2,b2, (c21))
(a3,b3, (c31,c32, c33))
Now, I want the data in format
(a1, b1, c11)
(a1, b2, c12)
(a2, b2, c21) etc.
How do I go about doing it?
Essentially I want the size of the tuples, and then use this size for running a nested for loop.
Can you try the below approach?
input
a1 b1 (c11,c12)
a2 b2 (c21)
a3 b3 (c31,c32,c33)
PigScript:
A = LOAD 'input' AS(f1,f2,T:(f3:chararray));
B = FOREACH A GENERATE f1,f2,FLATTEN(T);
C = FOREACH B GENERATE f1,f2,FLATTEN(TOKENIZE(T::f3));
DUMP C;
Output:
(a1,b1,c11)
(a1,b1,c12)
(a2,b2,c21)
(a3,b3,c31)
(a3,b3,c32)
(a3,b3,c33)

Efficiently find lowest sum paths

This is a big ask but I'm a bit stuck!
I am wondering if there is a name for this problem, or a similar one.
I am probably over complicating finding the solution but I can't think of a way without a full brute force exhaustive search (my current implementation). This is not acceptable for the application involved.
I am wondering if there are any ways of simplifying this problem, or implementation strategies I could employ (language/tool choice is open).
Here is a quick description of the problem:
Given n sequences of length k:
a = [0, 1, 1] == [a1, a2, a3]
b = [1, 0, 2] == [b1, b2, b3]
c = [0, 0, 2] == [c1, c2, c3]
find paths of length k through the sequences as so (i'll give examples starting at a1, but hopefully you get the idea the same paths need to be derived from b1, c1)
a1 -> a2 -> a3
a1 -> b1 -> b2
a1 -> b1 -> a2
a1 -> b1 -> c1
a1 -> c1 -> c2
a1 -> c1 -> a2
a1 -> c1 -> b1
I want to know, which path(s) are going to have the lowest sum:
a1 -> a2 -> a3 == 2
a1 -> b1 -> b2 == 1
a1 -> b1 -> a2 == 2
a1 -> b1 -> c1 == 1
a1 -> c1 -> c2 == 0
a1 -> c1 -> a2 == 1
a1 -> c1 -> b1 == 1
So in this case, out of the sample a1 -> c1 -> c2 is the lowest.
EDIT:
Sorry, just to clear up the rules for deriving the path.
For example you can move from node a1 to b2 if you haven't already exhausted b2, and have exhausted the previous node in that sequence (b1).
An alternative solution using Dynamic Programming
Let's assume the arrays are given as a matrix A such that each row is identical to one of the original arrays. Your matrix will be of size (n+1)x(k+1), and make sure that A[_][0] = 0
Now, use DP to solve it:
f(x,y,z) = min { f(i,y,z-1) | x < i <= n} [union] { f(i+1,0,z) } + A[x][y]
f(_,_,0) = 0
f(n,k,z) = infinity for each z > 0
Idea: In each step you can choose to go to each of the following lines (same column) - or go to the next column, while decreasing the number of more nodes needed.
Moving to the next column is done via the dummy index A[_][0], without decreasing number of nodes needed to go more and without cost, since A[_][0] = 0.
Complexity:
This solution is basically a brute force, but using memorization of each already explored value of f(_,_,_) you basically need only to fill a matrix of size O(n*k^2), where each cell takes O(n) time to compute on first look- but in practice can be computed iteratively in O(1) per step, because you only need to minimize with the new element in the row1. This gives you O(n*k^2) - better than brute force.
(1) This is done by min{x1,x2,x3,...,xk} = min{x_k, min{x1,...,k_k-1}}, and we already know min{x1,...,k_k-1}
You can implement a modified version of A* algorithm.
Copy the matrix and fill it will 0s
foreach secondary diagonal m from the last to first
Foreach cell n in m
4. New matrix's cell n = old matrix cell n minus min(cell bellow n in new matrix, cell to the right of n in the new matrix).
Cell 0,0 in the new matrix is the shortest path
**implement A algorithem over the pseudocode above.

Most commonly occurring combination

I have a list of integer array, where each array have some numbers sorted. Here I want to find the most commonly occurring combination of sequence of integers based on all the array. For example if the list of array is as follows
A1 - 1 2 3 5 7 8
A2 - 2 3 5 6 7
A3 - 3 5 7 9
A4 - 1 2 3 7 9
A5 - 3 5 7 10
Here
{3,5,7} - {A1,A3,A5}
{2,3} - {A1,A2,A4}
So we can take {3,5,7} or {2,3} as the most commonly occurring combinations.
Now the algorithm i used is as following
Find intersection of a set with all others. And store the resulting set. Increment a resulting set occurrence in case if its already exist.
for eg :
Find intersections of all the below
A1 intersection A2
A1 intersection A3
A1 intersection A4
A1 intersection A5
A2 intersection A3
A2 intersection A4
A2 intersection A5
A3 intersection A4
A3 intersection A5
A4 intersection A5
Here A1 intersection A3 is same as A3 intersection A5 , hence set-{3,5,7} occurrence can be set as 2.
Similarly each resulting set occurrence can be determined.
But this algorithm demands O(n^2) complexity.
Assuming each set is sorted , am pretty sure that we can find a better algorithm with O(n) complexity which i am not able to pen down.
Can anyone suggest a O(n) algorithm for the same.
If you have a sequence of length n, then its prefix is of length n-1 and occurs at least as often - a degenerate case is the most common character, which is a sequence of length 1 that occurs at least as often as any longer sequence. Do you have a minimum suffix length you are interested in?
Regardless of this, one idea is to concatenate all of the sequences, separating them by different integers which appear nowhere else, and then compute the http://en.wikipedia.org/wiki/Suffix_array in linear time. One pass through the suffix array should allow you to find the most common subsequence of any given length - and it shouldn't cross the gap between two different arrays, because each such sequence of length n is unique, because the characters separating the arrays are unique. (see also the http://en.wikipedia.org/wiki/LCP_array)
This example in Haskell does not scan intersections. Rather, it lists the sub-sequences for each list and aggregates them into an array indexed by sub-sequence. To look up the most commonly occurring sub-sequence, simply show the longest element in the array. The output is filtered to show sub-sequences greater than length 1. Output is a list of tuples showing the sub-sequence and indexes of the lists where the sub-sequence appears:
*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[4,2,1,0]),([5,7],[4,2,0]),([3,5,7],[4,2,0]),([2,3],[3,1,0]),([7,9],[3,2]),([2,3,5],[1,0]),([1,2,3],[3,0]),([1,2],[3,0])]
import Data.List
import qualified Data.Map as M
import Data.Function (on)
sInt xs = concat $ zipWith (\x y -> zip (subs x) (repeat y)) xs [0..]
where subs = filter (not . null) . concatMap inits . tails
res xs = foldl' (\x y -> M.insertWith (++) (fst y) [snd y] x) M.empty (sInt xs)
combFreq xs = reverse $ sortBy (compare `on` (length . snd))
. filter (not . null . drop 1 . snd)
. filter (not . null . drop 1 . fst)
. M.toList
. res $ xs

Algorithms to create a tabular representation of a DAG?

Given a DAG, in which each node belongs to a category, how can this graph be transformed into a table with a column for each category? The transformation doesn't have to be reversible, but should preserve useful information about the structure of the graph; and should be a 'natural' transformation, in the sense that a person looking at the graph and the table should not be surprised by any of the rows. It should also be compact, i.e. have few rows.
For example given a graph of nodes a1,b1,b2,c1 with edges a1->b1, a1->b2, b1->c1, b2->c1 (i.e. a diamond-shaped graph) I would expect to see the following table:
a b c
--------
a1 b1 c1
a1 b2 c1
I've thought about this problem quite a bit, but I'm having trouble coming up with an algorithm that gives intuitive results on certain graphs. Consider the graph a1,b1,c1 with edges a1->c1, b1->c1. I'd like the algorithm to produce this table:
a b c
--------
a1 b1 c1
But maybe it should produce this instead:
a b c
--------
a1 c1
a1 b1
I'm looking for creative ideas and insights into the problem. Feel free to vary to simplify or constrain the problem if you think it will help.
Brainstorm away!
Edit:
The transformation should always produce the same set of rows, although the order of rows does not matter.
The table should behave nicely when sorting and filtering using, e.g., Excel. This means that mutliple nodes cannot be packed into a single cell of the table - only one node per cell.
What you need is a variation of topological sorting. This is an algorithm that "sorts" graph vertexes as if a---->b edge meant a > b. Since the graph is a DAG, there is no cycles in it and this > relation is transitive, so at least one sorting order exists.
For your diamond-shaped graph two topological orders exist:
a1 b1 b2 c1
a1 b2 b1 c1
b1 and b2 items are not connected, even indirectly, therefore, they may be placed in any order.
After you sorted the graph, you know an approximation of order. My proposal is to fill the table in a straightforward way (1 vertex per line) and then "compact" the table. Perform sorting and pick the sequence you got as output. Fill the table from top to bottom, assigning a vertex to relevant column:
a b c
--------
a1
b2
b1
c1
Now compact the table by walking from top to bottom (and then make similar pass from bottom to top). On each iteration, you take a closer look to a "current" row (marked as =>) and to the "next" row.
If in a column nodes in current and next node differ, do nothing for this column:
from ----> to
X b c X b c
-------- --------
=> X1 . . X1 . .
X2 . . => X2 . .
If in a column X in the next row there is no vertex (table cell is empty) and in the current row there is vertex X1, then you sometimes should fill this empty cell with a vertex in the current row. But not always: you want your table to be logical, don't you? So copy the vertex if and only if there's no edge b--->X1, c--->X1, etc, for all vertexes in current row.
from ---> to
X b c X b c
-------- --------
=> X1 b c X1 b c
b1 c1 => X1 b1 c1
(Edit:) After first (forward) and second (backward) passes, you'll have such tables:
first second
a b c a b c
-------- --------
a1 a1 b2 c1
a1 b2 a1 b2 c1
a1 b1 a1 b1 c1
a1 b1 c1 a1 b1 c1
Then, just remove equal rows and you're done:
a b c
--------
a1 b2 c1
a1 b1 c1
And you should get a nice table. O(n^2).
How about compacting all reachable nodes from one node together in one cell ? For example, your first DAG should look like:
a b c
---------------
a1 [b1,b2]
b1 c1
b2 c1
It sounds like a train system map with stations within zones (a,b,c).
You could be generating a table of all possible routes in one direction. In which case "a1, b1, c1" would seem to imply a1->b1 so don't format it like that if you have only a1->c1, b1->c1
You could decide to produce a table by listing the longest routes starting in zone a,
using each edge only once, ending with the short leftover routes. Or allow edges to be reused only if they connect unused edges or extend a route.
In other words, do a depth first search, trying not to reuse edges (reject any path that doesn't include unused edges, and optionally trim used edges at the endpoints).
Here's what I ended up doing:
Find all paths emanating from a node without in-edges. (Could be expensive for some graphs, but works for mine)
Traverse each path to collect a row of values
Compact the rows
Compacting the rows is dones as follows.
For each pair of columns x,y
Construct a map of every value of x to it's possible values of y
Create another map For entries that only have one distinct value of y, mapping the value of x to its single value of y.
Fill in the blanks using these maps. When filling in a value, check for related blanks that can be filled.
This gives a very compact output and seems to meet all my requirements.

Resources