I have an edge-list represent a graph network like below:
Input : [(A,1),(A,2),(B,1)(B,2)(C,2)(C,3)]
What would be the optimal way to transform it to the following list:
Output : [(A,B,2),(B,C,1),(A,C,1)] .In the Output list, entries represent 2 nodes and a similarity measure over the other set of nodes.
the Input list represents the 1st graph from the following figure, and the Output list represents the association among nodes(graph 2 in the figure).
Here is what I did, I used self join on the input list and tried to count the entries for calculating edge value.
But, In that case, I am getting lots of redundant entries(because of the join) and It is not effective when I have lots of data.
Self Join is like : (A,1),(B,1) gives (A,1, B) as I am joining on the number node. and after that, I have to count the same results to get the edge value
We can get the information you want by building a list of all letter-vertices connected to every number-vertex; then for every pair of letter-vertices present in the same list, add 1 to a similarity counter for that pair.
I used a standard dict with the .setdefault method to store the lists of letter-vertices connected to every number-vertex in adjacency_dict.
I used a defaultdict to store the similarity counts in sim_dict. The choice between dict + setdefault or defaultdict is entirely a matter of taste; since there were two dicts in my code, I used one of each to showcase their ease of use. As for my personal taste, I like dict.setdefault when the dict holds mutables (such as lists) and I prefer defaultdict when the dict holds non-mutables (such as ints). Note that since sim_dict is holding counts, using a Counter would also have been appropriate.
To find all possible pairs of letter-vertices present in a same list, I used itertools.combinations along with sorted to make sure that combinations are always in alphabetical order - so that ('A', 'C') cannot be treated as a different pair than ('C', 'A').
import itertools # combinations
import collections # defaultdict
edge_list = [('A',1),('A',2),('B',1),('B',2),('C',2),('C',3)]
adjacency_dict = {}
for letter,num in edge_list:
adjacency_dict.setdefault(num, []).append(letter)
print(adjacency_dict)
# {1: ['A', 'B'], 2: ['A', 'B', 'C'], 3: ['C']}
sim_dict = collections.defaultdict(int)
for l in adjacency_dict.values():
for a,b in itertools.combinations(sorted(l), 2):
sim_dict[(a,b)] += 1
print(sim_dict)
# defaultdict(<class 'int'>, {('A', 'B'): 2, ('A', 'C'): 1, ('B', 'C'): 1})
sim_list = [(a,b,s) for (a,b),s in sim_dict.items()]
print(sim_list)
# [('A', 'B', 2), ('A', 'C', 1), ('B', 'C', 1)]
Relevant documentation:
dict.setdefault;
collections.defaultdict;
collections.Counter;
itertools.combinations.
Related
Each letter can be used only once. There may be more than one instance of the same letter in the array.
We can assume that each word in the dict can be spelled using the letters. The goal is to return the maximum number of words.
Example 1:
arr = ['a', 'b', 'z', 'z', 'z', 'z']
dict = ['ab', 'azz', 'bzz']
// returns 2 ( for [ 'azz', 'bzz' ])
Example 2:
arr = ['g', 't', 'o', 'g', 'w', 'r', 'd', 'e', 'a', 'b']
dict = ['we', 'bag', 'got', 'word']
// returns 3 ( for ['we', 'bag', 'got'] )
EDIT for clarity to adhere to SO guidelines:
Looking for a solution. I was given this problem during an interview. My solution is below, but it was rejected as too slow.
1.) For each word in dict, w
- Remove w's letters from the arr.
- With the remaining letters, count how many other words could be spelled.
Put that # as w's "score"
2.) With every word "scored", select the word with the highest score,
remove that word and its letters from the input arrays.
3.) Repeat this process until no more words can be spelled from the remaining
set of letters.
This is a fairly generic packing problem with up to 26 resources. If I were trying to solve this problem in practice, I would formulate it as an integer program and apply an integer program solver. Here's an example formulation for the given instance:
maximize x_ab + x_azz + x_bzz
subject to
constraint a: x_ab + x_azz <= 1
constraint b: x_ab + x_bzz <= 1
constraint z: 2 x_azz + 2 x_bzz <= 4
x_ab, x_azz, x_bzz in {0, 1} (or integer >= 0 depending on the exact variant)
The solver will solve the linear relaxation of this program and in the process put a price on each letter indicating how useful it is to make words, which guides the solver quickly to a provably optimal solution on surprisingly large instances (though this is an NP-hard problem for arbitrary-size alphabets, so don't expect much on artificial instances such as those resulting from NP-hardness reductions).
I don't know what your interviewer was looking for -- maybe a dynamic program whose states are multisets of unused letters.
Expression for One possible Dynamic Programming solution can be following:
WordCount(dict,i,listOfRemainingLetterCounts) =
max(WordCount(dict,i-1,listOfRemainingLetterCounts),
WordCount(dict,i-1,listOfRemainingLetterCountsAfterReducingCountOfWordDict[i]))
I see it as a multidimensional problem. Was the interviewer impressed by your answer?
Turn the list of letters into a set of letter-occurence pairs. Where the occurrence is incremented on each occurrence of the same letter in the list e.g. aba becomes set of a-1 b-1 a-2
Translate each word in the dictionary, independently, in a similar manner; so The word coo becomes a set: c-1 o-2.
A word is accepted if the set of its letter-occurences is a subset of the set generated from the original list of letters.
For fixed alphabet, and maximum letter frequencies, this could be implemented quite quickly using bitsets, but, again, how fast is fast enough?
There is numerous literature on the Web for the longest common subsequence problem but I have a slightly different problem and was wondering if anyone knows of a fast algorithm.
Say, you have a collection of paths:
[1,2,3,4,5,6,7], [2,3,4,9,10], [3,4,6,7], ...
We see that subpath [3,4] is the most common.
Know of a neat algorithm to find this? For my case there are tens of thousands of paths!
Assuming that a "path" has to encompass at least two elements, then the most common path will obviously have two elements (although there could also be a path with more than two elements that's equally common -- more on this later). So you can just iterate all the lists and count how often each pair of consecutive numbers appears in the different lists and remember those pairs that appear most often. This requires iterating each list once, which is the minimum amount you'd have to do in any case.
If you are interested in the longest most common path, then you can start the same way, finding the most common 2-segment-paths, but additionally to the counts, also record the position of each of those segments (e.g. {(3,4): [2, 1, 0], ...} in your example, the numbers in the list indicating the position of the segment in the different paths). Now, you can take all the most-common length-2-paths and see if for any of those, the next element is also the same for all the occurrences of that path. In this case you have a most-common length-3-path that is equally common as the prior length-2 path (it can not be more common, obviously). You can repeat this for length-4, length-5, etc. until it can no longer be expanded without making the path "less common". This part requires extra work of n*k for each expansion, with n being the number of candidates left and k how often those appear.
(This assumes that frequency beats length, i.e. if there is a length-2 path appearing three times, you prefer this over a length-3 path appearing twice. The same apprach can also be used for a different starting length, e.g. requiring at least length-3 paths, without changing the basic algorithm or the complexity.)
Here's a simple example implementation in Python to demonstrate the algorithm. This only goes up to length-3, but could easily be extended to length-4 and beyond with a loop. Also, it does not check any edge-cases (array-out-of-bounds etc.)
# example data
data = [[1,2, 4,5,6,7, 9],
[1,2,3,4,5,6, 8,9],
[1,2, 4,5,6,7,8 ]]
# step one: count how often and where each pair appears
from collections import defaultdict
pairs = defaultdict(list)
for i, lst in enumerate(data):
for k, pair in enumerate(zip(lst, lst[1:])):
pairs[pair].append((i,k))
# step two: find most common pair and filter
most = max([len(lst) for lst in pairs.values()])
pairs = {k: v for k, v in pairs.items() if len(v) == most}
print(pairs)
# {(1, 2): [(0, 0), (1, 0), (2, 0)], (4, 5): [(0, 2), (1, 3), (2, 2)], (5, 6): [(0, 3), (1, 4), (2, 3)]}
# step three: expand pairs to triplets, triplets to quadruples, etc.
triples = [k + (data[v[0][0]][v[0][1]+2],)
for k, v in pairs.items()
if len(set(data[i][k+2] for (i,k) in v)) == 1]
print(triples)
# [(4, 5, 6)]
I've need for a particular form of 'set' partitioning that is escaping me, as it's not quite partitioning. Or rather, it's the subset of all partitions for a particular list that maintain the original order.
I have a list of n elements [a,b,c,...,n] in a particular order.
I need to get all discrete variations of partitioning that maintains the order.
So, for four elements, the result will be:
[{a,b,c,d}]
[{a,b,c},{d}]
[{a,b},{c,d}]
[{a,b},{c},{d}]
[{a},{b,c,d}]
[{a},{b,c},{d}]
[{a},{b},{c,d}]
[{a},{b},{c},{d}]
I need this for producing all possible groupings of tokens in a list that must maintain their order, for use in a broader pattern matching algorithm.
I've found only one other question that relates to this particular issue here, but it's for ruby. As I don't know the language, it looks like someone put code in a blender, and don't particularly feel like learning a language just for the sake of deciphering an algorithm, I feel I'm out of options.
I've tried to work it out mathematically so many times in so many ways it's getting painful. I thought I was getting closer by producing a list of partitions and iterating over it in different ways, but each number of elements required a different 'pattern' for iteration, and I had to tweak them in by hand.
I have no way of knowing just how many elements there could be, and I don't want to put an artificial cap on my processing to limit it just to the sizes I've tweaked together.
You can think of the problem as follows: each of the partitions you want are characterized by a integer between 0 and 2^(n-1). Each 1 in the binary representation of such a number corresponds to a "partition break" between two consecutive numbers, e.g.
a b|c|d e|f
0 1 1 0 1
so the number 01101 corresponds to the partition {a,b},{c},{d,e},{f}. To generate the partition from a known parition number, loop through the list and slice off a new subset whenever the corresponding bit it set.
I can understand your pain reading the fashionable functional-programming-flavored Ruby example. Here's a complete example in Python if that helps.
array = ['a', 'b', 'c', 'd', 'e']
n = len(array)
for partition_index in range(2 ** (n-1)):
# current partition, e.g., [['a', 'b'], ['c', 'd', 'e']]
partition = []
# used to accumulate the subsets, e.g., ['a', 'b']
subset = []
for position in range(n):
subset.append(array[position])
# check whether to "break off" a new subset
if 1 << position & partition_index or position == n-1:
partition.append(subset)
subset = []
print partition
Here's my recursive implementation of partitioning problem in Python. For me, recursive solutions are always easier to comprehend. You can find more explanation about it in here.
# Prints partitions of a set : [1,2] -> [[1],[2]], [[1,2]]
def part(lst, current=[], final=[]):
if len(lst) == 0 :
if len(current) == 0:
print (final)
elif len(current) > 1:
print ([current] + final)
else :
part(lst[1:], current + [lst[0]], final[:])
part(lst[1:], current[:], final + [[lst[0]]])
Since nobody has mentioned backtrack technique in solving this. Here is the Python solution to solve this using backtrack.
def partition(num):
def backtrack(index, chosen):
if index == len(num):
print(chosen)
else:
for i in range(index, len(num)):
# Choose
cur = num[index:i + 1]
chosen.append(cur)
# Explore
backtrack(i + 1, chosen)
# Unchoose
chosen.pop()
backtrack(0, [])
>>> partition('123')
['1', '2', '3']
['1', '23']
['12', '3']
['123']
Assume we have an array of objects of length N (all objects have the same set of fields).
And we have an array of length N of the same type values, which represent certain object's field (e.g. array of numbers representing IDs).
Now we want to sort the array of objects by the field which is represented in the 2nd array and in the same order as in the 2nd array.
For example, here are 2 arrays (as in description) and expected result:
A = [ {id: 1, color: "red"}, {id: 2, color: "green"}, {id: 3, color: "blue"} ]
B = [ "green", "blue", "red"]
sortByColorByExample(A, B) ==
[ {id: 2, color: "green"}, {id: 3, color: "blue"}, {id: 1, color: "red"} ]
How to effectively implement 'sort-by-example' function? I can't come up with anything better then O(N^2).
This is assuming you have a bijection from elements in B to elements in A
Build a map (say M) from B's elements to their position (O(N))
For each element of A (O(N)), access the map to find where to put it in the sorted array (O(log(N)) with a efficient implementation of the map)
Total complexity: O(NlogN) time and O(N) space
Suppose we are sorting on an item's colour. Then create a dictionary d that maps each colour to a list of the items in A that have that colour. Then iterate across the colours in the list B, and for each colour c output (and remove) a value from the list d[c]. This runs in O(n) time with O(n) extra space for the dictionary.
Note that you have to decide what to do if A cannot be sorted according to the examples in B: do you raise an error? Choose the order that maximizes the number of matches? Or what?
Anyway, here's a quick implementation in Python:
from collections import defaultdict
def sorted_by_example(A, B, key):
"""Return a list consisting of the elements from the sequence A in the
order given by the sequence B. The function key takes an element
of A and returns the value that is used to match elements from B.
If A cannot be sorted by example, raise IndexError.
"""
d = defaultdict(list)
for a in A:
d[key(a)].append(a)
return [d[b].pop() for b in B]
>>> A = [{'id': 1, 'color': 'red'}, {'id': 2, 'color': 'green'}, {'id': 3, 'color': 'blue'}]
>>> B = ['green', 'blue', 'red']
>>> from operator import itemgetter
>>> sorted_by_example(A, B, itemgetter('color'))
[{'color': 'green', 'id': 2}, {'color': 'blue', 'id': 3}, {'color': 'red', 'id': 1}]
Note that this approach handles the case where there are multiple identical values in the sequence B, for example:
>>> A = 'proper copper coffee pot'.split()
>>> B = 'ccpp'
>>> ' '.join(sorted_by_example(A, B, itemgetter(0)))
'coffee copper pot proper'
Here when there are multiple identical values in B, we get the corresponding elements in A in reverse order, but this is just an artefact of the implementation: by using a collections.deque instead of a list (and popleft instead of pop), we could arrange to get the corresponding elements of A in the original order, if that were preferred.
Make an array of arrays, call it C of size B.length.
Loop through A. If it has color 'green' put it in C[0]. If it has a color of 'blue' put it in C[1], if it has a color of red put it in C[2].
When you're done go through C, and flatten it out to your original structure.
Wouldn't something along the lines of a merge sort be better? Create B.length arrays, one for each element inside B, and go through A, and place them in the appropriate smaller array then when it's all done merge the arrays together. It should be around O(2n)
Iterate through the first array and Make a HashMap of such fields versus the List of Objects. O(n) [assuming there are duplicate values of those key fields]
For eg. key = green will contain all objects with field value Green
Now iterate through the second array, get the list of objects from HashMap and store it in another array. O(k) .. (where k - distinct values of field)
The total running time is O(n) but it requires some additional memory in terms of a map and an auxiliary array
In the end you will get the array sorted as per your requirements.
I have a table of items with [ID, ATTR1, ATTR2, ATTR3]. I'd like to select about half of the items, but try to get a random result set that is NOT clustered. In other words, there's a fairly even spread of ATTR1 values, ATTR2 values, and ATTR3 values. This does NOT necessarily represent the data as a whole, in other words, the total table may be generally concentrated on certain attribute values, but I'd like to select a subset with more variety. The attributes are not inter-related, so there's not really a correlation between ATTR1 and ATTR2.
As an example, imagine ATTR1 = "State". I'd like each line item in my subset to be from a different state, even if in the whole set, most of my data is concentrated on a few states. And for this to simultaneously be true of the other 2 attributes, too. (I realize that some tables might not make this possible, but there's enough data that it's unlikely to have no solution)
Any ideas for an efficient algorithm? Thanks! I don't really even know how to search for this :)
(by the way, it's OK if this requires pre-calculation or -indexing on the whole set, so long as I can draw out random varied subsets quickly)
Interesting problem. Since you want about half of the list, how about this:-
Create a list of half the values chosen entirely at random. Compute histograms for the value of ATTR1, ATTR2, ATTR3 for each of the chosen items.
:loop
Now randomly pick an item that's in the current list and an item that isn't.
If the new item increases the 'entropy' of the number of unique attributes in the histograms, keep it and update the histograms to reflect the change you just made.
Repeat N/2 times, or more depending on how much you want to force it to move towards covering every value rather than being random. You could also use 'simulated annealing' and gradually change the probability to accepting the swap - starting with 'sometimes allow a swap even if it makes it worse' down to 'only swap if it increases variety'.
I don't know (and I hope someone who does will answer). Here's what comes to mind: make up a distribution for MCMC putting the most weight on the subsets with 'variety'.
Assuming the items in your table are indexed by some form of id, I would in a loop, iterate through half of the items in your table, and use a random number generator to get the number.
IMHO
Finding variety is difficult but generating it is easy.
So we can generate variety of combinations and
then seach the table for records with those combinations.
If the table is sorted then searching also becomes easy.
Sample python code:
d = {}
d[('a',0,'A')]=0
d[('a',1,'A')]=1
d[('a',0,'A')]=2
d[('b',1,'B')]=3
d[('b',0,'C')]=4
d[('c',1,'C')]=5
d[('c',0,'D')]=6
d[('a',0,'A')]=7
print d
attr1 = ['a','b','c']
attr2 = [0,1]
attr3 = ['A','B','C','D']
# no of items in
# attr2 < attr1 < attr3
# ;) reason for strange nesting of loops
for z in attr3:
for x in attr1:
for y in attr2:
k = (x,y,z)
if d.has_key(k):
print '%s->%s'%(k,d[k])
else:
print k
Output:
('a', 0, 'A')->7
('a', 1, 'A')->1
('b', 0, 'A')
('b', 1, 'A')
('c', 0, 'A')
('c', 1, 'A')
('a', 0, 'B')
('a', 1, 'B')
('b', 0, 'B')
('b', 1, 'B')->3
('c', 0, 'B')
('c', 1, 'B')
('a', 0, 'C')
('a', 1, 'C')
('b', 0, 'C')->4
('b', 1, 'C')
('c', 0, 'C')
('c', 1, 'C')->5
('a', 0, 'D')
('a', 1, 'D')
('b', 0, 'D')
('b', 1, 'D')
('c', 0, 'D')->6
('c', 1, 'D')
But assuming your table is very big (otherwise why would you need algorithm ;) and data is fairly uniformly distributed there will be more hits in actual scenario. In this dummy case there are too many misses which makes algorithm look inefficient.
Let's assume that ATTR1, ATTR2, and ATTR3 are independent random variables (over a uniform random item). (If ATTR1, ATTR2, and ATTR3 are only approximately independent, then this sample should be approximately uniform in each attribute.) To sample one item (VAL1, VAL2, VAL3) whose attributes are uniformly distributed, choose VAL1 uniformly at random from the set of values for ATTR1, choose VAL2 uniformly at random from the set of values for ATTR2 over items with ATTR1 = VAL1, choose VAL3 uniformly at random from the set of values for ATTR3 over items with ATTR1 = VAL1 and ATTR2 = VAL2.
To get a sample of distinct items, apply the above procedure repeatedly, deleting each item after it is chosen. Probably the best way to implement this would be a tree. For example, if we have
ID ATTR1 ATTR2 ATTR3
1 a c e
2 a c f
3 a d e
4 a d f
5 b c e
6 b c f
7 b d e
8 b d f
9 a c e
then the tree is, in JavaScript object notation,
{"a": {"c": {"e": [1, 9], "f": [2]},
"d": {"e": [3], "f": [4]}},
"b": {"c": {"e": [5], "f": [6]},
"d": {"e": [7], "f": [8]}}}
Deletion is accomplished recursively. If we sample id 4, then we delete it from its list at the leaf level. This list empties, so we delete the entry "f": [] from tree["a"]["d"]. If we now delete 3, then we delete 3 from its list, which empties, so we delete the entry "e": [] from tree["a"]["d"], which empties tree["a"]["d"], so we delete it in turn. In a good implementation, each item should take time O(# of attributes).
EDIT: For repeated use, reinsert the items into the tree after the whole sample is collected. This doesn't affect the asymptotic running time.
Idea #2.
Compute histograms for each attribute on the original table.
For each item compute it's uniqueness score = p(ATTR1) x p(ATTR2) x p(ATTR3) (multiply the probabilities for each attribute it has).
Sort by uniqueness.
Chose a probability distribution curve for your random numbers ranging from picking only values in the first half of the set (a step function) to picking values evenly over the entire set (a flat line). Maybe a 1/x curve might work well for you in this case.
Pick values from the sorted list using your chosen probability curve.
This allows you to bias it towards more unique values or towards more evenness just by adjusting the probability curve you use to generate the random numbers.
Taking over your example, assign every possible 'State' a numeric value (say, between 1 and 9). Do the same for the other attributes.
Now, assuming you don't have more than 10 possible values for each attribute, multiply the values for ATTR3 for 100, ATTR2 for 1000, ATTR1 for 10000. Add up the results, you will end up with what can resemble a vague hash of the item. Something like
10,000 * |ATTR1| + 1000 * |ATTR2| + 100 * |ATTR3|
the advantage here is that you know that values between 10000 and 19000 have the same 'State' value; in other words, the first digit represents ATTR1. Same for ATTR2 and the other attributes.
You can sort all values and using something like bucket-sort pick one for each type, checking that the digit you're considering hasn't been picked already.
An example: if your end values are
A: 15,700 = 10,000 * 1 + 1,000 * 5 + 100 * 7
B: 13,400 = 10,000 * 1 + 1,000 * 3 + 100 * 4
C: 13,200 = ...
D: 12,300
E: 11,400
F: 10,900
you know that all your values have the same ATTR1; 2 have the same ATTR2 (that being B and C); and 2 have the same ATTR3 (B, E).
This, of course, assuming I understood correctly what you want to do. It's saturday night, afterall.
ps: yes, I could have used '10' as the first multiplier but the example would have been messier; and yes, it's clearly a naive example and there are lots of possible optimizations here, which are left as an exercise to the reader
It's a very interesting problem, for which I can see a number of applications. Notably for testing software: you get many 'main-flow' transactions, but only one is necessary to test that it works and you would prefer when selecting to get an extremely varied sample.
I don't think you really need a histogram structure, or at least only a binary one (absent/present).
{ ATTR1: [val1, val2], ATTR2: [i,j,k], ATTR3: [1,2,3] }
This is used in fact to generate a list of predicates:
Predicates = [ lambda x: x.attr1 == val1, lambda x: x.attr1 == val2,
lambda x: x.attr2 == i, ...]
This list will contain say N elements.
Now you wish to select K elements from this list. If K is less than N it's fine, otherwise we will duplicate the list i times, so that K <= N*i and with i minimal of course, so i = ceil(K/N) (note that it works although if K <= N, with i == 1).
i = ceil(K/N)
Predz = Predicates * i # python's wonderful
And finally, pick up a predicate there, and look for an element that satisfies it... that's where randomness actually hits and I am less than adequate here.
Two remarks:
if K > N you may be willing to actually select i-1 times each predicate and then select randomly from the list of predicates only to top off your selection. Thus ensuring the over representation of even the least common elements.
the attributes are completely uncorrelated this way, you may be willing to select patterns as you could never get the tuple (1,2,3) by selecting on the third element being 3, so perhaps a refinement would be to group some related attributes together, though it would probably increase the number of predicates generated
for efficiency reasons, you should have the table by the predicate category if you wish to have an efficient select.