How To Find All Possible Permutations From A Bag under apache pig - hadoop

i'm trying to find all combinations possible using apache pig, i was able to generate permutation but i want to eliminate the replication of values i write this code :
A = LOAD 'data' AS f1:chararray;
DUMP A;
('A')
('B')
('C')
B = FOREACH A GENERATE $0 AS v1;
C = FOREACH A GENERATE $0 AS v2;
D = CROSS B, C;
And the result i obtained is like :
DUMP D;
('A', 'A')
('A', 'B')
('A', 'C')
('B', 'A')
('B', 'B')
('B', 'C')
('C', 'A')
('C', 'B')
('C', 'C')
but what i'm trying to obtain the result is like bellow
DUMP R;
('A', 'A')
('A', 'B')
('A', 'C')
('B', 'B')
('B', 'C')
('C', 'C')
how can i do this? i avoid to use comparison of characters because it's possible to have multiple occurrences of a string in more than a line

You can FILTER D to remove the rows you don't want. For example
A = load 'testdata.txt';
B = foreach A generate $0;
C = Cross A, B;
D = filter C by $0 <= $1;
dump D;
which prints out
(C,C)
(B,C)
(B,B)
(A,C)
(A,B)
(A,A)
when 'testdata.txt' has
A
B
C

Related

Generating the powerset of a multiset

Suppose I have a multiset
{a,a,a,b,c}
from which I can make the following selections:
{}
{a}
{a,a}
{a,a,a}
{a,a,a,b}
{a,a,a,b,c}
{a,a,a,c}
{a,a,b}
{a,a,b,c}
{a,a,c}
{a,b}
{a,b,c}
{a,c}
{b}
{b,c}
{c}
Notice that the number of selections equals 16. The cardinality of a powerset of a multiset, card(P(M)), is defined on OEIS as
card(P(M)) = prod(mult(x) + 1) for all x in M
where mult(x) is the multiplicity of x in M and prod is the product of the terms. So for our example, this would amount to 4 x 2 x 2 = 16.
Let's say, for example, that the multiplicity of these elements is very high:
m(a) = 21
m(b) = 36
m(c) = 44
Then
card(P(M)) = 22 * 37 * 45 = 36630.
But if we were to treat all those elements as distinct - as a set - the cardinality of the powerset would be
card(P(S)) = 2^(21+36+44) = 2535301200456458802993406410752.
The "standard" solution for this problem suggests to just compute the powerset of the set where all of the elements are treated as distinct, and then prune the results to remove the duplicates. That's a solution with O(2^n) complexity.
Does a general algorithm for generating a powerset of a multiset with complexity on the order of card(P(M)) exist?
powerset recipe with itertools
What you are asking is usually called the powerset and is available as an itertools recipe, as well as a function in the module more_itertools. See the documentation:
itertools recipe;
more_itertools.powerset.
multiset = ['a', 'a', 'a', 'b', 'c']
#
# USING ITERTOOLS
#
import itertools
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return itertools.chain.from_iterable(itertools.combinations(s, r) for r in range(len(s)+1))
print(list(powerset(multiset)))
# [(), ('a',), ('a',), ('a',), ('b',), ('c',), ('a', 'a'), ('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'b'), ('a', 'c'), ('b', 'c'), ('a', 'a', 'a'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'b', 'c'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'a', 'a', 'b'), ('a', 'a', 'a', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'a', 'b', 'c')]
#
# USING MORE_ITERTOOLS
#
import more_itertools
print(list(more_itertools.powerset(multiset)))
# [(), ('a',), ('a',), ('a',), ('b',), ('c',), ('a', 'a'), ('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'a'), ('a', 'b'), ('a', 'c'), ('a', 'b'), ('a', 'c'), ('b', 'c'), ('a', 'a', 'a'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'b', 'c'), ('a', 'a', 'b'), ('a', 'a', 'c'), ('a', 'b', 'c'), ('a', 'b', 'c'), ('a', 'a', 'a', 'b'), ('a', 'a', 'a', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'b', 'c'), ('a', 'a', 'a', 'b', 'c')]
Powerset of a collections.Counter object
In Python, multisets are usually represented with a collections.Counter rather than with a list. The class collections.Counter is a subclass of dict; it implements dictionaries that map elements to counts, as well as several useful methods such as building a Counter by counting occurrences in a sequence.
Taking the powerset of a Counter is the topic of another question on stackoverflow:
How to generate all the subsets of a Counter?
Although I am not aware of an already-implemented method doing this in standard modules, the answer to that question presents one solution using itertools:
import collections
import itertools
multiset = collections.Counter(['a', 'a', 'a', 'b', 'c'])
# Counter({'a': 3, 'b': 1, 'c': 1})
def powerset(multiset):
range_items = [[(x, z) for z in range(y + 1)] for x,y in multiset.items()]
products = itertools.product(*range_items)
return [{k: v for k, v in pairs if v > 0} for pairs in products]
print(powerset(multiset))
# [{}, {'c': 1}, {'b': 1}, {'b': 1, 'c': 1}, {'a': 1}, {'a': 1, 'c': 1}, {'a': 1, 'b': 1}, {'a': 1, 'b': 1, 'c': 1}, {'a': 2}, {'a': 2, 'c': 1}, {'a': 2, 'b': 1}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3}, {'a': 3, 'c': 1}, {'a': 3, 'b': 1}, {'a': 3, 'b': 1, 'c': 1}]
This will give you all the combinations of lst as tuples. Hope this answers your question.
from itertools import combinations
lst = ['a', 'a', 'a', 'b', 'c']
combs = set()
for i in range(len(lst)+1):
els = [tuple(x) for x in combinations(lst, i)]
for item in tuple(els):
combs.add(item)
print(combs)
The best way I like to think of it is that you start out with the empty set and then for each character you are making a choice of either adding it to the current existing sets or not adding it. Since you have 2 choices at each step, the number of total elements in a powerset is 2^n. Implementing this can be done through this Java code:
public List<List<Integer>> subsets(int[] nums) {
// list variable contains all of the subsets
List<List<Integer>> list = new ArrayList<>();
// add the empty set to start with
list.add(new ArrayList<Integer>());
for (int i = 0; i < nums.length; i++) {
//find current list size
int length = list.size();
// Loop through and add the current element to all existing
subsets
//Represents making the choice of adding the element
for (int j = 0; j < length; j++) {
// making a copy of current subset list
ArrayList<Integer> temp = new ArrayList<>(list.get(j));
temp.add(nums[i]);
list.add(temp);
}
}
return list;
}

Ranking algorithm with win-lose records

I am looking for an algorithmic approach to sort elements based on its win-lose records of each combiniation.
Please take a look at the sample data
('a', 'b') -> (W, L)
('a', 'c') -> (L, W)
('a', 'd') -> (L, W)
('a', 'e') -> (W, L)
('b', 'c') -> (L, W)
('b', 'd') -> (L, W)
('b', 'e') -> (W, L)
('c', 'd') -> (W, L)
('c', 'e') -> (W, L)
('d', 'e') -> (W, L)
The winner is placed right side of the array
ex)
c win over all the other elements
d win over all the other elements except c
...
Desired result ordered from Lost -> Win
[e, b, a, d, c]
Is there a keyword, or approach I can chase on to solve this problem?
I would go about by assigning each token, w, l, (and you could do draw d) a value, such as w=3, l=1, d=2.
Then you would map those values to each player's result and you'd sort it accordingly.
So from your example of:
('a', 'b') -> (W, L)
('a', 'c') -> (L, W)
('a', 'd') -> (L, W)
('a', 'e') -> (W, L)
('b', 'c') -> (L, W)
('b', 'd') -> (L, W)
('b', 'e') -> (W, L)
('c', 'd') -> (W, L)
('c', 'e') -> (W, L)
('d', 'e') -> (W, L)
gets mapped to something like this:
('a', 'b') -> (2, 1)
('a', 'c') -> (1, 2)
('a', 'd') -> (1, 2)
('a', 'e') -> (2, 1)
('b', 'c') -> (1, 2)
('b', 'd') -> (1, 2)
('b', 'e') -> (2, 1)
('c', 'd') -> (2, 1)
('c', 'e') -> (2, 1)
('d', 'e') -> (2, 1)
Sum up the values by their key:
a: 6
b: 4
c: 8
d: 7
e: 4
and sort the values starting with lowest:
b: 4
e: 4
a: 6
d: 7
c: 8
I am not a native English speaker, please forgive me if there is any grammatical error.
Record the number of wins and losses for each letter, the letter that has not been lost is the largest letter, and the letter that has not been won is the smallest letter.
This problem can be transformed into the longest path algorithm. You can think of each comparison as a path of length 1, so that the longest path from the smallest letter to the largest letter represents the result.
This assumes that the ordering is based on number of wins per item.
# -*- coding: utf-8 -*-
"""
https://stackoverflow.com/questions/67630173/ranking-algorithm-with-win-lose-records
Created on Fri May 21 19:00:33 2021
#author: Paddy3118
"""
data = {('a', 'b'): ('W', 'L'),
('a', 'c'): ('L', 'W'),
('a', 'd'): ('L', 'W'),
('a', 'e'): ('W', 'L'),
('b', 'c'): ('L', 'W'),
('b', 'd'): ('L', 'W'),
('b', 'e'): ('W', 'L'),
('c', 'd'): ('W', 'L'),
('c', 'e'): ('W', 'L'),
('d', 'e'): ('W', 'L'),
}
all_items = set()
for (i1, i2) in data.keys():
all_items |= {i1, i2} # Finally = {'a', 'b', 'c', 'd', 'e'}
win_counts = {item: 0 for item in all_items}
for (i1, i2), (r1, r2) in data.items():
if r1 == 'W':
win_counts[i1] += 1
else:
win_counts[i2] += 1
# win_counts = {'a': 2, 'd': 3, 'b': 1, 'e': 0, 'c': 4}
answer = sorted(all_items, key=lambda i: win_counts[i])
print(answer) # ['e', 'b', 'a', 'd', 'c']

How to find a best way to connect different nodes between different unions?

Description:
There are 1000 unions, each union contains x nodes(x is a random value between 1~100).
Now we can create a connection from one node in union A to another node in union B.
The rule is:
1. one node only accepts just one connection.
2. The connection must be across different unions.
Similarly, create such kind of connections as many as possible.
In the end, there may be several nodes left which are unable to be connected because of no other available nodes in other unions.
For example:
Union 1: a b c
Union 2: d e
Union 3: f g h i j
If we choose the following connections:
U1.a <-> U2.d
U1.b <-> U2.e
U1.c <-> U3.f
The h i j in union 3 will be left.
But if we use another kind of connection:
U1.a <-> U3.f
U1.b <-> U3.g
U1.c <-> U3.h
U2.d <-> U3.i
U2.e <-> U3.j
Then there will be no nodes left.
So the question is:
How can we design the algorithm to try to find the optimal solution which will make the no-connection nodes least?
This is equivalent to the partition problem, where each element in the input multiset is the length of a union. Furthermore, this problem can always be solved with a simple greedy implementation that runs in O(n) time. For example, consider the following input:
Union 1: a a a
Union 2: b b b b b b b b b
Union 3: c c c c c c c c c c
Union 4: d d d d d d d d d d
Union 5: e e
The simple greedy algorithm creates two output lists. For each union (starting with the union that has the most elements), the elements of the union are added to the shorter output list. The result is two lists like this:
c c c c c c c c c c b b b b b b b b b
d d d d d d d d d d a a a e e
The next step is to take some of the items from the end of the longer list and add them to the beginning of the shorter list. In this example two bs are moved:
c c c c c c c c c c b b b b b b b
b b d d d d d d d d d d a a a e e
So, will it always work? Yes, the only exception is when one union contains more than half of the total number of items. In that case, no items are moved from the longer list.
Here's an example implementation in python:
inputList = [['a','a','a'],
['b','b','b','b','b','b','b','b','b'],
['c','c','c','c','c','c','c','c','c','c'],
['d','d','d','d','d','d','d','d','d','d'],
['e','e']]
topCount = 0
botCount = 0
topList = []
botList = []
# sort the input in descending order based on length
inputList = sorted(inputList, key=lambda x:len(x), reverse=True)
# greedy partitioning into two output lists
for itemList in inputList:
if topCount <= botCount:
topList += itemList
topCount += len(itemList)
else:
botList += itemList
botCount += len(itemList)
# move some elements from the end of the longer list to the beginning of the shorter list
if topList[0] != topList[-1]:
if topCount > botCount+1:
excess = (topCount - botCount) // 2
botList = topList[-excess:] + botList
topList = topList[:-excess]
elif botCount > topCount+1:
excess = (botCount - topCount) // 2
topList = botList[-excess:] + topList
botList = botList[:-excess]
print topList
print botList
The output is:
['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'c', 'b', 'b', 'b', 'b', 'b', 'b', 'b']
['b', 'b', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'a', 'a', 'a', 'e', 'e']
This looks like a matching problem where your graph is:
G=(V,E),
V = {all nodes},
E={(u,v) | u and v not in same union}
It can be solved, for example by Blossom algorithm

Why does this get stuck in a constant loop?

I can't understand why this wont work:
EntryRow = input("Which row would you like to book for?")
Upper = EntryRow.upper()
while Upper != 'A' or 'B' or 'C' or 'D' or 'E':
print("That is not a row")
EntryRow = input("Which row would you like to book for?")
Upper = EntryRow.upper()
'!=' has precedence over 'or'. What your code really does is:
while (Upper != 'A') or 'B' or 'C' or 'D' or 'E':
Which is always true.
Try this instead:
while not Upper in ( 'A', 'B', 'C', 'D', 'E' ):
You need to explicitly write out each condition in full and combine them using and:
while Upper != 'A' and Upper != 'B' and ...
The interpreter takes 'B' and 'C' and so on to be independent conditionals which all evaluate to True, so your if statement is therefore always true.
You are using or the wrong way. (See Andrew's answer for the right way).
One possible shortcut is to use a containment check:
while Upper not in ('A', 'B', 'C', 'D', 'E'):
...

Python: What is the right way to modify list elements?

I've this list with tuples:
l = [('a','b'),('c','d'),('e','f')]
And two parameters: a key value, and a new value to modify. For example,
key = 'a'
new_value= 'B' # it means, modify with 'B' the value in tuples where there's an 'a'
I've this two options (both works):
f = lambda t,k,v: t[0] == k and (k,v) or t
new_list = [f(t,key,new_value) for t in l]
print new_list
and
new_list = []
for i in range(len(l)):
elem = l.pop()
if elem[0] == key:
new_list.append((key,new_value))
else:
new_list.append(elem)
print new_list
But, i'm new in Python, and don't know if its right.
Can you help me? Thank you!
Here is one solution involving altering the items in-place.
def replace(list_, key, new_value):
for i, (current_key, current_value) in enumerate(list_):
if current_key == key:
list_[i] = (key, new_value)
Or, to append if it's not in there,
def replace_or_append(list_, key, new_value):
for i, (current_key, current_value) in enumerate(list_):
if current_key == key:
list_[i] = (key, new_value)
break
else:
list_.append((key, new_value))
Usage:
>>> my_list = [('a', 'b'), ('c', 'd')]
>>> replace(my_list, 'a', 'B')
>>> my_list
[('a', 'B'), ('c', 'd')]
If you want to create a new list, a list comprehension is easiest.
>>> my_list = [('a', 'b'), ('c', 'd')]
>>> find_key = 'a'
>>> new_value = 'B'
>>> new_list = [(key, new_value if key == find_key else value) for key, value in my_list]
>>> new_list
[('a', 'B'), ('c', 'd')]
And if you wanted it to append if it wasn't there,
>>> if len(new_list) == len(my_list):
... new_list.append((find_key, new_value))
(Note also I've changed your variable name from l; l is too easily confused with I and 1 and is best avoided. Thus saith PEP8 and I agree with it.)
To create a new list, a list comprehension would do:
In [102]: [(key,'B' if key=='a' else val) for key,val in l]
Out[102]: [('a', 'B'), ('c', 'd'), ('e', 'f')]
To modify the list in place:
l = [('a','b'),('c','d'),('e','f')]
for i,elt in enumerate(l):
key,val=elt
if key=='a':
l[i]=(key,'B')
print(l)
# [('a', 'B'), ('c', 'd'), ('e', 'f')]
To modify existing list just use list assignment, e.g.
>>> l = [('a','b'),('c','d'),('e','f')]
>>> l[0] = ('a','B')
>>> print l
[('a', 'B'), ('c', 'd'), ('e', 'f')]
I would usually prefer to create a new list using comprehension, e.g.
[(key, new_value) if x[0] == key else x for x in l]
But, as the first comment has already mentioned, it sounds like you are trying to make a list do something which you should really be using a dict for instead.
Here's the approach I would use.
>>> l = [('a','b'),('c','d'),('e','f')]
>>> key = 'a'
>>> new_value= 'B'
>>> for pos in (index for index, (k, v) in enumerate(l) if k == key):
... l[pos] = (key, new_value)
... break
... else:
... l.append((key, new_value))
...
>>> l
[('a', 'B'), ('c', 'd'), ('e', 'f')]
This looks an awful lot like an OrderedDict, though; key-value pairs with preserved ordering. You might want to take a look at that and see if it suits your needs
Edit: Replaced try:...except StopIteration: with for:...break...else: since that might look a bit less weird.

Resources