Opposite of a GROUP in apache pig latin?

Opposite of a GROUP in apache pig latin? - hadoop

Let's say I have the following input in apache pig:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
and I want to convert these 2 rows into the following 7 rows:
(123, (1, 2) )
(123, (3, 4) )
(666, (8, 9) )
(666, (10, 11) )
(666, (3, 4) )
i.e. this is sorta 'doing the opposite of a GROUP'. Is this possible in pig latin?

Take a look at FLATTEN. It does what you probably need.
However, using your notation above, it looks like the list of tuples is a tuple. This should be a bag for this to work properly.
Instead of:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
You should be representing your data as:
(123, { (1, 2), (3, 4) } )
(666, { (8, 9), (10, 11), (3, 4) } )
Then, once it is this form, you can do:
O = FOREACH grouped GENERATE $0, FLATTEN($1);

Related

Gensim tfidf vector in dense form

I am vectorizing a bunch of documents using Gensim's tfidfmodel. I'd like to take the output so I can dump it into a vector DB and calculate document similarity using the DB. (I'm aware Gensim has the similarities module but would like to do this within a DB.)
Using the example from the docs, here is a sample output:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
Am I correct in thinking that I need to convert each document array to a dense array? My assumption is that I cannot take the second element in each tuple of the doc array since the document array omits vocab not present. How do I convert this to a dense array?
Is matutils.corpus2dense(corpus_tfidf,dictionary.num_nnz).T the correct approach?
import gensim.matutils
densematrix = gensim.matutils.corpus2dense(corpus_tfidf,dictionary.num_nnz).T

Sort by key then value which will then be grouped up...pyspark

So I'm trying to sort data in this format...
[((0, 4), 3), ((4, 0), 3), ((1, 6), 1), ((3, 2), 3), ((0, 5), 1)...
Ascending by key and then descending by value. I'm able to achieve this via...
test = test.sortBy(lambda x: (x[0], -x[1]))
which would give me based on shortened version above...
[((0, 4), 3), ((0, 5), 1), ((1, 6), 1), ((3, 2), 3), ((4, 0), 3)...
The problem I'm having is that after the sorting I no longer want the value but do need to retain the sort after grouping the data. So...
test = test.map(lambda x: (x[0][0],x[0][1]))
Gives me...
[(0, 4), (0, 5), (1, 6), (3, 2), (4, 0)...
Which is still in the order I need it but I need the elements to be grouped up by key. I then use this command...
test = test.groupByKey().map(lambda x: (x[0], list(x[1])))
But in the process I lose the sorting. Is there any way retain?

I managed to retain the order by changing the format of the tuple...
test = test.map(lambda x: (x[0][0],(x[0][1],x[1]))
test = test.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda x: (x[0],-x[1]))))
[(0, [(4, 3), (5, 1)] ...
which leaves me with the value (2nd element in the tuple) that I want to get rid of but took care of that too...
test = test.map(lambda x: (x[0], [e[0] for e in x[1]]))
Feels a bit hacky but not sure how else it could be done.

Obtaining all groups of n pairs of numbers that pass condition

I have a list with certain combinations between two numbers:
[1 2] [1 4] [1 6] [3 4] [5 6] [3 6] [2 3] [4 5] [2 5]
Now I want to make groups of 3 combinations, where each group contains all six digits once, e.g.:
[1 2] [3 6] [4 5] is valid
[1 4] [2 3] [5 6] is valid
[1 2] [2 3] [5 6] is invalid
Order is not important.
How can I arrive upon a list of all possible groups, without employing a brute forcing algorithm?
The language it is implemented in doesn't matter. Description of an algorithm that could achieve this is enough.

One thing to notice is that there are only finitely many possible pairs of elements you can pick from the set {1,2,3,4,5,6}. Specifically, there are (6P2) = 30 of them if you consider order relevant and (6 choose 2) = 15 if you don't. Even the simple "try all triples" algorithm that runs in cubic time in this case will only have to look at at most (30 choose 3) = 4,060 triples, which is a pretty small number. I doubt that you'd have any problems in practice just doing this.

Here's a recursive function in Python that picks a pair of numbers from a list, and then calls itself with the remaining list:
def pairs(l, picked, ok_pairs):
n = len(l)
for a in range(n-1):
for b in range(a+1,n):
pair = (l[a],l[b])
if pair not in ok_pairs:
continue
if picked and picked[-1][0] > pair[0]:
continue
p = picked+[pair]
if len(l) > 2:
pairs([m for i,m in enumerate(l) if i not in [a, b]], p, ok_pairs)
else:
print p
ok_pairs = set([(1, 2), (1, 4), (1, 6), (3, 4), (5, 6), (3, 6), (2, 3), (4, 5), (2, 5)])
pairs([1,2,3,4,5,6], [], ok_pairs)
The output (of 6 triplets) is:
[(1, 2), (3, 4), (5, 6)]
[(1, 2), (3, 6), (4, 5)]
[(1, 4), (2, 3), (5, 6)]
[(1, 4), (2, 5), (3, 6)]
[(1, 6), (2, 3), (4, 5)]
[(1, 6), (2, 5), (3, 4)]

Here's a version using Python set arithmetic:
pairs = [(1, 2), (1, 4), (1, 6), (3, 4), (5, 6), (3, 6), (2, 3), (4, 5), (2, 5)]
n = len(pairs)
for i in range(n-2):
set1 = set(pairs[i])
for j in range(i+1,n-1):
set2 = set(pairs[j])
if set1 & set2:
continue
for k in range(j+1,n):
set3 = set(pairs[k])
if set1 & set3 or set2 & set3:
continue
print pairs[i], pairs[j], pairs[k]
The output is:
(1, 2) (3, 4) (5, 6)
(1, 2) (3, 6) (4, 5)
(1, 4) (5, 6) (2, 3)
(1, 4) (3, 6) (2, 5)
(1, 6) (3, 4) (2, 5)
(1, 6) (2, 3) (4, 5)

quick sort list of tuple with python

I am trying to do this in its operation algorithm quicksort to sort though the elements of a list of tuples. Or if I have a list of this type [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )] I want to sort it in function of the second element of each tuple and obtain [(6,4), (3,3), (4,2), (0,1), (1,1), (2,1 ), (5,1)]. I have tried using the following algorithm:
def partition(array, begin, end, cmp):
pivot=array[end][1]
ii=begin
for jj in xrange(begin, end):
if cmp(array[jj][1], pivot):
array[ii], array[jj] = array[jj], array[ii]
ii+=1
array[ii], array[end] = pivot, array[ii]
return ii
enter code hedef sort(array, cmp=lambda x, y: x > y, begin=0, end=None):
if end is None: end = len(array)
if begin < end:
i = partition(array, begin, end-1, cmp)
sort(array, cmp, i+1, end)
sort(array, cmp, begin, i)
The problem is that the result is this: [4, (3, 3), (4, 2), 1, 1, 1, (5, 1)]. What do I have to change to get the correct result ??

Complex sorting patterns in Python are painless. Python's sorting algorithm is state of the art, one of the fastest available in real-world cases. No algorithm design needed.
>>> from operator import itemgetter
>>> l = [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )]
>>> l.sort(key=itemgetter(1), reverse=True)
>>> l
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
Above, itemgetter returns a function that returns the second element of its argument. Thus the key argument to sort is a function that returns the item on which to sort the list.
Python's sort is stable, so the ordering of elements with equal keys (in this case, the second item of each tuple) is determined by the original order.

Unfortunately the answer from #wkschwartz only works due to the peculiar start ordering of the terms. If the tuple (5, 1) is moved to the beginning of the list then it gives a different answer.
The following (first) method works in that it gives the same result for any initial ordering of the items in the initial list.
Python 3.4.2 |Continuum Analytics, Inc.| (default, Oct 22 2014, 11:51:45) [MSC v
.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> l = [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )]
>>> sorted(l, key=lambda x: (-x[1], x[0]))
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>> from operator import itemgetter
>>> sorted(l, key=itemgetter(1), reverse=True)
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>> # but note:
>>> l2 = [(5,1), (1,1), (2,1), (3,3), (4,2), (0,1), (6,4 )]
>>> # Swapped first and sixth elements
>>> sorted(l2, key=itemgetter(1), reverse=True)
[(6, 4), (3, 3), (4, 2), (5, 1), (1, 1), (2, 1), (0, 1)]
>>> sorted(l2, key=lambda x: (-x[1], x[0]))
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>>

I have a list of numbers, how to generate all unique k-partitions?

So if I had the numbers [1,2,2,3] and I want k=2 partitions I'd have [1][2,2,3], [1,2][2,3], [2,2][1,3], [2][1,2,3], [3][1,2,2], etc.

See an answer in Python at Code Review.

user3569's solution at Code Review produces five 2-tuples for the test case below, instead of exclusively 3-tuples. However, removing the frozenset() call for the returned tuples leads to the code returning exclusively 3-tuples. The revised code is as follows:
from itertools import chain, combinations
def subsets(arr):
""" Note this only returns non empty subsets of arr"""
return chain(*[combinations(arr,i + 1) for i,a in enumerate(arr)])
def k_subset(arr, k):
s_arr = sorted(arr)
return set([i for i in combinations(subsets(arr),k)
if sorted(chain(*i)) == s_arr])
s = k_subset([2,2,2,2,3,3,5],3)
for ss in sorted(s):
print(len(ss)," - ",ss)
As user3569 says "it runs pretty slow, but is fairly concise".
(EDIT: see below for Knuth's solution)
The output is:
3 - ((2,), (2,), (2, 2, 3, 3, 5))
3 - ((2,), (2, 2), (2, 3, 3, 5))
3 - ((2,), (2, 2, 2), (3, 3, 5))
3 - ((2,), (2, 2, 3), (2, 3, 5))
3 - ((2,), (2, 2, 5), (2, 3, 3))
3 - ((2,), (2, 3), (2, 2, 3, 5))
3 - ((2,), (2, 3, 3), (2, 2, 5))
3 - ((2,), (2, 3, 5), (2, 2, 3))
3 - ((2,), (2, 5), (2, 2, 3, 3))
3 - ((2,), (3,), (2, 2, 2, 3, 5))
3 - ((2,), (3, 3), (2, 2, 2, 5))
3 - ((2,), (3, 5), (2, 2, 2, 3))
3 - ((2,), (5,), (2, 2, 2, 3, 3))
3 - ((2, 2), (2, 2), (3, 3, 5))
3 - ((2, 2), (2, 3), (2, 3, 5))
3 - ((2, 2), (2, 5), (2, 3, 3))
3 - ((2, 2), (3, 3), (2, 2, 5))
3 - ((2, 2), (3, 5), (2, 2, 3))
3 - ((2, 3), (2, 2), (2, 3, 5))
3 - ((2, 3), (2, 3), (2, 2, 5))
3 - ((2, 3), (2, 5), (2, 2, 3))
3 - ((2, 3), (3, 5), (2, 2, 2))
3 - ((2, 5), (2, 2), (2, 3, 3))
3 - ((2, 5), (2, 3), (2, 2, 3))
3 - ((2, 5), (3, 3), (2, 2, 2))
3 - ((3,), (2, 2), (2, 2, 3, 5))
3 - ((3,), (2, 2, 2), (2, 3, 5))
3 - ((3,), (2, 2, 3), (2, 2, 5))
3 - ((3,), (2, 2, 5), (2, 2, 3))
3 - ((3,), (2, 3), (2, 2, 2, 5))
3 - ((3,), (2, 3, 5), (2, 2, 2))
3 - ((3,), (2, 5), (2, 2, 2, 3))
3 - ((3,), (3,), (2, 2, 2, 2, 5))
3 - ((3,), (3, 5), (2, 2, 2, 2))
3 - ((3,), (5,), (2, 2, 2, 2, 3))
3 - ((5,), (2, 2), (2, 2, 3, 3))
3 - ((5,), (2, 2, 2), (2, 3, 3))
3 - ((5,), (2, 2, 3), (2, 2, 3))
3 - ((5,), (2, 3), (2, 2, 2, 3))
3 - ((5,), (2, 3, 3), (2, 2, 2))
3 - ((5,), (3, 3), (2, 2, 2, 2))
Knuth's solution, as implemented by Adeel Zafar Soomro on the same Code Review page can be called as follows if no duplicates are desired:
s = algorithm_u([2,2,2,2,3,3,5],3)
ss = set(tuple(sorted(tuple(tuple(y) for y in x) for x in s)))
I haven't timed it, but Knuth's solution is visibly faster, even for this test case.
However, it returns 63 tuples rather than the 41 returned by user3569's solution. I haven't yet gone through the output closely enough to establish which output is correct.

Here's a version in Haskell:
import Data.List (nub, sort, permutations)
parts 0 = []
parts n = nub $ map sort $ [n] : [x:xs | x <- [1..n`div`2], xs <- parts(n - x)]
partition [] ys result = sort $ map sort result
partition (x:xs) ys result =
partition xs (drop x ys) (result ++ [take x ys])
partitions xs k =
let variations = filter (\x -> length x == k) $ parts (length xs)
in nub $ concat $ map (\x -> mapVariation x (nub $ permutations xs)) variations
where mapVariation variation = map (\x -> partition variation x [])
OUTPUT:
*Main> partitions [1,2,2,3] 2
[[[1],[2,2,3]],[[1,2,3],[2]],[[1,2,2],[3]],[[1,2],[2,3]],[[1,3],[2,2]]]

Python solution:
pip install PartitionSets
Then:
import partitionsets.partition
filter(lambda x: len(x) == k, partitionsets.partition.Partition(arr))
The PartitionSets implementation seems to be pretty fast however it's a pity you can't pass number of partitions as an argument, so you need to filter your k-set partitions from all subset partitions.
You may also want to look at:
similar topic on researchgate.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Opposite of a GROUP in apache pig latin? - hadoop

Related

Gensim tfidf vector in dense form

Sort by key then value which will then be grouped up...pyspark

Obtaining all groups of n pairs of numbers that pass condition

quick sort list of tuple with python

I have a list of numbers, how to generate all unique k-partitions?

Categories

Resources