Generate a set of tuples from one tuple in PIG - hadoop

I couldn't find any solution how to generate in Pig a set of tuples from one tuple according to the rule:
Input:
((1,2,3),(a,b,c),(aaa,bbb,ccc))
Output:
(1,a,aaa)
(2,b,bbb)
(3,c,ccc)
Suppose TOBAG and FLATTEN should be applied, but it seems too tricky.

Use the zip builtin function and argument unpacking ("star" args):
>>> x = ((1,2,3),('a','b','c'),('aaa','bbb','ccc'))
>>> tuple(zip(*x))
((1, 'a', 'aaa'), (2, 'b', 'bbb'), (3, 'c', 'ccc'))
>>> for y in zip(*x):
print(y)
(1, 'a', 'aaa')
(2, 'b', 'bbb')
(3, 'c', 'ccc')

[tuple(original[i] for original in originals) for i in range(len(original[0]))]
will give you the second list of tuples if your original list is called originals.

Related

Transformers tokenizer returns overlapping tokens. Is that a bug or am I doing something wrong?

I have been trying to do some token classification using huggingface transformers. I'm seeing instances where the tokenizer returns overlapping tokens. Sometimes (but not always) this will result in the model giving me an entity such that the (start, end) correspond to where the overlap starts and ends, but it lists the entity word as the empty string.
Here is a simple example to illustrate where it returns overlapping tokens:
>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained ('xlm-roberta-large-finetuned-conll03-english')
>>> text = "No clue."
>>> tokens = tokenizer(text)
>>> tokens[0].offsets
[(0, 0), (0, 2), (3, 4), (3, 6), (6, 7), (7, 8), (0, 0)]
>>> [text[start:end] for (start,end) in tokens[0].offsets[1:-1]]
['No', 'c', 'clu', 'e', '.']
The examples where the model returns the overlapping character as a named entity are quite a bit longer. I can include them if needed, but shouldn't the tokenizer always return a non-overlapping set of tokens?

Finding all unique combinations of overlapping items?

If I have data that's in the form of a list of tuples:
[(uid, start_time, end_time)]
I'd like to find all unique combinations of uids that overlap in time. Eg, if I had a list like the following:
[(0, 1, 2),
(1, 1.1, 3),
(2, 1.5, 2.5),
(3, 2.5, 4),
(4, 4, 5)]
I'd like to get as output:
[(0,1,2), (1,3), (0,), (1,), (2,), (3,), (4,)]
Is there a faster algorithm for this than the naive brute force?
First, sort your tuples by start time. Keep a heap of active tuples, which has the one with the earliest end time on top.
Then, you move through your sorted list and add tuples to the active set. Doing so, you also check if you need to remove tuples. If so, you can report an interval. In order to avoid duplicate reports, report new intervals only if there has been a new tuple added to the active set since the last report.
Here is some pseudo-code that visualizes the idea:
sort(tuples)
activeTuples := new Heap
bool newInsertAfterLastReport = false
for each tuple in tuples
while activeTuples is not empty and activeTuples.top.endTime <= tuple.startTime
//the first tuple from the active set has to be removed
if newInsertAfterLastReport
report activeTuples
newInsertAfterLastReport = false
activeTuples.pop()
end while
activeTuples.insert(tuple)
newInsertAfterLastReport = true
next
if activeTuples has more than 1 entry
report activeTuples
With your example data set you get:
data = [(0, 1, 2), (1, 1.1, 3), (2, 1.5, 2.5), (3, 2.5, 4), (4, 4, 5)]
tuple activeTuples newInsertAfterLastReport
---------------------------------------------------------------------
(0, 1, 2) [] false
[(0, 1, 2)] true
(1, 1.1, 3)
[(0, 1, 2), (1, 1.1, 3)]
(2, 1.5, 2.5)
[(0, 1, 2), (2, 1.5, 2.5), (1, 1.1, 3)]
(3, 2.5, 4) -> report (0, 1, 2)
[(2, 1.5, 2.5), (1, 1.1, 3)] false
[(1, 1.1, 3)]
[(1, 1.1, 3), (3, 2.5, 4)] true
(4, 4, 5) -> report (1, 3) false
[(3, 2.5, 4)]
[]
[(4, 4, 5)]
Actually, I would remove the if activeTuples has more than 1 entry part and always report at the end. This would result in an additional report of (4) because it is not included in any of the previous reports (whereas (0) ... (3) are).
I think this can be done in O(n lg n + n o) time where o is the maximum size of your output (o could be n in the worst case).
Build a 3-tuple for each start_time or end_time as follows: the first component is the start_time or end_time of an input tuple, the second component is the id of the input tuple, the third component is whether it's start_time or end_time. Now you have 2n 3-tuples. Sort them in ascending order of the first component.
Now start scanning the list of 3-tuples from the smallest to the largest. Each time a range starts, add its id to a balanced binary search tree (in O(lg o) time), and output the contents of the tree (in O(o)), and each time a range ends, remove its id from the tree (in O(lg o) time).
You also need to take care of the corner cases, e.g., how to deal with equal start and end times either of the same range or of different ranges.

string a = "1,2,3" string b = "4,5,6" required answer c = (1,4), (2,5), (3,6) in Ruby

I have two strings
a = "1,2,3"
b = "4,5,6"
how can i achieve this result c = "(1,4), (2,5), (3,6)". I have already tried many solution but no success
Use Array#zip to create an array of pairs.
a.split(',').zip(b.split(',')).map { |x, y| "(#{x},#{y})" }.join(', ')
Use zip instead of product. Here:
puts a.split(',').map(&:to_i).zip(b.split(',').map(&:to_i)).to_s.tr('[]', '()')
# ((1, 4), (2, 5), (3, 6))
Explaining Steps:
a.split(',') will split Array. (["1", "2", "3"])
.map(&:to_i) will convert Array elements to integer ([1, 2, 3])
.zip will merge the two Arrays in desired way.
I'm using two tr to replace [] with ().
require 'json'
((JSON.load("[#{a}]").zip JSON.load("[#{b}]")).to_s.tr'[]','()')[1...-1]

Python: What is the right way to modify list elements?

I've this list with tuples:
l = [('a','b'),('c','d'),('e','f')]
And two parameters: a key value, and a new value to modify. For example,
key = 'a'
new_value= 'B' # it means, modify with 'B' the value in tuples where there's an 'a'
I've this two options (both works):
f = lambda t,k,v: t[0] == k and (k,v) or t
new_list = [f(t,key,new_value) for t in l]
print new_list
and
new_list = []
for i in range(len(l)):
elem = l.pop()
if elem[0] == key:
new_list.append((key,new_value))
else:
new_list.append(elem)
print new_list
But, i'm new in Python, and don't know if its right.
Can you help me? Thank you!
Here is one solution involving altering the items in-place.
def replace(list_, key, new_value):
for i, (current_key, current_value) in enumerate(list_):
if current_key == key:
list_[i] = (key, new_value)
Or, to append if it's not in there,
def replace_or_append(list_, key, new_value):
for i, (current_key, current_value) in enumerate(list_):
if current_key == key:
list_[i] = (key, new_value)
break
else:
list_.append((key, new_value))
Usage:
>>> my_list = [('a', 'b'), ('c', 'd')]
>>> replace(my_list, 'a', 'B')
>>> my_list
[('a', 'B'), ('c', 'd')]
If you want to create a new list, a list comprehension is easiest.
>>> my_list = [('a', 'b'), ('c', 'd')]
>>> find_key = 'a'
>>> new_value = 'B'
>>> new_list = [(key, new_value if key == find_key else value) for key, value in my_list]
>>> new_list
[('a', 'B'), ('c', 'd')]
And if you wanted it to append if it wasn't there,
>>> if len(new_list) == len(my_list):
... new_list.append((find_key, new_value))
(Note also I've changed your variable name from l; l is too easily confused with I and 1 and is best avoided. Thus saith PEP8 and I agree with it.)
To create a new list, a list comprehension would do:
In [102]: [(key,'B' if key=='a' else val) for key,val in l]
Out[102]: [('a', 'B'), ('c', 'd'), ('e', 'f')]
To modify the list in place:
l = [('a','b'),('c','d'),('e','f')]
for i,elt in enumerate(l):
key,val=elt
if key=='a':
l[i]=(key,'B')
print(l)
# [('a', 'B'), ('c', 'd'), ('e', 'f')]
To modify existing list just use list assignment, e.g.
>>> l = [('a','b'),('c','d'),('e','f')]
>>> l[0] = ('a','B')
>>> print l
[('a', 'B'), ('c', 'd'), ('e', 'f')]
I would usually prefer to create a new list using comprehension, e.g.
[(key, new_value) if x[0] == key else x for x in l]
But, as the first comment has already mentioned, it sounds like you are trying to make a list do something which you should really be using a dict for instead.
Here's the approach I would use.
>>> l = [('a','b'),('c','d'),('e','f')]
>>> key = 'a'
>>> new_value= 'B'
>>> for pos in (index for index, (k, v) in enumerate(l) if k == key):
... l[pos] = (key, new_value)
... break
... else:
... l.append((key, new_value))
...
>>> l
[('a', 'B'), ('c', 'd'), ('e', 'f')]
This looks an awful lot like an OrderedDict, though; key-value pairs with preserved ordering. You might want to take a look at that and see if it suits your needs
Edit: Replaced try:...except StopIteration: with for:...break...else: since that might look a bit less weird.

Successive adding of char to get the longest word in the dictionary [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
Given a dictionary of words and an initial character. find the longest possible word in the dictionary by successively adding a character to the word. At any given instance the word should be valid word in the dictionary.
ex : a -> at -> cat -> cart -> chart ....
The brute force approach would be to try adding letters to each available index using a depth-first search.
So, starting with 'a', there are two places you can add a new letter. In front or behind the 'a', represented by dots below.
.a.
If you add a 't', there are now three positions.
.a.t.
You can try adding all 26 letters to each available position. The dictionary in this case can be a simple hashtable. If you add a 'z' in the middle, you get 'azt' which would not be in the hashtable so you don't continue down that path in the search.
Edit: Nick Johnson's graph made me curious what a graph of all maximal paths would look like. It's a large (1.6 MB) image here:
http://www.michaelfogleman.com/static/images/word_graph.png
Edit: Here's a Python implementation. The brute-force approach actually runs in a reasonable amount of time (a few seconds, depending on the starting letter).
import heapq
letters = 'abcdefghijklmnopqrstuvwxyz'
def search(words, word, path):
path.append(word)
yield tuple(path)
for i in xrange(len(word)+1):
before, after = word[:i], word[i:]
for c in letters:
new_word = '%s%s%s' % (before, c, after)
if new_word in words:
for new_path in search(words, new_word, path):
yield new_path
path.pop()
def load(path):
result = set()
with open(path, 'r') as f:
for line in f:
word = line.lower().strip()
result.add(word)
return result
def find_top(paths, n):
gen = ((len(x), x) for x in paths)
return heapq.nlargest(n, gen)
if __name__ == '__main__':
words = load('TWL06.txt')
gen = search(words, 'b', [])
top = find_top(gen, 10)
for path in top:
print path
Of course, there will be a lot of ties in the answer. This will print the top N results, measured by length of the final word.
Output for starting letter 'a', using the TWL06 Scrabble dictionary.
(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampedes', 'stampeders'))
(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampeder', 'stampeders'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangles', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangler', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangles', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangers', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'strangers', 'estrangers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranges', 'estranges', 'estrangers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranger', 'strangler', 'stranglers'))
(10, ('a', 'ta', 'tan', 'tang', 'stang', 'strang', 'strange', 'stranger', 'strangers', 'stranglers'))
And here are the results for each starting letter. Of course, an exception is made that the single starting letter doesn't have to be in the dictionary. Just some 2-letter word that can be formed with it.
(10, ('a', 'ta', 'tap', 'tape', 'taped', 'tamped', 'stamped', 'stampede', 'stampedes', 'stampeders'))
(9, ('b', 'bo', 'bos', 'bods', 'bodes', 'bodies', 'boodies', 'bloodies', 'bloodiest'))
(1, ('c',))
(10, ('d', 'od', 'cod', 'coed', 'coped', 'comped', 'compted', 'competed', 'completed', 'complected'))
(10, ('e', 're', 'rue', 'ruse', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(9, ('f', 'fe', 'foe', 'fore', 'forge', 'forges', 'forgoes', 'forgoers', 'foregoers'))
(10, ('g', 'ag', 'tag', 'tang', 'stang', 'strang', 'strange', 'strangle', 'strangles', 'stranglers'))
(9, ('h', 'sh', 'she', 'shes', 'ashes', 'sashes', 'slashes', 'splashes', 'splashers'))
(11, ('i', 'pi', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(7, ('j', 'jo', 'joy', 'joky', 'jokey', 'jockey', 'jockeys'))
(9, ('k', 'ki', 'kin', 'akin', 'takin', 'takins', 'takings', 'talkings', 'stalkings'))
(10, ('l', 'la', 'las', 'lass', 'lassi', 'lassis', 'lassies', 'glassies', 'glassines', 'glassiness'))
(10, ('m', 'ma', 'mas', 'mars', 'maras', 'madras', 'madrasa', 'madrassa', 'madrassas', 'madrassahs'))
(11, ('n', 'in', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(10, ('o', 'os', 'ose', 'rose', 'rouse', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(11, ('p', 'pi', 'pin', 'ping', 'oping', 'coping', 'comping', 'compting', 'competing', 'completing', 'complecting'))
(3, ('q', 'qi', 'qis'))
(10, ('r', 're', 'rue', 'ruse', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(10, ('s', 'us', 'use', 'uses', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(10, ('t', 'ti', 'tin', 'ting', 'sting', 'sating', 'stating', 'estating', 'restating', 'restarting'))
(10, ('u', 'us', 'use', 'uses', 'ruses', 'rouses', 'arouses', 'carouses', 'carousels', 'carrousels'))
(1, ('v',))
(9, ('w', 'we', 'wae', 'wake', 'wakes', 'wackes', 'wackest', 'wackiest', 'whackiest'))
(8, ('x', 'ax', 'max', 'maxi', 'maxim', 'maxima', 'maximal', 'maximals'))
(8, ('y', 'ye', 'tye', 'stye', 'styed', 'stayed', 'strayed', 'estrayed'))
(8, ('z', 'za', 'zoa', 'zona', 'zonae', 'zonate', 'zonated', 'ozonated'))
If you want to do this once, I'd do the following (generalized to the problem of starting with a full word):
Take your entire dictionary and throw away anything that does not have a superset of the characters in your target word (let's say it has length m). Then bin the remaining words by length. For each word of length m+1, try dropping each letter and see if that yields your desired word. If not, toss it. Then check each word of length m+2 against the valid set of length m+1, dropping any that can't be reduced. Keep going until you find an empty set; the last thing(s) you found will be the longest.
If you want to make this fast to look up, I'd build a suffix-tree-like data structure.
Group all words by length. For each word of length 2, place each of its two characters in a "subword" set, and add that word to each of the characters' "superword" sets. Now you've got a link between all valid words of length 2 and all characters. Do the same with words of length 3 and valid words of length 2. Now you can start anywhere in this hierarchy and do a breadth-first search to find the deepest branch.
Edit: the speed of this solution will depend greatly on the structure of the language, but if we decide to build everything using sets with log(n) performance for all operations (i.e. we use red-black trees or the like), and we have N(m) words of length m, then to form the link between words of length m+1 and m will approximately (m+1)*m*N(m+1)*log(N(m)) time (taking into account that string compares take linear time in the length of the string). Since we have to do this for all word lengths, the runtime for building the full data structure will be something on the order of
(typical word length)^3 * (dictionary length) * log (dictionary length / word length)
(The initial binning into words of a certain length will take linear time so can be neglected; the actual formula for runtime is complicated because it depends on the distribution of word lengths; for the case where you're doing it from a single word it's even more complicated because it depends on the expected number of longer words that have shorter subwords.)
Assuming you need to do this repeatedly (or you want the answer for every one of the 26 letters), do it backwards:
Load a dictionary, and sort it by length, descending
Establish a mapping, initially empty, between words and (extension, max_len) tuples.
For each word in the sorted list:
If it's already in the mapping, retrieve the max len.
If it's not, set the max len to the word length.
Examine each word produced by deleting a character. If that word is not in the mapping, or our max_len exceeds the max_len of the word already in the mapping, update the mapping with the current word and max_len
Then, to get the chain for a given prefix, simply start with that prefix and repeatedly look it and its extensions up in the dictionary.
Here's the sample Python code:
words = [x.strip().lower() for x in open('/usr/share/dict/words')]
words.sort(key=lambda x:len(x), reverse=True)
word_map = {} # Maps words to (extension, max_len) tuples
for word in words:
if word in word_map:
max_len = word_map[word][1]
else:
max_len = len(word)
for i in range(len(word)):
new_word = word[:i] + word[i+1:]
if new_word not in word_map or word_map[new_word][2] < max_len:
word_map[new_word] = (word, max_len)
# Get a chain for each letter
for term in "abcdefghijklmnopqrstuvwxyz":
chain = [term]
while term in word_map:
term = word_map[term][0]
chain.append(term)
print chain
And its output for each letter of the alphabet:
['a', 'ah', 'bah', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['b', 'ba', 'bac', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['c', 'ca', 'cap', 'camp', 'campo', 'campho', 'camphor', 'camphory', 'camphoryl', 'camphoroyl']
['d', 'ad', 'cad', 'card', 'carid', 'carida', 'caridea', 'acaridea', 'acaridean']
['e', 'er', 'ser', 'sere', 'secre', 'secret', 'secreto', 'secretor', 'secretory', 'asecretory']
['f', 'fo', 'fot', 'frot', 'front', 'afront', 'affront', 'affronte', 'affronted']
['g', 'og', 'log', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['h', 'ah', 'bah', 'bach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['i', 'ai', 'lai', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['j', 'ju', 'jug', 'juga', 'jugal', 'jugale']
['k', 'ak', 'sak', 'sake', 'stake', 'strake', 'straked', 'streaked']
['l', 'la', 'lai', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['m', 'am', 'cam', 'camp', 'campo', 'campho', 'camphor', 'camphory', 'camphoryl', 'camphoroyl']
['n', 'an', 'lan', 'lain', 'latin', 'lation', 'elation', 'delation', 'dealation', 'dealbation']
['o', 'lo', 'loy', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['p', 'pi', 'pig', 'prig', 'sprig', 'spring', 'springy', 'springly', 'sparingly', 'sparringly']
['q']
['r', 'ra', 'rah', 'rach', 'brach', 'branch', 'branchi', 'branchia', 'branchiae', 'branchiate', 'abranchiate']
['s', 'si', 'sig', 'spig', 'sprig', 'spring', 'springy', 'springly', 'sparingly', 'sparringly']
['t', 'ut', 'gut', 'gutt', 'gutte', 'guttle', 'guttule', 'guttulae', 'guttulate', 'eguttulate']
['u', 'ut', 'gut', 'gutt', 'gutte', 'guttle', 'guttule', 'guttulae', 'guttulate', 'eguttulate']
['v', 'vu', 'vum', 'ovum']
['w', 'ow', 'low', 'alow', 'allow', 'hallow', 'shallow', 'shallowy', 'shallowly']
['x', 'ox', 'cox', 'coxa', 'coxal', 'coaxal', 'coaxial', 'conaxial']
['y', 'ly', 'loy', 'logy', 'ology', 'oology', 'noology', 'nosology', 'nostology', 'gnostology']
['z', 'za', 'zar', 'izar', 'izard', 'izzard', 'gizzard']
Edit: Given the degree to which branches merge towards the end, I thought it would be interesting to draw a graph to demonstrate this:
An interesting extension of this challenge: It's likely there are several equilength final words for some letters. Which set of chains minimizes the number of final nodes (eg, merges the most letters)?

Resources