Gensim tfidf vector in dense form - gensim

I am vectorizing a bunch of documents using Gensim's tfidfmodel. I'd like to take the output so I can dump it into a vector DB and calculate document similarity using the DB. (I'm aware Gensim has the similarities module but would like to do this within a DB.)
Using the example from the docs, here is a sample output:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
Am I correct in thinking that I need to convert each document array to a dense array? My assumption is that I cannot take the second element in each tuple of the doc array since the document array omits vocab not present. How do I convert this to a dense array?
Is matutils.corpus2dense(corpus_tfidf,dictionary.num_nnz).T the correct approach?
import gensim.matutils
densematrix = gensim.matutils.corpus2dense(corpus_tfidf,dictionary.num_nnz).T

Related

Transformers tokenizer returns overlapping tokens. Is that a bug or am I doing something wrong?

I have been trying to do some token classification using huggingface transformers. I'm seeing instances where the tokenizer returns overlapping tokens. Sometimes (but not always) this will result in the model giving me an entity such that the (start, end) correspond to where the overlap starts and ends, but it lists the entity word as the empty string.
Here is a simple example to illustrate where it returns overlapping tokens:
>>> import transformers
>>> tokenizer = transformers.AutoTokenizer.from_pretrained ('xlm-roberta-large-finetuned-conll03-english')
>>> text = "No clue."
>>> tokens = tokenizer(text)
>>> tokens[0].offsets
[(0, 0), (0, 2), (3, 4), (3, 6), (6, 7), (7, 8), (0, 0)]
>>> [text[start:end] for (start,end) in tokens[0].offsets[1:-1]]
['No', 'c', 'clu', 'e', '.']
The examples where the model returns the overlapping character as a named entity are quite a bit longer. I can include them if needed, but shouldn't the tokenizer always return a non-overlapping set of tokens?

Sort by key then value which will then be grouped up...pyspark

So I'm trying to sort data in this format...
[((0, 4), 3), ((4, 0), 3), ((1, 6), 1), ((3, 2), 3), ((0, 5), 1)...
Ascending by key and then descending by value. I'm able to achieve this via...
test = test.sortBy(lambda x: (x[0], -x[1]))
which would give me based on shortened version above...
[((0, 4), 3), ((0, 5), 1), ((1, 6), 1), ((3, 2), 3), ((4, 0), 3)...
The problem I'm having is that after the sorting I no longer want the value but do need to retain the sort after grouping the data. So...
test = test.map(lambda x: (x[0][0],x[0][1]))
Gives me...
[(0, 4), (0, 5), (1, 6), (3, 2), (4, 0)...
Which is still in the order I need it but I need the elements to be grouped up by key. I then use this command...
test = test.groupByKey().map(lambda x: (x[0], list(x[1])))
But in the process I lose the sorting. Is there any way retain?
I managed to retain the order by changing the format of the tuple...
test = test.map(lambda x: (x[0][0],(x[0][1],x[1]))
test = test.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda x: (x[0],-x[1]))))
[(0, [(4, 3), (5, 1)] ...
which leaves me with the value (2nd element in the tuple) that I want to get rid of but took care of that too...
test = test.map(lambda x: (x[0], [e[0] for e in x[1]]))
Feels a bit hacky but not sure how else it could be done.

Finding all unique combinations of overlapping items?

If I have data that's in the form of a list of tuples:
[(uid, start_time, end_time)]
I'd like to find all unique combinations of uids that overlap in time. Eg, if I had a list like the following:
[(0, 1, 2),
(1, 1.1, 3),
(2, 1.5, 2.5),
(3, 2.5, 4),
(4, 4, 5)]
I'd like to get as output:
[(0,1,2), (1,3), (0,), (1,), (2,), (3,), (4,)]
Is there a faster algorithm for this than the naive brute force?
First, sort your tuples by start time. Keep a heap of active tuples, which has the one with the earliest end time on top.
Then, you move through your sorted list and add tuples to the active set. Doing so, you also check if you need to remove tuples. If so, you can report an interval. In order to avoid duplicate reports, report new intervals only if there has been a new tuple added to the active set since the last report.
Here is some pseudo-code that visualizes the idea:
sort(tuples)
activeTuples := new Heap
bool newInsertAfterLastReport = false
for each tuple in tuples
while activeTuples is not empty and activeTuples.top.endTime <= tuple.startTime
//the first tuple from the active set has to be removed
if newInsertAfterLastReport
report activeTuples
newInsertAfterLastReport = false
activeTuples.pop()
end while
activeTuples.insert(tuple)
newInsertAfterLastReport = true
next
if activeTuples has more than 1 entry
report activeTuples
With your example data set you get:
data = [(0, 1, 2), (1, 1.1, 3), (2, 1.5, 2.5), (3, 2.5, 4), (4, 4, 5)]
tuple activeTuples newInsertAfterLastReport
---------------------------------------------------------------------
(0, 1, 2) [] false
[(0, 1, 2)] true
(1, 1.1, 3)
[(0, 1, 2), (1, 1.1, 3)]
(2, 1.5, 2.5)
[(0, 1, 2), (2, 1.5, 2.5), (1, 1.1, 3)]
(3, 2.5, 4) -> report (0, 1, 2)
[(2, 1.5, 2.5), (1, 1.1, 3)] false
[(1, 1.1, 3)]
[(1, 1.1, 3), (3, 2.5, 4)] true
(4, 4, 5) -> report (1, 3) false
[(3, 2.5, 4)]
[]
[(4, 4, 5)]
Actually, I would remove the if activeTuples has more than 1 entry part and always report at the end. This would result in an additional report of (4) because it is not included in any of the previous reports (whereas (0) ... (3) are).
I think this can be done in O(n lg n + n o) time where o is the maximum size of your output (o could be n in the worst case).
Build a 3-tuple for each start_time or end_time as follows: the first component is the start_time or end_time of an input tuple, the second component is the id of the input tuple, the third component is whether it's start_time or end_time. Now you have 2n 3-tuples. Sort them in ascending order of the first component.
Now start scanning the list of 3-tuples from the smallest to the largest. Each time a range starts, add its id to a balanced binary search tree (in O(lg o) time), and output the contents of the tree (in O(o)), and each time a range ends, remove its id from the tree (in O(lg o) time).
You also need to take care of the corner cases, e.g., how to deal with equal start and end times either of the same range or of different ranges.

quick sort list of tuple with python

I am trying to do this in its operation algorithm quicksort to sort though the elements of a list of tuples. Or if I have a list of this type [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )] I want to sort it in function of the second element of each tuple and obtain [(6,4), (3,3), (4,2), (0,1), (1,1), (2,1 ), (5,1)]. I have tried using the following algorithm:
def partition(array, begin, end, cmp):
pivot=array[end][1]
ii=begin
for jj in xrange(begin, end):
if cmp(array[jj][1], pivot):
array[ii], array[jj] = array[jj], array[ii]
ii+=1
array[ii], array[end] = pivot, array[ii]
return ii
enter code hedef sort(array, cmp=lambda x, y: x > y, begin=0, end=None):
if end is None: end = len(array)
if begin < end:
i = partition(array, begin, end-1, cmp)
sort(array, cmp, i+1, end)
sort(array, cmp, begin, i)
The problem is that the result is this: [4, (3, 3), (4, 2), 1, 1, 1, (5, 1)]. What do I have to change to get the correct result ??
Complex sorting patterns in Python are painless. Python's sorting algorithm is state of the art, one of the fastest available in real-world cases. No algorithm design needed.
>>> from operator import itemgetter
>>> l = [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )]
>>> l.sort(key=itemgetter(1), reverse=True)
>>> l
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
Above, itemgetter returns a function that returns the second element of its argument. Thus the key argument to sort is a function that returns the item on which to sort the list.
Python's sort is stable, so the ordering of elements with equal keys (in this case, the second item of each tuple) is determined by the original order.
Unfortunately the answer from #wkschwartz only works due to the peculiar start ordering of the terms. If the tuple (5, 1) is moved to the beginning of the list then it gives a different answer.
The following (first) method works in that it gives the same result for any initial ordering of the items in the initial list.
Python 3.4.2 |Continuum Analytics, Inc.| (default, Oct 22 2014, 11:51:45) [MSC v
.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> l = [(0,1), (1,1), (2,1), (3,3), (4,2), (5,1), (6,4 )]
>>> sorted(l, key=lambda x: (-x[1], x[0]))
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>> from operator import itemgetter
>>> sorted(l, key=itemgetter(1), reverse=True)
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>> # but note:
>>> l2 = [(5,1), (1,1), (2,1), (3,3), (4,2), (0,1), (6,4 )]
>>> # Swapped first and sixth elements
>>> sorted(l2, key=itemgetter(1), reverse=True)
[(6, 4), (3, 3), (4, 2), (5, 1), (1, 1), (2, 1), (0, 1)]
>>> sorted(l2, key=lambda x: (-x[1], x[0]))
[(6, 4), (3, 3), (4, 2), (0, 1), (1, 1), (2, 1), (5, 1)]
>>>

How should I generate the partitions / pairs for the Chinese Postman problem?

I'm working on a program for class that involves solving the Chinese Postman problem. Our assignment only requires us to write a program to solve it for a hard-coded graph but I'm attempting to solve it for the general case on my own.
The part that is giving me trouble is generating the partitions of pairings for the odd vertices.
For example, if I had the following labeled odd verticies in a graph:
1 2 3 4 5 6
I need to find all the possible pairings / partitions I can make with these vertices.
I've figured out I'll have i paritions given:
n = num of odd verticies
k = n / 2
i = ((2k)(2k-1)(2k-2)...(k+1))/2^n
So, given the 6 odd verticies above, we will know that we need to generate i = 15 partitions.
The 15 partions would look like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
...
1 6 ...
Then, for each partition, I take each pair and find the shortest distance between them and sum them for that partition. The partition with the total smallest distance between its pairs is selected, and I then double all the edges between the shortest path between the odd vertices (found in the selected partition).
These represent the edges the postman will have to walk twice.
At first I thought I had worked out an appropriate algorithm for generating these partitions:
Start with all the odd verticies sorted in increasing order
12 34 56
Select the pair behind the pair that currently has the max vertice
12 [34] 56
Increase the second digit in this pair by 1. Leave everything to the
left of the selected pair the same,
and make everything to the right of
the selected pair the remaining
numbers in the set, sorted in
increasing order.
12 35 46
Repeat
However, this is flawed. For example, I realized that when I reach to the end and the select pair is at the left most position (ie):
[16] .. ..
The algorithm I worked out will stop in this case, and not generate the rest of the pairs that begin [16], because there is no pair to the left of it to alter.
So, it is back to the drawing board.
Does anyone who has studied this problem before have any tips that can help point me in the right direction for generating these partitions?
You can construct the partitions using a recursive algorithm.
Take the lowest node, in this case node 1. This must be paired with one of the other unpaired nodes (2 to 6). For each of these nodes, create with match 1, then find all of the pairs of the remaining 4 elements using the same algorithm on the remaining four elements.
In Python:
def get_pairs(s):
if not s: yield []
else:
i = min(s)
for j in s - set([i]):
for r in get_pairs(s - set([i, j])):
yield [(i, j)] + r
for x in get_pairs(set([1,2,3,4,5,6])):
print x
This generates the following solutions:
[(1, 2), (3, 4), (5, 6)]
[(1, 2), (3, 5), (4, 6)]
[(1, 2), (3, 6), (4, 5)]
[(1, 3), (2, 4), (5, 6)]
[(1, 3), (2, 5), (4, 6)]
[(1, 3), (2, 6), (4, 5)]
[(1, 4), (2, 3), (5, 6)]
[(1, 4), (2, 5), (3, 6)]
[(1, 4), (2, 6), (3, 5)]
[(1, 5), (2, 3), (4, 6)]
[(1, 5), (2, 4), (3, 6)]
[(1, 5), (2, 6), (3, 4)]
[(1, 6), (2, 3), (4, 5)]
[(1, 6), (2, 4), (3, 5)]
[(1, 6), (2, 5), (3, 4)]

Resources