how to convert a matrix to BoW format? - gensim

I am trying to convert a matrix to the type that can be received by gensim. AuthorTopic Model, which means I should convert a matrix to a sparse vector. I have already tried several functions in gensim like gensim.matutils.full2sparse and gensim.matutils.any2sparse. But there is something wrong:
my code:
matrix=numpy.array([[1,0 ,1],[0,1,1]])
mycorpus=any2sparse(matrix)
print(matrix)
print(mycorpus)
the output:
[[1 0 1]
[0 1 1]]
[(0, 1.0), (0, 1.0), (1, 0.0), (1, 0.0)] #mycorpus
accoring to the tutorial, mycorpus should be like:
[[(0,1),(2,1)]
[(1,1),(2,1)]]
I have no idea what's wrong. I really appreciate if anyone could give me some advise.

The Gensim AuthorTopicModel docs describe its desired corpus-format as iterable of list of (int, float).
Those int values would be word-ids, and ideally be accompanied by the id2word dict which idntifies which int means which word.
What's the source of your matrix, & do you know if it's the rows or the columns that represent words, and have a mapping of indexes to words? That will drive the conversion.
Also, as the docs mention, "The model is closely related to LdaModel. The AuthorTopicModel class inherits LdaModel, and its usage is thus similar.
Have you reviewed guides to Gensim LDA usage to see how they prepare their corpus, such as the multiple Usage Examples, to see if that helps suggest steps & necessary formats?
Or, is your corpus still available as texts, so you can directly use the examples there as a model to turn the text into the BoW format (rather than your already-processed matrix)?
If you're still having problems, you should expand your question text with more details, especially how the true corpus matrix that you have was created, and which errors you've encountered (& how you triggered them) that convince you things aren't working.

Related

Control print order of matrix terms in Sympy

I have a matrix addition with several terms that I want to display in a Jupyter Notebook. I need the order of terms to match the standard notation - in my case, of linear regression. But, the terms do not, by default, appear in the correct order for my purpose, and I would like to ask how to control the order of display of matrices in a matrix addition (MatAdd) term in Sympy. For example, here we see that Sympy selects a particular order for the terms, that appears to be based on the values in the Matrix.
from sympy import MatAdd, Matrix
A = Matrix([1])
B = Matrix([0])
print(MatAdd(A, B, evaluate=False))
This gives
Matrix([[0]]) + Matrix([[1]])
Notice the matrix terms do not follow the order of defintion or the variable names.
Is there anything I can do to control the print output order of Matrix terms in a MatAdd expression?
You can use init_printing to chose from a few options. In particular, the order keyword should control how things are shown on the screen vs how things are stored in SymPy objects.
Now comes the differences: by setting init_printing(order="none") printers behave differently. I believe this is some bug.
For example, I usually use Latex rendering when using Jupyter Notebook:
from sympy import MatAdd, Matrix, init_printing
init_printing(order="none")
A = Matrix([1])
B = Matrix([0])
add = MatAdd(A, B, evaluate=False)
print(add)
# out: Matrix([[0]]) + Matrix([[1]])
display(add)
# out: [1] + [0]
Here you can see that the latex printer is displaying the elements as they are stored (check add.args), whereas the string printer is not following that convention...

I compare two identical sentences with BLEU NLTK and don't get 1.0. Why?

I’m trying to use the BLEU score from NLTK for quality evaluation of the machine translation. I wanted to check this code with two identical sentences, here I’m using method1 as a Smoothing function because I’m comparing two sentences and not corpora. I set 4-grams and weights 0.25 (1/4). But as a result, I’m getting 0.0088308. What am I doing wrong? Two identical sentences should get a score 1.0. I'm coding on Python 3, Windows 7, in PyCharm.
My code:
import nltk
from nltk import word_tokenize
from nltk.translate.bleu_score import SmoothingFunction
ref = 'You know that it would be untrue You know that I would be a liar If I was to say to you Girl, we couldnt get much higher.'
cand = 'You know that it would be untrue You know that I would be a liar If I was to say to you Girl, we couldnt get much higher.'
smoothie = SmoothingFunction().method1
reference = word_tokenize(ref)
candidate = word_tokenize(cand)
weights = (0.25, 0.25, 0.25, 0.25)
BLEUscore = nltk.translate.bleu_score.sentence_bleu(reference, candidate, weights, smoothing_function=smoothie)
print(BLEUscore)
My result:
0.008830895300928163
Process finished with exit code 0
BLEU allows to compare set of references with a candidate, so if you want to use it you should set the list of lists of sentences as a list of references. In other words, even if you take only one reference it should be a list of lists (in my example reference should be [reference]:
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], candidate, weights, smoothing_function=smoothie)
When I put reference in [] I've got 1.0.

How to efficiently store and manipulate sparse binary matrices in Octave?

I'm trying to manipulate sparse binary matrices in GNU Octave, and it's using way more memory than I expect, and relevant sparse-matrix functions don't behave the way I want them to. I see this question about higher-than-expected sparse-matrix storage in MATLAB, which suggests that this matrix should consume even more memory, but helped explain (only) part of this situation.
For a sparse, binary matrix, I can't figure out any way to get Octave to NOT STORE the array of values (they're always implicitly 1, so need not be stored). Can this be done? Octave always seems to consume memory for a values array.
A trimmed-down example demonstrating the situation: create random sparse matrix, turn it into "binary":
mys=spones(sprandn(1024,1024,.03)); nnz(mys), whos mys
Shows the situation. The consumed size is consistent with the storage mechanism outlined in aforementioned SO answer and expanded below, if spones() creates an array of storage-class double and if all indices are 32-bit (i.e., TotalStorageSize - rowIndices - columnIndices == NumNonZero*sizeof(double) -- unnecessarily storing these values (all 1s as doubles) is over half of the total memory consumed by this 3%-sparse object.
After messing with this (for too long) while composing this question, I discovered some partial workarounds, so I'm going to "self-answer" (only) part of the question for continuity (hopefully), but I didn't figure out an adequate answer to main question:
How do I create an efficiently-stored ("no-/implicit-values") binary matrix in Octave?
Additional background on storage format follows...
The Octave docs say the storage format for sparse matrices uses format Compressed Sparse Column (CSC). This seems to imply storing the following arrays (expanding on aforementioned SO answer, with canonical Yale format labels and tweaks for column-major order):
values (A), number-of-nonzeros (NNZ) entries of storage-class size;
row numbers (IA), NNZ entries of index size (hopefully int64 but maybe int32);
start of each column (JA), number-of-columns-plus-1 entries of index size)
In this case, for binary-only storage, I hope there's a way to completely avoid storing array (A), but I can't figure it out.
Full disclosure: As noted above, as I was composing this question, I discovered a workaround to reduce memory usage, so I'm "self-answering" part of this here, but it still isn't fully satisfying, so I'm still listening for a better actual answer to storage of a sparse binary matrix without a trivial, bloated, unnecessary values array...
To get a binary-like value out of a number-like value and reduce the memory usage in this case, use "logical" storage, created by logical(X). For example, building from above,
logicalmys = logical(mys);
creates a sparse bool matrix, that takes up less memory (1-byte logical rather than 8-byte double for the values array).
Adding more information to the whos information using whos_line_format helps illuminate the situation: The default string includes 5 of the 7 properties (see docs for more). I'm using the format string
whos_line_format(" %a:4; %ln:6; %cs:16:6:1; %rb:12; %lc:8; %e:10; %t:20;\n")
to add display of "elements", and "type" (which is distinct from "class").
With that, whos mys logicalmys shows something like
Attr Name Size Bytes Class Elements Type
==== ==== ==== ===== ===== ======== ====
mys 1024x1024 391100 double 32250 sparse matrix
logicalmys 1024x1024 165350 logical 32250 sparse bool matrix
So this shows a distinction between sparse matrix and sparse bool matrix. However, the total memory consumed by logicalmys is consistent with actually storing an array of NNZ booleans (1-byte) -- That is:
totalMemory minus rowIndices minus columnOffsets leaves NNZ bytes left;
in numbers,
165350 - 32250*4 - 1025*4 == 32250.
So we're still storing 32250 elements, all of which are 1. Further, if you set one of the 1-elements to zero, it reduces the reported storage! For a good time, try: pick a nonzero element, e.g., (42,1), then zero it: logicalmys(42,1) = 0; then whos it!
My hope is that this is correct, and that this clarifies some things for those who might be interested. Comments, corrections, or actual answers welcome!

How to fuzzily search for a dictionary word?

I have read a lot of threads here discussing edit-distance based fuzzy-searches, which tools like Elasticsearch/Lucene provide out of the box, but my problem is a bit different. Suppose I have a dictionary of words, {'cat', 'cot', 'catalyst'}, and a character similarity relation f(x, y)
f(x, y) = 1, if characters x and y are similar
= 0, otherwise
(These "similarities" can be specified by the programmer)
such that, say,
f('t', 'l') = 1
f('a', 'o') = 1
f('f', 't') = 1
but,
f('a', 'z') = 0
etc.
Now if we have a query 'cofatyst', the algorithm should report the following matches:
('cot', 0)
('cat', 0)
('catalyst', 0)
where the number is the 0-based starting index of the match found. I have tried the Aho-Corasick algorithm, and while it works great for exact matching and in the case when a character has relatively less number of "similar" characters, its performance drops exponentially as we increase the number of similar characters for a character. Can anyone point me to a better way of doing this? Fuzziness is an absolute necessity, and it must take in to account character similarities(i.e., not blindly depend on just edit-distances).
One thing to note is that in the wild, the dictionary is going to be really large.
I might try to use the cosine similarity using the position of each character as a feature and mapping the product between features using a match function based on your character relations.
Not a very specific advise, I know, but I hope it helps you.
edited: Expanded answer.
With the cosine similarity, you will compute how similar two vectors are. In your case the normalisation might not make sense. So, what I would do is something very simple (I might be oversimplifying the problem): First, see the matrix of CxC as a dependency matrix with the probability that two characters are related (e.g., P('t' | 'l') = 1). This will also allow you to have partial dependencies to differentiate between perfect and partial matches. After this I will compute, for each position the probability that the letter from each word is not the same (using the complement of P(t_i, t_j)) and then you can just aggregate the results using a sum.
It will count the number of terms that are different for a specific pair of words, and it allows you to define partial dependencies. Furthermore, the implementation is very simple and should scale well. This is why I am not sure if I misunderstood your question.
I am using Fuse JavaScript Library for a project of mine. It is a javascript file which works on JSON dataset. It is quite fast. Have a look at it.
It has implemented a full Bitap algorithm, leveraging a modified version of the Diff, Match & Patch tool by Google(from his site).
The code is simple to understand the algorithm implementation done.

Details of the "New Yale" sparse matrix format?

There's some Netlib code written in Fortran which performs transposes and multiplication on sparse matrices. The library works with Bank-Smith (sort of), "old Yale", and "new Yale" formats.
Unfortunately, I haven't been able to find much detail on "new Yale." I implemented what I think matches the description given in the paper, and I can get and set entries appropriately.
But the results are not correct, leading me to wonder if I've implemented something which matches the description in the paper but is not what the Fortran code expects.
So a couple of questions:
Should row lengths include diagonal entries? e.g., if you have M=[1,1;0,1], it seems that it should look like this:
IJA = [3,4,4,1]
A = [1,1,X,1] // where X=NULL
It seems that if diagonal entries are included in row lengths, you'd get something like this:
IJA = [3,5,6,1]
A = [1,1,X,1]
That doesn't make much sense because IJA[2]=6 should be the size of the IJA/A arrays, but it is what the paper seems to say.
Should the matrices use 1-based indexing?
It is Fortran code after all. Perhaps instead my IJA and A should look like this:
IJA = [4,5,5,2]
A = [1,1,X,1] // still X=NULL
Is there anything else I'm missing?
Yes, that's vague, but I throw that out there in case someone who has messed with this code before would like to volunteer any additional information. Anyone else can feel free to ignore this last question.
I know these questions may seem rather trivial, but I thought perhaps some Fortran folks could provide me with some insight. I'm not used to thinking in a one-based system, and though I've converted the code to C using f2c, it's still written like Fortran.
I can't see how you deduced those vectors from that paper. First the Old Yale format:
M = [7,16;0,-12]
Then, A contains all non-zero values of M in row-form:
A = [7,16,-12]
and IA stores the position in A of the first elements of each row, and JA stores the column indices of all the values in A:
IA = [1,3,4]
JA = [1,2,2]
New format: A has diagonal values first, a zero and then the remaining non-zero elements (I have put | to clarify the seperation between diagonal and non-diagonal) :
A = [7,-12,0 | 16]
IA and JA are combined in IJA, but as far as I can tell from the paper you need to take into account the new ordering of A (I have put | to clarify the seperation between IA and JA):
IJA = [1,2,3 | 2]
So, applied to your case M = [1,1;0,1], I get
A = [1,1,0 | 1]
IJA = [1,2,3 | 2]
first element of the first row is the first in A and the first element of the second row is the second in A, then I put 3 since they say the length of a row is determined by IA(I)-IA(I+1), so I make sure the difference is 1. Then the column indices of the non-zero non-diagonal elements follow, and that is 2.
So, first of all, the reference given in the SMMP paper is possibly not the correct one. I checked it out (the ref) from the library last night. It appears to give the "old Yale" format. It does mention, on pp. 49-50, that the diagonal can be separated out from the rest of the matrix -- but doesn't so much as mention an IJA vector.
I was able to find the format described in the 1992 edition of Numerical Recipes in C on pp. 78-79.
Of course, there is no guarantee that this is the format accepted by the SMMP library from Netlib.
NR seems to have IA giving positions relative to IJA, not relative to JA. The last position in the IA portion gives not the size of the IJA and A vectors, but size-1, because the vectors are indexed starting at 1 (per Fortran standard).
Row lengths do not include non-zero diagonal entries.

Resources