Performing a SVD on tweets. Memory problem - matrix

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.
EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.
I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word counts for each person. Like this I am getting characteristic vectors for each person.
I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:
should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....
--> So in general I am asking for advice how to perform a svd on such a corpus.

This is a big dense matrix. However, it is only a small a small sparse matrix.
Using a sparse matrix SVD algorithm is enough. e.g. here.

SVD is constrained by your memory size. See:
Folding In: a paper on partial matrix updates.
Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD

Related

Networkx: How to create the incidence matrix from a huge graph

I am reading a very big graph specifically the wiki-topcats (http://snap.stanford.edu/data/wiki-topcats.html) and while I can read and create in acceptable time the graph with:
graph = nx.read_edgelist("C:/.../wiki-topcats.txt", nodetype=int)
Then I need to extract the incidence matrix in order to create the linegraph (that's my goal) , but when I run:
C = nx.incidence_matrix(graph)
I am getting memory errors, and I cannot find any efficient way of dealing with this or maybe a way to work around this by creating the incidence matrix from the adjancency, but I must work with scipy matrices due to magnitude of the graph e.g. this adjacency matrix works.
A = nx.to_scipy_sparse_matrix(graph)
. Any help would be greatly appreciated.
NetworkX is clearly not designed to compute big graphs. It is not very performant either (for both memory consumption and speed). Graph databases are meant to solve this problem by storing graphs on a storage device. Neo4j is an example of such database available in Python. That being said, the graph you want to compute is not so big and you could try to use alternative Python libraries that are more efficient like graph-tool.
Assuming you really want/need to use NetworkX, then be aware that the graph takes about 12 GiB of RAM and many NetworkX function will return a full graph resulting in several dozen of GiB of RAM allocated which is a complete waste of resources. Not to mention that most machine does not have so much memory. The graph file take only about 400 MiB and compact representations should be able to store in RAM in less than 100 MiB. Thus, using 12 GiB for that require >120 times more resources than expected. Not to mention that NetworkX take a very long time to fill such memory space.
Loading the file directly in a sparse matrix is much more efficient assuming the sparse matrix type is carefully chosen. Indeed, CSR matrix can store the adjacency matrix pretty efficiently while a DOK matrix is very inefficient (it takes 6 GiB of RAM and is as slow as NetworkX).
Note that np.genfromtxt (which is intended to load such kind of file) goes crazy by using 45-50 GiB of RAM on my machine. This is insane as it is actually ~200 times bigger than what is strictly needed! Hopefully, df = pd.read_table('wiki-topcats.txt', dtype=np.int32, header=None, sep=' ') from Pandas is able to load the file using less than 400 MiB. Based on that, you can easily convert the dataframe to Numpy with edges = df.to_numpy() and then build a relatively fast CSR matrix.
Note that you can build directly the (sparse CSR) incidence matrix with Numpy. One solution is to use idx = np.arange(len(df), dtype=np.int32) so to set an ID to each edge. Then, you can directly know all the position of the 1 in the sparse matrix (m[edges[:,0], idx] and m[edges[:,1], idx]) which should be enough to build it (be careful about possible duplicates that could be due to a node connected to itself).

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

What is the fastest way to compute the F-score for a million annotations?

Imagine you want to predict certain "events" (coded as: 0,1,2,3,...,N) within a finite number of sentences (coded as: 0,1,2,...,S) of a series of papers (coded as 0,1,...,P).
Your machine learning algorithm returns the following file:
paper,position,event
0,0,22
0,12,38
0,15,18
0,23,3
1,1064,25
1,1232,36
...
and you want to compute the F-score based on a similar ground truth data file:
paper,true_position,true_event
0,0,22
0,12,38
0,15,18
0,23,3
1,1064,25
1,1232,36
...
Since you have many papers and millions of those files, what is the fastest way to compute the F-score for each paper?
PS Notice that nothing guarantees that the two files will have the same number of positions, the ml algorithm might mistakenly identify positions that are not in the ground-truth.
As long as entries in two files are aligned so that you can directly compare line by line, I don't see why it will be slow to process millions of row in O(n) time, even on your laptop.

Why do Perlin noise algorithms use lookup tables for random numbers

I have been researching noise algorithms for a library I wish to build, and have started with Perlin noise (more accurately Simplex noise, I want to work with arbitrary dimensions, or at least up to 6). Reading Simplex noise demystified, helped, but looking through the implementations at the end, i saw a big lookup table named perm.
In the code example, it seems to be used to generate indexes into a set of gradients, but the method seems odd. I assume that the table is just there to provide 1) determinism, and 2) a speed boost.
My question is, does the perm lookup table have any auxiliary meaning or purpose, or it there for the reasons above? Or another way, is there a specific reason that a pseudo-random number generator is not used, other than performance?
This is a bytes array. The range is 0 to 255.
You may randomize it if you want. You will probably want to seed the random... etc.
The perm table (and grad table) is used for optimization. They are just lookup tables of precomputed values. You are correct on both points 1) and 2).
Other than performance and portability, there is no reason you couldn't use a PRN.

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.

Resources