Parallel using spark - parallel-processing

I have method A() to compare a pair of 3D structure of proteins ( 3D objects). I would like to repeat this method for 10000000 times for 10000000 pairs of protein. One protein description is in one text file, they are separated.
How can parallel the repeated method using spark?
Thank for your help.

Related

Efficiently building a dictionary from many DFs (>20gb in total)

I have roughly 25gb stored across thousands of .parquet files. They look like this:
ID | Value
Each ID itself is unique, however, there can be multiple entries for one ID (with same and different values).
I would like to:
Read in the .parquet files efficiently (parallel processing) and convert each to a pandas data frame.
Remove duplicate ID-Value pairs in each data frame (I am only interested in unique pairs) to downsize as soon as possible.
Build a common representation over ALL .parquet files/data frames that is a dictionary of ID:[Values].
Merge the dictionary by Values with another dictionary where the values are keys.
Essentially, I have all this up to (including) step 3 implemented. However, the building of the dictionary currently takes too long as I am iterating over each data frame:
Parameter 'df' is result of step 1, i.e. the converted .parquet file. The function is part of a class (that's why there are self) and is called inside another function, which is executed using multiprocessing and 8 cores. The dictionary is itself is a member variable (shared object) of the class and is built iteratively.
def build_id_map(self, df):
df = df.drop_duplicates()
def check_existence(row):
if row['id'] not in self.id_dict:
self.id_dict[row['id']] = [row['value']]
else:
if row['id'] not in self.id_dict[row['id']]:
self.id_dict[row['id']].append(row['value'])
df.apply(check_existence, axis=1)
I am looking for a considerably more efficient solution to the building this dictionary as, currently, it takes roughly 15 seconds for 1 file, which is too much for the large number of files.
Furthermore, I am happy to hear ideas on how to realize point 4 efficiently.

Random sampling in pyspark with replacement

I have a dataframe df with 9000 unique ids.
like
| id |
1
2
I want to generate a random sample with replacement these 9000 ids 100000 times.
How do I do it in pyspark
I tried
df.sample(True,0.5,100)
But I do not know how to get to 100000 number exact
Okay, so first things first. You will probably not be able to get exactly 100,000 in your (over)sample. The reason why is that in order to sample efficiently, Spark uses something called Bernouilli Sampling. Basically this means it goes through your RDD, and assigns each row a probability of being included. So if you want a 10% sample, each row individually has a 10% chance of being included but it doesn't take into account if it adds up perfectly to the number you want, but it tends to be pretty close for large datasets.
The code would look like this: df.sample(True, 11.11111, 100). This will take a sample of the dataset equal to 11.11111 times the size of the original dataset. Since 11.11111*9,000 ~= 100,000, you will get approximately 100,000 rows.
If you want an exact sample, you have to use df.takeSample(True, 100000). However, this is not a distributed dataset. This code will return an Array (a very large one). If it can be created in Main Memory then do that. However, because you require the exact right number of IDs, I don't know of a way to do that in a distributed fashion.

How Split a dataset into two random halves in weka?

I want to split my dataset into two random halves in weka.
How can I do it?
I had same question and the answer is too simple. First, you need to randomly shuffle the order of instances with weka filter (Unsupervised-> instances) and then split data set into two parts. You can find a complete explanation at below link:
http://cs-people.bu.edu/yingy/intro_to_weka.pdf
you can use first randomize data set in filter , to make it randomly, secondly use, the Remove percentage filter, use first for 30% for testing and save it then reuse it but check the INVERT box so will be the other 70% and save it
so u will have the testing, and training sets randomized and splitted
I have an idea but not using Weka native api. How about use Random Number Generator? Math.random() generates numbers from 0 to 1.
Suppose that we want to split dataset into set1 and set2.
for every instance in dataset
{
if Math.random() < 0.5
put the instance into set1
else
put the instance into set2
}
I think that this method may generate similar number of instances for the two subset. If you want to generate exactly the same quantities, you may add additional conditions to if-else.
Hope this may offer you some inspiration.

matrix-vector-multiplication with hadoop: vector and matrix in different files

I want to do matrix-vector-multiplication with hadoop. i've got a small working example now: there is only one input file containing the rows of the matrix always followed by the vector it is multiplied with. So each map-task gets one row and the vector from this single file.
Now I would like to have two input files. One file should contain the matrix and another one the Vector. but I cant think of a hadoop way to let the mapper access both files.
What would be the best approach here?
Thanks for your help!
The easiest and most efficient solution is to read the vector into memory in the Mapper directly from HDFS (not as map() input). Presumably it is not so huge that it can't fit in memory. Then, map() only the matrix data by row. As you receive each row, dot it with the vector to produce one element of the output. Emit (index,value) and then construct the vector in the Reducer (if needed).

Performing a SVD on tweets. Memory problem

EDIT: I the size of the wordlist is 10-20 times bigger than I wrote down. I simply forgot a zero.
EDIT2: I will have a look into SVDLIBC and also see how to reduce a matrix to its dense version so that might help too.
I have generated a huge csv file as an output from my pos tagging and stemming. It looks like this:
word1, word2, word3, ..., word 150.000
person1 1 2 0 1
person2 0 0 1 0
...
person650
It contains the word counts for each person. Like this I am getting characteristic vectors for each person.
I want to run a SVD on this beast, but it seems the matrix is too big to be held in memory to perform the operation. My quesion is:
should i reduce the column size by removing words which have a column sum of for example 1, which means that they have been used only once. Do I bias the data too much with this attempt?
I tried the rapidminer attempt, by loading the csv into the db. and then sequentially reading it in with batches for processing, like rapidminer proposes. But Mysql can't store that many columns in a table. If i transpose the data, and then retranspose it on import it also takes ages....
--> So in general I am asking for advice how to perform a svd on such a corpus.
This is a big dense matrix. However, it is only a small a small sparse matrix.
Using a sparse matrix SVD algorithm is enough. e.g. here.
SVD is constrained by your memory size. See:
Folding In: a paper on partial matrix updates.
Apache Mahout is a distributed data mining library that runs on hadoop which has a parallel SVD

Resources