Avoid creation of Dense Matrix on a large dataset in Python

Avoid creation of Dense Matrix on a large dataset in Python - performance

perhaps You might think that this question has already been answered but I didn't find anything useful on the internet.
I am trying to find similar users of a movies dataset, using jaccard distance. Each user has a userId from 1 to 1000. For each user, we store the movies he has seen (movieId) and the rating he has left. movieIds are integers from 1 to 100.000 and ratings are values from 1 to 10. If a rating is 0 then we suppose that the user didn't watch that particular movie.
So, our dense matrix should look like this:
movie1 | movie2 | movie3 | movie4 | .... | movie100000
user1: 5 0 3 2 .... 0
user2: 0 0 1 4 .... 0
user3: 0 0 0 0 .... 1
..... .... .... .... .... .... ....
user1000: 0 2 0 0 .... 0
Notice that because the dataset is too large, there will be a lot of zeros. Also, this matrix is 1000x100000 size. This means that a normal laptop will need to utilize a lot RAM and the algorithm will take a lot of time only to construct the dense matrix. That's why i am trying to avoid this approach. Performance matters a lot.
Is there any way to work around this?
Thanks in advance.

Related

How to convert layers of many-to-one relationships into input data without losing information?

I apologize if the question doesn't make much sense.
I'm trying to convert data into inputs for a classifier. I have multiple layers of many to one data which I need to "attach" to the top layer. I will try to explain with an example:
Let's say you have a household, each household contains at least one person (represented with categorical data male/female/other), each person has some amount of pets (represented with categorical data dog/cat/rat/etc..). Is it possible to represent this data into one row (for the household) without losing information?
One way I could think of doing it was the count of the amount of data for each category, so a household would have 2 males, 1 female, 2 dogs and 1 cat. Except this loses the information about how the data itself is structured, like if the female has all of the pets, that data doesn't tell you that.
The other way would be structuring each household into a database, so each row is a person containing m/f/o and the amount of each pet, then performing some dimensionality reduction technique to put it all on one row for the household but I'm not sure if this is feasible.
So yeah, any advice would be appreciated.

have you tried using the one-hot-encoding method to represent each on a single line? Or does this method work?
import pandas as pd
df = pd.DataFrame({'household':["M","M","F","M","F","F"],
'pets': ["Cat","Dog","Cat","Hamster","Dog","Cat"],
'amount of pets': [2,1,1,2,2,3]})
df=pd.get_dummies(df,columns=["pets"])
print(df)
household amount of pets pets_Cat pets_Dog pets_Hamster
0 M 2 1 0 0
1 M 1 0 1 0
2 F 1 1 0 0
3 M 2 0 0 1
4 F 2 0 1 0
5 F 3 1 0 0
print(df.iloc[:1])
household amount of pets pets_Cat pets_Dog pets_Hamster
0 M 2 1 0 0
In this way, it is possible to see both the animal owned by the household and its number in a single line. Is this the representation you want?

Algorithm for read matrixes

An algorithm that need process a matrix n x m that is scalable.
E.g. I have a time series of 3 seconds containing the values: 2,1,4.
I need to decompose it to take a 3 x 4 matrix, where 3 is the number of elements of time series and 4 the maximum value. The resulting matrix that would look like this:
1 1 1
1 0 1
0 0 1
0 0 1
Is this a bad solution or is it only considered a data entry problem?
The question is,
do I need to distribute information from each row of the matrix for various elements without losing the values?

How to solve 5 * 5 Cube in efficient easy way

There is 5*5 cube puzzle named Happy cube Problem where for given mat , need to make a cube .
http://www.mathematische-basteleien.de/cube_its.htm#top
Its like, 6 blue mats are given-
From the following mats, Need to derive a Cube -
These way it has 3 more solutions.
So like first cub
For such problem, the easiest approach I could imagine was Recursion based where for each cube, I have 6 position , and for each position I will try check all other mate and which fit, I will go again recursively to solve the same. Like finding all permutations of each of the cube and then find which fits the best.So Dynamic Programming approach.
But I am making loads of mistake in recursion , so is there any better easy approach which I can use to solve the same?
I made matrix out of each mat or diagram provided, then I rotated them in each 90 clock-wise 4 times and anticlock wise times . I flip the array and did the same, now for each of the above iteration I will have to repeat the step for other cube, so again recursion .
0 0 1 0 1
1 1 1 1 1
0 1 1 1 0
1 1 1 1 1
0 1 0 1 1
-------------
0 1 0 1 0
1 1 1 1 0
0 1 1 1 1
1 1 1 1 0
1 1 0 1 1
-------------
1 1 0 1 1
0 1 1 1 1
1 1 1 1 0
0 1 1 1 1
0 1 0 1 0
-------------
1 0 1 0 0
1 1 1 1 1
0 1 1 1 0
1 1 1 1 1
1 1 0 1 0
-------------
1st - block is the Diagram
2nd - rotate clock wise
3rd - rotate anti clockwise
4th - flip
Still struggling to sort out the logic .

I can't believe this, but I actually wrote a set of scripts back in 2009 to brute-force solutions to this exact problem, for the simple cube case. I just put the code on Github: https://github.com/niklasb/3d-puzzle
Unfortunately the documentation is in German because that's the only language my team understood, but source code comments are in English. In particular, check out the file puzzle_lib.rb.
The approach is indeed just a straightforward backtracking algorithm, which I think is the way to go. I can't really say it's easy though, as far as I remember the 3-d aspect is a bit challenging. I implemented one optimization: Find all symmetries beforehand and only try each unique orientation of a piece. The idea is that the more characteristic the pieces are, the less options for placing pieces exist, so we can prune early. In the case of many symmetries, there might be lots of possibilities and we want to inspect only the ones that are unique up to symmetry.
Basically the algorithm works as follows: First, assign a fixed order to the sides of the cube, let's number them 0 to 5 for example. Then execute the following algorithm:
def check_slots():
for each edge e:
if slot adjacent to e are filled:
if the 1-0 patterns of the piece edges (excluding the corners)
have XOR != 0:
return false
if the corners are not "consistent":
return false
return true
def backtrack(slot_idx, pieces_left):
if slot_idx == 6:
# finished, we found a solution, output it or whatever
return
for each piece in pieces_left:
for each orientation o of piece:
fill slot slot_idx with piece in orientation o
if check_slots():
backtrack(slot_idx + 1, pieces_left \ {piece})
empty slot slot_idx
The corner consistency is a bit tricky: Either the corner must be filled by exactly one of the adjacent pieces or it must be accessible from a yet unfilled slot, i.e. not cut off by the already assigned pieces.
Of course you can ignore to drop some or all of the consistency checks and only check in the end, seeing as there are only 8^6 * 6! possible configurations overall. If you have more than 6 pieces, it becomes more important to prune early.

Re-sort a vector after a small number of elements have been modified

If we have a vector of size N that was previously sorted, and replace up to M elements with arbitrary values (where M is much smaller than N), is there an easy way to re-sort them at lower cost (i.e. generate a sorting network of reduced depth) than a full sort?
For example if N=10 and M=2 the input might be
10 20 30 40 999 60 70 80 90 -1
Note: the indices of the modified elements are not known (until we compare them with the surrounding elements.)
Here is an example where I know the solution because the input size is small and I was able to find it with a brute-force search:
if N = 5 and M is 1, these would be valid inputs:
0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 1 1 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 0 0 1 1 0 1 1
0 0 0 1 1 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1
For example the input may be 0 1 1 0 1 if the previously sorted vector was 0 1 1 1 1 and the 4th element was modified, but there is no way to form 0 1 0 1 0 as a valid input, because it differs in at least 2 elements from any sorted vector.
This would be a valid sorting network for re-sorting these inputs:
>--*---*-----*-------->
| | |
>--*---|-----|-*---*-->
| | | |
>--*---|-*---*-|---*-->
| | | |
>--*---*-|-----*---*-->
| |
>--------*---------*-->
We do not care that this network fails to sort some invalid inputs (e.g. 0 1 0 1 0.)
And this network has depth 4, a saving of 1 compared with the general case (a depth of 5 generally necessary to sort a 5-element vector.)
Unfortunately the brute-force approach is not feasible for larger input sizes.
Is there a known method for constructing a network to re-sort a larger vector?
My N values will be in the order of a few hundred, with M not much more than √N.

Ok, I'm posting this as an answer since the comment restriction on length drives me nuts :)
You should try this out:
implement a simple sequential sort working on local memory (insertion sort or sth. similar). If you don't know how - I can help with that.
have only a single work-item perform the sorting on the chunk of N elements
calculate the maximum size of local memory per work-group (call clGetDeviceInfo with CL_DEVICE_LOCAL_MEM_SIZE) and derive the maximum number of work-items per work-group,
because with this approach your number of work-items will most likely be limited by the amount of local memory.
This will probably work rather well I suspect, because:
a simple sort may be perfectly fine, especially since the array is already sorted to a large degree
parallelizing for such a small number of items is not worth the trouble (using local memory however is!)
since you're processing billions of such small arrays, you will achieve a great occupancy even if only single work-items process such arrays
Let me know if you have problems with my ideas.
EDIT 1:
I just realized I used a technique that may be confusing to others:
My proposal for using local memory is not for synchronization or using multiple work items for a single input vector/array. I simply use it to get a low read/write memory latency. Since we use rather large chunks of memory I fear that using private memory may cause swapping to slow global memory without us realizing it. This also means you have to allocate local memory for each work-item. Each work-item will access its own chunk of local memory and use it for sorting (exclusively).
I'm not sure how good this idea is, but I've read that using too much private memory may cause swapping to global memory and the only way to notice is by looking at the performance (not sure if I'm right about this).

Here is an algorithm which should yield very good sorting networks. Probably not the absolute best network for all input sizes, but hopefully good enough for practical purposes.
store (or have available) pre-computed networks for n < 16
sort the largest 2^k elements with an optimal network. eg: bitonic sort for largest power of 2 less than or equal to n.
for the remaining elements, repeat #2 until m < 16, where m is the number of unsorted elements
use a known optimal network from #1 to sort any remaining elements
merge sort the smallest and second-smallest sub-lists using a merge sorting network
repeat #5 until only one sorted list remains
All of these steps can be done artificially, and the comparisons stored into a master network instead of acting on the data.
It is worth pointing out that the (bitonic) networks from #2 can be run in parallel, and the smaller ones will finish first. This is good, because as they finish, the networks from #5-6 can begin to execute.

I'm looking for some pointers here as I don't quite know where to start researching this one.
I have a 2D matrix with 0 or 1 in each cell, such as:
1 2 3 4
A 0 1 1 0
B 1 1 1 0
C 0 1 0 0
D 1 1 0 0
And I'd like to sort it so it is as "upper triangular" as possible, like so:
4 3 1 2
B 0 1 1 1
A 0 1 0 1
D 0 0 1 1
C 0 0 0 1
The rows and columns must remain intact, i.e. elements can't be moved individually and can only be swapped "whole".
I understand that there'll probably be pathological cases where a matrix has multiple possible sorted results (i.e. same shape, but differ in the identity of the "original" rows/columns.)
So, can anyone suggest where I might find some starting points for this? An existing library/algorithm would be great, but I'll settle for knowing the name of the problem I'm trying to solve!
I doubt it's a linear algebra problem as such, and maybe there's some kind of image processing technique that's applicable.
Any other ideas aside, my initial guess is just to write a simple insertion sort on the rows, then the columns and iterate that until it stabilises (and hope that detecting the pathological cases isn't too hard.)
More details: Some more information on what I'm trying to do may help clarify. Each row represents a competitor, each column represents a challenge. Each 1 or 0 represents "success" for the competitor on a particular challenge.
By sorting the matrix so all 1s are in the top-right, I hope to then provide a ranking of the intrinsic difficulty of each challenge and a ranking of the competitors (which will take into account the difficulty of the challenges they succeeded at, not just the number of successes.)
Note on accepted answer: I've accepted Simulated Annealing as "the answer" with the caveat that this question doesn't have a right answer. It seems like a good approach, though I haven't actually managed to come up with a scoring function that works for my problem.

An Algorithm based upon simulated annealing can handle this sort of thing without too much trouble. Not great if you have small matrices which most likely hae a fixed solution, but great if your matrices get to be larger and the problem becomes more difficult.
(However, it also fails your desire that insertions can be done incrementally.)
Preliminaries
Devise a performance function that "scores" a matrix - matrices that are closer to your triangleness should get a better score than those that are less triangle-y.
Devise a set of operations that are allowed on the matrix. Your description was a little ambiguous, but if you can swap rows then one op would be SwapRows(a, b). Another could be SwapCols(a, b).
The Annealing loop
I won't give a full exposition here, but the idea is simple. You perform random transformations on the matrix using your operations. You measure how much "better" the matrix is after the operation (using the performance function before and after the operation). Then you decide whether to commit that transformation. You repeat this process a lot.
Deciding whether to commit the transform is the fun part: you need to decide whether to perform that operation or not. Toward the end of the annealing process, you only accept transformations that improved the score of the matrix. But earlier on, in a more chaotic time, you allow transformations that don't improve the score. In the beginning, the algorithm is "hot" and anything goes. Eventually, the algorithm cools and only good transforms are allowed. If you linearly cool the algorithm, then the choice of whether to accept a transformation is:
public bool ShouldAccept(double cost, double temperature, Random random) {
return Math.Exp(-cost / temperature) > random.NextDouble();
}
You should read the excellent information contained in Numerical Recipes for more information on this algorithm.
Long story short, you should learn some of these general purpose algorithms. Doing so will allow you to solve large classes of problems that are hard to solve analytically.
Scoring algorithm
This is probably the trickiest part. You will want to devise a scorer that guides the annealing process toward your goal. The scorer should be a continuous function that results in larger numbers as the matrix approaches the ideal solution.
How do you measure the "ideal solution" - triangleness? Here is a naive and easy scorer: For every point, you know whether it should be 1 or 0. Add +1 to the score if the matrix is right, -1 if it's wrong. Here's some code so I can be explicit (not tested! please review!)
int Score(Matrix m) {
var score = 0;
for (var r = 0; r < m.NumRows; r++) {
for (var c = 0; c < m.NumCols; c++) {
var val = m.At(r, c);
var shouldBe = (c >= r) ? 1 : 0;
if (val == shouldBe) {
score++;
}
else {
score--;
}
}
}
return score;
}
With this scoring algorithm, a random field of 1s and 0s will give a score of 0. An "opposite" triangle will give the most negative score, and the correct solution will give the most positive score. Diffing two scores will give you the cost.
If this scorer doesn't work for you, then you will need to "tune" it until it produces the matrices you want.
This algorithm is based on the premise that tuning this scorer is much simpler than devising the optimal algorithm for sorting the matrix.

I came up with the below algorithm, and it seems to work correctly.
Phase 1: move rows with most 1s up and columns with most 1s right.
First the rows. Sort the rows by counting their 1s. We don't care
if 2 rows have the same number of 1s.
Now the columns. Sort the cols by
counting their 1s. We don't care
if 2 cols have the same number of
1s.
Phase 2: repeat phase 1 but with extra criterions, so that we satisfy the triangular matrix morph.
Criterion for rows: if 2 rows have the same number of 1s, we move up the row that begin with fewer 0s.
Criterion for cols: if 2 cols have the same number of 1s, we move right the col that has fewer 0s at the bottom.
Example:
Phase 1
1 2 3 4 1 2 3 4 4 1 3 2
A 0 1 1 0 B 1 1 1 0 B 0 1 1 1
B 1 1 1 0 - sort rows-> A 0 1 1 0 - sort cols-> A 0 0 1 1
C 0 1 0 0 D 1 1 0 0 D 0 1 0 1
D 1 1 0 0 C 0 1 0 0 C 0 0 0 1
Phase 2
4 1 3 2 4 1 3 2
B 0 1 1 1 B 0 1 1 1
A 0 0 1 1 - sort rows-> D 0 1 0 1 - sort cols-> "completed"
D 0 1 0 1 A 0 0 1 1
C 0 0 0 1 C 0 0 0 1
Edit: it turns out that my algorithm doesn't give proper triangular matrices always.
For example:
Phase 1
1 2 3 4 1 2 3 4
A 1 0 0 0 B 0 1 1 1
B 0 1 1 1 - sort rows-> C 0 0 1 1 - sort cols-> "completed"
C 0 0 1 1 A 1 0 0 0
D 0 0 0 1 D 0 0 0 1
Phase 2
1 2 3 4 1 2 3 4 2 1 3 4
B 0 1 1 1 B 0 1 1 1 B 1 0 1 1
C 0 0 1 1 - sort rows-> C 0 0 1 1 - sort cols-> C 0 0 1 1
A 1 0 0 0 A 1 0 0 0 A 0 1 0 0
D 0 0 0 1 D 0 0 0 1 D 0 0 0 1
(no change)
(*) Perhaps a phase 3 will increase the good results. In that phase we place the rows that start with fewer 0s in the top.

Look for a 1987 paper by Anna Lubiw on "Doubly Lexical Orderings of Matrices".
There is a citation below. The ordering is not identical to what you are looking for, but is pretty close. If nothing else, you should be able to get a pretty good idea from there.
http://dl.acm.org/citation.cfm?id=33385

Here's a starting point:
Convert each row from binary bits into a number
Sort the numbers in descending order.
Then convert each row back to binary.

Basic algorithm:
Determine the row sums and store
values. Determine the column sums
and store values.
Sort the row sums in ascending order. Sort the column
sums in ascending order.
Hopefully, you should have a matrix with as close to an upper-right triangular region as possible.

Treat rows as binary numbers, with the leftmost column as the most significant bit, and sort them in descending order, top to bottom
Treat the columns as binary numbers with the bottommost row as the most significant bit and sort them in ascending order, left to right.
Repeat until you reach a fixed point. Proof that the algorithm terminates left as an excercise for the reader.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio