I have a large matrix of integers that I want to be able to slice and run analytics on. I'm prototyping this with Apache Ignite.
The matrix is 50000 columns x 5 million rows. I want to be able to run the following operations on this matrix:
Fetch all data for a single column
Fetch all data for some random subset of rows and columns.
Compute a correlation coefficient for one row against every other row.
I'm trying to satisfy 1. and 2. right now, but I can't figure out how to store a matrix. I was thinking of storing the matrix like this:
row1 {
co1: val
co2: val
co3: val
...
co50000: val
}
row2{ ... }
But I'm not sure if I can have complex data types like this in Ignite, or if I can only have a single key:value pair. The documentation is not clear. When I try to insert a dictionary using pyignite (my Java is a little rusty, so I'm sticking to python right now), the data comes back as an array:
>>> test.put('row2', { "col1": 50, "col2":0 })
>>> test.get('cell2')
['gene1', 'gene2']
I'm new to Apache Ignite, but the documentation doesn't seem to detail how to do this, or if it would even be performant.
I think that you need to store 5 million KV pairs using row as key and containing 50000 columns array as value.
Better stick to primitive types. Not sure how to map it best to Python.
From a thin client perspective, Ignite caches are flat, not nested. You can put arrays, sequences, dictionaries, or any combinations of above as a value in Ignite cache, but you can not traverse values inside the cache afterwards. You can only retrieve the whole value and look into it.
cache.get(row)[column] will work, but it will retrieve the whole row of 50000 elements from the cache as a Python list, and then address the single element in this list. I think in your case it will be sub-optimal.
If I got your question right, JSON-oriented databases (like MongoDB or PostgreSQL's JSONB) have the features you describe. Don't know if they are fast enough for data analysis though.
Related
I have two Spark dataframes DFa and DFb, they have same schema, ('country', 'id', 'price', 'name').
DFa has around 610 million rows,
DFb has 3000 milllion rows.
Now I want to find all rows from DFa and DFb that have same id, where id looks like "A6195A55-ACB4-48DD-9E57-5EAF6A056C80".
It's a SQL inner join, but when I run Spark SQL inner join, one task got killed because container used too much memory and caused Java heap memory error. And my cluster has limited resources, tuning YARN and Spark configuration is not a feasible method.
Is there any other solution to deal with this? Not using spark solution is also acceptable if the runtime is acceptable.
More generally, Can anyone give some algorithms and solutions when find common elements in two very large datasets.
First compute 64 bit hashes of your ids. The comparison will be a lot faster on the hashes, than on the string ids.
My basic idea is:
Build a hash table from DFa.
As you compute the hashes for DFb, you do a lookup in the table. If there's nothing there then drop the entry (no match). If you get a hit compare the actual IDs to make sure you don't get a false positive.
The complexity is O(N). Not knowing how many overlaps you expect this is the best you can do since you might have to output everything, because it all matches.
The naive implementation would use about 6GB of ram for the table (assuming 80% occupancy and that you use a flat hash table), but you can do better.
Since we already have the hash, we only need to know if it's exists. So you only need one bit to mark that which reduces the memory usage by a lot (you need 64x less memory per entry, but you need to lower occupancy). However this is not a common datastructure so you'll need to implement it.
But there's something even better, something even more compact. That is called bloom filter. This will introduce some more false positives, but we had to double check anyway because we didn't trust the hash, so it's not a big downside. The best part is that you should be able to find libraries for it already available.
So everything together it looks like this:
Compute hashes from DFa and build a bloom filter.
Compute hashes from DFb and check against the bloom filter. If you get a match lookup the ID in DFa to make sure it's a real match and add it to the result.
This is a typical usecase in any big data environment. You can use the Map-Side joins where you cache the smaller table which is broadcasted to all the executors.
You can read more about broadcasted joins here
Broadcast-Joins
In KNN like algorithm we need to load model Data into cache for predicting the records.
Here is the example for KNN.
So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.
Example:
Inorder to predict 1 otcome, we need to find the distnce between that single record with all the records in model result and find the min distance. So we need to get the model result in our hands. And if it is large file it cannot be loaded into Distributed cache for finding distance.
The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.
How can we parttion the file and perform the operation on these partition ?
ie 1 record <Distance> file1,file2,....filen
2nd record <Distance> file1,file2,...filen
This is what came to my thought.
Is there any further way.
Any pointers would help me.
I think the way you partitionin the data mainly depends on your data itself.
Being that you have a model with a bunch of rows, and that you want to find the k closes ones to the data on your input, the trivial solution is to compare them one by one. This can be slow because of going through 1-2GB of data millions of times (I assume you have large numbers of records that you want to classify, otherwise you don't need hadoop).
That is why you need to prune your model efficiently (your partitioning) so that you can compare only those rows that are most likely to be the closest. This is a hard problem and requires knowledge of the data you operate on.
Additional tricks that you can use to fish out performance are:
Pre-sorting the input data so that the input items that will be compared from the same partition come together. Again depends on the data you operate on.
Use random access indexed files (like Hadoop's Map files) to find the data faster and cache it.
In the end it may actually be easier for your model to be stored in lucene index, so you can achieve effects of partitioning by looking up the index. Pre-sorting the data is still helpful there.
How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?
I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.
I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.
My needs are the following:
Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible
Are there existing databases adapted for these kind of constraints?
If not, what would get me the best performance :
A SQL database, with a table like this:
row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)
A NoSQL key-value store, with two tables like this:
row => columns ordered list
column => rows ordered list
(but with parallel insertion of elements to the lists)
Something else
Thanks for your help!
A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.
Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.
On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.
However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.
If your matrix is really sparse (i.e. the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. Essentially you would have a table with two fields (row, col) and an index or key each way.
Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph.
The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync.
If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 100 ednges per node will have 1,000,000 entries and so on. For sparse connectivity this should be fairly efficient.
A back-of-fag-packet estimate
This table will essentially have a row and column field. If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). If the additions and deletions were random (i.e. not batched by row or column), the I/O would be approximatley double that for just the table.
If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array.
A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. Updating one value would require a write of at least 80k for each write (writing out the whole row). You could optimise writes by rows if your data can be grouped by rows on inserts. If the inserts are realtime and random, then you will write out an 80k row for each insert.
In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data.
I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. This structure would take around 32MB + a little overhead to store.
Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered.
Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead.
For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will
be more efficient in storage space, and probably in I/O as well. You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice.