Representing 2D data optimized by row vs by column vs flat - data-structures

In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?

I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.

Related

custom cache alignment in rust

How can I optimize the performance of my RowMatrix struct in Rust for large number of rows?
I have a matrix defined in a RowMajor form using a struct in Rust as follows:
pub struct RowMatrix
{
data: Vec<[usize; 8]>,
width: usize,
}
Each row is broken down into an array of 8 elements and stacked one after the other in the data vector. For example, if the width is 64, then, the first 8 elements in the vector represent the first row, the next 8 elements represent the second row, and so on.
I need to perform operations on individual arrays belonging to two separate rows of this matrix at the same index. For example, if I want to perform an operation on the 2nd array segment of the 1st and 10th row, I would pick the 2nd and 74th elements from the data vector respectively. The array elements will always be from the same array segment.
This operation is performed a number of times with different row pairs and when the number of rows in the matrix is small, I don't see any issues with the performance. However, when the number of rows is significant, I'm seeing a significant degradation in performance, which I attribute to frequent cache misses.
Is there a way to custom align my struct along the cache line to reduce cache misses without changing the struct definition? I want to control the layout of elements in memory at a fine-grained level like keeping elements that are 8 elements apart in cache(if 64 is the width of the matrix).
I used the repr(align(x)) attribute to specify the alignment of a struct but I think it's not helping as I think it's keeping array elements in a sequential fashion and in the case of a big matrix the respective elements might not be there in the cache.
#[repr(align)] can only affect the items stored in the struct (The Vec pointer, length and capacity plus your width), but since Vec is little more than a pointer to the data the layout behind it is entirely dictated by it's implementation and there is no way for you to directly affect it. So "without changing the struct definition" it's not possible to change the layout. You can however create a custom Vec-like or manage the memory yourself directly in the RowMatrix

What is the relative performance of 1 geometry column vs 4 decimals in Sql Server 2008?

I need to represent the dimensions of a piece of a quadrilateral rectangle surface in a SQL Server 2008 database. I will need to perform queries based on the distance between different points and the total area of the surface.
Will my performance be better using a geometry datatype or 4 decimal columns? Why?
If the geometry datatype is unnecessary in this situation, what amount of complexity in the geometrical shape would be required for using the geometry datatype to make sense?
I have not used the geometry datatype, and have never had reason to read up on it. Even so, it seems to me that if you’re just doing basic arithmatic on a simple geometric object, the mundane old SQL datatypes should be quite effiicient, particularly if you toss in some calculated columns for frequently used calculations.
For example:
--DROP TABLE MyTable
CREATE TABLE MyTable
(
X1 decimal not null
,Y1 decimal not null
,X2 decimal not null
,Y2 decimal not null
,Area as abs((X2-X1) * (Y2-Y1))
,XLength as abs((X2 - X1))
,YLength as abs((Y2 - Y1))
,Diagonal as sqrt(power(abs((X2 - X1)), 2) + power(abs((Y2 - Y1)), 2))
)
INSERT MyTable values (1,1,4,5)
INSERT MyTable values (4,5,1,1)
INSERT MyTable values (0,0,3,3)
SELECT * from MyTable
Ugly calculations, but they won’t be performed unless and until they are actually referenced (or unless you choose to index them). I have no statistics, but performing the same operations via the Geometry datatype probably means accessing rarely used mathematical subroutines, possibly embedded in system CLR assemblies, and I just can’t see that being significantly faster than the bare-bones SQL arithmatic routines.
I just took a look in BOL on the Geometry datatype. (a) Zounds! (b) Cool! Check out the entries under “geomety Data Type Method Reference” (online here , but you want to look at the expanded treeview under this entry.) If that’s the kind of functionality you’ll be needing, by all means use the Geometry data type, but for simple processing, I’d stick with the knucklescraper datatypes.
the geometry data types are more complex than simple decimals so there just be an overhead. But they do provide functions that calculate distance between two points and i would assume these have been optermised. The question might be if you implemented the distance between points logic - would this take longer than having the data in appropriate format in the first place.
As every DB question might relate to the ratio of inserts v selects/calc's
Geometry datatype is Spatial and decimal isn't,
Spatial vs. Non-spatial Data
Spatial data includes location, shape, size, and orientation.
For example, consider a particular square:
its center (the intersection of its diagonals) specifies its location
its shape is a square
the length of one of its sides specifies its size
the angle its diagonals make with, say, the x-axis specifies its orientation.
Spatial data includes spatial relationships. For example, the arrangement of ten bowling pins is spatial data.
Non-spatial data (also called attribute or characteristic data) is that information which is independent of all geometric considerations.
For example, a person?s height, mass, and age are non-spatial data because they are independent of the person?s location.
It?s interesting to note that, while mass is non-spatial data, weight is spatial data in the sense that something?s weight is very much dependent on its location!
It is possible to ignore the distinction between spatial and non-spatial data. However, there are fundamental differences between them:
spatial data are generally multi-dimensional and autocorrelated.
non-spatial data are generally one-dimensional and independent.
These distinctions put spatial and non-spatial data into different philosophical camps with far-reaching implications for conceptual, processing, and storage issues.
For example, sorting is perhaps the most common and important non-spatial data processing function that is performed.
It is not obvious how to even sort locational data such that all points end up ?nearby? their nearest neighbors.
These distinctions justify a separate consideration of spatial and non-spatial data models. This unit limits its attention to the latter unless otherwise specified.
Here's some more if you're interested:
http://www.ncgia.ucsb.edu/giscc/units/u045/u045_f.html
Heres a link i found about Benchmarking Spatial Data Warehouses: http://hpc.ac.upc.edu/Talks/dir08/T000327/paper.pdf

Database to store sparse matrix

I have a very large and very sparse matrix, composed of only 0s and 1s. I then basically handle (row-column) pairs. I have at most 10k pairs per row/column.
My needs are the following:
Parallel insertion of (row-column) pairs
Quick retrieval of an entire row or column
Quick querying the existence of a (row-column) pair
A Ruby client if possible
Are there existing databases adapted for these kind of constraints?
If not, what would get me the best performance :
A SQL database, with a table like this:
row(indexed) | column(indexed) (but the indexes would have to be constantly refreshed)
A NoSQL key-value store, with two tables like this:
row => columns ordered list
column => rows ordered list
(but with parallel insertion of elements to the lists)
Something else
Thanks for your help!
A sparse 0/1 matrix sounds to me like an adjacency matrix, which is used to represent a graph. Based on that, it is possible that you are trying to solve some graph problem and a graph database would suit your needs.
Graph databases, like Neo4J, are very good for fast traversal of the graph, because retrieving the neighbors of an vertex takes O(number of neighbors of a given vertex), so it is not related to the number of vertices in the whole graph. Neo4J is also transactional, so parallel insertion is not a problem. You can use the REST API wrapper in MRI Ruby, or a JRuby library for more seamless integration.
On the other hand, if you are trying to analyze the connections in the graph, and it would be enough to do that analysis once in a while and just make the results available, you could try your luck with a framework for graph processing based on Google Pregel. It's a little bit like Map-Reduce, but aimed toward graph processing. There are already several open source implementations of that paper.
However, if a graph database, or graph processing framework does not suit your needs, I recommend taking a look at HBase, which is an open-source, column-oriented data store based on Google BigTable. It's data model is in fact very similar to what you described (a sparse matrix), it has row-level transactions, and does not require you to retrieve the whole row, just to check if a certain pair exists. There are some Ruby libraries for that database, but I imagine that it would be safer to use JRuby instead of MRI for interacting with it.
If your matrix is really sparse (i.e. the nodes only have a few interconnections) then you would get reasonably efficient storage from a RDBMS such as Oracle, PostgreSQL or SQL Server. Essentially you would have a table with two fields (row, col) and an index or key each way.
Set up the primary key one way round (depending on whether you mostly query by row or column) and make another index on the fields the other way round. This will only store data where a connection exists, and it will be proportional to the number ot edges in the graph.
The indexes will allow you to efficiently retrieve either a row or column, and will always be in sync.
If you have 10,000 nodes and 10 connections per node the database will only have 100,000 entries. 100 ednges per node will have 1,000,000 entries and so on. For sparse connectivity this should be fairly efficient.
A back-of-fag-packet estimate
This table will essentially have a row and column field. If the clustered index goes (row, column, value) then the other covering index would go (column, row, value). If the additions and deletions were random (i.e. not batched by row or column), the I/O would be approximatley double that for just the table.
If you batched the inserts by row or column then you would get less I/O on one of the indexes as the records are physically located together in one of the indexes. If the matrix really is sparse then this adjacency list representation is by far the most compact way to store it, which will be much faster than storing it as a 2D array.
A 10,000 x 10,000 matrix with a 64 bit value would take 800MB plus the row index. Updating one value would require a write of at least 80k for each write (writing out the whole row). You could optimise writes by rows if your data can be grouped by rows on inserts. If the inserts are realtime and random, then you will write out an 80k row for each insert.
In practice, these writes would have some efficiency because the would all be written out in a mostly contiguous area, depending on how your NoSQL platform physically stored its data.
I don't know how sparse your connectivity is, but if each node had an average of 100 connections, then you would have 1,000,000 records. This would be approximately 16 bytes per row (Int4 row, Int4 column, Double value) plus a few bytes overhead for both the clustered table and covering index. This structure would take around 32MB + a little overhead to store.
Updating a single record on a row or column would cause two single disk block writes (8k, in practice a segment) for random access, assuming the inserts aren't row or column ordered.
Adding 1 million randomly ordered entries to the array representation would result in approximately 80GB of writes + a little overhead. Adding 1m entries to the adjacency list representation would result in approximately 32MB of writes (16GB in practice because the whole block will be written for each index leaf node), plus a little overhead.
For that level of connectivity (10,000 nodes, 100 edges per node) the adjacency list will
be more efficient in storage space, and probably in I/O as well. You will get some optimisation from the platform, so some sort of benchmark might be appropriate to see which is faster in practice.

Invert a LUT (lookup table)

I am writing some color management code, and I am dealing with LUTs (look up tables).
I can read the color profile LUT and convert my values... but, how can I do the inverse operation? maybe, is there a good algorithm to generate the 'inverse' of a LUT?
If your LUT is a given, the simplest method is to find the closest entry to any given color value. You can accelerate this computation by a variety of methods; for example, you can build a k-d tree out of your LUT entries and use it to eliminate most of the comparisons an exhaustive check would require.
However, this will tend to result in a "posterized" image, since smooth areas in your image will shift abruptly from one entry to the next. You can avoid this by taking your pixels in (quasi-)random order, picking the best fit from your LUT, and pushing the difference between the pixel value and the chosen entry back onto the nearby pixels which haven't already been chosen.
There are a variety of ways to do this last, but they all result in a dithering effect that generally makes better use (for imaging purposes) of the available LUT entries than the simple, per-pixel operation can.
Yes, you can usually invert a lookup table efficiently (linear time), assuming that the function is a bijection. If your lookup table maps two different keys to the same value, then there is no direct way to invert the table because you would end up needing to have a value that maps to two different keys. If you're okay with this that's fine, though it may call into question why you're trying to build the reverse map.
If you know that every value is unique, you can build an inverse lookup table as follows. First, create a data structure to hold the mapping from values to keys - perhaps a hash table, or a balanced binary tree, or a raw array if the values are small integers. Next, iterate over each key/value pair from the lookup table, then insert the mapping value → key into the new lookup table. This can be done in linear time plus the time required to insert the values into the new container.

Random distribution of data

How do I distribute a small amount of data in a random order in a much larger volume of data?
For example, I have several thousand lines of 'real' data, and I want to insert a dozen or two lines of control data in a random order throughout the 'real' data.
Now I am not trying to ask how to use random number generators, I am asking a statistical question, I know how to generate random numbers, but my question is how do I ensure that this the data is inserted in a random order while at the same time being fairly evenly scattered through the file.
If I just rely on generating random numbers there is a possibility (albeit a very small one) that all my control data, or at least clumps of it, will be inserted within a fairly narrow selection of 'real' data. What is the best way to stop this from happening?
To phrase it another way, I want to insert control data throughout my real data without there being a way for a third party to calculate which rows are control and which are real.
Update: I have made this a 'community wiki' so if anyone wants to edit my question so it makes more sense then go right ahead.
Update: Let me try an example (I do not want to make this language or platform dependent as it is not a coding question, it is a statistical question).
I have 3000 rows of 'real' data (this amount will change from run to run, depending on the amount of data the user has).
I have 20 rows of 'control' data (again, this will change depending on the number of control rows the user wants to use, anything from zero upwards).
I now want to insert these 20 'control' rows roughly after every 150 rows or 'real' data has been inserted (3000/20 = 150). However I do not want it to be as accurate as that as I do not want the control rows to be identifiable simply based on their location in the output data.
Therefore I do not mind some of the 'control' rows being clumped together or for there to be some sections with very few or no 'control' rows at all, but generally I want the 'control' rows fairly evenly distributed throughout the data.
There's always a possibility that they get close to each other if you do it really random :)
But What I would do is:
You have N rows of real data and x of control data
To get an index of a row you should insert i-th control row, I'd use: N/(x+1) * i + r, where r is some random number, diffrent for each of the control rows, small compared to N/x. Choose any way of determining r, it can be either gaussian or even flat distribution. i is an index of the control row, so it's 1<=i<x
This way you can be sure that you avoid condensation of your control rows in one single place. Also you can be sure that they won't be in regular distances from each other.
Here's my thought. Why don't you just loop through the existing rows and "flip a coin" for each row to decide whether you will insert random data there.
for (int i=0; i<numberOfExistingRows; i++)
{
int r = random();
if (r > 0.5)
{
InsertRandomData();
}
}
This should give you a nice random distribution throughout the data.
Going with the 3000 real data rows and 20 control rows for the following example (I'm better with example than with english)
If you were to spread the 20 control rows as evenly as possible between the 3000 real data rows you'd insert one at each 150th real data row.
So pick that number, 150, for the next insertion index.
a) Generate a random number between 0 and 150 and subtract it from the insertion index
b) Insert the control row there.
c) Increase insertion index by 150
d) Repeat at step a)
Of course this is a very crude algorithm and it needs a few improvements :)
If the real data is large or much larger than the control data, just generate interarrival intervals for your control data.
So pick a random interval, copy out that many lines of real data, insert control data, repeat until finished. How to pick that random interval?
I'd recommend using a gaussian deviate with mean set to the real data size divided by the control data size, the former of which could be estimated if necessary, rather than measured or assumed known. Set the standard deviation of this gaussian based on how much "spread" you're willing to tolerate. Smaller stddev means a more leptokurtic distribution means tighter adherence to uniform spacing. Larger stdev means a more platykurtic distribution and looser adherence to uniform spacing.
Now what about the first and last sections of the file? That is: what about an insertion of control data at the very beginning or very end? One thing you can do is to come up with special-case estimates for these... but a nice trick is as follows: start your "index" into the real data at minus half the gaussian mean and generate your first deviate. Don't output any real data until your "index" into the real data is legit.
A symmetric trick at the end of the data should also work quite well (simply: keep generating deviates until you reach an "index" at least half the gaussian mean beyond the end of the real data. If the index just before this was off the end, generate data at the end.
You want to look at more than just statistics: it's helpful in developing an algorithm for this sort of thing to look at rudimentary queueing theory. See wikipedia or the Turing Omnibus, which has a nice, short chapter on the subject whose title is "Simulation".
Also: in some circumstance non-gaussian distributions, particularly the Poisson distribution, give better, more natural results for this sort of thing. The algorithm outline above still applies using half the mean of whatever distribution seems right.

Resources