Apache Spark - what is the best datastructure for three dimensional data - data-structures

I am working on an application with an huge amount of different three-dimensional data. The 3 dimensional data are relative small (like 100 x 100 x 1000) but likely millions of these objects. Now I wonder if anyone has experience dealing such data in breeze. Although I can use nested datastructures like a matrix of vectors, it is important to address single values of that structure by indexing (x,y,z). Is it better to define a own structure like Point3d(x,y,z) - but x,y,z are vectorsa itself - or use predefined breeze classes like DenseMatrix. My question is how the performance is affected by those alternatives.
Thanks for your replies
Rolf-Dieter

In my experience, for performance, the simpler the object the better. That means using only primitive type, no nested objects, etc... Simple objects are faster to serialize and are smaller so you can pack more of them into memory.
In your cases, I think using one 9-element tuple is better than 3 3-element tuple.
(x1, x2, x3, y1, y2, y3, z1, z2, z3)
is better than
((x1, x2, x3), (y1, y2, y3), (z1, z2, z3))

Related

Integrating a 2D function with a line of singularities

How can I numerically integrate Exp[-Abs[x1]-Abs[x2]]/Abs[x1-x2] for x1 and x2 from -Infinity to Infinity?
What strategy would be adapted, Monte-Carlo? The Vegas algorithm (I tried implementations in Python and Mathematica) might work, but I cannot find a package which enables me to exclude a set of points, which Mathematica's NIntegrate does (here: Exclusions->x1 == x2). The latter however gives different answers depending on the strategy (without error messages and with integral errors only going up to 1.1, using https://mathematica.stackexchange.com/questions/75426/obtaining-an-nintegrate-error-estimate). I also go to tangent space using x->Tan[alpha] to compress the integration limits to +-pi/2.
Thank you.
Density Plot of function along x1 and x2 (in tangent space)

Interpolation algorithm

I have a question regarding how to do interpolation like following case:
There are basically two sets of data, "o" and "*". In any case one of them is known, and I am trying to get the other by doing interpolation. There are some assumptions/conditions listed below:
p1, p2, p3....are the positions, p12, p23 are the values for the intervals that hold them. Same for d1, d2, d3 and d12, d23.
both o and * are distributed on the common axis (x axis in this case)
both o and * are equal-distantly distributed. Meaning
p2-p1 = p3-p2 = .....
and
d2-d1 = d3-d2 = .......
all positions (p1, p2, p3,... d1, d2, d3.....) are known, one of the data values are known (ex. p12, and p23), the other is unknown (ex. d12, and d23).
One example:
If p12 and p23 are known, and to calculate d23, d34 and d45, we simply consider the contribution of each value weighed by their length into the other data set.
I am just wondering, in the sense of computer science is there a efficient algorithm of interpolation for this particular setup? My intuition is because all the data are distributed with equi-distance, there should be some sorta simplification/acceleration can be done? Or anyone can point out a way so I can do some literature reading? Thanks a lot.
What you're trying to do is take a known set of points, use that to interpolate a function, and then evaluate that interpolated function at another set of points.
This is a huge topic. You can develop your function to be piecewise linear, piecewise polynomial, a Fourier series, using wavelet algorithms..it all comes down to what kind of underlying function you think that you are trying to represent. And that depends on your underlying problem.

Importance of a Random Variable using Entropy or other method

I have a two-dimensional random vector x = [x1, x2]T with a known joint probability density function (PDF). The PDF is non-Gaussian and the two entries of the random vector are statistically dependent. I need to show that for example x1 is more important than x2, in terms of the amount of information that it carries. Is there a classical solution for this problem? Can I show that for example n% of the total information carried by x is in x1 and 100-n% is carried by x2?
I assume that the standard way of measuring the amount of information is by calculating the Entropy. Any clues?

Is there a standard approach to find related/similar objects?

Suppose I have a set of entities (for example people with their physical characteristics) and I want to find, for a given entity X, all entities related (or similar) to it, for some definition of similarity.
I can easily find such entities for one dimension (all people with height Y ~= X's height within a certain threshold) but is there some approach that I can use to find similar entities considering more than one attribute?
It is going to depend on what you define as similarity, but you can use the same approach you take for 1D, to any dimension, with a small generalization. Assuming each element is represented as a vector, you can measure the distance of 2 vectors x,y as d=|x-y|, and accept/reject depending on this d and some threshold.
In here, the minus operator is vector negation:
(a1,a2,...,an)-(b1,b2,...,bn)=(a1-b1,a2-b2,...,an-bn)
and the absolute value is again for vectors:
|(a1,a2,...,an)| = sqrt(a1^2 + a2^2 + ... + an^2).
It is easy to see that this is generalization of your 1D example, and invoking the same approach for vectors with a single element will do the same.
Downside of this approach is (0,0,0,...,0,10^20) and (0,0,0,....,0) will be very far away from each other - which might or might not be what you are after, and then you might need a different distance metric - but that really depends on what exactly are you after.

Algorithm to choose multiple discrete parameters based on input vector

I am faced with the following problem: Given a point in k-dimensional space, choose a set of discrete parameters to maximize the probability of a positive (binary) outcome. I have training examples in the same form, for example
point parameters good?
------ ---------- -----
1) x1 x2 x3 p1 p2 p3 NO
2) x1 x2 x3 p1 p2 p3 YES
3) x1 x2 x3 p1 p2 p3 YES
...etc.
All parameters are free variables, and there is an arbitrary number of them (k is also arbitrary). I have considered
Generate a clustering of the points, tune the parameters for each cluster, and then associate each new point with a cluster.
Develop a model to predict each parameter separately.
Both have major drawbacks. I was wondering if there is a more systematic approach to going about this (seems like a common enough problem). Can anyone point me towards some relevant reading or an algorithm?
Thanks, and I apologize in advance if this is the wrong place to ask these kinds of questions.
This is a classic classification (data mining) problem and it's up to you to pick which algorithm to use. The most common approaches are:
KNN (k-nearest-neighbor)
Bayes classifier
SVM (support vector machine)
Decision trees
You should read up about them and decide which one is best for your problem, unfortunately there is no 'best' approach for all domains and data.
Another simple technique you haven't mentioned is k-nearest neighbours - find the nearest positive point in k-dimensional space to your input point and copy its choice of parameters.
If you knew or could find out more about what the k-dimensional space or parameters actually mean, you might be able to use this knowledge to construct a good model.

Resources