How to compute a distance between two byte arrays? - algorithm

I need to calculate the distance between two byte arrays of the same length. In particular, I am looking for approach to obtain a distance with the following features:
if the two arrays are very similar to each other, then the distance should be very small;
otherwise, the distance should be very large.
Basically, I'm looking for a way to measure the difference between two arrays.
UPDATE: As suggested, I provide the following additional information about the content of a byte array. A sequence of bytes contains the features of an image, so an image is divided into small regions, and some color information is measured for each region (each byte encodes information relating to a single region): when a bit is set within a byte, then it means that a given feature is present within the region.
Therefore, given two sequences of bytes, I would like to compare using a suitable distance measure. I read about Bhattacharyya distance, but I do not know how to apply it in this case, so I was wondering if there were other distance measures to compare two byte arrays.

You can use the Euclidean distance for this. Basically you add the squares of the difference between each pair of elements in your arrays and extract the square root from that sum.
However, there are other distance metrics that could apply better to your data, for example Pearson Correlation, cosine similiarity, hamming distance, etc.

By order of complexity,
a L1 = Sum | xi - yi |
or a L2 = Sum | xi - yi |^2


Number of subsets whose XOR contains less than two set bits

I have an Array A(size <= 10^5) of numbers(<= 10^8), and I need to answer some queries(50000), for L, R, how many subsets for elements in the range [L, R], the XOR of the subset is a number that has 0 or 1 bit set(power of 2). Also, point modifications in the array are being done in between the queries, so can't really do some offline processing or use techniques like square root decomposition etc.
I have an approach where I use DP to calculate for a given range, something on the lines of this:
But this is clearly too slow. This feels like a classical segment tree problem, but can't seem to find as to what data points to store at each node, so that I can use the left child and right child to compute the answer for the given range.
Yeah, that DP won't be fast enough.
What will be fast enough is applying some linear algebra over GF(2), the Galois field with two elements. Each number can be interpreted as a bit-vector; adding/subtracting vectors is XOR; scalar multiplication isn't really relevant.
The data you need for each segment is (1) how many numbers are there in the segment (2) a basis for the subspace of numbers generated by numbers in the segment, which will consist of at most 27 numbers because all numbers are less than 2^27. The basis for a one-element segment is just that number if it's nonzero, else the empty set. To find the span of the union of two bases, use Gaussian elimination and discard the zero vectors.
Given the length of an interval and a basis for it, you can count the number of good subsets using the rank-nullity theorem. Basically, for each target number, use your Gaussian elimination routine to test whether the target number belongs to the subspace. If so, there are 2^(length of interval minus size of basis) subsets. If not, the answer is zero.

How does Principle Component Initialization work for determining the weights of the map vectors in Self Organizing Maps?

I studied on a fundamental SOM initialization and was looking to understand exactly how this process, PCI, works for initializing weight vectors on the map. My understanding is that for a two dimensional Map, this initialization method looks at the eigenvectors for the two largest eigenvalues of the data matrix and then uses the subspace spanned by these eigenvectors to initialize the map. Does that mean that in order to get the initial map weights, does this method take random linear combinations of the largest two eigenvectors in order to generate the map weights? Is there a patten?
For example, for 40 input data vectors on the map, does the lininit initialization method take combinations a1*[e1] + a2*[e2] where [e1] and [e2] are the two largest eigenvectors and a1 and a2 are random integers ranging from -3 to 3? Or is there a different mechanism? I was looking to make sure I knew exactly how lininit takes the two largest eigenvectors of the input data matrix and uses them to construct the initial weight vectors for the map.
The SOM creates a map that has the neighbourhood relationship between nearby nodes. Random initialisation does not help this process, since the nodes start randomly. Therefore, the idea of using the PCA initialisation is just a shortcut to get the map closer to the final state. This saves a lot of computation.
So how does this work? The first two principal components (PCs) are used. Set the initial weights as linear combination of the PCs. Rather than using random a1 and a2, the weights are set in a range that corresponds to the scale of the principal components.
For example, for a 5x3 map, a1 and a2 can both be in the range (-1, 1) with the relevant number of elements. In other words, for the 5x3 map, a1 = [-1.0 -0.5 0.0 0.5 1.0] and a2 = [-1.0 0.0 1.0], with 5 nodes and 3 nodes, respectively.
Then set each of the weights of nodes. For a rectangular SOM, each node has indices [m, n]. Use the values of a1[m] and a2[n]. Thus, for all m = [1 2 3 4 5] and n = [1 2 3]:
weight[m, n] = a1[m] * e1 + a2[n] * e2
That is how to initialize the weights using the principal components. This makes the initial state globally ordered, so now the SOM algorithm is used to create the local ordering.
The Principal Component part of the name is a reference to
Here is the idea. You start with data points placed at vectors of many underlying factors. But they may be correlated in your data. So, for example, if you're measuring height, weight, blood pressure, etc, you expect that tall people will weigh more. But what you want to do is replace this with vectors of factors that are not correlated with each other in your data.
So your principal component is a vector of length 1 which is as strongly correlated as possible with the variation in your dataset.
Your secondary component is the vector of length 1 at right angles to the first which is as strongly correlated as possible with the rest of the variation in your data set.
Your tertiary component is the vector of length 1 at right angles to the first two which is as strongly correlated as possible with the rest of the variation in your data set.
And so on.
In practice you may start with many factors, but most of the information is captured in just the first few. For example in the results of intelligence testing the first component is IQ and the second is the difference between how you are at verbal and quantitative reasoning.
How this applies to SOM initialization is that a simple linear model built off of PCA analysis is a pretty good guess for the answer that you're looking for, so starting there reduces how much work you have to do to finish getting the answer.

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Compare two lists of distances (floats) in python

I'm looking to compare two lists of distances (floats) in python. The distances represent how far away my robot is from a wall at different angles. One array is my "best guess" distance array and the other is the array of actual distances. I need to return a number between [0, 1] that represents the similarity between these two lists of floats. The distances match up 1 to 1. That is, the distance at index 0 should be compared to the distance at index 0 in the other array. Right now, for each index, I am dividing the smaller number by the larger number to get a percentage difference. Then I am taking the average of these percentage differences (total percentage difference / number of entries in the array) to get a number between 0 and 1. However, my approach does not seem to be accurate enough. Is there a better algorithm for comparing two ordered lists of floats?
It looks like you need a normalized Euclidean distance between two vectors.
It is simple to caclulate and you can read more about it here.

Difference between observations and variables in Matlab

I'm kind of ashamed to even ask this but here goes. In every Matlab help file where the input matrix is a NxD matrix X Matlab describes the matrix arrangement as
Data, specified as a numeric matrix. The rows of X correspond to
observations, and the columns correspond to variables.
Above taken from help of kmeans
I'm kind of confused as to what does Matlab mean by observations and variables.
Suppose I have a data matrix composed of 100 images. Each image is represented by a feature vector of size 128 x 1. So here is 100 my observations and 128 the variables or is it the other way around?
Will my data matrix be of the size 128 x 100 or 100 x 128
Eugene's explanation in a statistical and probability construct is great, but I would like to explain it more in the viewpoint of data analysis and image processing.
Think of an observation as one sample from your data set. In this case, one observation is one image. For each sample, it has some dimensionality associated to it or a number of variables used to represent such a sample.
For example, if we had a set of 100 2D Cartesian points, the amount of observations is 100, while the dimensionality or the total number of variables used to describe the point is 2: We have a x point and a y point. As such, in the MATLAB universe, we'd place all of these data points into a single matrix. Each row of the matrix denotes one point in your data set. Therefore, the matrix you would create here is 100 x 2.
Now, go back to your problem. We have 100 images and each image can be expressed by 128 features. This suspiciously looks like you are trying to use SIFT or SURF to represent an image so think of this situation where each image can be described by a 128-dimensional vector, or a histogram with bins of 128 elements. Each feature is part of the dimensionality makeup that makes up the image. Therefore, you would have a 100 x 128 matrix. Each row represents one image, where each image is represented as a 1 x 128 feature vector.
In general, MATLAB's machine learning and data analysis algorithms assume that your matrix is M x N, where M is the total number of points that make up your data set while N is the dimensionality of one such point in your data set. In MATLAB's universe, the total number of observations is equal to the total number of points in your data set, while the total number of features / distinct attributes to represent one sample is the total number of variables.
Observation: One sample from your data set
Variable: One feature / attribute that helps describe an observation or sample in your data set.
Number of observations: Total number of points in your data set
Number of variables: Total number of features / attributes that make up an observation or sample in your data set.
It looks like you are talking about some specific statistical/probabilistic functions. In statistics or probability theory there are some random variables that are results of some kind of measurements/observations over time (or some other dimension). So such a matrix is just a collection of N measurements of D different random variables.
