Benefits of nearest neighbor search with Morton-order? - algorithm

While working on the simulation of particle interactions, I stumbled across grid indexing in Morton-order (Z-order)(Wikipedia link) which is regarded to provide an efficient nearest neighbor cell search. The main reason that I've read is the almost sequential ordering of spatially close cells in memory.
Being in the middle of a first implementation, I can not wrap my head around how to efficiently implement the algorithm for the nearest neighbors, especially in comparison to a basic uniform grid.
Given a cell (x,y) it is trivial to obtain the 8 neighbor cell indices and compute the respective z-index. Although this provides constant access time to the elements, the z-index has either to be calculated or looked up in predefined tables (separate for each axis and OR'ing). How can this possibly be more efficient? Is it true, that accessing elements in an array A in an order say A[0] -> A1 -> A[3] -> A[4] -> ... is more efficient than in an order A[1023] -> A[12] -> A[456] -> A[56] -> ...?
I've expected that there exists a simpler algorithm to find the nearest neighbors in z-order. Something along the lines: find first cell of neighbors, iterate. But this can't be true, as this works nicely only within 2^4 sized blocks. There are two problems however: When the cell is not on the boundary, one can easily determine the first cell of the block and iterate through the cells in the block, but one has to check whether the cell is a nearest neighbor. Worse is the case when the cell lies on the boundary, than one has to take into account 2^5 cells. What am I missing here? Is there a comparatively simple and efficient algorithm that will do what I need?
The question in point 1. is easily testable, but I'm not very familiar with the underlying instructions that the described access pattern generates and would really like to understand what is going on behind the scenes.
Thanks in advance for any help, references, etc...
EDIT:
Thank you for clarifying point 1! So, with Z-ordering, the cache hit rate is increased on average for neighbor cells, interesting. Is there a way to profile cache hit/miss rates?
Regarding point 2:
I should add that I understand how to build the Morton-ordered array for a point cloud in R^d where the index i = f(x1, x2, ..., xd) is obtained from bitwise interlacing etc. What I try to understand is whether there is a better way than the following naive ansatz to get the nearest neighbors (here in d=2, "pseudo code"):
// Get the z-indices of cells adjacent to the cell containing (x, y)
// Accessing the contents of the cells is irrelevant here
(x, y) \elem R^2
point = (x, y)
zindex = f(x, y)
(zx, zy) = f^(-1)(zindex) // grid coordinates
nc = [(zx - 1, zy - 1), (zx - 1, zy), (zx - 1, zy + 1), // neighbor grid
(zx , zy - 1), (zx, zy + 1), // coordinates
(zx + 1, zy - 1), (zx + 1, zy), (zx + 1, zy + 1)]
ni= [f(x[0], x[1]) for x in nc] // neighbor indices

In modern multi-level cache-based computer systems, spacial locality is an important factor in optimising access-time to data elements.
Put simply, this means if you access a data element in memory, then accessing another data element in memory that is nearby (has an address that is close to the first) can be cheaper by several orders of magnitude that accessing a data element that is far away.
When 1-d data is accessed sequentially, as in simply image processing or sound processing, or iterating over data structures processing each element the same way, then arranging the data elements in memory in order tends to achieve spatial locality - i.e. since you access element N+1 just after accessing element N, the two elements should be placed next to each other in memory.
Standard c arrays (and many other data structures) have this property.
The point of Morton ordering is to support schemes where data is accessed two dimensionally instead of one dimensionally. In other words, after accessing element (x,y), you may go on to access (x+1,y) or (x,y+1) or similar.
The Morton ordering means that (x,y), (x+1,y) and (x,y+1) are near to each other in memory. In a standard c multidimensional array, this is not necessarily the case. For example, in the array myArray[10000][10000], (x,y) and (x,y+1) are 10000 elements apart - too far apart to take advantage of spatial locality.
In a Morton ordering, a standard c array can still be used as a store for the data, but the calculation to work out where (x,y) is is no longer as simple as store[x+y*rowsize].
To implement your application using Morton ordering, you need to work out how to transform a coordinate (x,y) into the address in the store. In other words, you need a function f(x,y) that can be used to access the store as in store[f(x,y)].
Looks like you need to do some more research - follow the links from the wikipedia page, particularly the ones on the BIGMIN function.

Yes, accessing array elements in order is indeed faster. The CPU loads memory from RAM into cache in chunks. If you access sequentially, the CPU can preload the next chunk easily, and you won't notice the load time. If you access randomly, it can't. This is called cache coherency, and what it means is that accessing memory near to memory you've already accessed is faster.
In your example, when loading A[1], A[2], A[3] and A[4], the processor probably loaded several of those indices at once, making them very trivial. Moreover, if you then go on to try to access A[5], it can pre-load that chunk while you operate on A[1] and such, making the load time effectively nothing.
However, if you load A[1023], the processor must load that chunk. Then it must load A[12]- which it hasn't already loaded and thus must load a new chunk. Et cetera, et cetera. I have no idea about the rest of your question, however.

Related

Data structure and algorithms for 1D velocity model using layers?

This is for a geophysical analysis program I am creating. I already have code to do all this, but I am looking for inspirations and ideas (good datastructures and algorithms).
What I want to model:
Velocity as a function of depth (z)
The model is built up from multiple layers (<10)
Every layer is accessible by an index going from 0 for the top most layer to n for the bottom most layer
Every layer has velocity as a linear function of depth (gradient a_k and axis intercept b_k of the kth layer)
Every layer has a top and bottom depth (z_k-1 and z_k)
The model is complete, there is no space between layers. The point directly between two layers belongs to the lower layer
Requirements:
Get velocity at an arbitrary depth within the model. This will be done on the order of 1k to 10k times, so it should be well optimized.
Access to the top and bottom depths, gradients and intercepts of a layer by the layer index
What I have so far:
I have working Python code where every layer is saved as a numpy array with the values of z_k (bottom depth), z_k-1 (top depth), a_k (velocity gradient) and b_k (axis intercept). To evaluate the model at a certain depth, I get the layer index (, use that to get the parameters of the layer and pass them to a function that evaluates the linear velocity gradient.
So you have piecewise linear dependence, where z-coordinates of pieces ends go irrregular, and want to get function value at given z.
Note that there is no sense to use binary search for 10 pieces (3-4 rounds of BS might be slower than 9 simple comparisons).
But what precision have your depth queries? Note that you can store a table both for 1-meter resolution and for 1 millimeter too - only 10^7 entries provide O(1) access to any precalculated velocity value
For limited number of pieces it is possible to make long formula (involving integer division) but results perhaps should be slower.
Example for arbitrary three-pieces polyline with border points 2 and 4.5:
f = f0 + 0.2*int(z/2.0)*(z-2.0) + 0.03*int(z/4.5)*(z-4.5)

How does Principle Component Initialization work for determining the weights of the map vectors in Self Organizing Maps?

I studied on a fundamental SOM initialization and was looking to understand exactly how this process, PCI, works for initializing weight vectors on the map. My understanding is that for a two dimensional Map, this initialization method looks at the eigenvectors for the two largest eigenvalues of the data matrix and then uses the subspace spanned by these eigenvectors to initialize the map. Does that mean that in order to get the initial map weights, does this method take random linear combinations of the largest two eigenvectors in order to generate the map weights? Is there a patten?
For example, for 40 input data vectors on the map, does the lininit initialization method take combinations a1*[e1] + a2*[e2] where [e1] and [e2] are the two largest eigenvectors and a1 and a2 are random integers ranging from -3 to 3? Or is there a different mechanism? I was looking to make sure I knew exactly how lininit takes the two largest eigenvectors of the input data matrix and uses them to construct the initial weight vectors for the map.
The SOM creates a map that has the neighbourhood relationship between nearby nodes. Random initialisation does not help this process, since the nodes start randomly. Therefore, the idea of using the PCA initialisation is just a shortcut to get the map closer to the final state. This saves a lot of computation.
So how does this work? The first two principal components (PCs) are used. Set the initial weights as linear combination of the PCs. Rather than using random a1 and a2, the weights are set in a range that corresponds to the scale of the principal components.
For example, for a 5x3 map, a1 and a2 can both be in the range (-1, 1) with the relevant number of elements. In other words, for the 5x3 map, a1 = [-1.0 -0.5 0.0 0.5 1.0] and a2 = [-1.0 0.0 1.0], with 5 nodes and 3 nodes, respectively.
Then set each of the weights of nodes. For a rectangular SOM, each node has indices [m, n]. Use the values of a1[m] and a2[n]. Thus, for all m = [1 2 3 4 5] and n = [1 2 3]:
weight[m, n] = a1[m] * e1 + a2[n] * e2
That is how to initialize the weights using the principal components. This makes the initial state globally ordered, so now the SOM algorithm is used to create the local ordering.
The Principal Component part of the name is a reference to https://en.wikipedia.org/wiki/Principal_component_analysis.
Here is the idea. You start with data points placed at vectors of many underlying factors. But they may be correlated in your data. So, for example, if you're measuring height, weight, blood pressure, etc, you expect that tall people will weigh more. But what you want to do is replace this with vectors of factors that are not correlated with each other in your data.
So your principal component is a vector of length 1 which is as strongly correlated as possible with the variation in your dataset.
Your secondary component is the vector of length 1 at right angles to the first which is as strongly correlated as possible with the rest of the variation in your data set.
Your tertiary component is the vector of length 1 at right angles to the first two which is as strongly correlated as possible with the rest of the variation in your data set.
And so on.
In practice you may start with many factors, but most of the information is captured in just the first few. For example in the results of intelligence testing the first component is IQ and the second is the difference between how you are at verbal and quantitative reasoning.
How this applies to SOM initialization is that a simple linear model built off of PCA analysis is a pretty good guess for the answer that you're looking for, so starting there reduces how much work you have to do to finish getting the answer.

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.
Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.
There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

Parabolic knapsack

Lets say I have a parabola. Now I also have a bunch of sticks that are all of the same width (yes my drawing skills are amazing!). How can I stack these sticks within the parabola such that I am minimizing the space it uses as much as possible? I believe that this falls under the category of Knapsack problems, but this Wikipedia page doesn't appear to bring me closer to a real world solution. Is this a NP-Hard problem?
In this problem we are trying to minimize the amount of area consumed (eg: Integral), which includes vertical area.
I cooked up a solution in JavaScript using processing.js and HTML5 canvas.
This project should be a good starting point if you want to create your own solution. I added two algorithms. One that sorts the input blocks from largest to smallest and another that shuffles the list randomly. Each item is then attempted to be placed in the bucket starting from the bottom (smallest bucket) and moving up until it has enough space to fit.
Depending on the type of input the sort algorithm can give good results in O(n^2). Here's an example of the sorted output.
Here's the insert in order algorithm.
function solve(buckets, input) {
var buckets_length = buckets.length,
results = [];
for (var b = 0; b < buckets_length; b++) {
results[b] = [];
}
input.sort(function(a, b) {return b - a});
input.forEach(function(blockSize) {
var b = buckets_length - 1;
while (b > 0) {
if (blockSize <= buckets[b]) {
results[b].push(blockSize);
buckets[b] -= blockSize;
break;
}
b--;
}
});
return results;
}
Project on github - https://github.com/gradbot/Parabolic-Knapsack
It's a public repo so feel free to branch and add other algorithms. I'll probably add more in the future as it's an interesting problem.
Simplifying
First I want to simplify the problem, to do that:
I switch the axes and add them to each other, this results in x2 growth
I assume it is parabola on a closed interval [a, b], where a = 0 and for this example b = 3
Lets say you are given b (second part of interval) and w (width of a segment), then you can find total number of segments by n=Floor[b/w]. In this case there exists a trivial case to maximize Riemann sum and function to get i'th segment height is: f(b-(b*i)/(n+1))). Actually it is an assumption and I'm not 100% sure.
Max'ed example for 17 segments on closed interval [0, 3] for function Sqrt[x] real values:
And the segment heights function in this case is Re[Sqrt[3-3*Range[1,17]/18]], and values are:
Exact form:
{Sqrt[17/6], 2 Sqrt[2/3], Sqrt[5/2],
Sqrt[7/3], Sqrt[13/6], Sqrt[2],
Sqrt[11/6], Sqrt[5/3], Sqrt[3/2],
2/Sqrt[3], Sqrt[7/6], 1, Sqrt[5/6],
Sqrt[2/3], 1/Sqrt[2], 1/Sqrt[3],
1/Sqrt[6]}
Approximated form:
{1.6832508230603465,
1.632993161855452, 1.5811388300841898, 1.5275252316519468, 1.4719601443879744, 1.4142135623730951, 1.35400640077266, 1.2909944487358056, 1.224744871391589, 1.1547005383792517, 1.0801234497346435, 1, 0.9128709291752769, 0.816496580927726, 0.7071067811865475, 0.5773502691896258, 0.4082482904638631}
What you have archived is a Bin-Packing problem, with partially filled bin.
Finding b
If b is unknown or our task is to find smallest possible b under what all sticks form the initial bunch fit. Then we can limit at least b values to:
lower limit : if sum of segment heights = sum of stick heights
upper limit : number of segments = number of sticks longest stick < longest segment height
One of the simplest way to find b is to take a pivot at (higher limit-lower limit)/2 find if solution exists. Then it becomes new higher or lower limit and you repeat the process until required precision is met.
When you are looking for b you do not need exact result, but suboptimal and it would be much faster if you use efficient algorithm to find relatively close pivot point to actual b.
For example:
sort the stick by length: largest to smallest
start 'putting largest items' into first bin thy fit
This is equivalent to having multiple knapsacks (assuming these blocks are the same 'height', this means there's one knapsack for each 'line'), and is thus an instance of the bin packing problem.
See http://en.wikipedia.org/wiki/Bin_packing
How can I stack these sticks within the parabola such that I am minimizing the (vertical) space it uses as much as possible?
Just deal with it like any other Bin Packing problem. I'd throw meta-heuristics on it (such as tabu search, simulated annealing, ...) since those algorithms aren't problem specific.
For example, if I'd start from my Cloud Balance problem (= a form of Bin Packing) in Drools Planner. If all the sticks have the same height and there's no vertical space between 2 sticks on top of each other, there's not much I'd have to change:
Rename Computer to ParabolicRow. Remove it's properties (cpu, memory, bandwith). Give it a unique level (where 0 is the lowest row). Create a number of ParabolicRows.
Rename Process to Stick
Rename ProcessAssignement to StickAssignment
Rewrite the hard constraints so it checks if there's enough room for the sum of all Sticks assigned to a ParabolicRow.
Rewrite the soft constraints to minimize the highest level of all ParabolicRows.
I'm very sure it is equivalent to bin-packing:
informal reduction
Be x the width of the widest row, make the bins 2x big and create for every row a placeholder element which is 2x-rowWidth big. So two placeholder elements cannot be packed into one bin.
To reduce bin-packing on parabolic knapsack you just create placeholder elements for all rows that are bigger than the needed binsize with size width-binsize. Furthermore add placeholders for all rows that are smaller than binsize which fill the whole row.
This would obviously mean your problem is NP-hard.
For other ideas look here maybe: http://en.wikipedia.org/wiki/Cutting_stock_problem
Most likely this is the 1-0 Knapsack or a bin-packing problem. This is a NP hard problem and most likely this problem I don't understand and I can't explain to you but you can optimize with greedy algorithms. Here is a useful article about it http://www.developerfusion.com/article/5540/bin-packing that I use to make my php class bin-packing at phpclasses.org.
Props to those who mentioned the fact that the levels could be at varying heights (ex: assuming the sticks are 1 'thick' level 1 goes from 0.1 unit to 1.1 units, or it could go from 0.2 to 1.2 units instead)
You could of course expand the "multiple bin packing" methodology and test arbitrarily small increments. (Ex: run the multiple binpacking methodology with levels starting at 0.0, 0.1, 0.2, ... 0.9) and then choose the best result, but it seems like you would get stuck calulating for an infinite amount of time unless you had some methodlogy to verify that you had gotten it 'right' (or more precisely, that you had all the 'rows' correct as to what they contained, at which point you could shift them down until they met the edge of the parabola)
Also, the OP did not specify that the sticks had to be laid horizontally - although perhaps the OP implied it with those sweet drawings.
I have no idea how to optimally solve such an issue, but i bet there are certain cases where you could randomly place sticks and then test if they are 'inside' the parabola, and it would beat out any of the methodologies relying only on horizontal rows.
(Consider the case of a narrow parabola that we are trying to fill with 1 long stick.)
I say just throw them all in there and shake them ;)

How to quickly count the number of neighboring voxels?

I have got a 3D grid (voxels), where some of the voxels are filled, and some are not. The 3D grid is sparsely filled, so I have got a set filledVoxels with coordinates (x, y, z) of the filled voxels. What I am trying to do is find out is for each filled voxel, how many neighboring voxels are filled too.
Here is an example:
filledVoxels contains the voxels (1, 1, 1), (1, 2, 1), and (1, 3, 1).
Therefore, the neighbor counts are:
(1,1,1) has 1 neighbor
(1,2,1) has 2 neighbors
(1,3,1) has 1 neighbor.
Right now I have this algorithm:
voxelCount = new Map<Voxel, Integer>();
for (voxel v in filledVoxels)
count = checkAllNeighbors(v, filledVoxels);
voxelCount[v] = count;
end
checkAllNeighbors() looks up all 26 surrounding voxels. So in total I am doing 26*filledVoxels.size() lookups, which is quite slow.
Is there any way to cut down the number of required lookups? When you look at the above example you can see that I am checking the same voxels several times, so it might be possible to get rid of lookups with some clever caching.
If this helps in any way, the voxels represent a voxelized 3D surface (but there might be holes in it). I usually want to get a list of all voxels that have 5 or 6 neighbors.
You can transform your voxel space into a octree in which every node contains a flag that specifies whether it contains filled voxels at all.
When a node does not contain filled voxels, you don't need to check any of its descendants.
I'd say if each of your lookups is slow (O(size)), you should optimize it by binary search in an ordered list (O(log(size))).
The constant 26, I wouldn't worry much. If you iterate smarter, you could cache something and have 26 -> 10 or something, I think, but unless you have profiled the whole application and found out decisively that it is the bottleneck I would concentrate on something else.
As ilya states, there's not much you can do to get around the 26 neighbor look-ups. You have to make your biggest gains in efficiently identifying whether a given neighbor is filled or not. Given that the brute force solution is essentially O(N^2), you have a lot of possible ground to gain in that area. Since you have to iterate over all filled voxels at least once, I would take an approach similar to the following:
voxelCount = new Map<Voxel, Integer>();
visitedVoxels = new EfficientSpatialDataType();
for (voxel v in filledVoxels)
for (voxel n in neighbors(v))
if (visitedVoxels.contains(n))
voxelCount[v]++;
voxelCount[n]++;
end
next
visitedVoxels.add(v);
next
For your efficient spatial data type, a kd-tree, as Zifre suggested, might be a good idea. In any case, you're going to want to reduce your search space by binning visited voxels.
If you're marching along the voxels one at a time, you can keep a lookup table corresponding to the grid, so that after you've checked it once using IsFullVoxel() you put the value in this grid. For each voxel you're marching in you can check if its lookup table value is valid, and only call IsFullVoxel() it it isn't.
OTOH it seems like you can't avoid iterating over all neighboring voxels, either using IsFullVoxel() or the LUT. If you had some more a priori information it could help. For instance, if you knew that there were at most x neighboring filled voxels, or you knew that there were at most y neighboring filled voxels in each direction. For instance, if you know you're looking for voxels with 5 to 6 neighbors, you can stop after you've found 7 full neighbors or 22 empty neighbors.
I'm assuming that a function IsFullVoxel() exists that returns true if a voxel is full.
If most of the moves in your iteration were to neighbors, you could reduce your checking by around 25% by not looking back at the ones you just checked before you made the step.
You may find a Z-order curve a useful concept here. It lets you (with certain provisos) keep a sliding window of data around the point you're currently querying, so that when you move to the next point, you don't have to throw away many of the queries you've already performed.
Um, your question is not very clear. I'm assuming you just have a list of the filled points. In that case, this is going to be very slow, because you have to iterate through it (or use some kind of tree structure such as a kd-tree, but this will still be O(log n)).
If you can (i.e. the grid is not too big), just make a 3d array of bools. 26 lookups in a 3d array shouldn't really take that long (and there really is no way to cut down on the number of lookups).
Actually, now that I think of it, you could make it a 3d array of longs (64 bits). Each 64 bit block would hold 64 (4 x 4 x 4) voxels. When you are checking the neighbors of a voxel in the middle of the block, you could do a single 64 bit read (which would be much faster).
Is there any way to cut down the number of required lookups?
You will, at minimum, have to perform at least 1 lookup per voxel. Since that's the minimum, then any algorithm which only performs one lookup per voxel will meet your requirement.
One simplistic idea is to initialize an array to hold the count for each voxel, then look at each voxel and increment the neighbors of that voxel in the array.
Pseudo C might look something like this:
#define MAXX 100
#define MAXY 100
#define MAXZ 100
int x, y, z
char countArray[MAXX][MAXY][MAXZ];
initializeCountArray(MAXX, MAXY, MAXZ); // Set all array elements to 0
for(x=0; x<MAXX; x++)
for(y=0;y<MAXY;y++)
for(z=0;z<MAXZ;z++)
if(VoxelExists(x,y,z))
incrementNeighbors(x,y,z);
You'll need to write initializeCountArray so it sets all array elements to 0.
More importantly you'll also need to write incrementNeighbors so that it won't increment outside the array. A slight speed increase here is to only perform the above algorithm on all voxels on the interior, then do a separate run on all the outside edge voxels with a modified incrementNeighbrs routine that understands there won't be neighbors on one side.
This algorithm results in 1 lookup per voxel, and at most 26 byte additions per voxel. If your voxel space is sparse then this will result in very few (relative) additions. If your voxel space is very dense, you might consider reversing the algorithm - initialize the array to the value of 26 for each entry, then decrement the neighbors when a voxel doesn't exist.
The results for a given voxel (ie, how many neighbors do I have?) reside in the array. If you need to know how many neighbors voxel 2,3,5 has, just look at the byte in countArray[2][3][5].
The array will consume 1 byte per voxel. You could use less space, and possibly increase the speed a little bit by packing the bytes.
There are better algorithms if you know details about your data. For instance, a very sparse voxel space will benefit greatly from an octree, where you can skip large blocks of lookups when you already know there are no filled voxels inside. Most of these algorithms, however, would still require at least one lookup per voxel to fill their matrix, but if you are performing several operations then they may benefit more than this one operation.

Resources