CUDA: Send data from GPU to GPU - matrix

I have two GPU cards Tesla C2070 (P2P & UAV support) that I want to Send and Receive data using CUDA.
In GPU A, I have a matrix: a11 a12 a13 a14 a21 a22
a23 a24
In GPU B, I have another matrix:
b11 b12 b13 b14
b21 b22 b23 b24
I can only send contiguous elements as the code below:
int main(void)
{
float *d_a, *d_b;
int N = 4;
int M = 2;
size_t pitch;
cudaSetDevice(0);
cudaMallocPitch(&d_a, &pitch, sizeof(float)*N, M);
cudaDeviceEnablePeerAccess(1, 0);
cudaSetDevice(1);
cudaMallocPitch(&d_b, &pitch, sizeof(float)*N, M);
cudaDeviceEnablePeerAccess(0, 0);
//Initialization for d_a
//Initialization for d_b
//Copy M*N/2 element from d_a to d_b, starting from d_a[1]
cudaMemcpy(&d_b[1], &d_a[1], M*N/2*sizeof(float), cudaMemcpyDefault);
//Print result d_b
}
How to send the last two columns of the matrix from GPU A to GPU B directly, so on GPU B I will get:
b11 b12 a13 a14
b21 b22 a23 a24
Similarly, how to send the first row of the matrix from GPU A to GPU B, so on GPU B I will get:
a11 a12 a13 a14
b21 b22 b23 b24
If I have 1-D array as follow: a1 a2 a3 a4 a5 a6 a7 a8.....
How to send elements 1,4,7,...(every 3 elements) from GPU A to replace the same ones on GPU B?

The API call you need to look at is cudaMemcpy2D. This allows fairly straightforward copying of all or portions of pitched data, and is the natural counterpart of cudaMallocPitch.
If we leave aside the multiGPU aspect of your question for a moment, and just focus on the copying of pitched data (in UVA platforms, how GPU to GPU transfers are handled is basically an implementation detail you don't need to know about), there are only three things required to do what you want:
Use pointer arithmetic to calculate the starting address of source and destination memory
Remember that the pitch of the source and destination memory is always constant (that returned by cudaMallocPitch). Note you should keep a pitch for each pointer you allocate. There is no guarantee that the API will return the same pitch for two different allocations of the same size, this is particularly true if the allocations are not on the same device
Remember that you need to calculate the width of any transfer in bytes, and the number widths is always a count, not a byte value.
Here is a concrete example based off the code you posted which performs copying of a subset of data between two pitched allocations assuming column major order. Note that for brevity, I have encapsulated most of the addressing mechanics in a simple class which can be used on both the host and device. Two 5x10 pitched arrays are allocated, and a 3x3 sub array is copied from one to the other. I have used kernel printf to show the copying action:
#include <cstdio>
struct mat
{
int m, n;
size_t pitch;
char *ptr;
__device__ __host__
mat(int _m, int _n, size_t _pitch, char *_ptr) : m(_m), n(_n), pitch(_pitch), ptr(_ptr) {};
__device__ __host__ float * getptr(int i=0, int j=0) {
float * col = (float*)(ptr + j*pitch);
return col + i;
};
__device__ __host__ float& operator() (int i, int j) {
return *getptr(i,j);
};
__device__ __host__
void print() {
for(int i=0; i<m; i++) {
for(int j=0; j<n; j++) {
printf("%4.f ", (*this)(i,j));
}
printf("\n");
}
};
};
__global__ void printmat(struct mat x) { x.print(); }
int main(void)
{
const int M = 5, N = 10;
const size_t hostpitch = M * sizeof(float);
float *a = new float[M*N], *b = new float[M*N];
mat A(M, N, hostpitch, (char *)(a));
mat B(M, N, hostpitch, (char *)(b));
for(int v=0, j=0; j<N; j++) {
for(int i=0; i<M; i++) {
A(i,j) = (float)v; B(i,j) = (float)(100+v++);
}
}
char *d_a, *d_b;
size_t pitch_a, pitch_b;
cudaMallocPitch((void **)&d_a, &pitch_a, sizeof(float)*M, N);
cudaMallocPitch((void **)&d_b, &pitch_b, sizeof(float)*M, N);
mat Ad(M, N, pitch_a, d_a); mat Bd(M, N, pitch_b, d_b);
cudaMemcpy2D(Ad.getptr(), Ad.pitch, A.getptr(), A.pitch,
A.pitch, A.n, cudaMemcpyHostToDevice);
printmat<<<1,1>>>(Ad);
cudaMemcpy2D(Bd.getptr(), Bd.pitch, B.getptr(), B.pitch,
B.pitch, B.n, cudaMemcpyHostToDevice);
printmat<<<1,1>>>(Bd);
int ci = 3, cj = 3;
cudaMemcpy2D(Ad.getptr(1,1), Ad.pitch, Bd.getptr(1,1), Bd.pitch,
ci*sizeof(float), cj, cudaMemcpyDeviceToDevice);
printmat<<<1,1>>>(Ad); cudaDeviceSynchronize();
return 0;
}
which does this:
>nvcc -m32 -Xptxas="-v" -arch=sm_21 pitched.cu
pitched.cu
tmpxft_00001348_00000000-5_pitched.cudafe1.gpu
tmpxft_00001348_00000000-10_pitched.cudafe2.gpu
pitched.cu
ptxas : info : 0 bytes gmem, 8 bytes cmem[2]
ptxas : info : Compiling entry function '_Z8printmat3mat' for 'sm_21'
ptxas : info : Function properties for _Z8printmat3mat
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 23 registers, 48 bytes cmem[0]
tmpxft_00001348_00000000-5_pitched.cudafe1.cpp
tmpxft_00001348_00000000-15_pitched.ii
>cuda-memcheck a.exe
========= CUDA-MEMCHECK
0 5 10 15 20 25 30 35 40 45
1 6 11 16 21 26 31 36 41 46
2 7 12 17 22 27 32 37 42 47
3 8 13 18 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
100 105 110 115 120 125 130 135 140 145
101 106 111 116 121 126 131 136 141 146
102 107 112 117 122 127 132 137 142 147
103 108 113 118 123 128 133 138 143 148
104 109 114 119 124 129 134 139 144 149
0 5 10 15 20 25 30 35 40 45
1 106 111 116 21 26 31 36 41 46
2 107 112 117 22 27 32 37 42 47
3 108 113 118 23 28 33 38 43 48
4 9 14 19 24 29 34 39 44 49
========= ERROR SUMMARY: 0 errors

Related

Spiral in sampled x-y plane

Let’s say I have the following 3D discretized space, in which the indexes of the samples/nodes are sequential as it is shown in the picture.
Now consider only the horizontal middle layer.
My objective is to find a programmatically and iterative rule/s that allow me to run a spiral (like the image or similar, it can start in any direction) over the mid-layer, starting from node 254, as it is shown on the image:
As you can see in the picture, the yellow crosses show the nodes to be explored. In the first lap these nodes are consecutive while in the second they are separated by 1 node and so on.
I started to solve the problem as follows (pseudocode):
I considered size(y) = y = 13
Size(z) = z = 3
Lap 1:
254 – z * y = 215
254 – z * (y + 1) = 212
254 – z = 251
254 + z * (y - 1) = 290
254 + z * y = 293
254 + z * (y + 1) = 296
254 + z = 257
254 – z * (y – 1) = 218
Lap 2:
254 – 3 * z * y = 137
254 – 3 * z * (y + 2/3) = 131
…
But I think there may be a simpler, more general rule.
each direction has constant index increment:
const int dx = 39;
const int dy = 3;
const int dz = 1;
so to make a spiral you just start from start index and increment in current direction i-times then rotate by 90 deg and do the same ... then increment i and do this until desired size is hit ...
You should also add range checking so your spiral will not go outside your array as that would screw things up. By checking actual x,y,z coordinates. So either compute them in parallel or infer them from ix using modular arithmetics so for example something like (C++):
const int dx = 39;
const int dy = 3;
const int dz = 1;
int cw[4]={-dx,-dy,+dx,+dy}; // CW rotation
int ix=254; // start point (center of spiral)
int dir=0; // direction cw[dir]
int n=5; // size
int i,j,k,x,y,z,a; // temp
for (k=0,i=1;i<=n;i+=k,k^=1,dir++,dir&=3)
for (j=1;j<=i;j++)
{
int a=ix-1;
z = a% 3; a/= 3; // 3 is z-resolution
y = a%13; a/=13; // 13 is y-resolution
x = a;
if ((x>=0)&&(x<13)&&(y>=0)&&(y<13)&&(z>=0)&&(z<3))
{
// here use point ix
// Form1->mm_log->Lines->Add(AnsiString().sprintf("%i (%i,%i,%i) %i",ix,x,y,z,i));
}
ix+=cw[dir];
}
producing this output
ix x,y,z i
254 (6,6,1) 1
215 (5,6,1) 1
212 (5,5,1) 2
251 (6,5,1) 2
290 (7,5,1) 2
293 (7,6,1) 2
296 (7,7,1) 3
257 (6,7,1) 3
218 (5,7,1) 3
179 (4,7,1) 3
176 (4,6,1) 3
173 (4,5,1) 3
170 (4,4,1) 4
209 (5,4,1) 4
248 (6,4,1) 4
287 (7,4,1) 4
326 (8,4,1) 4
329 (8,5,1) 4
332 (8,6,1) 4
335 (8,7,1) 4
338 (8,8,1) 5
299 (7,8,1) 5
260 (6,8,1) 5
221 (5,8,1) 5
182 (4,8,1) 5
143 (3,8,1) 5
140 (3,7,1) 5
137 (3,6,1) 5
134 (3,5,1) 5
131 (3,4,1) 5
In case you want CCW spiral either reverse the cw[] or instead of dir++ do dir--
In case you want to have changeable screw width then you just increment i by the actual width instead of just by one.
Based on #Spektre answer, this code worked for me:
const int x_res = 13;
const int y_res = 13;
const int z_res = 3;
const int dx = 39;
const int dy = 3;
const int dz = 1;
int cw[4]={-dx,-dy,+dx,+dy}; // CW rotation
int ix=254; // start point (center of spiral)
int dir=0; // direction cw[dir]
int n=30; // size
int i,j,k;
cout << ix << endl;
// first "lap" (consecutive nodes)
for (k=0,i=1;i<=2;i+=k,k^=1,dir++,dir&=3)
for (j=1;j<=i;j++)
{
ix+=cw[dir];
cout << ix << endl;
}
i-=1;
int width = 2; //screw width
i+=width;
int dist = 1; //nodes separation
int node_count = 0; //nodes counter
for (k=k,i=i;i<=n;i+=k,k^=width,dir++,dir&=3)
{
if (dir==1)
{
dist+=1;
}
for (j=1;j<=i;j++)
{
ix+=cw[dir];
node_count +=1;
if ((0 < ix) && (ix <= x_res*y_res*z_res))
{
if (node_count == dist)
{
cout << ix << endl;
node_count = 0;
}
}
else return 0;
}
}
return 0;
with this output:
254 215 212 251 290 293 296 257 218 179 140 134 128 206 284 362 368 374 380 302
224 146 68 59 50 83 200 317 434 443 452 461 386 269 152 35

Compressing a vector of positive integers (int32) that have a specific order

I'm trying to compress long vectors (their size ranges from 1 to 100 million elements). The vectors have positive integers with values ranging from 0 to 1 or 100 million (depending on the vector size). Hence, I'm using 32 bit integers to encompass the large numbers but that consumes too much storage.
The vectors have the following characteristic features:
All values are positive integers. Their range grows as the vector size grows.
Values are increasing but smaller numbers do intervene frequently (see the figure below).
None of the values before a specific index are larger than that index (Index starts at zero). For instance, none of the values that occur before the index of 6 are larger than 6. However, smaller values may repeat after that index. This holds true for the entire array.
I'm usually dealing with very long arrays. Hence, as the array length passes 1 million elements, the upcoming numbers are mostly large numbers mixed with previous reoccurring numbers. Shorter numbers usually re-occur more than larger numbers. New Larger numbers are added to the array as you pass through it.
Here is a sample of the values in the array: {initial padding..., 0, 1, 2, 3, 4, 5, 6, 4, 7, 4, 8, 9, 1, 10, ... later..., 1110, 11, 1597, 1545, 1392, 326, 1371, 1788, 541,...}
Here is a plot of a part of the vector:
What do I want? :
Because I'm using 32 bit integers this is wasting a lot of memory since smaller numbers that can be represented with less than 32 bit do repeat too. I want to compress this vector maximally to save memory (Ideally, by a factor of 3 because only a reduction by that amount or more will meet our needs!). What is the best compression algorithm to achieve that? Or is there away to take advantage of the array's characteristic features described above to reversibly convert the numbers in that array to 8 bit integers?
Things that I have tried or considered:
Delta encoding: This doesn't work here because the vector is not always increasing.
Huffman coding: Does not seem to help here since the range of unique numbers in the array is quite large, hence, the encoding table will be a large overhead.
Using variable Int encoding. i.e using 8 bit integers for smaller numbers and 16 bit for larger ones...etc. This has reduced the vector size to size*0.7 (not satisfactory since it doesn't take advantage of the specific characteristics described above)
I'm not quite sure if this method described in the following link is applicable to my data: http://ygdes.com/ddj-3r/ddj-3r_compact.html
I don't quite understand the method but it gives me the encouragement to try similar things because I think there is some order in the data that can be taken to its advantage.
For example, I tried to reassign any number(n) larger than 255 to n-255 so that I can keep the integers in 8 bit realm because I know that no number is larger than 255 before that index. However, I'm not able to distinguish the reassigned numbers with the repeated numbers... so this idea doesn't work unless doing some more tricks to reverse the re-assignments...
Here is the link to the fist 24000 elements of the data for those interested:
data
Any advice or suggestions are deeply appreciated. Thanks a lot in advance.
Edit1:
Here is a plot of the data after delta encoding. As you can see, it doesn't reduce the range!
Edit2:
I was hoping that I could find a pattern in the data that allows me to reversibly change the 32-bit vector to a single 8-bit vector but this seems very unlikely.
I have tried to decompose the 32-bit vector to 4 x 8-bit vectors, hoping that the decomposed vectors lend themselves to compression better.
Below are plots for the 4 vectors. Now their ranges are from 0-255.
What I did was to recursively divide each element in the vectors by 255 and store the reminder into another vector. To reconstruct the original array all I need to do is: ( ( (vec4*255) + vec3 )*255 + vec2 ) *255 + vec1...
As you can see, the last vector is all zeros for the current shown length of the data.. in fact, this should be zeros all the way to 2^24th element. This will be a 25% reduction if my total vector length was less than 16 million elements but since I'm dealing with much longer vectors this has a much smaller impact.
More importantly, the third vector seems also to have some compressible features as its values do increase by 1 after each 65,535 steps.
It does seem that now I can benefit from Huffman coding or variable bit encoding as suggested. Any suggestions that allows me to maximally compress this data are deeply appreciated.
Here I attached a bigger sample of the data if anyone is interested:
https://drive.google.com/file/d/10wO3-1j3NkQbaKTcr0nl55bOH9P-G1Uu/view?usp=sharing
Edit3:
I'm really thankful for all the given answers. I've learnt a lot from them. For those of you who are interested to tinker with a larger set of the data the following link has 11 million elements of a similar dataset (zipped 33MB)
https://drive.google.com/file/d/1Aohfu6II6OdN-CqnDll7DeHPgEDLMPjP/view
Once you unzip the data, you can use the following C++ snippet to read the data into a vector<int32_t>
const char* path = "path_to\compression_int32.txt";
std::vector<int32_t> newVector{};
std::ifstream ifs(path, std::ios::in | std::ifstream::binary);
std::istream_iterator<int32_t> iter{ ifs };
std::istream_iterator<int32_t> end{};
std::copy(iter, end, std::back_inserter(newVector));
It's easy to get better than a factor of two compression on your example data by using property 3, where I have taken property 3 to mean that every value must be less than its index, with the indices starting at 1. Simply use ceiling(log2(i)) bits to store the number at index i (where i starts at 1). For your first example with 24,977 values, that compresses it of 43% of the size of the vector using 32-bit integers.
The number of bits required depends only on the length of the vector, n. The number of bits is:
1 - 2ceiling(log2(n)) + n ceiling(log2(n))
As noted by Falk Hüffner, a simpler approach would be a fixed number of bits for all values of ceiling(log2(n)). A variable number of bits will always be less than that, but not much less than that for large n.
If it is common to have a run of zeros at the start, then compress those with a count. There are only a handful of runs of two or three numbers in the remainder, so run-length encoding won't help except for that initial run of zeros.
Another 2% or so (for large sets) could be shaved off using an arithmetic coding approach, considering each value at index k (indices starting at zero) to be a base k+1 digit of a very large integer. That would take ceiling(log2(n!)) bits.
Here is a plot of the compression ratios of the arithmetic coding, variable bits per sample coding, and fixed bits per sample coding, all ratioed to a representation with 32 bits for every sample (the sequence length is on a log scale):
The arithmetic approach requires multiplication and division on integers the length of the compressed data, which is monumentally slow for large vectors. The code below limits the size of the integers to 64 bits, at some cost to the compression ratio, in exchange for it being very fast. This code will give compression ratios about 0.2% to 0.7% more than arithmetic in the plot above, well below variable bits. The data vector must have the property that each value is non-negative
and that each value is less than its position (positions starting at one).
The compression effectiveness depends only on that property, plus a small reduction if there is an initial run of zeros.
There appears to be a bit more redundancy in the provided examples that this
compression approach does not exploit.
#include <vector>
#include <cmath>
// Append val, as a variable-length integer, to comp. val must be non-negative.
template <typename T>
void write_varint(T val, std::vector<uint8_t>& comp) {
while (val > 0x7f) {
comp.push_back(val & 0x7f);
val >>= 7;
}
comp.push_back(val | 0x80);
}
// Return the variable-length integer at offset off in comp, updating off to
// point after the integer.
template <typename T>
T read_varint(std::vector<uint8_t> const& comp, size_t& off) {
T val = 0, next;
int shift = 0;
for (;;) {
next = comp.at(off++);
if (next > 0x7f)
break;
val |= next << shift;
shift += 7;
}
val |= (next & 0x7f) << shift;
return val;
}
// Given the starting index i >= 1, find the optimal number of values to code
// into 64 bits or less, or up through index n-1, whichever comes first.
// Optimal is defined as the least amount of entropy lost by representing the
// group in an integral number of bits, divided by the number of bits. Return
// the optimal number of values in num, and the number of bits needed to hold
// an integer representing that group in len.
static void group_ar64(size_t i, size_t n, size_t& num, int& len) {
// Analyze all of the permitted groups, starting at index i.
double min = 1.;
uint64_t k = 1; // integer range is 0..k-1
auto j = i + 1;
do {
k *= j;
auto e = log2(k); // entropy of k possible integers
int b = ceil(e); // number of bits to hold 0..k-1
auto loss = (b - e) / b; // unused entropy per bit
if (loss < min) {
num = j - i; // best number of values so far
len = b; // bit length for that number
if (loss == 0.)
break; // not going to get any better
min = loss;
}
} while (j < n && k <= (uint64_t)-1 / ++j);
}
// Compress the data arithmetically coded as an incrementing base integer, but
// with a 64-bit limit on each integer. This puts values into groups that each
// fit in 64 bits, with the least amount of wasted entropy. Also compress the
// initial run of zeros into a count.
template <typename T>
std::vector<uint8_t> compress_ar64(std::vector<T> const& data) {
// Resulting compressed data vector.
std::vector<uint8_t> comp;
// Start with number of values to make the stream self-terminating.
write_varint(data.size(), comp);
if (data.size() == 0)
return comp;
// Run-length code the initial run of zeros. Write the number of contiguous
// zeros after the first one.
size_t i = 1;
while (i < data.size() && data[i] == 0)
i++;
write_varint(i - 1, comp);
// Compress the data into variable-base integers starting at index i, where
// each integer fits into 64 bits.
unsigned buf = 0; // output bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < data.size()) {
// Find the optimal number of values to code, starting at index i.
size_t num; int len;
group_ar64(i, data.size(), num, len);
// Code num values.
uint64_t code = 0;
size_t k = 1;
do {
code += k * data[i++];
k *= i;
} while (--num);
// Write code using len bits.
if (bits) {
comp.push_back(buf | (code << bits));
code >>= 8 - bits;
len -= 8 - bits;
}
while (len > 7) {
comp.push_back(code);
code >>= 8;
len -= 8;
}
buf = code;
bits = len;
}
if (bits)
comp.push_back(buf);
return comp;
}
// Decompress the result of compress_ar64(), returning the original values.
// Start decompression at offset off in comp. When done, off is updated to
// point just after the compressed data.
template <typename T>
std::vector<T> expand_ar64(std::vector<uint8_t> const& comp, size_t& off) {
// Will contain the uncompressed data to return.
std::vector<T> data;
// Get the number of values.
auto vals = read_varint<size_t>(comp, off);
if (vals == 0)
return data;
// Get the number of zeros after the first one, and write all of them.
auto run = read_varint<size_t>(comp, off) + 1;
auto i = run;
do {
data.push_back(0);
} while (--run);
// Extract the values from the compressed data starting at index i.
unsigned buf = 0; // input bit buffer
int bits = 0; // number of bits in buf (0..7)
while (i < vals) {
// Find the optimal number of values to code, starting at index i. This
// simply repeats the same calculation that was done when compressing.
size_t num; int len;
group_ar64(i, vals, num, len);
// Read len bits into code.
uint64_t code = buf;
while (bits + 8 < len) {
code |= (uint64_t)comp.at(off++) << bits;
bits += 8;
}
len -= bits; // bits to pull from last byte (1..8)
uint64_t last = comp.at(off++); // last byte
code |= (last & ((1 << len) - 1)) << bits;
buf = last >> len; // save remaining bits in buffer
bits = 8 - len;
// Extract num values from code.
do {
i++;
data.push_back(code % i);
code /= i;
} while (--num);
}
// Return the uncompressed data.
return data;
}
Solving every compression problem should begin with an analysis.
I looked at the raw data file containing the first 24976 values. The smallest value is 0 and the largest is 24950. The "slope" of the data is then around 1. However, It should decrease over time, if the maximum is, as told, only 33M#100M values. Assumption of slope=1 is then a bit pessimistic.
As for the distribution,
tr '[,]' '[\n]' <compression.txt | sort -n | uniq -c | sort -nr | head -n256
produces
164 0
131 8
111 1648
108 1342
104 725
103 11
91 1475
90 1446
82 21
82 1355
78 69
76 2
75 12
72 328
71 24
70 614
70 416
70 1608
70 1266
69 22
67 356
67 3
66 1444
65 19
65 1498
65 10
64 2056
64 16
64 1322
64 1182
63 249
63 1335
61 43
60 17
60 1469
59 33
59 3116
58 20
58 1201
57 303
55 5
55 4
55 2559
55 1324
54 1110
53 1984
53 1357
52 807
52 56
52 4321
52 2892
52 1
50 39
50 2475
49 1580
48 664
48 266
47 317
47 1255
46 981
46 37
46 3531
46 23
43 1923
43 1248
41 396
41 2349
40 7
39 6
39 54
39 4699
39 32
38 815
38 2006
38 194
38 1298
38 1284
37 44
37 1550
37 1369
37 1273
36 1343
35 61
35 3991
35 3606
35 1818
35 1701
34 836
34 27
34 264
34 241
34 1306
33 821
33 28
33 248
33 18
33 15
33 1017
32 9
32 68
32 53
32 240
32 1516
32 1474
32 1390
32 1312
32 1269
31 667
31 326
31 263
31 25
31 160
31 1253
30 3365
30 2082
30 18550
30 1185
30 1049
30 1018
29 73
29 487
29 48
29 4283
29 34
29 243
29 1605
29 1515
29 1470
29 1297
29 1183
28 980
28 60
28 302
28 242
28 1959
28 1779
28 161
27 811
27 51
27 36
27 201
27 1270
27 1267
26 979
26 50
26 40
26 3111
26 26
26 2425
26 1807
25 825
25 823
25 812
25 77
25 46
25 217
25 1842
25 1831
25 1534
25 1464
25 1321
24 730
24 66
24 59
24 427
24 355
24 1465
24 1299
24 1164
24 1111
23 941
23 892
23 7896
23 663
23 607
23 556
23 47
23 2887
23 251
23 1776
23 1583
23 1488
23 1349
23 1244
22 82
22 818
22 661
22 42
22 411
22 3337
22 3190
22 3028
22 30
22 2226
22 1861
22 1363
22 1301
22 1262
22 1158
21 74
21 49
21 41
21 376
21 354
21 2156
21 1688
21 162
21 1453
21 1067
21 1053
20 711
20 413
20 412
20 38
20 337
20 2020
20 1897
20 1814
20 17342
20 173
20 1256
20 1160
19 9169
19 83
19 679
19 4120
19 399
19 2306
19 2042
19 1885
19 163
19 1623
19 1380
18 805
18 79
18 70
18 6320
18 616
18 455
18 4381
18 4165
18 3761
18 35
18 2560
18 2004
18 1900
18 1670
18 1546
18 1291
18 1264
18 1181
17 824
17 8129
17 63
17 52
17 5138
as the most frequent 256 values.
It seems some values are inherently more common. When examined, those common values also seem to be distributed all over the data.
I propose the following:
Divide the data into blocks. For each block, send the actual value of the slope, so when coding each symbol we know its maximum value.
Code the common values in a block with statistical coding (Huffman etc.). In this case, the cutoff with an alphabet of 256 would be around 17 occurrences.
For less common values, we reserve a small part of the alphabet for sending the amount of bits in the value.
When we encounter a rare value, its bits are coded without statistical modeling. The topmost bit can be omitted, since we know it's always 1 (unless value is '0').
Usually the range of values to be coded is not a power-of-2. For example, if we have 10 choices, this requires 4 bits to code, but there are 6 unused bit patterns - sometimes we only need 3 bits. The first 6 choices we code directly with 3 bits. If it's 7 or 8, we send an extra bit to indicate if we meant 9 or 10.
Additionally, we could exclude any value that is directly coded from the list of possible values. Otherwise we have two ways to code the same value, which is redundant.
As I suggested in my comment you can represent your data as 8bit. There are simple ways on how to do it efficiently no need for modular arithmetics..
You can use union or pointers for this so for example in C++ if you have:
unsigned int data32[]={0,0,0,...};
unsigned char *data08=data32;
Or you can copy it to 4 BYTE array but that will be slower.
If you have to use modular arithmetics for any reasons then you might want to do it like this:
x &255
(x>> 8)&255
(x>>16)&255
(x>>24)&255
Now I have tried LZW on your new data and the compression ratio result without any data reordering (single LZW) was 81-82% (depending on dictionary size I suggest to use 10bit LZW dictionary) which is not as good as expected. So I reordered the data into 4 arrays (just like you did) so first array has lowest 8bits and last the highest. The results with 12 bit dictionary where:
ratio08: 144%
ratio08: 132%
ratio08: 26%
ratio08: 0%
total: 75%
The results with 10 bit dictionary where:
ratio08: 123%
ratio08: 117%
ratio08: 28%
ratio08: 0%
total: 67%
Showing that LZW is bad for lowest bytes (and with increasing size it will be worse for higher bytes too) So use it only for the higher BYTEs which would improve the compress ratio more.
However I expect huffman should lead to much better results so I computed entropy for your data:
H32 = 5.371071 , H/8 = 0.671384
H08 = 7.983666 , H/8 = 0.997958
H08 = 7.602564 , H/8 = 0.950321
H08 = 1.902525 , H/8 = 0.237816
H08 = 0.000000 , H/8 = 0.000000
total: 54%
meaning naive single huffman encoding would have compress ratio 67% and the separate 4 arrays would lead to 54% which is much better so in your case I would go for huffman encoding. After I implemented it here the result:
[Huffman]
ratio08 = 99.992%
ratio08 = 95.400%
ratio08 = 24.706%
ratio08 = 0.000%
total08 = 55.025%
ratio32 = 67.592%
Which closely matches the estimation by Shannon entropy as expected (not accounting the decoding table) ...
However with very big datasets I expect naive huffman will start to get slightly better than the separate 4x huffman ...
Also note that the result where truncated so those 0% are not zero but something less than 1% ...
[Edit1] 300 000 000 entries estimation
so to simulate the conditions for 300M 32bit numbers of yours I use 16bit numbers sub part of your data with similar "empty space" properties.
log2(300 000 000) = ~28
28/32 * 16 = 14
so I use only 2^14 16bit numbers which should have similar properties as your 300M 32 bit numbers The 8bit Huffman encoding leads to:
ratio08 = 97.980%
ratio08 = 59.534%
total08 = 78.757%
So I estimate 80% ratio between encoded/decoded sizes ~1.25 size reduction.
(Hope I did not screw something up with my assumptions).
The data you are dealing with is "nearly" sorted, so you can use that to great effect with delta encoding.
A simple approach is as follows:
Look for runs of data, denoted by R_i = (v,l,N) where l is the length of the run, N is the bit-depth needed to do delta encoding on the sorted run, and v is the value of the first element of the (sorted) run (needed for delta encoding.) The run itself then just needs to store 2 pieces of information for each entry in the run: the idx of each sorted element in the run and the delta. Note, to store the idx of each sorted element, only log_2(l) bits are needed per idx, where l is the length of the run.
The encoding works by attempting to find the least number of bits to fully encode the run when compared to the number of bytes used in its uncompressed form. In practice, this can be implemented by finding the longest run that is encoded for a fixed number of bytes per element.
To decode, simply decode run-by-run (in order) first decoding the delta coding/compression, then undoing the sort.
Here is some C++ code that computes the compression ratio that can be obtained using this scheme on the data sample you posted. The implementation takes a greedy approach in selecting the runs, it is possible slightly better results are available if a smarter approach is used.
#include <algorithm>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <map>
#include <queue>
#include "data.h"
template <int _N, int _M> struct T {
constexpr static int N = _N;
constexpr static int M = _M;
uint16_t idx : N;
uint16_t delta : M;
};
template <int _N, int _M>
std::pair<int32_t, int32_t> best_packed_run_stats(size_t idx) {
const int N = 1 << _N;
const int M = 1 << _M;
static std::vector<int32_t> buffer(N);
if (idx + N >= data.size())
return {-1, 0};
std::copy(&data[idx], &data[idx + N], buffer.data());
std::sort(buffer.begin(), buffer.end());
int32_t run_len = 0;
for (size_t i = 1; i < N; ++i, ++run_len) {
auto delta = buffer[i] - buffer[i - 1];
assert(delta >= 0);
if (delta >= M) {
break;
}
}
int32_t savings = run_len * (sizeof(int32_t) - sizeof(T<_N, _M>)) -
1 // 1 byte to store bit-depth
- 2; // 2 bytes to store run length
return {savings, run_len};
}
template <class... Args>
std::vector<std::pair<int32_t, int32_t>> all_runs_stats(size_t idx) {
return {best_packed_run_stats<Args::N, Args::M>(idx)...};
}
int main() {
size_t total_savings = 0;
for (size_t i = 0; i < data.size(); ++i) {
auto runs =
all_runs_stats<T<2, 14>, T<4, 12>, T<8, 8>, T<12, 4>, T<14, 2>>(i);
auto best_value = *std::max_element(runs.begin(), runs.end());
total_savings += best_value.first;
i += best_value.second;
}
size_t uncomp_size = data.size() * sizeof(int32_t);
double comp_ratio =
(uncomp_size - (double)total_savings) / (double)uncomp_size;
printf("uncomp_size: %lu\n", uncomp_size);
printf("compression: %lf\n", comp_ratio);
printf("size: %lu\n", data.size());
}
Note, only certain fixed configurations of 16-bit representations of elements in a run are attempted. Because of this we should expect the best possible compression we can achieve is 50% (i.e. 4 bytes -> 2 bytes.) In reality, there is overhead.
This code when run on the data sample you supplied reports this compression ration:
uncomp_size: 99908
compression: 0.505785
size: 24977
which is very close to the theoretical limit of .5 for this compression algorithm.
Also, note, that this slightly beats out the Shannon entropy estimate reported in another answer.
Edit to address Mark Adler's comment below.
Re-running this compression on the larger data-set provided (compression2.txt) along with comparing to Mark Adler's approach here are the results:
uncomp_size: 2602628
compression: 0.507544
size: 650657
bit compression: 0.574639
Where bit compression is the compression ratio of Mark Adler's approach. As noted by others, compressing the bits of each entry will not scale well for large data, we should expect the ratio to get worse with n.
Meanwhile the delta + sorting compression described above maintains close to its theoretical best of .5.

Algorithm for visiting all grid cells in pseudo-random order that has a guaranteed uniformity at any stage

Context:
I have a hydraulic erosion algorithm that needs to receive an array of droplet starting positions. I also already have a pattern replicating algorithm, so I only need a good pattern to replicate.
The Requirements:
I need an algorism that produces a set of n^2 entries in a set of format (x,y) or [index] that describe cells in an nxn grid (where n = 2^i where i is any positive integer).
(as a set it means that every cell is mentioned in exactly one entry)
The pattern [created by the algorism ] should contain zero to none clustering of "visited" cells at any stage.
The cell (0,0) is as close to (n-1,n-1) as to (1,1), this relates to the definition of clustering
Note
I was/am trying to find solutions through fractal-like patterns built through recursion, but at the time of writing this, my solution is a lookup table of a checkerboard pattern(list of black cells + list of white cells) (which is bad, but yields fewer artifacts than an ordered list)
C, C++, C#, Java implementations (if any) are preferred
You can use a linear congruential generator to create an even distribution across your n×n space. For example, if you have a 64×64 grid, using a stride of 47 will create the pattern on the left below. (Run on jsbin) The cells are visited from light to dark.
That pattern does not cluster, but it is rather uniform. It uses a simple row-wide transformation where
k = (k + 47) mod (n * n)
x = k mod n
y = k div n
You can add a bit of randomness by making k the index of a space-filling curve such as the Hilbert curve. This will yield the pattern on the right. (Run on jsbin)
     
     
You can see the code in the jsbin links.
I have solved the problem myself and just sharing my solution:
here are my outputs for the i between 0 and 3:
power: 0
ordering:
0
matrix visit order:
0
power: 1
ordering:
0 3 2 1
matrix visit order:
0 3
2 1
power: 2
ordering:
0 10 8 2 5 15 13 7 4 14 12 6 1 11 9 3
matrix visit order:
0 12 3 15
8 4 11 7
2 14 1 13
10 6 9 5
power: 3
ordering:
0 36 32 4 18 54 50 22 16 52 48 20 2 38 34 6
9 45 41 13 27 63 59 31 25 61 57 29 11 47 43 15
8 44 40 12 26 62 58 30 24 60 56 28 10 46 42 14
1 37 33 5 19 55 51 23 17 53 49 21 3 39 35 7
matrix visit order:
0 48 12 60 3 51 15 63
32 16 44 28 35 19 47 31
8 56 4 52 11 59 7 55
40 24 36 20 43 27 39 23
2 50 14 62 1 49 13 61
34 18 46 30 33 17 45 29
10 58 6 54 9 57 5 53
42 26 38 22 41 25 37 21
the code:
public static int[] GetPattern(int power, int maxReturnSize = int.MaxValue)
{
int sideLength = 1 << power;
int cellsNumber = sideLength * sideLength;
int[] ret = new int[cellsNumber];
for ( int i = 0 ; i < cellsNumber && i < maxReturnSize ; i++ ) {
// this loop's body can be used for per-request computation
int x = 0;
int y = 0;
for ( int p = power - 1 ; p >= 0 ; p-- ) {
int temp = (i >> (p * 2)) % 4; //2 bits of the index starting from the begining
int a = temp % 2; // the first bit
int b = temp >> 1; // the second bit
x += a << power - 1 - p;
y += (a ^ b) << power - 1 - p;// ^ is XOR
// 00=>(0,0), 01 =>(1,1) 10 =>(0,1) 11 =>(1,0) scaled to 2^p where 0<=p
}
//to index
int index = y * sideLength + x;
ret[i] = index;
}
return ret;
}
I do admit that somewhere along the way the values got transposed, but it does not matter because of how it works.
After doing some optimization I came up with this loop body:
int x = 0;
int y = 0;
for ( int p = 0 ; p < power ; p++ ) {
int temp = ( i >> ( p * 2 ) ) & 3;
int a = temp & 1;
int b = temp >> 1;
x = ( x << 1 ) | a;
y = ( y << 1 ) | ( a ^ b );
}
int index = y * sideLength + x;
(the code assumes that c# optimizer, IL2CPP, and CPP compiler will optimize variables temp, a, b out)

How to rotate a centered hexagonal bitboard?

Consider the following centered hexagonal bitboard representation (padding is in boldface):
56
55 49
54 48 42
53 47 41 35
52 46 40 34 28
45 39 33 27
44 38 32 26 20
37 31 25 19
36 30 24 18 12
29 23 17 11
28 22 16 10 04
21 15 09 03
20 14 08 02 60
13 07 01 59
06 00 58
63 57
56
This representation fits in a 64-bit integer and allows for easy movement in the 6 hexagonal directions by rotating bits 1, 7 or 8 spaces to the right or to the left respectively. If it helps with visualization, you can deform this hexagon into a square:
42 43 44 45 46 47 48
35 36 37 38 39 40 41
28 29 30 31 32 33 34
21 22 23 24 25 26 27
14 15 16 17 18 19 20
07 08 09 10 11 12 13
00 01 02 03 04 05 06
Now, what I want to do is rotate this bitboard 60° clockwise, such that the [45,46,47,38,39,31] triangle becomes the [48,41,34,40,33,32] triangle, etc. How do I do this?
This permutation is kind of a mess, with every relevant bit having a distinct move-distance. The permutation diagram looks like this (top row is output):
That does suggest some approaches though. If we look near the top, every "group" is formed by gathering some bits from the input in ascending order, so it can be done with 7 compress_right operations aka PEXT which is efficient on Intel (not so efficient on AMD so far). What that really comes down to is sampling the vertical columns, so extracting bits with a stride of 8.
So if PEXT is acceptable, it could be done like this (not tested):
uint64_t g0 = _pext_u64(in, 0x8080808);
uint64_t g1 = _pext_u64(in, 0x404040404);
uint64_t g2 = _pext_u64(in, 0x20202020202);
uint64_t g3 = _pext_u64(in, 0x1010101010101);
uint64_t g4 = _pext_u64(in, 0x808080808080);
uint64_t g5 = _pext_u64(in, 0x404040404000);
uint64_t g6 = _pext_u64(in, 0x202020200000);
uint64_t out = g0 | (g1 << 7) | (g2 << 14) | (g3 << 21) |
(g4 << 28) | (g5 << 35) | (g6 << 42);
This permutation is not routable by a butterfly network, but Beneš networks are universal so that will work.
So it can be done with 11 of these permute steps, also known as delta swaps:
word bit_permute_step(word source, word mask, int shift) {
word t;
t = ((source >> shift) ^ source) & mask;
return (source ^ t) ^ (t << shift);
}
There is some choice in how to create the exact masks, but this works:
x = bit_permute_step(x, 0x1001400550054005, 1);
x = bit_permute_step(x, 0x2213223111023221, 2);
x = bit_permute_step(x, 0x01010B020104090E, 4);
x = bit_permute_step(x, 0x002900C400A7007B, 8);
x = bit_permute_step(x, 0x00000A0400002691, 16);
x = bit_permute_step(x, 0x0000000040203CAD, 32);
x = bit_permute_step(x, 0x0000530800001CE0, 16);
x = bit_permute_step(x, 0x000C001400250009, 8);
x = bit_permute_step(x, 0x0C00010403080104, 4);
x = bit_permute_step(x, 0x2012000011100100, 2);
x = bit_permute_step(x, 0x0141040000000010, 1);

How do I make this program work for input >10 for the USACO Training Pages Square Palindromes?

Problem Statement -
Given a number base B (2 <= B <= 20 base 10), print all the integers N (1 <= N <= 300 base 10) such that the square of N is palindromic when expressed in base B; also print the value of that palindromic square. Use the letters 'A', 'B', and so on to represent the digits 10, 11, and so on.
Print both the number and its square in base B.
INPUT FORMAT
A single line with B, the base (specified in base 10).
SAMPLE INPUT
10
OUTPUT FORMAT
Lines with two integers represented in base B. The first integer is the number whose square is palindromic; the second integer is the square itself. NOTE WELL THAT BOTH INTEGERS ARE IN BASE B!
SAMPLE OUTPUT
1 1
2 4
3 9
11 121
22 484
26 676
101 10201
111 12321
121 14641
202 40804
212 44944
264 69696
My code works for all inputs <=10, however, gives me some weird output for inputs >10.
My Code-
#include<iostream>
#include<cstdio>
#include<cmath>
using namespace std;
int baseToBase(int num, int base) //accepts a number in base 10 and the base to be converted into as arguments
{
int result=0, temp=0, i=1;
while(num>0)
{
result = result + (num%base)*pow(10, i);
i++;
num = num/base;
}
result/=10;
return result;
}
long long int isPalin(int n, int base) //checks the palindrome
{
long long int result=0, temp, num=n*n, x=n*n;
num = baseToBase(num, base);
x = baseToBase(x, base);
while(num)
{
temp=num%10;
result = result*10 + temp;
num/=10;
}
if(x==result)
return x;
else
return 0;
}
int main()
{
int base, i, temp;
long long int sq;
cin >> base;
for(i=1; i<=300; i++)
{
temp=baseToBase(i, base);
sq=isPalin(i, base);
if(sq!=0)
cout << temp << " " << sq << endl;
}
return 0;
}
For input = 11, the answer should be
1 1
2 4
3 9
6 33
11 121
22 484
24 565
66 3993
77 5335
101 10201
111 12321
121 14641
202 40804
212 44944
234 53535
While my answer is
1 1
2 4
3 9
6 33
11 121
22 484
24 565
66 3993
77 5335
110 10901
101 10201
111 12321
121 14641
209 40304
202 40804
212 44944
227 50205
234 53535
There is a difference in my output and the required one as 202 shows under 209 and 110 shows up before 101.
Help appreciated, thanks!
a simple example for B = 11 to show error in your base conversion is for i = 10 temp should be A but your code calculates temp = 10. Cause in we have only 10 symbols 0-9 to perfectly show every number in base 10 or lower but for bases greater than that you have to use other symbols to represent a different digit like 'A', 'B' and so on. problem description clearly states that. Hope You will be able to fix your code now by modifying your int baseToBase(int num, int base)function.

Resources