Why do the following two codes make different results? - caching

I have two similar codes below.
1st code:
unsigned long size = 256*1024*1024;
unsigned long stride = 256;
void *array = (void*)malloc(size);
for (unsigned long off = 0; off < size; off+=stride) {
*(unsigned int*)(array+off) = off+stride;
}
*(unsigned int*)(array+off)=0;
int i=10000000;
struct timeval start, end;
gettimeofday(&start, NULL);
while (i>=1) {
offset = *(unsigned int*)(array+off);
i--;
}
gettimeofday(&end, NULL);
*(volatile unsigned int*)(array+offset);
printf("%.2f\n", (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec));
2nd code:
unsigned long size = 256*1024*1024;
unsigned long stride = 256;
void *array = (void*)malloc(size);
for (unsigned long off = 0; off < size; off+=stride) {
*(unsigned int*)(array+off) = off+stride;
}
*(unsigned int*)(array+off)=0;
int i=10000000;
struct timeval start, end;
gettimeofday(&start, NULL);
#define ONE offset = *(unsigned int*)(array+off);
#define FIVE ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY
while (i>=1000) {
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
HUNDRED
i-=1000;
}
gettimeofday(&end, NULL);
*(volatile unsigned int*)(array+offset);
printf("%.2f\n", (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec));
Question
The only difference between two codes is "while loop."
They both measure the elapsed time for while loop.
The first code makes a result of 779,851,000 ns and the second code makes a result of 1,624,344,000 ns. (2.1 times larger)
I thought this difference comes from L1-i cache misses, so I measured L1-i cache misses with perf.
However, the L1-i cache miss of the first code is 34,541 and the L1-i cache miss of the second code is 43,078. (1.2 times larger)
This result cannot completely explain the difference in elapsed times for while loop.
What makes the big difference between elapsed times of two codes?
Is there anything that I miss?


Related

Metal Compute function causes GPU timeout error

I am trying to compute the Collatz conjecture for a range of numbers to see how much I can benefit from using the GPU. For some reason, the function seems to fail for integers above one hundred million. I use 64-bit unsigned long for the calculations, so it can't be integer overflow; the largest number reached in the calculations for any integer is well below the maximum representable value for this datatype.
The application is basically Apple's Performing Calculations on a GPU, where the array buffers and fragment function are the only things changed. The basic idea for the function is to pass an array of integers (say from 1 to 1000), where each integer serves as a starting point for a while-loop performing the Collatz calculations for every number until the thread reaches a top set limit, for example, a billion.
kernel void compute_collatz(device const unsigned int *array [[buffer(0)]],
device unsigned int *result [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
const unsigned long arrayLength = (unsigned long)1000;
unsigned long arrayNumber = (unsigned long)array[index];
unsigned long maxNumber = (unsigned long)1000000000 - arrayLength;
while (arrayNumber <= maxNumber) {
unsigned long curentStep = arrayNumber;
while (curentStep != 1) {
if (curentStep % 2 == 0) {curentStep = curentStep / 2;}
else {curentStep = (curentStep * 3) + 1;}
}
arrayNumber += arrayLength;
}
result[index] = (int)arrayNumber;
}
When the function reaches the maximum limit, it stores that value in a different array. This works well when the maximum value is sett to one hundred million or less, but only about one third of the array is changed when I try higher values. The program fails with the following error: Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF code 2). I had a similar problem when I tried multithreading on the CPU for the first time, but then the problem was related to pointers. Can't see that this is a problem here.
I am using Xcode 12.5.1 on macOS 11.5.2 on a MacBook Pro 16 (2016). Any help is much appreciated!

Cannot understand the metric returned by "perf" regarding the cache-misses

My question is about understanding the Linux perf tool metrics. I did an optimisations related to prefetch/cache-misses in my code, that is now faster. However, perf does not show me that (or more certainly, I do not understand what perf shows me).
Taking it back to where it all began. I did an investigation in order to speed up random memory access using prefetch.
Here is what my program does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
each value is a random index in the second buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU context switches
After my last tunings, my code is the following one:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#include <sched.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
#define PADDING 256
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + PADDING) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
unsigned int value = valueBuffer[index];
double s = sin(value);
sum += s;
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
void testWithPrefetchStep(unsigned short prefetchStep)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
int iterateOnPrefetchSteps()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
testWithPrefetchStep(prefetchStep);
}
}
void setCpuAffinity(int cpuId)
{
int pid=0;
cpu_set_t mask;
unsigned int len = sizeof(mask);
CPU_ZERO(&mask);
CPU_SET(cpuId,&mask);
sched_setaffinity(pid, len, &mask);
}
int main(int argc, char ** argv)
{
setCpuAffinity(7);
if (argc == 2)
{
testWithPrefetchStep(atoi(argv[1]));
}
else
{
iterateOnPrefetchSteps();
}
}
At the end of my previous stackoverflow question I thought I had all the elements: In order to avoid cache-misses I made my code prefetching data (using __builtin_prefetch) and my program was faster. Everything looked as normal as possible
However, I wanted to study it using the Linux perf tool. So I launched a comparison between two executions of my program:
./TestPrefetch 0: doing so, the prefetch is inefficient because it is done on the data that is read just after (when the data is accessed, it cannot have be loaded in the CPU cache). Run duration: 21.346 seconds
./TestPrefetch 1: Here the prefetch is far more efficient because data is fetched one loop-iteration before it is read. Run duration: 12.624 seconds
The perf outputs are the following ones:
$ gcc -O3 TestPrefetch.c -o TestPrefetch -lm && for cpt in 0 1; do echo ; echo "### Step=$cpt" ; sudo perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./TestPrefetch $cpt; done
### Step=0
prefetchStep = 0, Sum = -1107.523504 - Time: 21346278 micro-seconds = 21.346 seconds
Performance counter stats for './TestPrefetch 0':
24387,010283 task-clock (msec) # 1,000 CPUs utilized
97 274 163 155 cycles # 3,989 GHz
59 183 107 508 instructions # 0,61 insn per cycle
425 300 823 cache-references # 17,440 M/sec
249 261 530 cache-misses # 58,608 % of all cache refs
24,387790203 seconds time elapsed
### Step=1
prefetchStep = 1, Sum = -1107.523504 - Time: 12623665 micro-seconds = 12.624 seconds
Performance counter stats for './TestPrefetch 1':
15662,864719 task-clock (msec) # 1,000 CPUs utilized
62 159 134 934 cycles # 3,969 GHz
59 167 595 107 instructions # 0,95 insn per cycle
484 882 084 cache-references # 30,957 M/sec
321 873 952 cache-misses # 66,382 % of all cache refs
15,663437848 seconds time elapsed
Here, I have difficulties to understand why am I better:
The number of cache-misses is almost the same (I even have a little bit more): I can't understand why and (overall) if so, why am I faster?
what are the cache-references?
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
I wouldn't trust the perf summary, as it's not very clear what each name represents and which perf counter are they programmed to follow. The default settings have also been known to count the wrong things (see - https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604)
What could happen here is that your cache miss counter may count also the prefetch instructions (which may seem as loads to the machine, especially as you descend in the cache hierarchy). In that case, having more cache references (lookups) makes sense, and you would expect these requests to be misses (the whole point of a prefetch is to miss...).
Instead of relying on some ambiguous counter, find our the counter IDs and masks for your specific machine that represent demand reads lookups and misses, and see if they improved.
Edit: looking at your numbers again, I see an increase of ~50M accesses, but ~70M misses. It's possible that there are more misses due to cache thrashing done by the prefetches
can't understand why and (overall) if so, why am I faster?
Because you run more instructions per the time. The old one:
0,61 insn per cycle
and the new one
0,95 insn per cycle
what are the cache-references?
The count how many times the cache was asked if it does contain the data you were loading/storing.
what what are task-clock and cycles? Do they include the time waiting for data-access in case of cache miss?
Yes. But note that in today processors, there is no wait for any of this. The instructions are executed out-of-order, usually prefetched and if the next instruction needs some data that are not ready, other instructions will get executed.
I recently progress on my perf issues. I discovered a lot of new events among which some are really interesting.
Regarding the current problem, the following event have to be concidered: L1-icache-load-misses
When I monitor my test-application with perf in the same conditions than previously, I get the following values for this event:
1 202 210 L1-icache-load-misses
against
530 127 L1-icache-load-misses
For the moment, I do not yet understand why cache-misses events are not impacted by prefetches while L1-icache-load-misses are...

Speed up random memory access using prefetch

I am trying to speed up a single program by using prefetches. The purpose of my program is just for test. Here is what it does:
It uses two int buffers of the same size
It reads one-by-one all the values of the first buffer
It reads the value at the index in the second buffer
It sums all the values taken from the second buffer
It does all the previous steps for bigger and bigger
At the end, I print the number of voluntary and involuntary CPU
In the very first time, values in the first buffers contains the values of its index (cf. function createIndexBuffer in the code just below) .
It will be more clear in the code of my program:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 100000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = i;
}
return (indexBuffer);
}
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
unsigned int computeTimeInMicroSeconds()
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer();
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
unsigned long long sum = computeSum(indexBuffer, valueBuffer);
gettimeofday(&endTime, NULL);
printf("Sum = %llu\n", sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds();
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
If I launch it, I get the following output:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439813150288855829
Time: 201172 micro-seconds = 0.201 seconds
Quick and fast!!!
According to my knowledge (I may be wrong), one of the reason for having such a fast program is that, as I access my two buffers sequentially, data can be prefetched in the CPU cache.
We can make it more complex in order that data is (almost) prefeched in CPU cache. For example, we can just change the createIndexBuffer function in:
unsigned int * createIndexBuffer()
{
unsigned int * indexBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
Let's try the program once again:
$ gcc TestPrefetch.c -O3 -o TestPrefetch && ./TestPrefetch
sizeof buffers = 1562Mb
Sum = 439835307963131237
Time: 3730387 micro-seconds = 3.730 seconds
More than 18 times slower!!!
We now arrive to my problem. Given the new createIndexBuffer function, I would like to speed up computeSum function using prefetch
unsigned long long computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer)
{
unsigned long long sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &indexBuffer[i + 1], 0, 0);
unsigned int index = indexBuffer[i];
sum += valueBuffer[index];
}
return (sum);
}
of course I also have to change my createIndexBuffer in order it allocates a buffer having one more element
I relaunch my program: not better! As prefetch may be slower than one "for" loop iteration, I may prefetch not one element before but two elements before
__builtin_prefetch((char *) &indexBuffer[i + 2], 0, 0);
not better! two loops iterations? not better? Three? **I tried it until 50 (!!!) but I cannot enhance the performance of my function computeSum.
Can I would like help to understand why
Thank you very much for your help
I believe that above code is automatically optimized by CPU without any further space for manual optimization.
1. Main problem is that indexBuffer is sequentially accessed. Hardware prefetcher senses it and prefetches further values automatically, without need to call prefetch manually. So, during iteration #i, values indexBuffer[i+1], indexBuffer[i+2],... are already in cache. (By the way, there is no need to add artificial element to the end of array: memory access errors are silently ignored by prefetch instructions).
What you really need to do is to prefetch valueBuffer instead:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + 1]], 0, 0);
2. But adding above line of code won't help either in such simple scenario. Cost of accessing memory is hundreds of cycles, while add instruction is ~1 cycle. Your code already spends 99% of time in memory accesses. Adding manual prefetch will make it this one cycle faster and no better.
Manual prefetch would really work well if your math were much more heavy (try it), like using an expression with large number of non-optimized out divisions (20-30 cycles each) or calling some math function (log, sin).
3. But even this doesn't guarantee to help. Dependency between loop iterations is very weak, it is only via sum variable. This allows CPU to execute instructions speculatively: it may start fetching valueBuffer[i+1] concurrently while still executing math for valueBuffer[i].
Prefetch fetches normally a full cache line. This is typically 64 bytes. So the random example fetches always 64 bytes for a 4 byte int. 16 times the data you actually need which fits very well with the slow down by a factor of 18. So the code is simply limited by memory throughput and not latency.
Sorry. What I gave you was not the correct version of my code. The correct version is, what you said:
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
However, even with the right version, it is unfortunately not better
Then I adapted my program to try your suggestion using the sin function.
My adapted program is the following one:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/time.h>
#include <math.h>
#define BUFFER_SIZE ((unsigned long) 4096 * 50000)
unsigned int randomUint()
{
int value = rand() % UINT_MAX;
return value;
}
unsigned int * createValueBuffer()
{
unsigned int * valueBuffer = (unsigned int *) malloc(BUFFER_SIZE * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
valueBuffer[i] = randomUint();
}
return (valueBuffer);
}
unsigned int * createIndexBuffer(unsigned short prefetchStep)
{
unsigned int * indexBuffer = (unsigned int *) malloc((BUFFER_SIZE + prefetchStep) * sizeof(unsigned int));
for (unsigned long i = 0 ; i < BUFFER_SIZE ; i++)
{
indexBuffer[i] = rand() % BUFFER_SIZE;
}
return (indexBuffer);
}
double computeSum(unsigned int * indexBuffer, unsigned int * valueBuffer, unsigned short prefetchStep)
{
double sum = 0;
for (unsigned int i = 0 ; i < BUFFER_SIZE ; i++)
{
__builtin_prefetch((char *) &valueBuffer[indexBuffer[i + prefetchStep]], 0, 0);
unsigned int index = indexBuffer[i];
sum += sin(valueBuffer[index]);
}
return (sum);
}
unsigned int computeTimeInMicroSeconds(unsigned short prefetchStep)
{
unsigned int * valueBuffer = createValueBuffer();
unsigned int * indexBuffer = createIndexBuffer(prefetchStep);
struct timeval startTime, endTime;
gettimeofday(&startTime, NULL);
double sum = computeSum(indexBuffer, valueBuffer, prefetchStep);
gettimeofday(&endTime, NULL);
printf("prefetchStep = %d, Sum = %f - ", prefetchStep, sum);
free(indexBuffer);
free(valueBuffer);
return ((endTime.tv_sec - startTime.tv_sec) * 1000 * 1000) + (endTime.tv_usec - startTime.tv_usec);
}
int main()
{
printf("sizeof buffers = %ldMb\n", BUFFER_SIZE * sizeof(unsigned int) / (1024 * 1024));
for (unsigned short prefetchStep = 0 ; prefetchStep < 250 ; prefetchStep++)
{
unsigned int timeInMicroSeconds = computeTimeInMicroSeconds(prefetchStep);
printf("Time: %u micro-seconds = %.3f seconds\n", timeInMicroSeconds, (double) timeInMicroSeconds / (1000 * 1000));
}
}
The output is:
$ gcc TestPrefetch.c -O3 -o TestPrefetch -lm && taskset -c 7 ./TestPrefetch
sizeof buffers = 781Mb
prefetchStep = 0, Sum = -1107.523504 - Time: 20895326 micro-seconds = 20.895 seconds
prefetchStep = 1, Sum = 13456.262424 - Time: 12706720 micro-seconds = 12.707 seconds
prefetchStep = 2, Sum = -20179.289469 - Time: 12136174 micro-seconds = 12.136 seconds
prefetchStep = 3, Sum = 12068.302534 - Time: 11233803 micro-seconds = 11.234 seconds
prefetchStep = 4, Sum = 21071.238160 - Time: 10855348 micro-seconds = 10.855 seconds
prefetchStep = 5, Sum = -22648.280105 - Time: 10517861 micro-seconds = 10.518 seconds
prefetchStep = 6, Sum = 22665.381676 - Time: 9205809 micro-seconds = 9.206 seconds
prefetchStep = 7, Sum = 2461.741268 - Time: 11391088 micro-seconds = 11.391 seconds
...
So here, it works better! Honestly, I was almost sure that it will not be better because the math function cost is higher compared to the memory access.
If anyone could give me more information about why it is better now, I would appreciate it
Thank you very much

Binary to decimal (on huge numbers)

I am building a C library on big integer number. Basically, I'm seeking a fast algorythm to convert any integer in it binary representation to a decimal one
I saw JDK's Biginteger.toString() implementation, but it looks quite heavy to me, as it was made to convert the number to any radix (it uses a division for each digits, which should be pretty slow while dealing with thousands of digits).
So if you have any documentations / knowledge to share about it, I would be glad to read it.
EDIT: more precisions about my question:
Let P a memory address
Let N be the number of bytes allocated (and set) at P
How to convert the integer represented by the N bytes at address P (let's say in little endian to make things simpler), to a C string
Example:
N = 1
P = some random memory address storing '00101010'
out string = "42"
Thank for your answer still
The reason for the BigInteger.toString method looking heavy is doing the conversion in chunks.
A trivial algorithm would take the last digits and then divide the whole big integer by the radix until there is nothing left.
One problem with this is that a big integer division is quite expensive, so the number is subdivided into chunks that can be processed with regular integer division (opposed to BigInt division):
static String toDecimal(BigInteger bigInt) {
BigInteger chunker = new BigInteger(1000000000);
StringBuilder sb = new StringBuilder();
do {
int current = bigInt.mod(chunker).getInt(0);
bigInt = bigInt.div(chunker);
for (int i = 0; i < 9; i ++) {
sb.append((char) ('0' + remainder % 10));
current /= 10;
if (currnet == 0 && bigInt.signum() == 0) {
break;
}
}
} while (bigInt.signum() != 0);
return sb.reverse().toString();
}
That said, for a fixed radix, you are probably even better off with porting the "double dabble" algorithm to your needs, as suggested in the comments: https://en.wikipedia.org/wiki/Double_dabble
I recently got the challenge to print a big mersenne prime: 2**82589933-1. On my CPU that takes ~40 minutes with apcalc and ~120 minutes with python 2.7. It's a number with 24 million digits and a bit.
Here is my own little C code for the conversion:
// print 2**82589933-1
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>
const uint32_t exponent = 82589933;
//const uint32_t exponent = 100;
//outputs 1267650600228229401496703205375
const uint32_t blocks = (exponent + 31) / 32;
const uint32_t digits = (int)(exponent * log(2.0) / log(10.0)) + 10;
uint32_t num[2][blocks];
char out[digits + 1];
// blocks : number of uint32_t in num1 and num2
// num1 : number to convert
// num2 : free space
// out : end of output buffer
void conv(uint32_t blocks, uint32_t *num1, uint32_t *num2, char *out) {
if (blocks == 0) return;
const uint32_t div = 1000000000;
uint64_t t = 0;
for (uint32_t i = 0; i < blocks; ++i) {
t = (t << 32) + num1[i];
num2[i] = t / div;
t = t % div;
}
for (int i = 0; i < 9; ++i) {
*out-- = '0' + (t % 10);
t /= 10;
}
if (num2[0] == 0) {
--blocks;
num2++;
}
conv(blocks, num2, num1, out);
}
int main() {
// prepare number
uint32_t t = exponent % 32;
num[0][0] = (1LLU << t) - 1;
memset(&num[0][1], 0xFF, (blocks - 1) * 4);
// prepare output
memset(out, '0', digits);
out[digits] = 0;
// convert to decimal
conv(blocks, num[0], num[1], &out[digits - 1]);
// output number
char *res = out;
while(*res == '0') ++res;
printf("%s\n", res);
return 0;
}
The conversion is destructive and tail recursive. In each step it divides num1 by 1_000_000_000 and stores the result in num2. The remainder is added to out. Then it calls itself with num1 and num2 switched and often shortened by one (blocks is decremented). out is filled from back to front. You have to allocate it large enough and then strip leading zeroes.
Python seems to be using a similar mechanism for converting big integers to decimal.
Want to do better?
For large number like in my case each division by 1_000_000_000 takes rather long. At a certain size a divide&conquer algorithm does better. In my case the first division would be by dividing by 10 ^ 16777216 to split the number into divident and remainder. Then convert each part separately. Now each part is still big so split again at 10 ^ 8388608. Recursively keep splitting till the numbers are small enough. Say maybe 1024 digits each. Those convert with the simple algorithm above. The right definition of "small enough" would have to be tested, 1024 is just a guess.
While the long division of two big integer numbers is expensive, much more so than a division by 1_000_000_000, the time spend there is then saved because each separate chunk requires far fewer divisions by 1_000_000_000 to convert to decimal.
And if you have split the problem into separate and independent chunks it's only a tiny step away from spreading the chunks out among multiple cores. That would really speed up the conversion another step. It looks like apcalc uses divide&conquer but not multi-threading.

Strange fseek()/fwrite() performance on MacOS

I have problems with write performance of fseek()/fwrite() on my Mac. I'm operating on large files up to 4 GB of size, tests below were made with a rather small one with only 120 MB. My strategy is as follows:
fopen() a new file on disk
fill the file with zeroes (takes ~3 seconds)
write small blocks of data to random positions (30.000 blocks, 4k each)
The whole procedure takes around 120 seconds.
The write strategy is bound to an image rotation algorithm (see my question here) and unless someone comes up with a faster solution for the rotation problem, I'm not able to change the strategy of using fseek() and then writing 4k or less to the file.
What I am observing is this: The first few thousand fseek()/fwrite() perform quite well, but the performance drops very fast, faster than you would expect from any system cache being filled up. The chart below shows fwrite()s per second vs time in seconds. As you see, after 7 seconds the fseek()/fwrite() rate reaches approx. 200 per second, still going down until it reaches 100 per second at the very end of the process.
In the middle of the process (2 or 3 times), the OS decides to flush file contents to disk which I can see from my console output hanging a few seconds, during that time I have approx. 5 MB/s write on my disk (which isn't that much). After fclose() the system seems to write the whole file, I see 20 MB/s disk activity for a longer period of time.
If I use fflush() every 5.000 fwrite()s, the behaviour doesn't change at all. Putting in fclose()/fopen() to force flushing somehow speeds up the whole thing by approx. 10%.
I did profile the process (screenshot below) and you see, that virtually all time is spent inside fwrite() and fseek() which can be drilled down to __write_nocancel() for both of them.
Completely absurd summary
Imagine the case where my input data fits into my buffers completely and thus I'm able to write my rotated output data linearly without the need to split the write process into fragments. I still use fseek() to position the file pointer, just because the logic of the writing function behaves that way, but the file pointer in this case is set to the same position where it already was. One would expect no performance impact. Wrong.
What is absurd is, if I remove the calls to fseek() for that special case, my function finishes within 2.7 seconds instead of 120 seconds.
Now, after a long foreword, the question is: Why does fseek() have such an impact on performance, even if I seek to the same position? How could I speed it up (by another strategy or other function calls, disabling caching if possible, memory mapped access, ...)?
For reference, here's my code (not tidied up, not optimized, containing lots of debug output):
-(bool)writeRotatedRaw:(TIFF*)tiff toFile:(NSString*)strFile
{
if(!tiff) return NO;
if(!strFile) return NO;
NSLog(#"Starting to rotate '%#'...", strFile);
FILE *f = fopen([strFile UTF8String], "w");
if(!f)
{
NSString *msg = [NSString stringWithFormat:#"Could not open '%#' for writing.", strFile];
NSRunAlertPanel(#"Error", msg, #"OK", nil, nil);
return NO;
}
#define LINE_CACHE_SIZE (1024*1024*256)
int h = [tiff iImageHeight];
int w = [tiff iImageWidth];
int iWordSize = [tiff iBitsPerSample]/8;
int iBitsPerPixel = [tiff iBitsPerSample];
int iLineSize = w*iWordSize;
int iLinesInCache = LINE_CACHE_SIZE / iLineSize;
int iLinesToGo = h, iLinesToRead;
NSLog(#"Creating temporary file");
double time = CACurrentMediaTime();
double lastTime = time;
unsigned char *dummy = calloc(iLineSize, 1);
for(int i=0; i<h; i++) fwrite(dummy, 1, iLineSize, f);
free(dummy);
fclose(f);
f = fopen([strFile UTF8String], "w");
NSLog(#"Created temporary file (%.1f MB) in %.1f seconds", (float)iLineSize*(float)h/1024.0f/1024.0f, CACurrentMediaTime()-time);
fseek(f, 0, SEEK_SET);
lastTime = CACurrentMediaTime();
time = CACurrentMediaTime();
int y=0;
unsigned char *ucRotatedPixels = malloc(iLinesInCache*iWordSize);
unsigned short int *uRotatedPixels = (unsigned short int*)ucRotatedPixels;
unsigned char *ucLineCache = malloc(w*iWordSize*iLinesInCache);
unsigned short int *uLineCache = (unsigned short int*)ucLineCache;
unsigned char *uc;
unsigned int uSizeCounter=0, uMaxSize = iLineSize*h, numfwrites=0, lastwrites=0;
while(iLinesToGo>0)
{
iLinesToRead = iLinesToGo;
if(iLinesToRead>iLinesInCache) iLinesToRead = iLinesInCache;
for(int i=0; i<iLinesToRead; i++)
{
// read as much lines as fit into buffer
uc = [tiff getRawLine:y+i withBitsPerPixel:iBitsPerPixel];
memcpy(ucLineCache+i*iLineSize, uc, iLineSize);
}
for(int x=0; x<w; x++)
{
if(iBitsPerPixel==8)
{
for(int i=0; i<iLinesToRead; i++)
{
ucRotatedPixels[iLinesToRead-i-1] = ucLineCache[i*w+x];
}
fseek(f, w*x+(h-y-1), SEEK_SET);
fwrite(ucRotatedPixels, 1, iLinesToRead, f);
numfwrites++;
uSizeCounter += iLinesToRead;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(#"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
else
{
for(int i=0; i<iLinesToRead; i++)
{
uRotatedPixels[iLinesToRead-i-1] = uLineCache[i*w+x];
}
fseek(f, (w*x+(h-y-1))*2, SEEK_SET);
fwrite(uRotatedPixels, 2, iLinesToRead, f);
uSizeCounter += iLinesToRead*2;
if(CACurrentMediaTime()-lastTime>1.0)
{
lastTime = CACurrentMediaTime();
NSLog(#"Progress: %.1f %%, x=%d, y=%d, iLinesToRead=%d\t%d", (float)uSizeCounter * 100.0f / (float)uMaxSize, x, y, iLinesToRead, numfwrites);
}
}
}
y += iLinesInCache;
iLinesToGo -= iLinesToRead;
}
free(ucLineCache);
free(ucRotatedPixels);
fclose(f);
NSLog(#"Finished, %.1f s", (CACurrentMediaTime()-time));
return YES;
}
I'm a bit lost because I do not understand how the system "optimizes" my calls. Any input is appreciated.
Just to somehow close this question, I'll answer it myself and share my solution.
Although I wasn't able to improve the performance of the fseek() calls, I did implement a well performing workaround. The aim was to avoid fseek() at any cost. Because I need to write fragments of data to different positions of the target file but those fragments appear in equal distance and the gaps between those fragments will be filled with other fragments written somewhat later in the process, I splitted the writing into multiple files. I write to as many files as fragment streams are generated and then, in a last step, re-open all those temporary files, read them rotational and linearly write data blocks to the target file. The performance of this is good, reaching approx. 4 seconds for the example given above.

Resources