Why does Unix block size increase with bigger memory size?

Why does Unix block size increase with bigger memory size? - algorithm

I am profiling binary data which has
increasing Unix block size (one got from stat > Blocks) when the number of events are increased as in the following figure
but the byte distance between events stay constant
I have noticed some changes in other fields of the file which may explain the increasing Unix block size
The unix block size is a dynamic measure.
I am interested in why it is increasing with bigger memory units in some systems.
I have had an idea that it should be constant.
I used different environments to provide the stat output:
Debian Linux 8.1 with its default stat
OSX 10.8.5 with Xcode 6 and its default stat
Greybeard's comment may have the answer to the blocks behaviour:
The stat (1) command used to be a thin CLI to the stat (2) system
call, which used to transfer relevant parts of a file's inode. Pretty
early on, the meaning of the st_blksize member of the C struct
returned by stat (2) was changed to "preferred" blocksize for
efficient file system I/O, which carries well to file systems with
mixed block sizes or non-block oriented allocation.
How can you measure the block size in case (1) and (2) separately?
Why can the Unix block size increase with bigger memory size?

"Stat blocks" is not a block size. It is number of blocks the file consists of. It is obvious that number of blocks is proportional to size. Size of block is constant for most file systems (if not all).

Related

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment：
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.

Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).

I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent

Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Can the USN Journal of the NTFS file system be bigger than it's declared size?

Hello fellow programmers.
I'm trying to dump the contents of the USN Journal of a NTFS partition using WinIoCtl functions. I have the *USN_JOURNAL_DATA* structure that tells me that it has a maximum size of 512 MB. I have compared that to what fsutil has to say about it and it's the same value.
Now I have to read each entry into a *USN_RECORD* structure. I do this in a for loop that starts at 0 and goes to the journal's maximum size in increments of 4096 (the cluster size).
I read each 4096 bytes in a buffer of the same size and read all the USN_RECORD structures from it.
Everything is going great, file names are correct, timestamps as well, reasons, everything, except I seem to be missing some recent records. I create a new file on the partition, I write something in it and then I delete the file. I run the app again and the record doesn't appear. I find that the record appears only if I keep reading beyond the journal's maximum size. How can that be?
At the moment I'm reading from the start of the Journal's data to the maximum size + the allocation delta (both are values stored in the *USN_JOURNAL_DATA* structure) which I don't believe it's correct and I'm having trouble finding thorough information related to this.
Can someone please explain this? Is there a buffer around the USN Journal that's similar to how the MFT works (meaning it's size halves when disk space is needed for other files)?
What am I doing wrong?

That's the expected behaviour, as documented:
MaximumSize
The target maximum size for the change journal, in bytes. The change journal can grow larger than this value, but it is then truncated at the next NTFS file system checkpoint to less than this value.
Instead of trying to predetermine the size, loop until you reach the end of the data.
If you are using the FSCTL_ENUM_USN_DATA control code, you have reached the end of the data when the error code from DeviceIoControl is ERROR_HANDLE_EOF.
If you are using the FSCTL_READ_USN_JOURNAL control code, you have reached the end of the data when the next USN returned by the driver (the DWORDLONG at the beginning of the output buffer) is the USN you requested (the value of StartUsn in the input buffer). You will need to set the input parameter BytesToWaitFor to zero, otherwise the driver will wait for the specified amount of new data to be added to the journal.

Fortran array memory management

I am working to optimize a fluid flow and heat transfer analysis program written in Fortran. As I try to run larger and larger mesh simulations, I'm running into memory limitation problems. The mesh, though, is not all that big. Only 500,000 cells and small-peanuts for a typical CFD code to run. Even when I request 80 GB of memory for my problem, it's crashing due to insufficient virtual memory.
I have a few guesses at what arrays are hogging up all that memory. One in particular is being allocated to (28801,345600). Correct me if I'm wrong in my calculations, but a double precision array is 8 bits per value. So the size of this array would be 28801*345600*8=79.6 GB?
Now, I think that most of this array ends up being zeros throughout the calculation so we don't need to store them. I think I can change the solution algorithm to only store the non-zero values to work on in a much smaller array. However, I want to be sure that I'm looking at the right arrays to reduce in size. So first, did I correctly calculate the array size above? And second, is there a way I can have Fortran show array sizes in MB or GB during runtime? In addition to printing out the most memory intensive arrays, I'd be interested in seeing how the memory requirements of the code are changing during runtime.

Memory usage is a quite vaguely defined concept on systems with virtual memory. You can have large amounts of memory allocated (large virtual memory size) but only a small part of it actually being actively used (small resident set size - RSS).
Unix systems provide the getrusage(2) system call that returns information about the amount of system resources in use by the calling thread/process/process children. In particular it provides the maxmimum value of the RSS ever reached since the process was started. You can write a simple Fortran callable helper C function that would call getrusage(2) and return the value of the ru_maxrss field of the rusage structure.
If you are running on Linux and don't care about portability, then you may just open and read from /proc/self/status. It is a simple text pseudofile that among other things contains several lines with statistics about the process virtual memory usage:
...
VmPeak: 9136 kB
VmSize: 7896 kB
VmLck: 0 kB
VmHWM: 7572 kB
VmRSS: 6316 kB
VmData: 5224 kB
VmStk: 88 kB
VmExe: 572 kB
VmLib: 1708 kB
VmPTE: 20 kB
...
Explanation of the various fields - here. You are mostly interested in VmData, VmRSS, VmHWM and VmSize. You can open /proc/self/status as a regular file with OPEN() and process it entirely in your Fortran code.
See also what memory limitations are set with ulimit -a and ulimit -aH. You may be exceeding the hard virtual memory size limit. If you are submitting jobs through a distributed resource manager (e.g. SGE/OGE, Torque/PBS, LSF, etc.) check that you request enough memory for the job.

How to calculate Storage when ftping to MainFrame

How can I calculate storage when FTPing to MainFrame? I was told LRECL will always remain '80'. Not sure how I can calculate PRI and SEC dynamically based on the file size...
QUOTE SITE LRECL=80 RECFM=FB CY PRI=100 SEC=100

If the site has SMS, you shouldn't need to, but if you need to calculate the number of tracks is the size of the file in bytes divided by 56,664, or the number of cylinders is the size of the file in bytes divided by 849,960. In either case, you would round up.

Unfortunately IBM's FTP server does not support the newer space allocation specifications in number of records (the JCL parameter AVGREC=U/M/K plus the record length as the first specification in the SPACE parameter).
However, there is an alternative, and that is to fall back on one of the lesser-used SPACE parameters - the blocksize specification. I will assume 3390 disk types for simplicity, and standard data sets.
For fixed-length records, you want to calculate the largest number that will fit in half a track (27994 bytes), because z/OS only supports block sizes up to 32760. Since you are dealing with 80-byte records, that number is 27290. Divide your file size by that number and that will give you the number of blocks. Then in a SITE server command, specify
SITE BLKSIZE=27920 LRECL=80 RECFM=FB BLOCKS=27920 PRI=calculated# SEC=a_little_extra
This is equivalent to SPACE=(27920,(calculated#,a_little_extra)).
z/OS space allocation calculates the number of tracks required and rounds up to the nearest track boundary.
For variable-length records, if your reading application can handle it, always use BLKSIZE=27994. The reason I have the warning about the reading application is that even today there are applications from ISVs that still have strange hard-coded maximum variable length blocks such as 12K.
If you are dealing with PDSEs, always use BLKSIZE=32760 for variable-length and the closest-to-32760 for fixed-length in your specification (32720 for FB/80), but calculate requirements based on BLKSIZE=4096. PDSEs are strange in their underlying layout; the physical records are 4096 bytes, which is because there is some linear data set VSAM code that handles the physical I/O.

memory_limit=80M. what is the maximum image size for imagecreateformjpeg()?

i have a webhosting that gives maximum memory_limit of 80M (i.e. ini_set("memory_limit","80M");).
I'm using photo upload that uses the function imagecreatefromjpeg();
When i upload large images it gives the error
"Fatal error: Allowed memory size of 83886080 bytes exhausted"
What maximum size (in bytes) for the image i can restrict to the users?
or the memory_limit depends on some other factor?

The memory size of 8388608 is 8 Megabytes, not 80. You may want to check whether you can still increase the value somewhere.
Other than that, the general rule for image manipulation is that it will take at least
image width x image height x 3
bytes of memory to load or create an image. (One byte for red, one byte for green, one byte for blue, possibly one more for alpha transparency)
By that rule, a 640 x 480 pixel image will need at least 9.2 Megabytes of space - not including overhead and space occupied by the script itself.
It's impossible to determine a limit on the JPG file size in bytes because JPG is a compressed format with variable compression rates. You will need to go by image resolution and set a limit on that.
If you don't have that much memory available, you may want to look into more efficient methods of doing what you want, e.g. tiling (processing one part of an image at a time) or, if your provider allows it, using an external tool like ImageMagick (which consumes memory as well, but outside the PHP script's memory limit).

Probably your script uses more memory than the just the image itself. Trying debugging your memory consumption.
One quick-and-dirty way is to utilize memory_get_usage and memory_get_usage memory_get_peak_usage on certain points in your code and especially in a custom error_handler and shutdown_function. This can let you know what exact operations causes the memory exhaustion.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio