compression level using miniz having no effect - miniz

I'm evaluating using the low-level routines in miniz and testing different compression levels with differently sized inputs. Level 0 always gets a little bit larger (as expected) but the other levels all result in the exact same size for the same input. They are getting progressively slower (and level 0 works) so setting a different level has some effect. I can decompress everything back again.
tdefl_compressor compressor;
auto flags = tdefl_create_comp_flags_from_zip_params(level, -15, MZ_DEFAULT_STRATEGY);
auto status = tdefl_init(&compressor, write, &target, flags);
assert(status == TDEFL_STATUS_OKAY);
tdefl_compress_buffer(&compressor, data.data, data.size, TDEFL_FINISH);

Related

No direct packet access in BPF program with just CAP_BPF?

Up until Linux 5.8 CAP_SYSADMIN was required to load any but the most basic BPF program. The recently introduced CAP_BPF is a welcome addition as it allows to run software leveraging BPF with less privileges.
Certain types of BPF programs can access packet data. The pre-4.7 way of doing it is via bpf_skb_load_bytes() helper. As the verifier got smarter, it became possible to perform "direct packet access", i.e. to access packet bytes by following pointers in the context structure. E.g:
static const struct bpf_insn prog[] = {
// BPF_PROG_TYPE_SK_REUSEPORT: gets a pointer to sk_reuseport_md (r1).
// Get packet data pointer (r2) and ensure length >= 2, goto Drop otherwise
BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
offsetof(struct sk_reuseport_md, data)),
BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
offsetof(struct sk_reuseport_md, data_end)),
BPF_MOV64_REG(BPF_REG_4, BPF_REG_2),
BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, 2),
BPF_JMP_REG(BPF_JGT, BPF_REG_4, BPF_REG_3, /* Drop: */ +4),
// Ensure first 2 bytes are 0, goto Drop otherwise
BPF_LDX_MEM(BPF_H, BPF_REG_4, BPF_REG_2, 0),
BPF_JMP_IMM(BPF_JNE, BPF_REG_4, 0, /* Drop: */ +2),
// return SK_PASS
BPF_MOV32_IMM(BPF_REG_0, SK_PASS),
BPF_EXIT_INSN(),
// Drop: return SK_DROP
BPF_MOV32_IMM(BPF_REG_0, SK_DROP),
BPF_EXIT_INSN()
};
It is required to ensure that the accessed bytes are within bounds explicitly. The verifier will reject the program otherwise.
The program above loads successfully if the caller bears CAP_SYSADMIN. Supposedly, CAP_BPF should suffice as well, but it doesn't (Linux 5.13). Earlier kernels behave similarly. The verifier output follows:
Permission denied
0: (79) r2 = *(u64 *)(r1 +0)
1: (79) r3 = *(u64 *)(r1 +8)
2: (bf) r4 = r2
3: (07) r4 += 2
4: (2d) if r4 > r3 goto pc+4
R3 pointer comparison prohibited
processed 5 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
I understand that arbitrary pointer comparison is restricted as it reveals kernel memory layout. However, comparing a pointer to a packet data offset by a certain amount with a pointer to the packet end is safe.
I'd like to find a way to load the program without granting CAP_SYSADMIN.
Is there a way to write bounds checks in a way that doesn't trigger pointer comparison error?
The relevant code is in check_cond_jmp_op(). It looks like one can't get away with pointer comparison, even with the latest kernel version.
If there's no way to write bounds check in a way that keeps verifier happy, I wonder if lifting the limitation is on the roadmap.
As a workaround, I can grant CAP_PERFORM on top of CAP_BPF, removing the "embargo" on pointer comparison. The program loads successfully. I can probably restrict perf_event_open() and other superfluous bits with seccomp. Doesn't feel nice though.
Reproducer.
To make direct packet accesses in your program, you will need CAP_PERFMON in addition to CAP_BPF. I'm not aware of any way around it.
Why?
Because of Spectre vulnerabilities, someone able to perform arithmetic on unbounded pointers (i.e., all except stack and map value pointers) can read arbitrary memory via speculative out-of-bounds loads.
Such operations thus need to be forbidden for unprivileged users. Allowing CAP_BPF users to perform those operations would essentially give read access to arbitrary memory to CAP_BPF. For those reasons, I doubt this limitation will be lifted in the future.

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

Atomic operations in ARM

I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.

Pixel modifying code runs quick in main app, really slow in Delphi 6 DirectShow filter with other problems

I have a Delphi 6 application that sends bitmaps to a DirectShow DLL in real-time, 25 frames a second. The DirectShow DLL is my code too and is also written in Delphi 6 using the DSPACK DirectShow component suite. I have a simple block of code that goes through each pixel in the bitmap modifying the brightness and contrast of the image, if a certain flag is set, otherwise the bitmap is pushed out the DirectShow DLL unmodified (push source video filter). The code used to be in the main application and then I just moved it into the DirectShow DLL. When it was in the main application it ran fine. I could see the changes in the bitmap as expected. However, now that the code resides in the DirectShow DLL it has the following problems:
When the code block below is active the DirectShow DLL is really slow. I have a quad core i5 and it's really slow. I can also see a big spike in the CPU consumption. In contrast, the very same code running in the main application ran fine on an old single core P4. It did hit the CPU noticeably on that old machine but the video was smooth and there were no problems. The images are only 352 x 288 pixels in size.
I don't see the expected changes to the visible bitmap. I can trace the code in the DirectShow DLL and see the numerical values of each pixel properly altered by the code, but the viewable image in the Graph Edit ActiveMovie window looks completely unchanged.
If I deactivate the code, which I can do in real-time, the ActiveMovie window shows video that is as smooth as glass, perfectly rendered with the CPU barely touched. If I reactivate the code the video is now really choppy, probably showing only 1 to 2 frames a second with a long delay before the first frame is shown, and the CPU spikes. Not completely, but a lot more than I would expect.
I tried compiling the DirectShow DLL with everything on including range checking, overflow checking, etc. and there were no warnings or errors during run-time. I then tried compiling for fastest speed and it still had the exact same problems listed above. Something is really wrong and I can't figure out what. Note, I do indeed lock the canvas before modifying the bitmap and unlock it after I'm done. If it weren't for the "everything on" compilation run I noted above I'd say it felt like an FPU Exception was being raised and silently swallowed with every pixel computation, but as I said, no errors or Exceptions are occurring.
UPDATE: I am putting this here so that the solution, which is embedded in one of Roman R's comment, is plainly visible. The problem that I was not setting the PixelFormat property to pf24Bit before accessing the ScanLine property. As Roman suggested, not doing this must make the TBitmap code create a temporary copy of the bitmap. As soon as I added the line of code below the problems went away, both that of changes not being visible and the soft page faults. It's an insidious problem because the only object that is affected is the pointer you use to access the ScanLine property, since (assumption) it contains a pointer to a temporary copy of the bitmap. That's must be why the subsequent TextOut() call still worked since it worked on the original copy of the bitmap.
clip.PixelFormat := pf24bit; // The missing code line that fixes the problem.
Here's the code block I've been referring to:
function IntToByte(i: Integer): Byte;
begin
if i > 255 then
Result := 255
else if i < 0 then
Result := 0
else
Result := i;
end;
// ---------------------------------------------------------------
procedure brightnessTurboBoost(var clip: TBitmap; rangeExpansionPowerOf2: integer; shiftValue: Byte);
var
p0: PByte;
x,y: Integer;
begin
if (rangeExpansionPowerOf2 = 0) and (shiftValue = 0) then
exit; // These parameter settings will not change the pixel values.
for y := 0 to clip.Height-1 do
begin
p0 := clip.scanline[y];
// Can't just do the whole buffer as a big block of bytes since the
// individual scan lines may be padded for CPU alignment.
for x := 0 to (clip.Width - 1) * 3 do
begin
if rangeExpansionPowerOf2 >= 1 then
p0^ := IntToByte((p0^ shl rangeExpansionPowerOf2) + shiftValue)
else
p0^ := IntToByte(p0^ + shiftValue);
Inc(p0);
end;
end;
end;
There are a few things to say about this code snippet.
First of all, you are using Scanline property of TBitmap class. I have not been dealign with Delphi for many years, so I might be wrong about this but I am under impression that Scanline is not actually a thin accessor, is it? It might be internally hiding things which can dramatically affect performance, such as "if he wants to access the bits of the image, then we have to first convert it to DIB before returning pointers". So a thing looking so simple might appear to be a killer.
"if rangeExpansionPowerOf2 >= 1 then" in the inner loop body? You don't really want to compare this all the way. Either make two separate functions or duplicate the whole loop without in two version for zero and non-zero rangeExpansionPowerOf2 and do this if only once.
"for ... to (clip.Width - 1) * 3 do" I am not really sure that Delphi optimizes the upper boundary evaluation to make it only once. You might be doing those multiplication thrice for every pixel, while you could do it only once the whole image.
For top perofrmance IntToByte is definitely implemented in MMX to avoid ifs and process multiple bytes at once.
Still as you say that images are only 352x288, I would suspect that #1 is ruining the performance.

Resources