What makes node.js SlowBuffers "slow"? - performance

I'm using node.js to serve some PNG images that are stored in an SQLite database as binary BLOBs. These images are small, on average 9500 bytes.
I'm using the sqlite3 npm package, which appears to return binary BLOB objects as SlowBuffers. My node.js service holds these SlowBuffers in memory to mitigate IO latency, serving them like this:
response.send(slowBuffer);
It appears that SlowBuffer has an interface similar to Buffer; converting to Buffer is trivial:
var f = function(slowBuffer) {
var buffer = new Buffer(slowBuffer.length);
slowBuffer.copy(buffer);
return buffer;
}
Should I convert these SlowBuffers to Buffers?
Help me understand why they are called "slow" buffers.

If you would read the posts:
https://groups.google.com/forum/?fromgroups=#!topic/nodejs-dev/jd52ZsVSZNo
https://groups.google.com/forum/?fromgroups=#!topic/nodejs/s1dnFbb-Rj8
Node provides two types of buffer objects. Buffer is a native Javascript data structure; SlowBuffer is implemented by a C++ module. Using C++ modules from the native Javascript environment costs extra CPU time, hence the "slow". Buffer objects are backed by SlowBuffer objects, but the contents can be read/written directly from Javascript for better performance.
Any Buffer object larger than 8 KB is backed by a single SlowBuffer object. Multiple Buffer objects smaller than 8 KB can be backed by a single SlowBuffer object. When many Buffer objects smaller than 8 KB exist in memory (backed by a single SlowBuffer), the C++ module penalty can be very high if you were to instead use a SlowBuffer for each one. Small Buffers are often used in large quantities.
This class is primarily for internal use means that if you want to manage buffers on the server on your own, then use SlowBuffer (to use in smaller chunks you have to partition SlowBuffer yourself). Unless you want that minute level control in handling buffers, you should be fine with using Buffer objects.

Related

Adding a new member in struct sk_buff - any impact on performance?

I need to add one small buffer to sk_buff structure and want add it as separate member or adding it on top of default skb->cb.
The size will be around 100 bytes. Here, the concern is "performance". Can it result in any performance hit for packet processing? Especially, cache alignment, as an sk_buff of this size cannot be loaded in a single cache line, could that cause issues?
I did an experiment where I simply added 4 more bytes in sk_buff, and noticed that there was 30-50MBps performance drops in regular UDP tests.
Any advice?
Yes, it will have an impact on performance!
Beside possible issues with the alignment of the data structure, the main problem is coming from either a higher memory diffusion and/or a bigger memory bandwidth usage. Indeed, if the buffer is not completely used, it acts as a big padding. This padding decreases performance because of possibly wasted cache lines or because structure items will not be loaded contiguously from the main memory (this is the case when a traversal of many items of the data structure is needed with only few fields read). Otherwise, if the buffer is fully read/useful, more data need to be loaded from the memory hierarchy (CPU caches & RAM) which is anything but free.
I advise you to just put a (small) reference (eg. array cell offset, pointer) to the buffer in this critical skb_buff data structure and move this buffer in another separate data structure. The benefit is that buffers can be packed and the impact of memory diffusion is significantly reduced (if the added buffers are not often used). However, the downside of this method is an additional indirection/access and the need to manage the separate buffers in a coherent way.

DirectX11 - Buffer size of instanced vertices, with various size

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.
All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

glDrawArrays - guaranteed to copy?

Following scenario:
I have a buffer of vertices in system memory whose address I submit to glDrawArrays via glVertexAttribPointer every frame. The rendering API is OpenGL ES 3.0.
Now my question:
Can I assume, that glDrawArrays will create a full copy of the buffer on every draw call? Or is it possible that it will draw from the buffer directly if I'm on a shared memory platform?
Regards
For all porpoises you need to treat it as "that it will draw from the buffer directly".
When setting the pointer you do not tell the openGL about any sizes of the buffer so at that point there is nothing set at all but some pointer on the GPU or rather an integer value since the same procedure is used when using GPU vertex buffers.
So the data size is only determined when calling draw where you say how many vertices are used from the buffer. At this point I would not expect openGL to copy the data into its internal memory but even if it does it is more like a temporary data cache, nothing more. These data are not reused through render calls and access to data must be done again.
You need to persist the data in your memory and they must be accessible. If they are no longer owned you may make the draw call to draw garbage or even receive a crash if you have no access to the memory you inserted. So for instance setting a pointer from a method/function which allocates the data in stack is generally a no-no.
Can I assume, that glDrawArrays will create a full copy of the buffer
on every draw call?
Graphics drivers try very hard not to copy bulk data buffers - it's horribly slow and energy intensive. The entire point of using buffers rather than client-side arrays is that they can be uploaded to the graphics server once and subsequently just referenced by the draw call without needing a copy or (on desktop) a transfer over PCIe into graphics RAM.
Rendering will behave as if the buffer has been copied (e.g. the output must reflect the state of the buffer at the point the draw call was made, even if it is subsequently modified). However, in most cases no copy is actually needed.
A badly written application can force copies to be taken; for example, if you modify a buffer (e.g. calling glBufferSubData) immediately after submitting a draw using it, the driver may need to create a new version of the buffer as the original data is likely still referenced by the draw you just queued. Well written applications try to pipeline their resource updates so this doesn't happen because it is normally fatal for application rendering performance ...
See https://community.arm.com/graphics/b/blog/posts/mali-performance-6-efficiently-updating-dynamic-resources for a longer explanation.

Ignite uses more memory than expected

I am using Ignite to build a framework for data calculation. One big problem is the memory usage is a little more than expected. The data using 1G memory outside Ignite will use more than 1.5G in Ignite cache.
I turned off backup and copyOnRead already. I don't use query feature so no extra index space. I also counted in the extra space used for each cache and cache entry. The total memory usages still doesn't add up.
The data value for each cache entry is a big map contains list of primitive arrays. Each entry is about 120MB.
What can be the problem? The data structure or the configuration?
Ignite does introduce some overhead to your data and half of a GB doesn't sound too bad too me. I would recommend you to refer to this guide for more details: https://apacheignite.readme.io/docs/capacity-planning
Difference between expected and real memory usage arises from 2 main points:
Each entry takes constant overhead consists of objects providing support for processing entries in distributed computing environment.
E.g. you can declare integer local variable, it takes 4 bytes in the stack, but it's hard to make the variable long live and accessible from other places of program. So you have to create new Integer object, which consumes at least 16 bytes (300% overhead isn't it?). Going further, if you want to make this object mutable and safely acsessible by multiple threads, you have to create new AtomicReference and store your object inside. Total memory consumption will be at least 32 bytes... and so on. Every time we're extending object functionality, we get additional overhead, there is no other way.
Each entry stored inside a cache in a special serialized format. So the actual memory footprint of an entry depends on the format is used. By default Ignite uses BinaryMarshaller to convert an object to the byte array, and this array is stored inside a BinaryObject.
The reason is simple, distributed computing systems continiously exchange entries between nodes, and every entry in cache should be ready to be transferred as a byte array.
Please, read the article, it was recently updated. You could estimate entry overhead for small entries by hand, but for big entries you should inspect actual entry stored in the cache as a byte array. Look at the withKeepBinary method.

Better to create a large number of static VBOs for static data up front, or stream data into the VBOs?

I need to render meshes that can be very dense. The vertex data is completely static. Apparently large VBOs are not great for performance (source: http://www.opengl.org/wiki/Vertex_Specification_Best_Practices#Size_of_a_VBO.2FIBO, although unfortunately it fails to link to its own source.) And even if large buffers were OK for performance, the total size of my vertex data sometimes exceeds what I can successfully allocate with glBufferData().
So assuming that I need to break my mesh into smaller VBOs of a few MB each, which of these methods is recommended:
1) Allocate enough buffers at startup to hold all of the mesh data. Rendering is then as simple as binding each buffer one at a time and calling glDrawArrays().
2) Allocate a small fixed pool of buffers at startup. Rendering would require filling up a buffer with a block of triangles, calling glDrawArrays(), fill up another buffer with the next block, call glDrawArrays() again, and so on. So potentially a lot more CPU work.
3) Some other method I'm not thinking of.
Part of my question just comes down to how memory allocation with VBOs works — if I allocate enough small buffers at startup to hold all of the mesh data, am I going to run into the same memory limit that would prevent me from allocating a single buffer large enough to hold all of the data? Or are VBOs integrated with virtual memory such that OpenGL will handle swapping out VBOs when I exceed available graphics memory?
Finally, how much of this is actually implementation-dependent? Are there useful references from AMD/Intel/NVidia that explain best practices for buffer managment?
I suggest you continue to allocate as few static buffers with as many vertices each as possible. If performance begins to suffer, you need to scale down the quality level (ie number of vertices) of your model.
Actually, using too many SMALL buffers are bad for performance, because each render involves a certain amount of fixed overhead. The more vertices in a single buffer (within reason), the less number of times that initial setup needs done.
The reason large (as in very large) VBOs might not be good, is because video cards have limited performance power / memory capacity, and swapping large amounts of memory back and forth between the system RAM and video RAM can be intensive... NOT because another mechanism is better for rendering a very large model. I can't imagine why you are failing on glBufferData() unless your requirements are incredibly huge (vastly larger than 256 MB), or perhaps you are using the USHORT data type for indices, but more than 65000 allocated.
The 1MB - 4MB suggestion has to do with how you should optimize your models, and is really somewhat dependent on the generation of video card you are using. Most video cards support HUGE VBOs. I'm betting your problem is elsewhere.

Resources