Can cereal serialization be more performant and deterministic? - c++11

I have been using cereal in highly time-sensitive software where every microsecond counts. My program runs in a loop and serializes a struct on every iteration. The struct contains some STL containers and strings and thus the size can vary between iterations.
I noticed that cereal takes much longer to complete on the very first serialization, and much less time in subsequent serialization attempts. It took approximately 600 microseconds the first time, then averaged 80 microseconds subsequently.
After tracing through the library I haven't been able to determine what is different about the first attempt versus all others. I'm guessing it has to do with parsing my struct or with allocating memory for the stringstream.
I found this post interesting, in particular the recommendation to extend a cereal class to not use streams. I tried to create a version of the BinaryOutputArchive class that used a void* buffer instead of a std::ostream, but have been unsuccessful getting things to compile. I also tried playing with the rdbuf of the stringstream as suggested here but I could not get it to serialize properly.
Does anyone have a recommendation on how to improve cereal's performance, especially on the very first serialization? Or perhaps a way to achieve deterministic latencies? Am I on the right track with my attempts above?

Related

rxjs performance array vs stream

I'm new to the rxjs world and trying to get my head around it. My understanding is one of the reasons to use rxjs is to improve performance with large datasets.
I'm trying to measure the speed improvement that you could get vs normal arrays high functions (map, reduce).
I have set up this example here https://jsbin.com/bagoli/edit?js,console
The idea is to generate an array and apply some operators to it, measuring the time spent.
I don't understand why the stream calculation is always slower. Am I missing something?
Thank you for your help.
Your calculateWithStreams function is async and will run in parallel to your Array function. Therefore making it slower. If you run them one at a time, the times are basically the same once you increase the Size a bit.
RxJS does of course have some overhead compared to native Arrays, but it makes up for it with lazy evaluation.
Also consider that the improvement isn't just in execution speed, but also memory usage. The Array version will always create a new array and will take up more memory.

Reflection.Emit Performance

Here's a simple question.
Let's say we want to unroll a looping method such as:
public int DoSum1(int n)
{
int result = 0;
for(int i = 1;i <= n; i++)
{
result += i;
}
return result;
}
Into a method performing simple additions only:
public int DoSum2( )
{
return 1+2+3+4+5+6+7+8+9+10+11+12+13+14+15+16+17+18+19+20;
}
[http://etutorials.org/Programming/Programming+C.Sharp/Part+III+The+CLR+and+the+.NET+Framework/Chapter+18.+Attributes+and+Reflection/18.3+Reflection+Emit/][1]
Logically, we're going to need code to create DoSum2 in IL at some point.
In this IL generation code we will perform an actual loop with the same iteration count than the unoptimized method.
What's the point of creating a super fast dynamic method if the code required to generate it will use a similar amount of time to execute???
Perhaps you can give an example, when it worths using Emit in a similar case?
What's the point of creating a super fast dynamic method if the code required to generate it will use a similar amount of time to execute
This isn't really specific to Reflection.Emit, but to runtime code generation in general, so I will answer accordingly.
First, I do not recommend using code generation simply to perform micro-optimizations that compilers normally perform like loop unrolling. Let the JIT compiler do it's job.
Second, you are right in that there is usually little point in generating code that will only execute once. The time required to emit and JIT compile the IL is not insubstantial. You should only bother generating code if it will be executed many times.
Now, there definitely are cases where runtime code generation can prove beneficial. In fact, it's a technique I leverage heavily. I work in an electronic trading environment where it is necessary to process very high volumes of dynamic data. This introduces several concerns, the most significant being memory usage and throughput.
Our trading application needs to keep a lot of data in memory, so the footprint of each record is critical. Dynamic data structures like maps/dictionaries are less efficient than "POCO" classes with optimized field layouts and, depending on the design, may require boxing some values. I avoid this overhead by generating client-side storage classes once the shape of the data is known. In effect, the memory layout is as it would have been had I known the shape of the data at compile time.
Throughput is a major issue as well; (de)serializing dynamic data often involves some additional introspection and extra layers of indirection. Need to serialize a record? OK, first you need to query what the fields are. Then, for each field, you need to determine its type, then select a serializer for that type, and then invoke the serializer. If your data structure has optional fields, you may need to do some additional pre-processing, like figuring out the size of a presence map, and which bits in the presence map correspond to which fields. If you need to process a ton of data, all that overhead becomes a real problem. I avoid this overhead by generating specialized (de)serializers on both the server side and client side. Since the serializers are generated on demand, they can know the exact shape of the data, and read/write that data as efficiently as a hand-optimized serializer. When you have a high volume of data updating at very high frequencies, this can make a huge difference.
Now, keep in mind that we're something of an edge case. Most applications do not have the aggressive memory and throughput requirements that ours has, so runtime code generation isn't necessary. You should only go that route if you really need it, and you have exhausted all other possibilities. Although it can help with performance, generated code can be very difficult to debug and maintain.

Is there a well established incremental algorithm to maintain a history of values with accumulation over specific time frames?

I have practically completed one, but wanted to compare mine with a well researched an possibly academic algorithm. There may be a library of statistical objects which either directly or in-combination solve my particular need.
My system (which I intend to OpenSource) has a stream of NetFlow data. Rather than store in database and using SQL functions, I prefer to have a database-free system and maintain a set of statistics, updated for each new flow, and scrolled per-second (or higher).
My solution involves an single array of uint, to effectively create a jagged array of sizes [60, 59, 23, 6, ...], representing seconds, minutes, hours, days, weeks, etc....
Each slot contains the total amount of Bytes for that time. So after 60 seconds a single minute statistic is created as Avg(seconds). This of course continues relatively up the time scale.
Rather than simply having thousands of second increments, it is due to:
Memory constraints and the potential to have more statistical nodes; AND
Ideal presentation to users
...that I roll up time scales.
Given that a flow may be applied to several nodes in a heirarchy of statistics (WAN Link, IP Address, Destination Address, SourcePort-DestinationPort), I calculate the delta once (GenerateDelta) and then simply apply at every node which is both active and which matches the flow meta-data.
A statistic on a given node would be "scrolled", in the following potential cases:
When being read/displayed (via HTTP\JSON AJAX Request)
When a delta is being applied (due to relevant flow)
Simply every n-seconds (n is typically 1)
Overall there may be a well established algorithm for keeping running totals over time (with seconds, minutes...). But failing that, there may also be a suitable algorithms for comparison on smaller sub-sections of my code:
GenerateDelta - not likely as this is specific for breaking down and averaging a flow with duration over slots in the statistics array.
Scroll - if there were only seconds, then this would of course be simple, however my solution requires the 60 seconds to be combined into a new minute total every 60 seconds, and so on.
I do not wish responders to suggest any of their own algorithms, I have already (almost) completed all of my own without any problems and with many performance considerations. And others will likely be able to have a look at my algorithm when I have finished and published as Open Source.
What I do wish to see is any "well established" algorithms for comparison. Perhaps mine will be better, perhaps mine will be worse. Google isn't good at this sort of question, I need your help.
Thanks!
Thanks to comment from #rici, I found the "Stream Statistics" domain is what is required. There are Data Stream Management Systems (DSMS) for dealing with stream statistics. Whereas SQL RDBMS systems can store data with statistics generated by SQL query, a Data Stream Management Systems, enables the processing of a continuous stream of data, given one or more queries.
This paper, describes a DSMS as:
Being able to sacrifice quality for qualitative use
Being single pass, because the data is vast
Having Queries treating data as sequences not sets
And more...
This one, depicts a diagram of such a DSMS, references the Network Traffic Analysis problem domain,
This paper, describes StreamSQL, SQL-like syntax, for defining continuous queries.
Even though proprietary solutions are not accessible. There certainly are well established algorithms. I can therefore test the performance of my specialised system against general stream query tools.
Several products/prototypes of DSMS can be found in this wiki page, specifically Odysseus is of interest, being Java based and open source.

Techniques for handling arrays whose storage requirements exceed RAM

I am author of a scientific application that performs calculations on a gridded basis (think finite difference grid computation). Each grid cell is represented by a data object that holds values of state variables and cell-specific constants. Until now, all grid cell objects have been present in RAM at all times during the simulation.
I am running into situations where the people using my code wish to run it with more grid cells than they have available RAM. I am thinking about reworking my code so that information on only a subset of cells is held in RAM at any given time. Unfortunately the grids (or matrices if you prefer) are not sparse, which eliminates a whole class of possible solutions.
Question: I assume that there are libraries out in the wild designed to facilitate this type of data access (i.e. retrieve constants and variables, update variables, store for future reference, wipe memory, move on...) After several hours of searching Google and Stack Overflow, I have found relatively few libraries of this sort.
I am aware of a few options, such as this one from the HSL mathematical library: http://www.hsl.rl.ac.uk/specs/hsl_of01.pdf. I'd prefer to work with something that is open source and is written in Fortran or C. (my code is mostly Fortran 95/2003, with a little C and Python thrown in for good measure!)
I'd appreciate any suggestions regarding available libraries or advice on how to reformulate my problem. Thanks!
Bite the bullet and roll your own?
I deal with too-large data all the time, such as 30,000+ data series of half-hourly data that span decades. Because of the regularity of the data (daylight savings changeovers a problem though) it proved quite straightforward to devise a scheme involving a random-access disc file and procedures ReadDay and WriteDay that use a series number, and a day number, with further details because series start and stop at different dates. Thus, a day's data in an array might be Array(Run,DayNum) but now is ReturnCode = ReadDay(Run,DayNum,Array) and so forth, the codes indicating presence/absence of that day's data, etc. The key is that a day's data is a convenient size, and a regular (almost) size, and although my prog. allocates a buffer of one record per series, it runs in ~100MB of memory rather than GB.
Because your array is non-sparse, it is regular. Granted that a grid cell's data are of fixed size, you could devise a random-access disc file with each record holding one cell, or, perhaps a row's worth of cells (or a column's worth of cells) or some worthwhile blob size. I choose to have 4,096 bytes/record as that is the disc file allocation size. Let the computer's operating system and disc storage controller do whatever buffering to real memory they feel up to. Typical execution is restricted to the speed of data transfer however, unless the local data's computation is heavy. Thus, I get cpu use of a few percent until data requests start being satisfied from buffers.
Because fortran uses the same syntax for functions as for arrays (unlike say Pascal), instead of declaring DIMENSION ARRAY(Big,Big) you would remove that and devise FUNCTION ARRAY(i,j), and all read references in your source file stay as they are. Alas, in the absence of a "palindromic" function declaration, assignments of values to your array will have to be done with a different syntax and you devise a subroutine or similar. Possibly a scratchpad array could be collated, worked upon with convenient syntax, and then written back if changed.

How do I handle the creation/destruction of many objects in memory effectively?

Im in the process of making a game of my own. One of the goals is to have as many objects within the world as possible. In this game, many objects will need to be created in some unpredictable period of time (like a weapon firing will create an object) and once that projectile hits something, the object will need to be destroyed aswell (and maybe the thing it hits).
So i was wondering what the best way to handle this in memory is. Ive thought about creating a stack or table, and adding the pointers to those objects there, and creating and destroying those objects on demand, however, what if several hundred (or thousand) objects try to be created or destroyed at once between frames? I want to keep a steady and fluid frame rate, and such a surge in system calls would surely slow it down.
So ive thought i could try to keep a number of objects in memory so that i could just copy information into them, and use them without having to request the memory for them on demand. But how much memory should i try to reserve? Or should i not worry about that as long as the users computer has enough (presumably they will be focusing on the game and not running a weather simulation in the background).
What would be the best way of handling this?
Short answer: it depends on the expected lifetime of the objects.
Usually, the methods are combined. An object that is fairly static and is unlikely to be removed or created often (usually, players, levels, certain objects in the levels, etc) are created with the first method you described (a list of objects, an array, a singleton, etc) The exact method depends on the game, and the object being created.
For short term objects, like bullets, particle effects, or in some game, the enemies themselves, something like object pool pattern is usually used. A chunk of memory is reversed at the beginning of the game and reused throughout the course of the game for bullets and pretty particle effects. As for how much memory should I reserve?, the ideal answer is "As little as possible". Unfortunately, it's hard to figure that out sometimes. The best way to figure it out is to take a guess at how many bullets or whatnot you plan on having on screen at any given time, multiply by two (for when you decide that your bullet hell shooter doesn't really work to well with only 50 bullets) and then add a little buffer. To make it easier, store that value in a easily understood #define BULLET_MAX 110 so you can change it when the game is closer to done, and you can reasonably be sure that the value isn't going to fluctuate as much. For extra fun, you can tie the value into a config variable, and have the graphics setting affect it.
In real time games, where fluidity is critical, they often allocate a large chunk of memory in the beginning of the level and avoids any allocation/deallocation in the middle of the game.
You can often design so the game mechanic prevents the game from running out of memory (such as increasing the chance of weapon jamming when the player shoots too much too often).
Ultimately though, test your game in your targeted minimum supported machine, if it's fast enough there then it's fast enough and don't overcomplicate your code for hypothetical situations.

Resources