Can FlatBuffers still prove beneficial in terms of performance and memory efficiency when we copy the source stream (e.g. from a streambuf in boost.asio) to a user space application allocated memory? Or does it perform the best with Zero Copy only?
Just curious about this while implementing a boost.asio service using flatbuffers as the protocol.
Any example code to achieve zero copy in context of flatbuffers and boost.asio are also welcome.
Please clarify with some relevant reading material or pointers in case you feel my overall understanding of the topic itself is incorrect at the moment.
Thanks in advance.
Related
I'm implementing an algorithm using Halide while comparing hand-tuned(using CUDA) version of same algorithm.
Acceleration of the Halide implementation mostly went well, but still slower a bit than hand-tuned version. So I tried to see exact execution time of each Func using nvvp(nvidia visual profiler). By doing that, I figured out that hand-tuned implementation overlaps multiple function's(they're similar) execution which is implemented as a Func in Halide implemetation. Cuda's Stream technology is used to do it.
I would like to know whether I can do similar exploitation of GPU in Halide or not.
I appreciate for reading.
Currently the runtime has no support for CUDA streams. It might be possible to replace the runtime with something that can do this, but there is no extra information passed in to control the concurrency. (The runtime is somewhat designed to be replaceable, but there is a bit of a notion of a single queue and full dependency information is not passed down. It may be possible to reconstruct the dependencies from the inputs and outputs, but that starts to be a lot of work to solve a problem the compiler should be solving itself.)
We're talking about how to express such control in the schedule. One possibility is to use the support being prototyped in the async branch to do this, but we haven't totally figured out how to apply this to GPUs. (The basic idea is scheduling a Func async on a GPU would put it on a different stream. We'd need to use GPU synchronization APIs to handle producer/consumer dependencies.) Ultimately this is something we are interested in exploiting, but work needs to be done.
This is a subject that I have never found a suitable answer to, and so I was wondering if the helpful people of Stack Overflow may be able to answer this.
First of all: I'm not asking for a tutorial or anything, merely a discussion because I have not seen much information online about this.
Basically what I'd like to know is how one designs a new type of partition format, and then how it is capable of being interfaced with the operating system for use?
And better yet, what qualifies one partition format to be better than another? Is it performance/security, filename/filesize? Or is there more to it?
It's just something I've always wondered about. I'd love to dabble in creating one just for education purposes someday.
OK, although the question is broad, I'll try to dabble into it:
Assume that we are talking about a 'filesystem' as opposed to
certain 'raw' partition formats such as swap formats etc.
A filesystem should be able to map from low-level OS, BIOS, Network or Custom calls into a coherent file-and-folder file' names
that can be used by user applications. So, in your case, a
'partitition format' should be something that presents low-level
disk sectors and cylinders and their contents into a file-and-folder
abstraction.
Along the way, if you can provide features such as less fragmentation, redundant nodes indexes, journalling to prevent data
loss, survival in case of loss of power, work around bad sectors,
redundant data, mirroring of hardware, etc. then it can be
considered better than another one that does not provide such
features. If you can optimise file sizes to match usage of disk
sectors and clusters while accommodating very small and very large
files, that would be a plus.
Thorough bullet-proof security and testing would be considered essential for any non-experimental use.
To start hacking on your own, work with one of the slightly older filesystems like ext2. You would need considerable
build/compile/kernel skills
to get going, but nothing monumental.
It seems that if it were possible to serialize data as the raw memory chunks that properties and fields are made up of, then it ought to be that much faster to communicate these objects to another system and the other system would only have to allocate memory for this memory and properly set reference pointers where they should go.
Yes, I know that's a little oversimplified and there are probably a plethora of reasons why it's difficult to do (like circular references). But I'm wondering if anyone has tried it and if there is a way to do it possibly with objects that meet certain restrictions?
One the one hand this is probably me just trying to micro-optimize, but on the other hand it really seems like this could be pretty useful in certain scenarios where performance is vital.
Obviously this kind of serialization is going to be faster than JSON any day (XML is slow by definition. In fact, I think that's what the L stands for. It was supposed to be XMS, but because it's so slow they missed the S and ended up with an L). However, I doubt it would beat efficient binary serializations such as Google's Protocol Buffers in real world scenarios.
If your serialized entities hold no references to other entities, and your memory layout on the two sides is exactly the same (same alignment, same order, etc...), you'll earn a little bit of performance by copying the memory buffer once, instead of doing so in chunks. However, the second you have to reconstruct references, memory copying is going to be trivial compared to looking up the referenced object. Copying memory is fast, especially when done in order, minimizing cache misses.
Things like normal memory addresses will completely break between serialization-deserialization. However if you're clever and careful you could device a mechanism where a data structure is serialized. Maybe translate addresses to offset-bytes-from-base?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So sometimes I need to write a data structure I can't find on Hackage, or what I find isn't tested or quality enough for me to trust, or it's just something I don't want to be a dependency. I am reading Okasaki's book right now, and it's quite good at explaining how to design asymptotically fast data structures.
However, I am working specifically with GHC. Constant factors are a big deal for my applications. Memory usage is also a big deal for me. So I have questions specifically about GHC.
In particular
How to maximize sharing of nodes
How to reduce memory footprint
How to avoid space leaks due to improper strictness/laziness
How to get GHC to produce tight inner loops for important sections of code
I've looked around various places on the web, and I have a vague idea of how to work with GHC, for example, looking at core output, using UNPACK pragmas, and the like. But I'm not sure I get it.
So I popped open my favorite data structures library, containers, and looked at the Data.Sequence module. I can't say I understand a lot of what they're doing to make Seq fast.
The first thing that catches my eye is the definition of FingerTree a. I assume that's just me being unfamiliar with finger trees though. The second thing that catches my eye is all the SPECIALIZE pragmas. I have no idea what's going on here, and I'm very curious, as these are littered all over the code.
Many functions also have an INLINE pragma associated with them. I can guess what that means, but how do I make a judgement call on when to INLINE functions?
Things get really interesting around line ~475, a section headered as 'Applicative Construction'. They define a newtype wrapper to represent the Identity monad, they write their own copy of the strict state monad, and they have a function defined called applicativeTree which, apparently is specialized to the Identity monad and this increases sharing of the output of the function. I have no idea what's going on here. What sorcery is being used to increase sharing?
Anyway, I'm not sure there's much to learn from Data.Sequence. Are there other 'model programs' I can read to gain wisdom? I'd really like to know how to soup up my data structures when I really need them to go faster. One thing in particular is writing data structures that make fusion easy, and how to go about writing good fusion rules.
That's a big topic! Most has been explained elsewhere, so I won't try to write a book chapter right here. Instead:
Real World Haskell, ch 25, "Performance" - discusses profiling, simple specialization and unpacking, reading Core, and some optimizations.
Johan Tibell is writing a lot on this topic:
Computing the size of a data structure
Memory footprints of common data types
Faster persistent structures through hashing
Reasoning about laziness
And some things from here:
Reading GHC Core
How GHC does optimization
Profiling for performance
Tweaking GC settings
General improvements
More on unpacking
Unboxing and strictness
And some other things:
Intro to specialization of code and data
Code improvement flags
applicativeTree is quite fancy, but mainly in a way which has to do with FingerTrees in particular, which are quite a fancy data structure themselves. We had some discussion of the intricacies over at cstheory. Note that applicativeTree is written to work over any Applicative. It just so happens that when it is specialized to Id then it can share nodes in a manner that it otherwise couldn't. You can work through the specialization yourself by inlining the Id methods and seeing what happens. Note that this specialization is used in only one place -- the O(log n) replicate function. The fact that the more general function specializes neatly to the constant case is a very clever bit of code reuse, but that's really all.
In general, Sequence teaches more about designing persistent data structures than about all the tricks for eeking out performance, I think. Dons' suggestions are of course excellent. I'd also just browse through the source of the really canonical and tuned libs -- Map, IntMap, Set, and IntSet in particular. Along with those, its worth taking a look at Milan's paper on his improvements to containers.
I'm a C# developer and I use data structures such as List and Dictionary all the time, I'm reading some interview books and they all seem to suggest that we should know how to implement such data structures as well as how to use them, do a lot of you share the same viewpoint?
I would say that at a minimum every competent programmer should understand the internals of the most widely used data structures.
By that I mean being able to explain how they work internally, and what complexity guarantees (both time and space) they offer.
Yes.
For the same reasons that a C or C++ programmer should be familiar with assembly language; it helps you understand what is going on under the hood, and improves your ability to select the appropriate data structure for your particular programming problem.
In the same vein, you don't have to write a compiler use your favorite programming language effectively, but you can greatly improve your knowledge about that language by writing a compiler for it.
If you don't know how to implement the data structure how can you possibly say you understand the strengths and weaknesses of the structure in question? As aix mentioned it should be a requirement that you understand the internals of what you are using. I would never trust a mechanic who didn't understand how an engine worked.
It is preferably that you know how to implement these data structures, but you do not need this knowledge in order to be a competent or even effective programmer.
You should have a high level understanding of (obviously) what they do but also how they do it. That should suffice.
I don't need to know the inner workings of every tool I use to be able to use it effectively. I just need to have a grasp on what it does, which uses it is suited to, and which uses it is not suited to.
The best programmers will know such data structures and all known variations inside out, but then they will also know every little corner of their chosen language / framework as well. They are well above the 'competent' level.