I would like to use in-memory btree. I am considering LMDB and STX.
I appreciate help in understanding the difference between them. Among the others in context with concurrency (I'm not sure if STX supports it)
STX implements in-memory (non-persistent) B+trees as C++ templates, LMDB implements memory-mapped (persistent) B+trees as a C library. While LMDB can deliver the performance of an in-memory data structure, it is more than that. If you only need in-memory it may be overkill for whatever you're doing.
Related
What is the purpose of creating your own linked list, or other data structure like maps, queues or hash function, for some programming language, instead of using built-in ones, or why should I create it myself? Thank you.
Good question! There are several reasons why you might want to do this.
For starters, not all programming languages ship with all the nice data structures that you might want to use. For example, C doesn't have built-in libraries for any data structures (though it does have bsearch and qsort for arrays), so if you want to use a linked list, hash table, etc. in C you need to either build it yourself or use a custom third-party library.
Other languages (say, JavaScript) have built-in support for some but not all types of data structures. There's no native JavaScript support for linked lists or binary search trees, for example. And I'm not aware of any mainstream programming language that has a built-in library for tries, though please let me know if that's not the case!
The above examples indicate places where a lack of support, period, for some data structure would require you to write your own. But there are other reasons why you might want to implement your own custom data structures.
A big one is efficiency. Put yourself in the position of someone who has to implement a dynamic array, hash table, and binary search tree for a particular programming language. You can't possibly know what workflows people are going to subject your data structures to. Are they going to do a ton of inserts and deletes, or are they mostly going to be querying things? For example, if you're writing a binary search tree type where insertions and deletions are common, you probably would want to look at something like a red/black tree, but if insertions and deletions are rare then an AVL tree would work a lot better. But you can't know this up front, because you have to write one implementation that stands the test of time and works pretty well for all applications. That might counsel you to pick a "reasonable" choice that works well in many applications, but isn't aggressively performance-tuned for your specific application. Coding up a custom data structure, therefore, might let you take advantage of the particular structure of the problem you're solving.
In some cases, the language specification makes it impossible or difficult to use fast implementations of data structures as the language standard. For example, C++ requires its associative containers to allow for deletions and insertions of elements without breaking any iterators into them. This makes it significantly more challenging / inefficient to implement those containers with types like B-trees that might actually perform a bit better than regular binary search trees due to the effects of caches. Similarly, the implementation of the unordered containers has an interface that assumes chained hashing, which isn't necessarily how you'd want to implement a hash table. That's why, for example, there's Google's alternatives to the standard containers that are optimized to use custom data structures that don't easily fit into the language framework.
Another reason why libraries might not provide the fastest containers would be challenges in providing a simple interface. For example, cuckoo hashing is a somewhat recent hashing scheme that has excellent performance in practice and guarantees worst-case efficient lookups. But to make cuckoo hashing work, you need the ability to select multiple hash functions for a given data type. Most programming languages have a concept that each data type has "a" hash function (std::hash<T>, Object.hashCode, __hash__, etc.), which isn't compatible with this idea. The languages could in principle require users to write families of hash functions with the idea that there would be many different hashes to pick from per object, but that complicates the logistics of writing your own custom types. Leaving it up to the programmer to write families of hash functions for types that need it then lets the language stay simple.
And finally, there's just plain innovation in the space. New data structures get invented all the time, and languages are often slow to grow and change. There's been a bunch of research into new faster binary search trees recently (check out WAVL trees as an example) or new hashing strategies (cuckoo hashing and the "Swiss Table" that Google developed), and language designers and implementers aren't always able to keep pace with them.
So, overall, the answer is a mix of "because you can't assume your favorite data structure will be there" and "because you might be able to get better performance rolling your own implementations."
There's one last reason I can think of, and that's "to learn how the language and the data structure work." Sometimes it's worthwhile building out custom data types just to sharpen your skills, and you'll often find some really clever techniques in data structures when you do!
All of this being said, I wouldn't recommend defaulting to coding your own version of a data structure every time you need one. Library versions are usually a pretty safe bet unless you're looking for extra performance or you're missing some features that you need. But hopefully this gives you a better sense as to why you may want to consider setting aside the default, well-tested tools and building out your own.
Hope this helps!
I was wondering if we interpreters were cheating to get better performance. As I understand, the only real datastructure in a scheme is the cons cell.
Obviously, a cons cell is good to make simple datastructure like linked list and trees but I think it might get make the code slowlier in some case for example if you want to access the cadr of an object. It would get worse with a data structure with many more elements...
That said, may be scheme car and cdr are so efficient that it's not much slowlier than having a register offset in C++ for example.
I was wondering if it was necessary to implement a special datastructure that allocate native memory block. Something similar to using malloc. I'm talking about pure scheme and not anything related to FFI.
As I understand, the only real datastructure in a scheme is the cons cell.
That’s not true at all.
R5RS, R6RS, and R7RS scheme all include vectors and bytevectors as well as pairs/lists, which are the contiguous memory blocks you allude to.
Also, consider that Scheme is a minimal standard, and individual Scheme implementations tend to provide many more primitives than exist in the standard. These make it possible to do efficient I/O, for example, and many Schemes also provide an FFI to call C code if you want to do something that isn’t natively supported by the Scheme implementation you’re using.
It’s true that linked lists are a relatively poor general-purpose data structure, but they are simple and work alright for iteration from front to back, which is fairly common in functional programming. If you need random access, though, you’re right—using linked lists is a bad idea.
First off. There are many primitive types and many different compound types and even user described types in Scheme.
In C++ the memory model and how values are stored are a crucial part of the standard. In Scheme you have not have access to the language internals as standard, but implementations can do it to have a higher percentage of the implementation written in Scheme.
The standard doesn't interfere in how the implementation chooses to store data so even though many imitate each other with embedding primitive values in the address and otherwise every other value is a object on the heap it doesn't need to be. Using pairs as the implementation of vectors (arrays in C++) is pushing it and would make for a very unpopular implementation if not just a funny prank.
With R6RS you can make your own types it's even extendable with records:
(define-record-type (node make-node node?)
(fields
(immutable value node-value)
(immutable left node-left))
(immutable right node-right)))
node? would be disjoint and thus no other values would be #t other than values made with the constructor make-node and this has 3 fields instead of using two cons to store the same.
Now C++ has perhaps the edge by default when it comes to storing elements of the same type in an array, but you can in many ways work around this. Eg. use the same trick as you see in this video about optimizing java for memory usage. I would have started by making a good data modeling with records and rather worried about the performance when they become an issue.
The topic of algorithms class today was reimplementing data structures, specifically ArrayList in Java. The fact that you can customize a structure for in various ways definitely got me interested, particularly with variations of add() & iterator.remove() methods.
But is reimplementing and customizing a data structure something that is of more interest to the academics vs the real-world programmers? Has anyone reimplemented their own version of a data structure in a commercial application/program, and why did you pick that route over your particular language's implementation?
Knowing how data structures are implemented and can be implemented is definitely of interest to everyone, not just academics. While you will most likely not reimplement a datastructure if the language already provides an implementation with suitable functions and performance characteristics, it is very possible that you will have to create your own data structure by composing other data structures... or you may need to implement a data structure with slightly different behavior than a well-known data structure. In that case, you certainly will need to know how the original data structure is implemented. Alternatively, you may end up needing a data structure that does not exist or which provides similar behavior to an existing data structure, but the way in which it is used requires that it be optimized for a different set of functions. Again, such a situation would require you to know how to implement (and alter) the data structure, so yes it is of interest.
Edit
I am not advocating that you reimplement existing datastructures! Don't do that. What I'm saying is that the knowledge does have practical application. For example, you may need to create a bidirectional map data structure (which you can implement by composing two unidirectional map data structures), or you may need to create a stack that keeps track of a variety of statistics (such as min, max, mean) by using an existing stack data structure with an element type that contains the value as well as these various statistics. These are some trivial examples of things that you might need to implement in the real world.
I have re-implemented some of a language's built-in data structures, functions, and classes on a number of occasions. As an embedded developer, the main reason I would do that is for speed or efficiency. The standard libraries and types were designed to be useful in a variety of situations, but there are many instances where I can create a more specialized version that is custom-tailored to take advantage of the features and limitations of my current platform. If the language doesn't provide a way to open up and modify existing classes (like you can in Ruby, for instance), then re-implementing the class/function/structure can be the only way to go.
For example, one system I worked on used a MIPS CPU that was speedy when working with 32-bit numbers but slower when working with smaller ones. I re-wrote several data structures and functions to use 32-bit integers instead of 16-bit integers, and also specified that the fields be aligned to 32-bit boundaries. The result was a noticable speed boost in a section of code that was bottlenecking other parts of the software.
That being said, it was not a trivial process. I ended up having to modify every function that used that structure and I ended up having to re-write several standard library functions as well. In this particular instance, the benefits outweighed the work. In the general case, however, it's usually not worth the trouble. There's a big potential for hard-to-debug problems, and it's almost always more work than it looks like. Unless you have specific requirements or restrictions that the existing structures/classes don't meet, I would recommend against re-implementing them.
As Michael mentions, it is indeed useful to know how to re-implement structures even if you never do so. You may find a problem in the future that can be solved by applying the principles and techniques used in existing data structures.
Would you recommend Google Protocol Buffers or Caucho Hessian for a cross-language over-the-wire binary format? Or anything else, for that matter - Facebook Thrift for example?
We use Caucho Hessian because of the reduced integration costs and simplicity. It's performance is very good, so it's perfect for most cases.
For a few apps where cross-language integration is not that important, there's an even faster library that can squeeze even more performance called Kryo.
Unfortunately it's not that widely used, and it's protocol is not quasi-standard like the one from Hessian.
Depends on use case. PB is much more tightly coupled, best used internally with closely-coupled systems; not good for shared/public interfaces (as in to be shared between more than 2 specific systems).
Hessian is bit more self-descriptive, has nice performance on Java. Better than PB on my tests, but I'm sure that depends on use case. PB seems to have trouble with textual data, perhaps it has been optimized for integer data.
I don't think either is particularly good for public interfaces, but given you want binary format, that is probably not a big problem.
EDIT: Hessian performance is actually not all that good as, per jvm-serializers benchmark. And PB is pretty fast as long as you make sure to add the flag that forces use of fast options on Java.
And if PB is not good for public interfaces, what is? IMO, open formats like JSON are superior externally, and more often than not fast enough that performance does not matter a lot.
For me, Caucho Hessian is the best.
It is very easy to get started, and the performance is good. I have tested local, the latent is about 3ms, on Lan you can expect about 10ms.
With hessian you don't have to write another file to define the model (we using java + java). It saves a lot of time for development and maintenance.
If you need a support to interconnect apps from many languages/platforms, than Hessian is the best. If you use only Java, than Kryo is even faster.
I'm myself looking into this.. no good conclusions so far, but I found http://dewpoint.snagdata.com/2008/10/21/google-protocol-buffers/ summarizing all the options.
Muscle has a binary message transport. Sorry that I can't comment on the others as I haven't tried them.
I tried Google Protocol Buffers. It works with C++/MFC, C#, PHP and more languages (see: http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns) and works really well regardless of transport and disk save/loading.
I would say that ProtocolBuffers, Thrift or Hessian are fairly similar as far as their Binary formats are concerned - where they provide cross-language serialization support. The inherent serialization might have some small performance differences between them ( size/space tradeoffs ) but this is not the most important thing. ProtocolBuffers is certainly a well performing IDL defined format which has features for extensibility which make it attractive.
HOWEVER the use of an "over-the-wire" in the question implies the use of a communications library. Here Google has provided an interface definition for protobuf RPC, which is equivalent to making a specification where all implementation details is left to the implementer. This is unfortunate because it means there is de-facto NO cross-language implementation - unless you can find a cross language implementation probably mentioned here http://code.google.com/p/protobuf/wiki/ThirdPartyAddOns. I have seen some RPC implementations which support java and c, or c and c++, or python and c etc, but here you just have to find a library which satisfies your concrete requirements and evaluate otherwise youre likely to be disappointed. ( At least i was disappointed enough to write protobuf-rpc-pro )
Kyro is a serialization format like protobuf, but java only. Kyro/Net is a java only RPC implementation using Kryo messages. So it's not a good choice for "cross-language-ness" communication.
Today it would seem that ICE http://www.zeroc.com/, and Thrift which provides an RPC implementation out of the box, are the best cross-language RPC implementations out there.
Where can I find performance metrics (memory/time) for a non-trivial example of using XSLT (with Xalan) compared to using STX (with Joost)
Probably there is no universal set of benchmarks. For XSLT there is (was?) XSLTMark, but this is for comparing the XSLT engines.
There is one page with the comparison of the same transformation written in different transformation languages.
Probably the best option is to model your problem, generate test data and measure the things you are interested in.
I agree in that real answers are best obtained by writing your own benchmark.
For what it's worth, my recollection is that many developers had high hopes for STX to be much faster than XSLT processors; but found the actual performance of implementations to fall short on expectations. Part of the reason may be that XSLT processor implementations are ridiculously well optimized by now, and thus can handle simple transformations very efficiently, all things considered. As such, STX implementations would also need to spend time honing implementation to same degree, to produce significant speed improvements for common transformations.
You really should use your own benchmark to cover the things you use.
But here's one data point, ( http://www.kindle-maps.com/blog/some-performance-information-on-joost-stx.html ), a 1.3GB XML file (from OpenStreetMap data), 1,800,000ish nodes were processed with a simple STX template in 3 minutes on a low end laptop.