Computing percentiles

Computing percentiles - algorithm

I'm writing a program that's going to generate a bunch of data. I'd like to find various percentiles over that data.
The obvious way to do this is to store the data in some kind of sorted container. Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
The alternative is to use an unordered container and perform sorting at the end. I don't know if that's going to be any faster. Either way, we're still left with needing a container which offers fast random access. (An array, perhaps...)
Suggestions?
(Another alternative is to build a histogram, rather than keep the entire data set in memory. But since the objective is to compute percentiles extremely accurately, I'm reluctant to go down that route. I also don't know the range of my data until I generate it...)

Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
Yes, it's your good old Data.Map. See elemAt and other functions under the «Indexed» category.
Data.Set doesn't offer these, but you can emulate it with Data.Map YourType ().

Related

Checking a given hash is in a very, very long list

I have a list of hashes. Long list. Very long list. I need to check does a given hash is in that list.
The easiest way is to store hashes in memory (in a map or a simple array) and check that. But it will require lots of RAM/SSD/HDD memory. More than a server(s) can handle.
I'm wondering is there a trick to do that in reasonable memory usage. Maybe there's an algorithm I'm not familiar with or a special collection?

Three thoughts-
Depending on the structure of these hashes, you may be able to borrow some ideas from the concept of a Rainbow Table to implicitly store some of them.
You could use a trie to compress storage for shared prefixes if you have enough hashes, however given their length and (likely) uniformity, you won't see terrific savings.
You could split the hash into multiple smaller hashes, and then use these to implement a Bloom Filter, however this a probabilistic test, so you'll still need them stored somewhere else (or able to be calculated / derived) if there's a perceived "hit", however this may enable you filter out enough "misses" that a less performant (speed-wise) data structure becomes feasible.

What is the utility of treap data structure?

I am currently studying advanced data structures and I came across a weird data structure called Treap. I understand what Treap is but I can't seem to find it's utility in a valid use case scenario.
Why should you use such a data structure and in what type of problems/conditions treaps are best used?
I find myself much more into using either hash maps, min/max heaps, binary search tree or balanced binary search trees, but I can't tell on why should you use a treap.

They are easier to implement and more importantly, that makes them easier to modify/maintain into the future if you want to make slight variations on them or change them some way. They also allow for efficient parallel versions of set operations Union/Intersect/Difference which is extremely valuable. Using them simultaneously as a heap and binary tree isn't really very handy unless the stuff you use for priorities are coincidentally really nicely randomly distributed/permuted. I suppose there might be a case where that would be handy, but it seems really unlikely. Stuff so randomly distributed is usually more like a hash key which typically aren't useful as ordered data. How often do you want to pull people out in order of their SSNs? I guess it's possible but unlikely.

Iterable O(1) insert and random delete collection

I am looking to implement my own collection class. The characteristics I want are:
Iterable - order is not important
Insertion - either at end or at iterator location, it does not matter
Random Deletion - this is the tricky one. I want to be able to have a reference to a piece of data which is guaranteed to be within the list, and remove it from the list in O(1) time.
I plan on the container only holding custom classes, so I was thinking a doubly linked list that required the components to implement a simple interface (or abstract class).
Here is where I am getting stuck. I am wondering whether it would be better practice to simply have the items in the list hold a reference to their node, or to build the node right into them. I feel like both would be fairly simple, but I am worried about coupling these nodes into a bunch of classes.
I am wondering if anyone has an idea as to how to minimize the coupling, or possibly know of another data structure that has the characteristics I want.

It'd be hard to beat a hash map.

Take a look at tries.
Apparently they can beat hashtables:
Unlike most other algorithms, tries have the peculiar feature that the time to insert, or to delete or to find is almost identical because the code paths followed for each are almost identical. As a result, for situations where code is inserting, deleting and finding in equal measure tries can handily beat binary search trees or even hash tables, as well as being better for the CPU's instruction and branch caches.
It may or may not fit your usage, but if it does, it's likely one of the best options possible.

In C++, this sounds like the perfect fit for std::unordered_set (that's std::tr1::unordered_set or boost::unordered_set to you if you have an older compiler). It's implemented as a hash set, which has the characteristics you describe.
Here's the interface documentation. Note that the hash containers actually offer two sets of iterators, the usual ones and local ones which only go through one bucket.
Many other languages have "hash sets" as well, certainly Java and C#.

External store for complex collections that can be accessed by Key-Value

Problem
I need a key-value store that can store values of the following form:
DS<DS<E>>
where the data structure DS can be
either a List, SortedSet or an Array
and E can be either a String or byte-array.
It is very expensive to generate this data and so once I put it into the store, I will only perform read queries on it. Essentially it is a complex object cache with no eviction.
Example Application
A (possibly bad, but sufficient to clarify) example of an application is storing tokenized sentences from a document where you need to be able to quickly access the qth word of the pth sentence given documentID. In this case, I would be storing it as a K-V pair as follows:
K - docID
V - List<List<String>>
String word = map.get(docID).get(p).get(q);
I prefer to avoid app-integrated Map solutions (such as EhCache within Java).
I have worked with Redis but it doesn't appear to support the second layer of data-structure complexity. Any other K-V solutions that can help my use case?
Update:
I know that I could serialize/deserialize my object but I was wondering if there is any other solution.

In terms of platform choice you have two options - A full document database will support arbitrarily complex objects, but won't have built in commands for working with specific data structures. Something like Redis which does have optimised code for specific data structures can't support all possible data structures.
You can actually get pretty close with Redis by using ids instead of the nested data structure. DS1<DS2<E>> becomes DS1<int> and DS2<E>, with the int from DS1 and a prefix giving you the key holding DS2.
With this structure you can access any E with only two operations. In some cases you will be able to get that down to a single operation by knowing what the id of DS2 will be for a given query.

I hesitate to "recommend" it, but one of the only storage engines I know of which handles multi-dimensional data of this sort efficiently is Intersystems Cache. I had to use it at my last job, mostly coding against it using it's built in MUMPS-based language. I would not recommend the native approach, unless you hate yourself or your developers. However, they do have decent Java adapters, which appears to be what you're using. I've seen it handle billions of records, efficiently stored in nested binary tree tables. There is no practical limit to the depth (number of dimensions) you can use. However, this is very much a proprietary solution. There is an open-source alternative called GT.M, but I don't know how compatible it is with languages that aren't M or C.

Any Key-Value store supports complex values, you just need to serialize/deserialize the data.
If you want fast retrieval only for specific parts of the data, you could use a more complex Key. In your example this would be:
K - tuple(docID, p, q)

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."

A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).

What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.

Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio