Data structure for very large binary data - data-structures

I'm building a Genetic Algorithm and I was wondering what's a good data structure to use for encoding the chromosomes (essentially a long sequence of 0s and 1s).
I'm aiming for efficiency in changing bits at random within the chromosomes and performing crossovers between chromosomes. Essentially a lot of copying and changing of bits or sub sequences of bits.
So far I'm just stuck with a plain Boolean array but I feel like there should be a better data structure for efficiently handling very large amounts of binary data.
Any suggestions?

Switching to using int primitives to represent groups of binary values and using bitwise operations and masks to change the groups of binary values could gain you large speed increases depending on how you are manipulating data. You could randomly mutate blocks of genes at a time using randomly generated masks.
An array is hard to beat if you are scanning the whole thing or know the index ahead of time. However, copying sections of the array over other sections can be challenging, but its still reasonably efficient.
If you are more concerned with swapping fixed sized groups of genes, building a 2 level tree with n branches with the groups of genes on each leaf would allow you to swap groups of genes very fast. The groups might not need to be the same size either. If you the genes need to be broken up further into chromosomes, you can add an intermediate level to the tree.

Related

Gomoku Board representation

I'm working on a Gomoku game and I need an efficient data structure to store the boards state,
I've thought about storing it in a 2D array, but I'm sure that there is a more efficient way.
Thanks
In terms of time efficiency, since I believe you'll mainly be doing index lookups, an array would be pretty much the best choice - it supports this lookup in constant time, with a low constant factor.
In terms of space efficiency:
Each square can be either empty, or populated by either player. So there are a maximum of 3 possibilities. For maximum space efficiency, we could store our entire board in base-3 representation, but, since a computer works in binary, we'd need to process the entire board to determine the value of some given square (thus a simply index lookup will take time proportional to the size of the board - if time really isn't a problem, you could consider this). Instead, I'd recommend using 2 bits per square, which would allow us to indicate one of 4 possibilities (the 4th being unused).
Many languages have some sort of bitset implementation, allowing you to work with an array of bits, which would be perfect for the above.
You'd also just want a single bitset (not 2D) as there's usually a bit of memory overhead involved in working with 2D structures. The conversion from 2D to 1D is simple - we could convert the 2D index to 1D with either x*height + y or y*width + x.
Although I'd recommend first being sure that you need to perform this optimization - I believe Gomoku boards are typically small, so even a bulky representation would work perfectly (although some AI techniques make many copies of the board, so, if you're doing that, a minimal representation would make sense).

Data Structure for fast searching

If I have to develop an application for a data grid station of an institute. The purpose of application is to receive the data from data GRID station once in a week between 10 A.M to 10:30 A.M and then store it into a data structure and data is consist of digits only but the numbers could be very long for one entry then which data structure will be the best for given scenario from array, list, linked list, doubly linked list, queue, priority queue, stack, binary search tree, AVL trees, threaded binary tree, heap, sorted sequential array and skip list
I want to store sorted digits. The sorted data can be in ascending or descending order and the main concern is "fast and efficient searching".
From your description I gather that you don't store any other data with the digits or numbers. So basically you want to know if a number is in the set or not.
Fastest way to know this, is to have an array of flags for each number. Let's say you deal with numbers from 1 to 1000. You want to know if number 200 is in the set. Look at position 200 wether the flag is true or false. You see, this is the fastest method, because you only look up one place.
As we are talking about boolean flags here, a bit is sufficient for storage. You would decide wether to store the booleans in bits, bytes, words or whatever, depending on the number of numbers, the available memory and the machine's architecture.
Having said this, you may have to deal with so many numbers that above approach is no more feasible. It would be fastest in theory, but with limited memory, swaps to hard disk, many, many reads from it, other algorithms may prove better. You would have the choice between:
storing the numbers contiguously and perform a binary search on them
storing the numbers in a binary tree
using a hash algorithm
Which of these proves most efficient, again depends on your data and the machine.
It depends what type of searching you want to do. If you just want to know if a number is within your dataset, then a hash will be extremely fast and independent of the size of your dataset. And there is no need to sort, or even any concept of order.
If I may quote Larry Wall, author of Perl:
Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi.
(An associative array is synonymous with a hash.)

Could I use a faster data structure than a tree for this?

I have a binary decision tree. It takes inputs as an array of floats, and each branch node splits on an input index and value eventually taking me to a leaf.
I'm performing a massive number of lookups on this tree (about 17% of execution time according to performance analysis (Edit: Having optimised other areas it's now at almost 40%)), and am wondering if I could/should be using a different data structure to improve lookup speed.
Some kind of hash table can't be used, as inputs do not map directly to a leaf node, but I was wondering is anyone had any suggesting as to methods and data-structures I could use in place of the tree (or as well as?) to improve lookup speeds.
Memory is a concern, but less of a concern than speed.
Code is currently written in C#, but obviously any method could be applied.
Edit:
There's a bit too much code to post, but I'll give more detail about the tree.
The tree is generated using information gain calculations, it's not always a 50/50 split, the split value could be any float value. A single input could also be split multiple times increasing the resolution on that input.
I posted a question about performance of the iterator here:
Micro optimisations iterating through a tree in C#
But I think I might need to look at the data structure itself to improve performance further.
I'm aiming for as much performance as possible here. I'm working on a new method of machine learning, and the tree grows itself using a feedback loop. For the process I'm working on, I estimate it'll be running for several months, so a few % saving here and there is massive. The ultimate goal is speed without using too much memory.
If I understand correctly, you have floating point ranges than have to be mapped to a decision. Something like this:
x <= 0.0 : Decision A
0.0 < x <= 0.5 : Decision B
0.5 < x <= 0.6 : Decision C
0.6 < x : Decision D
A binary tree is a pretty good way to handle that. As long as the tree is well balanced and the input values are evenly distributed across the ranges, you can expect O(log2 n) comparisons, where n is the number of possible decisions.
If the tree is not balanced, then you could be doing far more comparisons than necessary. In the worst case: O(n). So I would look at the trees and see how deep they are. If the same tree is used again and again, then the cost spent rebalancing once may be amortized over many lookups.
If the input values are not evenly distributed (and you know that ahead of time), then you might want to special-case the order of the comparisons so that the most common cases are detected early. You can do this by manipulating the tree or by adding special cases in the code before actually checking the tree.
If you've exhausted algorithmic improvements and you still need to optimize, you might look into a data structure with better locality than a general binary tree. For example, you could put the partition boundaries into a contiguous array and perform a binary search on it. (And, if the array isn't too long, you might even try a linear search on the array as it may be friendlier for the cache and the branch prediction.)
Lastly, I'd consider building a coarse index that gives us a headstart into the tree (or array). For example, use a few of the most significant bits of the input value as an index and see if that can cut off the first few layers of the tree. This may help more than you might imagine, as the skipped comparisons probably have a low chance of getting correct branch predictions.
Presuming decisions have a 50/50 chance:
Imagine that you had two binary decisions; possible paths are 00, 01, 10, 11
Imagine instead of tree you had an array with four outcomes; you could turn your array of floats into a binary number which would be index into this array.

Efficiently querying a B+ Tree holding multidimensional data

I have a collection of tuples (x,y) of 64-bit integers that make up my dataset. I have, say, trillions of these tuples; it is not feasible to keep the dataset in memory on any machine on earth. However, it is quite reasonable to store them on disk.
I have an on-disk store (a B+-tree) that allow for the quick, and concurrent, querying of data in a single dimension. However, some of my queries rely on both dimensions.
Query examples:
Find the tuple whose x is greater than or equal than some given value
Find the tuple whose x is as small as possible s.t. it's y is greater than or equal to some given value
Find the tuple whose x is as small as possible s.t. it's y is less than or equal to some given value
Perform maintenance operations (insert some tuple, remove some tuple)
The best bet I have found are Z-order curves but I cannot seem to figure out how to conduct the queries given my two dimensional data-set.
Solutions that are not acceptable include a sequential scan of the data, this could be far too slow.
I think, the most appropriate data structures for your requirements are R-tree and its variants (R*-tree, R+-tree, Hilbert R-tree). R-tree is similar to B+-tree, but also allows multidimensional queries.
Other relevant data structure is Priority Search Tree. It is good for queries like your examples 1 .. 3, but not very efficient if you need frequent updates or on-disk store. For details see this paper or this book: "Handbook of Data Structures and Applications" (Chapter 18.5).
Are you saying you don't know how to query z-order curves? The Wikipedia page describes how you do range searches.
A z-curve divides your space into nested rectangles, where each additional bit in the key divides the space in half. To search for a point:
Start with the largest rectangle that might contain your point.
Recursively:
Create a result set of rectangles
For each rectangle in your set
If the rectangle is a single point, you are done, it is what you are looking for.
Otherwise, divide the rectangle in two (specify one additional bit of the z-curve)
If both halves contain a point
If one half is better
Add that rectangle to your result set of rectangles
Otherwise
Add both rectangles to your result set of rectangles
Otherwise, only one half contains a point
Add that rectangle to your result set of rectangles
Search your result set of rectangles
Worst case performance is bad, of course. You can adjust it by changing how you construct your z-order index.
I'm currently working on designing a data structure which is essentially a 'stacked' B+ tree (or a d+ tree where d is the number of dimensions) for multidimensional data. I believe it would suit your data perfectly and is being designed specifically for your use case.
The basic idea is this:
Each dimension is a B+ tree and is linked to the next dimension's B+ tree. Search through the first dimension normally, once a leaf is reached it contains a pointer to the root of the next B+ tree which belongs to the next dimension. Everything in the second B+ tree belongs to the same x value.
The original plan was to only store the unique values for each dimension along with it's count. This employs a very simple compression algorithm (if you can even call it that) while still allowing for the entire data set to be represented. This 'linked' dimension scheme could allow for extra dimensions to be added later as they are simply added to the stack of B+ trees.
Total insert/search/delete time for 2 dimensions would be something similar to this:
log b(card(x)) + log b(card(y))
where b is the base of each B+ tree and card(x) would be the cardinality of the x dimension.
I hope that makes sense. I'm still working on an implementation, however feel free to use or even augment the idea.
http://fallabs.com/tokyocabinet/
Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.
Tokyo Cabinet is written in the C language, and provided as API of C, Perl, Ruby, Java, and Lua. Tokyo Cabinet is available on platforms which have API conforming to C99 and POSIX. Tokyo Cabinet is a free software licensed under the GNU Lesser General Public License.
may it easy for u to embed?

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources