How can a compact trie be beneficial? - memory-management

I just learned that a compact trie can save more memory compared to a regular trie, by storing 3 integers as a reference to a string object and instead of actually storing a string in each node. However, I'm still confused about how it can actually save memory using that method.
If in a node of a compact trie, we store 3 integers and a reference to a String object, would that not save any memory at all if that String object is also stored in memory?
If this is the case, is the compact trie only beneficial when we store the String object on disk.

The compact storage of a compressed trie can be more space efficient if you're already storing the strings for some other purpose.
The compact and non-compact versions will have similar memory usage if you're also counting the storing of the strings. The compact version might even be worse, depending on how many bytes your integers and pointers require.
For a non-compact compressed trie:
Each node would have a string which requires a pointer (let's say 4 bytes) and a length (2 bytes). This gives us 6 bytes.
On top of this we've need to store the actual string (even if we're already storing these elsewhere).
For a compact compressed trie:
Each node would have 3 integers (2 bytes each). This gives us 6 bytes.
If we're counting storing the strings as well, we'd have the actual strings (which is the same as the above), in addition to the overhead (pointer and length) of those strings.
Given these numbers (if we're counting storing the strings), this version would be worse.
For a non-compressed trie: (that is, where we only store one character per node)
Each node directly stores its one character.
There would be no string overhead.
However, if you have long chains, you could have many more nodes with this representation, thus this could end up being less time and space efficient (especially considering the time cost of having to jump between a bunch of locations in memory instead of just being able to read a sequential block of memory).

Related

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

index and tag switching position

Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?
This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory.  Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block.  How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity.  If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2.  Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before.  This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.

Is there an efficient way to store a lookup structure that uses random integer keys?

I need to implement a lookup structure with the following requirements:
Keys are random 128-bit integers
Values are 64-bit
It will be stored on disk
It must be searchable without the entire structure being resident in memory (I intend to memory map the file)
It must be mutable, but writes to disk must be incremental (must not require overwriting the entire structure)
Is there an efficient way to achieve all of this?
Please do not answer, "Don't use UUIDs." I am asking a specific question; changing the requirements changes the question.
Since your keys and values each are a fixed number of bytes, you could implement a hashtable as a file. The first few bytes contain the current number of elements and the current capacity, and then the entries each take up 16 + 8 bytes (if 0 is forbidden as a key) or 1 + 16 + 8 bytes if you need a flag to indicate whether an entry exists or not.
You can hash the key, then use arithmetic to seek to the correct position in the file, then read or write just the entries you need to. To resolve hash collisions, linear probing is probably best to avoid the number of seeks. Since the keys are random, catastrophic collision pileups shouldn't happen, and the hash can simply be to take the lowest k bits of the key, where the current capacity is 2^k.
This takes O(n) space, and allows lookups in O(1) average time, and writes in O(1) amortized time. Occasionally, you have to resize the hashtable to increase the capacity on a write; this takes O(n) time on those occasions.
If you need O(1) writes in the worst-case, you could maintain both the old and new hashtables, do lookups in both, and then on each write operation, copy across two entries from the old to the new. If the capacity is always increased by a factor of 2, then this gives non-amortized constant time writes, except for the cost of allocating an empty hashtable of size O(n). If creating an empty file of a particular size is also too slow for a single write operation, then you can amortize empty-file-creation across many writes too.

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?
This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

Redis 10x more memory usage than data

I am trying to store a wordlist in redis. The performance is great.
My approach is of making a set called "words" and adding each new word via 'sadd'.
When adding a file thats 15.9 MB and contains about a million words, the redis-server process consumes 160 MB of ram. How come I am using 10x the memory, is there any better way of approaching this problem?
Well this is expected of any efficient data storage: the words have to be indexed in memory in a dynamic data structure of cells linked by pointers. Size of the structure metadata, pointers and memory allocator internal fragmentation is the reason why the data take much more memory than a corresponding flat file.
A Redis set is implemented as a hash table. This includes:
an array of pointers growing geometrically (powers of two)
a second array may be required when incremental rehashing is active
single-linked list cells representing the entries in the hash table (3 pointers, 24 bytes per entry)
Redis object wrappers (one per value) (16 bytes per entry)
actual data themselves (each of them prefixed by 8 bytes for size and capacity)
All the above sizes are given for the 64 bits implementation. Accounting for the memory allocator overhead, it results in Redis taking at least 64 bytes per set item (on top of the data) for a recent version of Redis using the jemalloc allocator (>= 2.4)
Redis provides memory optimizations for some data types, but they do not cover sets of strings. If you really need to optimize memory consumption of sets, there are tricks you can use though. I would not do this for just 160 MB of RAM, but should you have larger data, here is what you can do.
If you do not need the union, intersection, difference capabilities of sets, then you may store your words in hash objects. The benefit is hash objects can be optimized automatically by Redis using zipmap if they are small enough. The zipmap mechanism has been replaced by ziplist in Redis >= 2.6, but the idea is the same: using a serialized data structure which can fit in the CPU caches to get both performance and a compact memory footprint.
To guarantee the hash objects are small enough, the data could be distributed according to some hashing mechanism. Assuming you need to store 1M items, adding a word could be implemented in the following way:
hash it modulo 10000 (done on client side)
HMSET words:[hashnum] [word] 1
Instead of storing:
words => set{ hi, hello, greetings, howdy, bonjour, salut, ... }
you can store:
words:H1 => map{ hi:1, greetings:1, bonjour:1, ... }
words:H2 => map{ hello:1, howdy:1, salut:1, ... }
...
To retrieve or check the existence of a word, it is the same (hash it and use HGET or HEXISTS).
With this strategy, significant memory saving can be done provided the modulo of the hash is
chosen according to the zipmap configuration (or ziplist for Redis >= 2.6):
# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given number of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
Beware: the name of these parameters have changed with Redis >= 2.6.
Here, modulo 10000 for 1M items means 100 items per hash objects, which will guarantee that all of them are stored as zipmaps/ziplists.
As for my experiments, It is better to store your data inside a hash table/dictionary . the best ever case I reached after a lot of benchmarking is to store inside your hashtable data entries that are not exceeding 500 keys.
I tried standard string set/get, for 1 million keys/values, the size was 79 MB. It is very huge in case if you have big numbers like 100 millions which will use around 8 GB.
I tried hashes to store the same data, for the same million keys/values, the size was increasingly small 16 MB.
Have a try in case if anybody needs the benchmarking code, drop me a mail
Did you try persisting the database (BGSAVE for example), shutting the server down and getting it back up? Due to fragmentation behavior, when it comes back up and populates its data from the saved RDB file, it might take less memory.
Also: What version of Redis to you work with? Have a look at this blog post - it says that fragmentation has partially solved as of version 2.4.

Resources