Redis 10x more memory usage than data - performance

I am trying to store a wordlist in redis. The performance is great.
My approach is of making a set called "words" and adding each new word via 'sadd'.
When adding a file thats 15.9 MB and contains about a million words, the redis-server process consumes 160 MB of ram. How come I am using 10x the memory, is there any better way of approaching this problem?

Well this is expected of any efficient data storage: the words have to be indexed in memory in a dynamic data structure of cells linked by pointers. Size of the structure metadata, pointers and memory allocator internal fragmentation is the reason why the data take much more memory than a corresponding flat file.
A Redis set is implemented as a hash table. This includes:
an array of pointers growing geometrically (powers of two)
a second array may be required when incremental rehashing is active
single-linked list cells representing the entries in the hash table (3 pointers, 24 bytes per entry)
Redis object wrappers (one per value) (16 bytes per entry)
actual data themselves (each of them prefixed by 8 bytes for size and capacity)
All the above sizes are given for the 64 bits implementation. Accounting for the memory allocator overhead, it results in Redis taking at least 64 bytes per set item (on top of the data) for a recent version of Redis using the jemalloc allocator (>= 2.4)
Redis provides memory optimizations for some data types, but they do not cover sets of strings. If you really need to optimize memory consumption of sets, there are tricks you can use though. I would not do this for just 160 MB of RAM, but should you have larger data, here is what you can do.
If you do not need the union, intersection, difference capabilities of sets, then you may store your words in hash objects. The benefit is hash objects can be optimized automatically by Redis using zipmap if they are small enough. The zipmap mechanism has been replaced by ziplist in Redis >= 2.6, but the idea is the same: using a serialized data structure which can fit in the CPU caches to get both performance and a compact memory footprint.
To guarantee the hash objects are small enough, the data could be distributed according to some hashing mechanism. Assuming you need to store 1M items, adding a word could be implemented in the following way:
hash it modulo 10000 (done on client side)
HMSET words:[hashnum] [word] 1
Instead of storing:
words => set{ hi, hello, greetings, howdy, bonjour, salut, ... }
you can store:
words:H1 => map{ hi:1, greetings:1, bonjour:1, ... }
words:H2 => map{ hello:1, howdy:1, salut:1, ... }
...
To retrieve or check the existence of a word, it is the same (hash it and use HGET or HEXISTS).
With this strategy, significant memory saving can be done provided the modulo of the hash is
chosen according to the zipmap configuration (or ziplist for Redis >= 2.6):
# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given number of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
Beware: the name of these parameters have changed with Redis >= 2.6.
Here, modulo 10000 for 1M items means 100 items per hash objects, which will guarantee that all of them are stored as zipmaps/ziplists.

As for my experiments, It is better to store your data inside a hash table/dictionary . the best ever case I reached after a lot of benchmarking is to store inside your hashtable data entries that are not exceeding 500 keys.
I tried standard string set/get, for 1 million keys/values, the size was 79 MB. It is very huge in case if you have big numbers like 100 millions which will use around 8 GB.
I tried hashes to store the same data, for the same million keys/values, the size was increasingly small 16 MB.
Have a try in case if anybody needs the benchmarking code, drop me a mail

Did you try persisting the database (BGSAVE for example), shutting the server down and getting it back up? Due to fragmentation behavior, when it comes back up and populates its data from the saved RDB file, it might take less memory.
Also: What version of Redis to you work with? Have a look at this blog post - it says that fragmentation has partially solved as of version 2.4.

Related

Using a slice instead of list when working with large data volumes in Go

I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).

Is there an efficient way to store a lookup structure that uses random integer keys?

I need to implement a lookup structure with the following requirements:
Keys are random 128-bit integers
Values are 64-bit
It will be stored on disk
It must be searchable without the entire structure being resident in memory (I intend to memory map the file)
It must be mutable, but writes to disk must be incremental (must not require overwriting the entire structure)
Is there an efficient way to achieve all of this?
Please do not answer, "Don't use UUIDs." I am asking a specific question; changing the requirements changes the question.
Since your keys and values each are a fixed number of bytes, you could implement a hashtable as a file. The first few bytes contain the current number of elements and the current capacity, and then the entries each take up 16 + 8 bytes (if 0 is forbidden as a key) or 1 + 16 + 8 bytes if you need a flag to indicate whether an entry exists or not.
You can hash the key, then use arithmetic to seek to the correct position in the file, then read or write just the entries you need to. To resolve hash collisions, linear probing is probably best to avoid the number of seeks. Since the keys are random, catastrophic collision pileups shouldn't happen, and the hash can simply be to take the lowest k bits of the key, where the current capacity is 2^k.
This takes O(n) space, and allows lookups in O(1) average time, and writes in O(1) amortized time. Occasionally, you have to resize the hashtable to increase the capacity on a write; this takes O(n) time on those occasions.
If you need O(1) writes in the worst-case, you could maintain both the old and new hashtables, do lookups in both, and then on each write operation, copy across two entries from the old to the new. If the capacity is always increased by a factor of 2, then this gives non-amortized constant time writes, except for the cost of allocating an empty hashtable of size O(n). If creating an empty file of a particular size is also too slow for a single write operation, then you can amortize empty-file-creation across many writes too.

Why is two lines that differ in their address by precisely 65,536 bytes cannot be stored in the cache at the same?

I read a book Andrew Tanenbaum - structured computer organization (6th edition) - 2012, and I dont understand it.
"This mapping scheme puts consecutive memory lines in consecutive cache entries.In fact, up to 64 KB of contiguous data can be stored in the cache.However,two lines that differ in their address by precisely 65,536 bytes or any integral multiple of that number cannot be stored in the cache at the same time (because they have the same Line value).For example, if a program accesses data at location X and next executes an instruction that needs data at location X + 65,536 (or anyother location within the same line), the second instruction will force the cache entry to be reloaded, overwriting what was there.If this happens often enough, itcan result in poor behavior.In fact, the worst-case behavior of a cache is worsethan if there were no cache at all, since each memory operation involves reading in an entire cache line instead of just one word."
Why are they have the same Line value?
This is because of two concepts in cache design. First, a concept called associativity in cache design. For every possible input cache-line address (64 byte aligned on a modern x86-64 system) there are only N possible slots in the cache it may access.
The second is the a problem much like what is encountered with the hash function used within a hashmap. Simply put, some scheme has to be used in converting input addresses to slots in the cache. Notice that the book says the cache can hold 64 (presumably imperial) kilobytes. 64 kB is 65,536 bytes, and the magical cache-ruining distance in question is ALSO 65,536! So, in this case the address -> cache slot function is a simple and operation, and it appears the author is talking about a 1-way associativity cache (that is, each line may only be stored in ONE location inside the cache.) Leading to the mentioned conflict.
Why would microprocessor designers choose a simple AND function? Well... Because it's simple, mainly. Instead of wasting transistors on more complex logic, a basic operation like AND will suffice.

What's the maximum number of elements a scheme list can have?

I'm working with Chicken Scheme, I wonder how many elements a list can have.
There is no hard limit – it can have as many as there's room for in memory.
The documentation, under the option -:hmNUMBER, it mentions that there is a default max limit of 2GB heap size, which gives you about 45 million pairs. You can increase these with several options, but the simplest for a set default memory limit is -heap-size. Here is how to double the default:
csc -heap-size 4000M <file>
It says under the documentation for -heap-size that it only uses half of the allocated memory at every given time. It might be using the lonely hearts garbage collection algorithm where when memory is full it moves used memory to the unused segment making the old segment the unused one.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Resources