stxxl sorting of very large file (ubuntu) - stxxl

I am trying to sort a large file with about billion records (each containing four integers). The size of the file would shoot up beyond 50GB.
I am testing my code with 400 million records (about 6 GB file). My disk configuration looks like this:
disk=/var/tmp/stxxl,50G,syscall delete
My machine has 16 GB RAM and has 8 physical processors (Intel i7), stxxl version 1.4.1. If I run the code with 200 million records, it takes about 5 minutes. But when I run the code with 400 million records, it seems to be running out of disk space. My questions are:
1) Why is my code running out of disk space for sorting even a 6 GB file? Kindly review it (only few important lines attached).
2) Is 5 minute a reasonable time for my PC to sort 200 million records? If its true, I wonder if stxxl can sort 5 billion records within daytime.
3) Do you think stxxl is a nice choice for this sort of problem? I have access to a cluster as well which has mpi installed.
CODE (inspired by examples/algo/sort_file.cpp and examples/algo/phonebills.cpp):
size_t memory_to_use = (1*1024) * 1024 * 1024ul;
typedef stxxl::vector<my_type, 1, stxxl::lru_pager<8>, block_size> vector_type;
std::copy(std::istream_iterator<my_type>(in),
std::istream_iterator<my_type>(),
std::back_inserter(v));
stxxl::sort(v.begin(), v.end(), Cmp(), memory_to_use);
Each vector element or record is a tuple of four unsigned numbers:
struct my_type
{
typedef unsigned short key_type;
typedef std::tuple<key_type, key_type, key_type, key_type> key4tuple;
...
}

If you only want to sort, consider using the stxxl::sorter.
It should require only the expected amount of disk space, the total size of your data, and should sorting with at least ~100 MB/s, depending on your disk(s) and how complicated comparisons are relative to the data type size.
The stxxl::sort() function does more work and needs extra space, since it writes temporary extra data.
Also see my tutorial video :).

Related

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why 2 way merge sorting is more efficient than one way merge sorting

I am reading about external sorting from wikipedia, and need to understand why 2 phase merging is more efficient than 1 phase merging.
Wiki : However, there is a limitation to single-pass merging. As the
number of chunks increases, we divide memory into more buffers, so
each buffer is smaller, so we have to make many smaller reads rather
than fewer larger ones.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge
pass isn't efficient: the disk seeks required to fill the input
buffers with data from each of the 500 chunks (we read 100MB / 501 ~
200KB from each chunk at a time) take up most of the sort time. Using
two merge passes solves the problem. Then the sorting process might
look like this:
Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.
Could anyone give me a simple example to understand this concept well. I am particularly confused about allocating more buffers in 2 phase merging.
The issue is random access overhead, average rotational delay is around 4 ms, average seek time around 9 ms, so say 10 ms average access time, versus an average transfer rate around 150 mega-bytes per second. So the average access overhead takes about the same time as it does to read or write 1.5 megabytes.
Large I/O's reduce the relative overhead of access time. If data is read or written 10 mega bytes at a time, the overhead is reduced to 15%. Using the wiki example 100MB of working buffer and 25 chunks to merge, plus one chunk for writes, that's 3.8 megabyte chunks, in which case random access overhead is about 39%.
On most PC's, it's possible to allocate enough memory for a 1GB working buffer, so 26 chunks at over 40 mega bytes per chunk, reducing random access overhead to 3.75%. 50 chunks would be about 20 megabytes per chunk, with random access overhead around 7.5%.
Note that even with 50 chunks, it's still a two pass sort, the first pass uses all of the 1gb buffer to do a memory sort, then writes sorted 1gb chunks. The second pass merges all 50 chunks that result in the sorted file.
The other issue with 50 chunks, is that a min heap type method is needed to maintain sorted information in order to determine which element of which chunk is the smallest of the 50 and gets moved to the output buffer. After an element is moved to the output buffer, a new element is moved into the heap and then the heap is re-heapified, which takes about 2 log2(50) operations, or about 12 operations. This is better than the simple approach of doing 49 compares to determine the smallest element from a group of 50 elements when doing the 50 way merge. The heap is composed of structures, the current element, which chunk this is (file position or file), the number of elements left in the chunk, ... .

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

How can I approximate the size of a data structure in scala?

I have a query that returns me around 6 million rows, which is too big to process all at once in memory.
Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.
How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?
I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.
Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)
I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.
Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).
How much memory have you got at your disposal? 6 million instances of a triple is really not very much!
Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).
So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.
Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

Redis 10x more memory usage than data

I am trying to store a wordlist in redis. The performance is great.
My approach is of making a set called "words" and adding each new word via 'sadd'.
When adding a file thats 15.9 MB and contains about a million words, the redis-server process consumes 160 MB of ram. How come I am using 10x the memory, is there any better way of approaching this problem?
Well this is expected of any efficient data storage: the words have to be indexed in memory in a dynamic data structure of cells linked by pointers. Size of the structure metadata, pointers and memory allocator internal fragmentation is the reason why the data take much more memory than a corresponding flat file.
A Redis set is implemented as a hash table. This includes:
an array of pointers growing geometrically (powers of two)
a second array may be required when incremental rehashing is active
single-linked list cells representing the entries in the hash table (3 pointers, 24 bytes per entry)
Redis object wrappers (one per value) (16 bytes per entry)
actual data themselves (each of them prefixed by 8 bytes for size and capacity)
All the above sizes are given for the 64 bits implementation. Accounting for the memory allocator overhead, it results in Redis taking at least 64 bytes per set item (on top of the data) for a recent version of Redis using the jemalloc allocator (>= 2.4)
Redis provides memory optimizations for some data types, but they do not cover sets of strings. If you really need to optimize memory consumption of sets, there are tricks you can use though. I would not do this for just 160 MB of RAM, but should you have larger data, here is what you can do.
If you do not need the union, intersection, difference capabilities of sets, then you may store your words in hash objects. The benefit is hash objects can be optimized automatically by Redis using zipmap if they are small enough. The zipmap mechanism has been replaced by ziplist in Redis >= 2.6, but the idea is the same: using a serialized data structure which can fit in the CPU caches to get both performance and a compact memory footprint.
To guarantee the hash objects are small enough, the data could be distributed according to some hashing mechanism. Assuming you need to store 1M items, adding a word could be implemented in the following way:
hash it modulo 10000 (done on client side)
HMSET words:[hashnum] [word] 1
Instead of storing:
words => set{ hi, hello, greetings, howdy, bonjour, salut, ... }
you can store:
words:H1 => map{ hi:1, greetings:1, bonjour:1, ... }
words:H2 => map{ hello:1, howdy:1, salut:1, ... }
...
To retrieve or check the existence of a word, it is the same (hash it and use HGET or HEXISTS).
With this strategy, significant memory saving can be done provided the modulo of the hash is
chosen according to the zipmap configuration (or ziplist for Redis >= 2.6):
# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given number of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
Beware: the name of these parameters have changed with Redis >= 2.6.
Here, modulo 10000 for 1M items means 100 items per hash objects, which will guarantee that all of them are stored as zipmaps/ziplists.
As for my experiments, It is better to store your data inside a hash table/dictionary . the best ever case I reached after a lot of benchmarking is to store inside your hashtable data entries that are not exceeding 500 keys.
I tried standard string set/get, for 1 million keys/values, the size was 79 MB. It is very huge in case if you have big numbers like 100 millions which will use around 8 GB.
I tried hashes to store the same data, for the same million keys/values, the size was increasingly small 16 MB.
Have a try in case if anybody needs the benchmarking code, drop me a mail
Did you try persisting the database (BGSAVE for example), shutting the server down and getting it back up? Due to fragmentation behavior, when it comes back up and populates its data from the saved RDB file, it might take less memory.
Also: What version of Redis to you work with? Have a look at this blog post - it says that fragmentation has partially solved as of version 2.4.

Resources