How does the disk seek is faster in column oriented database - performance

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.

The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).

The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Related

Scan time of cache, RAM và disk?

Consider a computer system that has cache memory, main memory (RAM), and disk, and OS uses virtual memory. It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM, and 10 msec to access a block of 1000 bytes from the disk. If a book has 1000 pages, each with 50 lines of 80 characters each, How long it will take to electronically scan the text for the case of the master copy being in each of the level as one proceeds down the memory hierarchy(from inboard memory to offline storage)?
If a book has 1000 pages, each with 50 lines of 80 characters each, then the book has 1000 * 50 * 80 = 4000000 characters. We don't know how big a character is (and it could be UTF8 where different characters are different sizes), we don't know how much meta-data there is (e.g. if it's a word-processor file with extra information in a header about page margins, tab stops; plus more data for which font and style, etc) or if there's any extra processing (e.g. compression of pieces within the file).
If we make an unfounded assumption that the file happens to be 4000000 bytes; then we might say that it'll be 4000 blocks (with 1000 bytes per block) on disk.
Then we get into trouble. A CPU can't access data on disk (and can only access data in RAM or cache); so it needs to be loaded into RAM (e.g. by a disk controller) before the CPU can access it.
If it takes the disk controller 10 msec to access a block of 1000 bytes from disk, then we might say it will take at least 10 msec * 4000 = 40000 msec = 40 seconds to read the whole file into disk. However this would be wrong - the disk controller (acting on requests by file system support code) will have to find the file (e.g. read directory info, etc), and that the file may be fragmented so the disk controller will need to read (and then follow) a "list of where the pieces of the file are".
Of course while the CPU is scanning the first part of the file the disk controller can be reading the last part of the file; either because the software is designed to use asynchronous IO or because the OS detected a sequential access pattern and started pre-fetching the file before the program asked for it. In other words; the ideal case is that when the disk controller finishes loading the last block the CPU has already scanned the first 3999 blocks and only has to scan 1 more (and the worst case is that disk controller and CPU don't do anything at the same time it becomes "40 seconds to load the file into RAM plus however long it takes for CPU to scan the data in RAM".
Of course we also don't know things like (e.g.) if the file is actually loaded 1 block at a time (if it's split into 400 transfers with 10 blocks per transfer then the "ideal case" would be worse as CPU would have to scan the last 10 blocks after they're loaded and not just the last one block); of how many reads the disk controller does before a pre-fetcher detects that it's a sequential pattern.
Once the file is in RAM we have more problems.
Specifically; anyone that understands how caches work will know that "It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM" means that when you access one byte in RAM it takes 18 nsec to transfer a cache line (a group of consecutive bytes) from RAM to cache plus 2 nsec to obtain that 1 byte from cache; and then the next byte you access will have already been transferred to cache (as it's part of "group of consecutive bytes") and will only cost 2 nsec.
After the file's data is loaded into RAM by disk controller; because we don't know the cache line size, we don't know how many of software's accesses will take 20 nsec and how many will take 2 nsec.
The final thing is that we don't actually know anything useful about the caches. The general rule is that the larger a cache is the slower it is; and "large enough to contain the entire book (plus the program's code, stack, parts of the OS, etc) but fast enough to have a 2 nsec access times" is possibly an order of magnitude better than any cache that has ever existed. Essentially, the words "the cache" (in the question) can not be taken literally as it would be implausible. If we look at anything that's close to being able to provide a 2 nsec access time we see multiple levels - e.g. a fast and small L1 cache with a slower but larger L2 cache (with an even slower but larger L3 cache). To make sense of the question you must assume that "the cache" meant "the set of caches", and that the L1 cache has an access time of 2 nsec (but is too small to hold the whole file and everything else) and other levels of the cache hierarchy (e.g. L2 cache) have unknown slower access times but may be large enough to hold the whole file (and everything else).
Mostly; if I had to guess; I'd assume that the question was taken from a university course (because universities have a habit of tricking students into paying $$$ for worthless "fictional knowledge").

Why 2 way merge sorting is more efficient than one way merge sorting

I am reading about external sorting from wikipedia, and need to understand why 2 phase merging is more efficient than 1 phase merging.
Wiki : However, there is a limitation to single-pass merging. As the
number of chunks increases, we divide memory into more buffers, so
each buffer is smaller, so we have to make many smaller reads rather
than fewer larger ones.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge
pass isn't efficient: the disk seeks required to fill the input
buffers with data from each of the 500 chunks (we read 100MB / 501 ~
200KB from each chunk at a time) take up most of the sort time. Using
two merge passes solves the problem. Then the sorting process might
look like this:
Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.
Could anyone give me a simple example to understand this concept well. I am particularly confused about allocating more buffers in 2 phase merging.
The issue is random access overhead, average rotational delay is around 4 ms, average seek time around 9 ms, so say 10 ms average access time, versus an average transfer rate around 150 mega-bytes per second. So the average access overhead takes about the same time as it does to read or write 1.5 megabytes.
Large I/O's reduce the relative overhead of access time. If data is read or written 10 mega bytes at a time, the overhead is reduced to 15%. Using the wiki example 100MB of working buffer and 25 chunks to merge, plus one chunk for writes, that's 3.8 megabyte chunks, in which case random access overhead is about 39%.
On most PC's, it's possible to allocate enough memory for a 1GB working buffer, so 26 chunks at over 40 mega bytes per chunk, reducing random access overhead to 3.75%. 50 chunks would be about 20 megabytes per chunk, with random access overhead around 7.5%.
Note that even with 50 chunks, it's still a two pass sort, the first pass uses all of the 1gb buffer to do a memory sort, then writes sorted 1gb chunks. The second pass merges all 50 chunks that result in the sorted file.
The other issue with 50 chunks, is that a min heap type method is needed to maintain sorted information in order to determine which element of which chunk is the smallest of the 50 and gets moved to the output buffer. After an element is moved to the output buffer, a new element is moved into the heap and then the heap is re-heapified, which takes about 2 log2(50) operations, or about 12 operations. This is better than the simple approach of doing 49 compares to determine the smallest element from a group of 50 elements when doing the 50 way merge. The heap is composed of structures, the current element, which chunk this is (file position or file), the number of elements left in the chunk, ... .

Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.
If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file.
(10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.
Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s) = 0.1s+ 0.1s = 0.2s and even lesser time.

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

Redis 10x more memory usage than data

I am trying to store a wordlist in redis. The performance is great.
My approach is of making a set called "words" and adding each new word via 'sadd'.
When adding a file thats 15.9 MB and contains about a million words, the redis-server process consumes 160 MB of ram. How come I am using 10x the memory, is there any better way of approaching this problem?
Well this is expected of any efficient data storage: the words have to be indexed in memory in a dynamic data structure of cells linked by pointers. Size of the structure metadata, pointers and memory allocator internal fragmentation is the reason why the data take much more memory than a corresponding flat file.
A Redis set is implemented as a hash table. This includes:
an array of pointers growing geometrically (powers of two)
a second array may be required when incremental rehashing is active
single-linked list cells representing the entries in the hash table (3 pointers, 24 bytes per entry)
Redis object wrappers (one per value) (16 bytes per entry)
actual data themselves (each of them prefixed by 8 bytes for size and capacity)
All the above sizes are given for the 64 bits implementation. Accounting for the memory allocator overhead, it results in Redis taking at least 64 bytes per set item (on top of the data) for a recent version of Redis using the jemalloc allocator (>= 2.4)
Redis provides memory optimizations for some data types, but they do not cover sets of strings. If you really need to optimize memory consumption of sets, there are tricks you can use though. I would not do this for just 160 MB of RAM, but should you have larger data, here is what you can do.
If you do not need the union, intersection, difference capabilities of sets, then you may store your words in hash objects. The benefit is hash objects can be optimized automatically by Redis using zipmap if they are small enough. The zipmap mechanism has been replaced by ziplist in Redis >= 2.6, but the idea is the same: using a serialized data structure which can fit in the CPU caches to get both performance and a compact memory footprint.
To guarantee the hash objects are small enough, the data could be distributed according to some hashing mechanism. Assuming you need to store 1M items, adding a word could be implemented in the following way:
hash it modulo 10000 (done on client side)
HMSET words:[hashnum] [word] 1
Instead of storing:
words => set{ hi, hello, greetings, howdy, bonjour, salut, ... }
you can store:
words:H1 => map{ hi:1, greetings:1, bonjour:1, ... }
words:H2 => map{ hello:1, howdy:1, salut:1, ... }
...
To retrieve or check the existence of a word, it is the same (hash it and use HGET or HEXISTS).
With this strategy, significant memory saving can be done provided the modulo of the hash is
chosen according to the zipmap configuration (or ziplist for Redis >= 2.6):
# Hashes are encoded in a special way (much more memory efficient) when they
# have at max a given number of elements, and the biggest element does not
# exceed a given threshold. You can configure this limits with the following
# configuration directives.
hash-max-zipmap-entries 512
hash-max-zipmap-value 64
Beware: the name of these parameters have changed with Redis >= 2.6.
Here, modulo 10000 for 1M items means 100 items per hash objects, which will guarantee that all of them are stored as zipmaps/ziplists.
As for my experiments, It is better to store your data inside a hash table/dictionary . the best ever case I reached after a lot of benchmarking is to store inside your hashtable data entries that are not exceeding 500 keys.
I tried standard string set/get, for 1 million keys/values, the size was 79 MB. It is very huge in case if you have big numbers like 100 millions which will use around 8 GB.
I tried hashes to store the same data, for the same million keys/values, the size was increasingly small 16 MB.
Have a try in case if anybody needs the benchmarking code, drop me a mail
Did you try persisting the database (BGSAVE for example), shutting the server down and getting it back up? Due to fragmentation behavior, when it comes back up and populates its data from the saved RDB file, it might take less memory.
Also: What version of Redis to you work with? Have a look at this blog post - it says that fragmentation has partially solved as of version 2.4.

Resources