Random write Vs Seek time - random

I have a very weird question here...
I am trying to write the data randomly to a file of 100 MB.
data size is 4KB and the the random offset is page alligned.(4KB ).
I am trying to write 1 GB of data at random offset on 100 MB file.
If I remove the actual code that writes the data to the disk, the entire operation takes less than a second (say 0.04 sec).
If I keep the code that writes the data its takes several seconds .
In case of random write operation, what happens internally? whether the cost is seek time or the write time? From above scenario its really confusing.. !!!!
Can anybody explain in depth please....
The same procedure applied with a sequential offset, write is very fast.
Thank you ......

If you're writing all over the file, then the disk (I presume this is on a disk) needs to seek to a new place on every write.
Also, the write speed of hard disks isn't particularly stunning.
Say for the sake of example (taken from a WD Raptor EL150) that we have a 5.9 ms seek time. If you are writing 1GB randomly everywhere in 4KB chunks, you're seeking 1,000,000,000 ÷ 4,000 × 0.0059 seconds = a total seeking time of ~1,400 seconds!

Related

Scan time of cache, RAM và disk?

Consider a computer system that has cache memory, main memory (RAM), and disk, and OS uses virtual memory. It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM, and 10 msec to access a block of 1000 bytes from the disk. If a book has 1000 pages, each with 50 lines of 80 characters each, How long it will take to electronically scan the text for the case of the master copy being in each of the level as one proceeds down the memory hierarchy(from inboard memory to offline storage)?
If a book has 1000 pages, each with 50 lines of 80 characters each, then the book has 1000 * 50 * 80 = 4000000 characters. We don't know how big a character is (and it could be UTF8 where different characters are different sizes), we don't know how much meta-data there is (e.g. if it's a word-processor file with extra information in a header about page margins, tab stops; plus more data for which font and style, etc) or if there's any extra processing (e.g. compression of pieces within the file).
If we make an unfounded assumption that the file happens to be 4000000 bytes; then we might say that it'll be 4000 blocks (with 1000 bytes per block) on disk.
Then we get into trouble. A CPU can't access data on disk (and can only access data in RAM or cache); so it needs to be loaded into RAM (e.g. by a disk controller) before the CPU can access it.
If it takes the disk controller 10 msec to access a block of 1000 bytes from disk, then we might say it will take at least 10 msec * 4000 = 40000 msec = 40 seconds to read the whole file into disk. However this would be wrong - the disk controller (acting on requests by file system support code) will have to find the file (e.g. read directory info, etc), and that the file may be fragmented so the disk controller will need to read (and then follow) a "list of where the pieces of the file are".
Of course while the CPU is scanning the first part of the file the disk controller can be reading the last part of the file; either because the software is designed to use asynchronous IO or because the OS detected a sequential access pattern and started pre-fetching the file before the program asked for it. In other words; the ideal case is that when the disk controller finishes loading the last block the CPU has already scanned the first 3999 blocks and only has to scan 1 more (and the worst case is that disk controller and CPU don't do anything at the same time it becomes "40 seconds to load the file into RAM plus however long it takes for CPU to scan the data in RAM".
Of course we also don't know things like (e.g.) if the file is actually loaded 1 block at a time (if it's split into 400 transfers with 10 blocks per transfer then the "ideal case" would be worse as CPU would have to scan the last 10 blocks after they're loaded and not just the last one block); of how many reads the disk controller does before a pre-fetcher detects that it's a sequential pattern.
Once the file is in RAM we have more problems.
Specifically; anyone that understands how caches work will know that "It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM" means that when you access one byte in RAM it takes 18 nsec to transfer a cache line (a group of consecutive bytes) from RAM to cache plus 2 nsec to obtain that 1 byte from cache; and then the next byte you access will have already been transferred to cache (as it's part of "group of consecutive bytes") and will only cost 2 nsec.
After the file's data is loaded into RAM by disk controller; because we don't know the cache line size, we don't know how many of software's accesses will take 20 nsec and how many will take 2 nsec.
The final thing is that we don't actually know anything useful about the caches. The general rule is that the larger a cache is the slower it is; and "large enough to contain the entire book (plus the program's code, stack, parts of the OS, etc) but fast enough to have a 2 nsec access times" is possibly an order of magnitude better than any cache that has ever existed. Essentially, the words "the cache" (in the question) can not be taken literally as it would be implausible. If we look at anything that's close to being able to provide a 2 nsec access time we see multiple levels - e.g. a fast and small L1 cache with a slower but larger L2 cache (with an even slower but larger L3 cache). To make sense of the question you must assume that "the cache" meant "the set of caches", and that the L1 cache has an access time of 2 nsec (but is too small to hold the whole file and everything else) and other levels of the cache hierarchy (e.g. L2 cache) have unknown slower access times but may be large enough to hold the whole file (and everything else).
Mostly; if I had to guess; I'd assume that the question was taken from a university course (because universities have a habit of tricking students into paying $$$ for worthless "fictional knowledge").

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why 2 way merge sorting is more efficient than one way merge sorting

I am reading about external sorting from wikipedia, and need to understand why 2 phase merging is more efficient than 1 phase merging.
Wiki : However, there is a limitation to single-pass merging. As the
number of chunks increases, we divide memory into more buffers, so
each buffer is smaller, so we have to make many smaller reads rather
than fewer larger ones.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge
pass isn't efficient: the disk seeks required to fill the input
buffers with data from each of the 500 chunks (we read 100MB / 501 ~
200KB from each chunk at a time) take up most of the sort time. Using
two merge passes solves the problem. Then the sorting process might
look like this:
Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.
Could anyone give me a simple example to understand this concept well. I am particularly confused about allocating more buffers in 2 phase merging.
The issue is random access overhead, average rotational delay is around 4 ms, average seek time around 9 ms, so say 10 ms average access time, versus an average transfer rate around 150 mega-bytes per second. So the average access overhead takes about the same time as it does to read or write 1.5 megabytes.
Large I/O's reduce the relative overhead of access time. If data is read or written 10 mega bytes at a time, the overhead is reduced to 15%. Using the wiki example 100MB of working buffer and 25 chunks to merge, plus one chunk for writes, that's 3.8 megabyte chunks, in which case random access overhead is about 39%.
On most PC's, it's possible to allocate enough memory for a 1GB working buffer, so 26 chunks at over 40 mega bytes per chunk, reducing random access overhead to 3.75%. 50 chunks would be about 20 megabytes per chunk, with random access overhead around 7.5%.
Note that even with 50 chunks, it's still a two pass sort, the first pass uses all of the 1gb buffer to do a memory sort, then writes sorted 1gb chunks. The second pass merges all 50 chunks that result in the sorted file.
The other issue with 50 chunks, is that a min heap type method is needed to maintain sorted information in order to determine which element of which chunk is the smallest of the 50 and gets moved to the output buffer. After an element is moved to the output buffer, a new element is moved into the heap and then the heap is re-heapified, which takes about 2 log2(50) operations, or about 12 operations. This is better than the simple approach of doing 49 compares to determine the smallest element from a group of 50 elements when doing the 50 way merge. The heap is composed of structures, the current element, which chunk this is (file position or file), the number of elements left in the chunk, ... .

Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.
If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file.
(10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.
Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s) = 0.1s+ 0.1s = 0.2s and even lesser time.

Resources