Scan time of cache, RAM và disk? - memory-management

Consider a computer system that has cache memory, main memory (RAM), and disk, and OS uses virtual memory. It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM, and 10 msec to access a block of 1000 bytes from the disk. If a book has 1000 pages, each with 50 lines of 80 characters each, How long it will take to electronically scan the text for the case of the master copy being in each of the level as one proceeds down the memory hierarchy(from inboard memory to offline storage)?

If a book has 1000 pages, each with 50 lines of 80 characters each, then the book has 1000 * 50 * 80 = 4000000 characters. We don't know how big a character is (and it could be UTF8 where different characters are different sizes), we don't know how much meta-data there is (e.g. if it's a word-processor file with extra information in a header about page margins, tab stops; plus more data for which font and style, etc) or if there's any extra processing (e.g. compression of pieces within the file).
If we make an unfounded assumption that the file happens to be 4000000 bytes; then we might say that it'll be 4000 blocks (with 1000 bytes per block) on disk.
Then we get into trouble. A CPU can't access data on disk (and can only access data in RAM or cache); so it needs to be loaded into RAM (e.g. by a disk controller) before the CPU can access it.
If it takes the disk controller 10 msec to access a block of 1000 bytes from disk, then we might say it will take at least 10 msec * 4000 = 40000 msec = 40 seconds to read the whole file into disk. However this would be wrong - the disk controller (acting on requests by file system support code) will have to find the file (e.g. read directory info, etc), and that the file may be fragmented so the disk controller will need to read (and then follow) a "list of where the pieces of the file are".
Of course while the CPU is scanning the first part of the file the disk controller can be reading the last part of the file; either because the software is designed to use asynchronous IO or because the OS detected a sequential access pattern and started pre-fetching the file before the program asked for it. In other words; the ideal case is that when the disk controller finishes loading the last block the CPU has already scanned the first 3999 blocks and only has to scan 1 more (and the worst case is that disk controller and CPU don't do anything at the same time it becomes "40 seconds to load the file into RAM plus however long it takes for CPU to scan the data in RAM".
Of course we also don't know things like (e.g.) if the file is actually loaded 1 block at a time (if it's split into 400 transfers with 10 blocks per transfer then the "ideal case" would be worse as CPU would have to scan the last 10 blocks after they're loaded and not just the last one block); of how many reads the disk controller does before a pre-fetcher detects that it's a sequential pattern.
Once the file is in RAM we have more problems.
Specifically; anyone that understands how caches work will know that "It takes 2 nsec to access a byte from the cache, 20 nsec to access a byte from RAM" means that when you access one byte in RAM it takes 18 nsec to transfer a cache line (a group of consecutive bytes) from RAM to cache plus 2 nsec to obtain that 1 byte from cache; and then the next byte you access will have already been transferred to cache (as it's part of "group of consecutive bytes") and will only cost 2 nsec.
After the file's data is loaded into RAM by disk controller; because we don't know the cache line size, we don't know how many of software's accesses will take 20 nsec and how many will take 2 nsec.
The final thing is that we don't actually know anything useful about the caches. The general rule is that the larger a cache is the slower it is; and "large enough to contain the entire book (plus the program's code, stack, parts of the OS, etc) but fast enough to have a 2 nsec access times" is possibly an order of magnitude better than any cache that has ever existed. Essentially, the words "the cache" (in the question) can not be taken literally as it would be implausible. If we look at anything that's close to being able to provide a 2 nsec access time we see multiple levels - e.g. a fast and small L1 cache with a slower but larger L2 cache (with an even slower but larger L3 cache). To make sense of the question you must assume that "the cache" meant "the set of caches", and that the L1 cache has an access time of 2 nsec (but is too small to hold the whole file and everything else) and other levels of the cache hierarchy (e.g. L2 cache) have unknown slower access times but may be large enough to hold the whole file (and everything else).
Mostly; if I had to guess; I'd assume that the question was taken from a university course (because universities have a habit of tricking students into paying $$$ for worthless "fictional knowledge").

Related

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

How does the disk seek is faster in column oriented database

I have recently started working on bigqueries, I come to know they are column oriented data base and disk seek is much faster in this type of databases.
Can any one explain me how the disk seek is faster in column oriented database compare to relational db.
The big difference is in the way the data is stored on disk.
Let's look at an (over)simplified example:
Suppose we have a table with 50 columns, some are numbers (stored binary) and others are fixed width text - with a total record size of 1024 bytes. Number of rows is around 10 million, which gives a total size of around 10GB - and we're working on a PC with 4GB of RAM. (while those tables are usually stored in separate blocks on disk, we'll assume the data is stored in one big block for simplicity).
Now suppose we want to sum all the values in a certain column (integers stored as 4 bytes in the record). To do that we have to read an integer every 1024 bytes (our record size).
The smallest amount of data that can be read from disk is a sector and is usually 4kB. So for every sector read, we only have 4 values. This also means that in order to sum the whole column, we have to read the whole 10GB file.
In a column store on the other hand, data is stored in separate columns. This means that for our integer column, we have 1024 values in a 4096 byte sector instead of 4! (and sometimes those values can be further compressed) - The total data we need to read now is around 40MB instead of 10GB, and that will also stay in the disk cache for future use.
It gets even better if we look at the CPU cache (assuming data is already cached from disk): one integer every 1024 bytes is far from optimal for the CPU (L1) cache, whereas 1024 integers in one block will speed up the calculation dramatically (those will be in the L1 cache, which is around 50 times faster than a normal memory access).
The "disk seek is much faster" is wrong. The real question is "how column oriented databases store data on disk?", and the answer usually is "by sequential writes only" (eg they usually don't update data in place), and that produces less disk seeks, hence the overall speed gain.

Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation?
A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.
If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.
This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.
In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.
Since 100mb is divided into 10 blocks you gotta do 10 seeks and transfer rate is (10/100)mb/s for each file.
(10ms*10) + (10/100mb/s)*10 = 1.1 sec. which is greater than 1.01 anyway.
Since 100mb is divided among 10 blocks, each block has 10mb only as it is HDFS. Then it should be 10*10ms + 10mb/(100Mb/s) = 0.1s+ 0.1s = 0.2s and even lesser time.

How does direct mapped cache work?

I am taking a System Architecture course and I have trouble understanding how a direct mapped cache works.
I have looked in several places and they explain it in a different manner which gets me even more confused.
What I cannot understand is what is the Tag and Index, and how are they selected?
The explanation from my lecture is:
"Address divided is into two parts
index (e.g 15 bits) used to address (32k) RAMs directly
Rest of address, tag is stored and compared with incoming tag. "
Where does that tag come from? It cannot be the full address of the memory location in RAM since it renders direct mapped cache useless (when compared with the fully associative cache).
Thank you very much.
Okay. So let's first understand how the CPU interacts with the cache.
There are three layers of memory (broadly speaking) - cache (generally made of SRAM chips), main memory (generally made of DRAM chips), and storage (generally magnetic, like hard disks). Whenever CPU needs any data from some particular location, it first searches the cache to see if it is there. Cache memory lies closest to the CPU in terms of memory hierarchy, hence its access time is the least (and cost is the highest), so if the data CPU is looking for can be found there, it constitutes a 'hit', and data is obtained from there for use by CPU. If it is not there, then the data has to be moved from the main memory to the cache before it can be accessed by the CPU (CPU generally interacts only with the cache), that incurs a time penalty.
So to find out whether the data is there or not in the cache, various algorithms are applied. One is this direct mapped cache method. For simplicity, let's assume a memory system where there are 10 cache memory locations available (numbered 0 to 9), and 40 main memory locations available (numbered 0 to 39). This picture sums it up:
There are 40 main memory locations available, but only upto 10 can be accommodated in the cache. So now, by some means, the incoming request from CPU needs to be redirected to a cache location. That has two problems:
How to redirect? Specifically, how to do it in a predictable way which will not change over time?
If the cache location is already filled up with some data, the incoming request from CPU has to identify whether the address from which it requires the data is same as the address whose data is stored in that location.
In our simple example, we can redirect by a simple logic. Given that we have to map 40 main memory locations numbered serially from 0 to 39 to 10 cache locations numbered 0 to 9, the cache location for a memory location n can be n%10. So 21 corresponds to 1, 37 corresponds to 7, etc. That becomes the index.
But 37, 17, 7 all correspond to 7. So to differentiate between them, comes the tag. So just like index is n%10, tag is int(n/10). So now 37, 17, 7 will have the same index 7, but different tags like 3, 1, 0, etc. That is, the mapping can be completely specified by the two data - tag and index.
So now if a request comes for address location 29, that will translate to a tag of 2 and index of 9. Index corresponds to cache location number, so cache location no. 9 will be queried to see if it contains any data, and if so, if the associated tag is 2. If yes, it's a CPU hit and the data will be fetched from that location immediately. If it is empty, or the tag is not 2, it means that it contains the data corresponding to some other memory address and not 29 (although it will have the same index, which means it contains a data from address like 9, 19, 39, etc.). So it is a CPU miss, and data from location no. 29 in main memory will have to be loaded into the cache at location 9 (and the tag changed to 2, and deleting any data which was there before), after which it will be fetched by CPU.
Lets use an example. A 64 kilobyte cache, with 16 byte cache-lines has 4096 different cache lines.
You need to break the address down into three different parts.
The lowest bits are used to tell you the byte within a cache line when you get it back, this part isn't directly used in the cache lookup. (bits 0-3 in this example)
The next bits are used to INDEX the cache. If you think of the cache as a big column of cache lines, the index bits tell you which row you need to look in for your data. (bits 4-15 in this example)
All the other bits are TAG bits. These bits are stored in the tag store for the data you have stored in the cache, and we compare the corresponding bits of the cache request to what we have stored to figure out if the data we are cacheing are the data that are being requested.
The number of bits you use for the index is log_base_2(number_of_cache_lines) [it's really the number of sets, but in a direct mapped cache, there are the same number of lines and sets]
A direct mapped cache is like a table that has rows also called cache line and at least 2 columns one for the data and the other one for the tags.
Here is how it works: A read access to the cache takes the middle part of the address that is called index and use it as the row number. The data and the tag are looked up at the same time.
Next, the tag needs to be compared with the upper part of the address to decide if the line is from the same address range in memory and is valid. At the same time, the lower part of the address can be used to select the requested data from cache line (I assume a cache line can hold data for several words).
I emphasized a little on data access and tag access+compare happens at the same time, because that is key to reduce the latency (purpose of a cache). The data path ram access doesn't need to be two steps.
The advantage is that a read is basically a simple table lookup and a compare.
But it is direct mapped that means for every read address there is exactly one place in the cache where this data could be cached. So the disadvantage is that a lot of other addresses would be mapped to the same place and may compete for this cache line.
I have found a good book at the library that has offered me the clear explanation I needed and I will now share it here in case some other student stumbles across this thread while searching about caches.
The book is "Computer Architecture - A Quantitative Approach" 3rd edition by Hennesy and Patterson, page 390.
First, keep in mind that the main memory is divided into blocks for the cache.
If we have a 64 Bytes cache and 1 GB of RAM, the RAM would be divided into 128 KB blocks (1 GB of RAM / 64B of Cache = 128 KB Block size).
From the book:
Where can a block be placed in a cache?
If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The destination block is calculated using this formula: <RAM Block Address> MOD <Number of Blocks in the Cache>
So, let's assume we have 32 blocks of RAM and 8 blocks of cache.
If we want to store block 12 from RAM to the cache, RAM block 12 would be stored into Cache block 4. Why? Because 12 / 8 = 1 remainder 4. The remainder is the destination block.
If a block can be placed anywhere in the cache, the cache is said to be fully associative.
If a block can be placed anywhere in a restricted set of places in the cache, the cache is set associative.
Basically, a set is a group of blocks in the cache. A block is first mapped onto a set and then the block can be placed anywhere inside the set.
The formula is: <RAM Block Address> MOD <Number of Sets in the Cache>
So, let's assume we have 32 blocks of RAM and a cache divided into 4 sets (each set having two blocks, meaning 8 blocks in total). This way set 0 would have blocks 0 and 1, set 1 would have blocks 2 and 3, and so on...
If we want to store RAM block 12 into the cache, the RAM block would be stored in the Cache blocks 0 or 1. Why? Because 12 / 4 = 3 remainder 0. Therefore set 0 is selected and the block can be placed anywhere inside set 0 (meaning block 0 and 1).
Now I'll go back to my original problem with the addresses.
How is a block found if it is in the cache?
Each block frame in the cache has an address. Just to make it clear, a block has both address and data.
The block address is divided into multiple pieces: Tag, Index and Offset.
The tag is used to find the block inside the cache, the index only shows the set in which the block is situated (making it quite redundant) and the offset is used to select the data.
By "select the data" I mean that in a cache block there will obviously be more than one memory locations, the offset is used to select between them.
So, if you want to imagine a table, these would be the columns:
TAG | INDEX | OFFSET | DATA 1 | DATA 2 | ... | DATA N
Tag would be used to find the block, index would show in which set the block is, offset would select one of the fields to its right.
I hope that my understanding of this is correct, if it is not please let me know.

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

Resources