How to find average seek time in disk scheduling algorithms?

How to find average seek time in disk scheduling algorithms? - time

Seek Time : The amount of time required to move the read/write head from its current position to desired track.
I am looking for formula of average seek time used in disk scheduling algorithms.

How to find average seek time in disk scheduling algorithms?
I am looking for formula of average seek time used in disk scheduling algorithms.
The first step is to determine the geography of the device itself. This is difficult. Modern hard disks can not defined by the old "cylinders, heads, sectors" triplet, the number of sectors per track is different for different tracks (more sectors on outer tracks where the circumference is greater, less sectors on inner tracks where the circumference is smaller), and all of the information you can get about the drive (from the device itself, or from any firmware or OS API) is a lie to make legacy software happy.
To work around that you need to resort to "benchmarking tactics". Specifically, read from LBA sector 0 then LBA sector 1 and measure the time it took (to establish a "time taken when both sectors are in the same track" assumption), then read from LBA sector 0 then LBA sector N in a loop (with N starting at 2 and increasing) while measure the time it takes and comparing it to the previous value and looking for a larger increase in time taken that indicates that you've found the boundary between "track 0" and "track 1". Then repeat this (starting with the first sector in "track 1") to find the boundary between "track 1" and "track 2"; and keep repeating this to build an array of "how many sectors on each track". Note that it is not this simple - there's various pitfalls (e.g. larger physical sectors than logical sectors, sectors that are interleaved on the track, bad block replacement, internal caches built into the disk drive, etc) that need to be taken into account. Of course this will be excessively time consuming (e.g. you don't want to do this for every disk every time an OS boots), so you'll want to obtain the hard disk's identification (manufacturer and model number) and store the auto-detected geometry somewhere, so that you can skip the auto-detection if the geometry for that model of disk was previously stored.
The next step is to use the information about the real geometry (and not the fake geometry) combined with more "benchmarking tactics" to determine performance characteristics. Ideally you'd be trying to find constants for a formula like expected_time = sector_read_time + rotational_latency + distance_between_tracks * head_travel_time + head_settle_time, which could be done like:
measure time to read from first sector in first track then sector N in the first track; for every value of N (for every sector in the first track) and find the minimum time it can take, divide it by 2 and call it sector_read_time.
measure time to read from first sector in first track then sector N in the first track; for every value of N (for every sector in the first track) and find the maximum time it can take, divide it by the number of sectors in the first track, and call it rotational_latency.
measure time to read from first sector in track N then first sector in track N+1, with N ranging from 0 to "max_track - 1", and determine the average, and call it time0
measure time to read from first sector in track N then first sector in track N+2, with N ranging from 0 to "max_track - 1", and determine the average, and call it time1
assume head_travel_time = time1 - time0
assume head_settle_time = time0 - head_travel_time - sector_read_time
Note that there are various pitfalls with this too (same as before), and (if you work around them) the best you can hope for is a generic estimate (and not an accurate predictor).
Of course this will also be excessively time consuming, and if you're storing the auto-detected geometry somewhere it'd be a good idea to also store the auto-detected performance characteristics in the same place; so that you can skip all of the auto-detection if all of the information for that model of disk was previously stored.
Note that all of the above assumes "stand alone rotating platter hard disk with no caching and no hybrid/flash layer" and will be completely useless for a lot of cases. For some of the other cases (SSD, CD/DVD) you'd need different techniques to auto-detect their geometry and/or characteristics. Then there's things like RAID and virtualisation to complicate things more.
Mostly; it's far too much hassle to bother in practice.
Instead; just assume that cost = abs(previous_LBA_sector_number - next_LBA_sector_number) and/or let the hard disk sort out the optimum order itself (e.g. using Native Command Queuing - see https://en.wikipedia.org/wiki/Native_Command_Queuing ).

Related

Calculating size class

High-performance malloc implementations often implement segregated free lists, that is, each of the more common (smaller) sizes gets its own separate free list.
A first attempt at this could say, below a certain threshold, the size class is just the size divided by 8, rounded up. But actual implementations have more nuance, arranging the recognized size classes on something like an exponential curve (but gentler than simply doubling at each step), e.g. http://jemalloc.net/jemalloc.3.html
I'm trying to figure out how to convert a size to a size class on some such curve. Now, in principle this is not difficult; there are several ways to do it. But to achieve the desired goal of speeding up the common case, it really needs to be fast, preferably only a few instructions.
What's the fastest way to do this conversion?

In the dark ages, when I used to worry about those sorts of things, I just iterated through all the possible sizes starting at the smallest one.
This actually makes a lot of sense, since allocating memory strongly implies work outside of the actual allocation -- like initializing and using that memory -- that is proportional to the allocation size. In all but the smallest allocations, that overhead will swamp whatever you spend to pick a size class.
Only the small ones really need to be fast.

Lets assume you want to use all the power of two sizes and a plus half the size, ie 8, 12, 16, 24, 32, 48, 64 etc. ... 4096.
Check the value is less than or equal 4096, I have arbitrarily chosen that to be the highest allocable for this example.
Take the size of your struct, then use the highest set bit times two, and add 1 if the next bit is also set and you get an index into the size list add one more if the value is higher than the value the two bits would give. Should be 5-6 ASM instructions
So 26 is 16+8+2 bits are 4,3,1 4*2 + 1 + 1 so the 10th index is chosen for a 32 byte list.
Your system might require some minimum allocation size.
Also if your doing a lot of allocations, consider using some pool allocator that is private to your program with backup from the system allocator.

Why actual runtime for a larger search value is smaller than a lower search value in a sorted array?

I executed a linear search on an array containing all unique elements in range [1, 10000], sorted in increasing order with all search values i.e., from 1 to 10000 and plotted the runtime vs search value graph as follows:
Upon closely analysing the zoomed in version of the plot as follows:
I found that the runtime for some larger search values is smaller than the lower search values and vice versa
My best guess for this phenomenon is that it is related to how data is processed by CPU using primary memory and cache, but don't have a firm quantifiable reason to explain this.
Any hint would be greatly appreciated.
PS: The code was written in C++ and executed on linux platform hosted on virtual machine with 4 VCPUs on Google Cloud. The runtime was measured using the C++ Chrono library.

CPU cache size depends on the CPU model, there are several cache levels, so your experiment should take all those factors into account. L1 cache is usually 8 KiB, which is about 4 times smaller than your 10000 array. But I don't think this is cache misses. L2 latency is about 100ns, which is much smaller than the difference between lowest and second line, which is about 5 usec. I suppose this (second line-cloud) is contributed from the context switching. The longer the task, the more probable the context switching to occur. This is why the cloud on the right side is thicker.
Now for the zoomed in figure. As Linux is not a real time OS, it's time measuring is not very reliable. IIRC it's minimal reporting unit is microsecond. Now, if a certain task takes exactly 15.45 microseconds, then its ending time depends on when it started. If the task started at exact zero time clock, the time reported would be 15 microseconds. If it started when the internal clock was at 0.1 microsecond in, than you will get 16 microsecond. What you see on the graph is a linear approximation of the analogue straight line to the discrete-valued axis. So the tasks duration you get is not actual task duration, but the real value plus task start time into microsecond (which is uniformly distributed ~U[0,1]) and all that rounded to the closest integer value.

Is this algorithm adequate for 2016 ICPC World Finals task L?

I've been asked to program a solution for the 2016 ICPC World Finals problem L, as a homework.
I've managed to successfully create and run a program that solves the problem in question. However, I'm curious if there is a better algorithm for solving the problem.
Problem L is as follows:
A set of n hard drives needs to be reformatted. When drive i is reformatted, its capacity will go from a_i GB to b_i GB. All the hard drives are initially full to brim. In order to store the data from hard drive x while
it is being reformatted, you are allowed to buy an extra hard drive.
Any drive that has been reformatted can be used to store more data
(assuming it has the capacity to do so). The data from hard drive x
can be split among other hard drives. Write a program that estimates
the minimum amount of extra space required in order to reformat the
hard drives without losing data.
The algorithm I'm currently using is the same one as in this YouTube video.
The algorithm follows these steps:
Separate the hard drives that lose capacity upon being reformatted from those that maintain or increase their capacity upon being reformatted.
Sort the new lists by initial capacity of the hard drives. The ones that lose capacity (called "shrink list") are sorted from biggest to smallest, and the others (called "grow list") are sorted from smallest to biggest.
Starting with the first member of the grow list, check its starting capacity and assign that value to swap (the accumulated required space for successfully swapping around the data).
After reformatting the hard drive, assign to extra the amount of capacity gained from reformatting the first hard drive (this first value will always be at least 0).
Compare the initial capacity of the next hard drive (hard drive x) that is going to be reformatted, with the values of swap + extra. If swap + extra is at least equal to the initial value of hard drive x then the hard drive can be reformatted.
If the value of swap + extra is smaller than the initial capacity of hard drive x, increase the value of swap until you have enough capacity to make the swap, and reformat drive x.
Add to extra the difference between the final and initial capacities of hard drive x, even if it's a negative value.
Return to step 5 until all members of the grow list are reformatted.
Once all members of the grow list have been reformatted, return to step 5, but now, use the shrink list instead.
Is there is a better/faster way to solve this problem?

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment，I found such phenomenon as:
operation 1：tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32]，means convolution using 3x3 kernel on x，both input channels and output channels are 32，the total multiply times is
3*3*32*32*128*128=150994944
operation 2：tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64]，means convolution using 3x3 kernel on x，both input channels and output channels are 64，the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1，the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1，the training time for one epoch reduced 23 seconds，means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2，the training time for one epoch reduced 13 seconds，means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti，os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?

If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.

chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

Measuring read time performance on HDD

Assume we have a HDD with throughput 100MB/sec and seek time = 10ms.
How long will it take to read 50MB from the disk.
Friends answer:
Read Time = Seek Time + 50MB/Throughput = 510ms.
My concerns:
First of all my understanding is HDD contains platters, platters contains sectors, sectors contain tracks. The right sector needs to be brought under the head (i.e rotational latency) then the head now needs to move between those tracks in the sector (i.e seek time) to read the data. Correct me if I am wrong here.
Now,
a) Does not this answer assume that the data lies within contigous sectors ?
Because if the data was fragemented the disk is required to rotate to bring the right sector under the head, shouldnt that add rotational latency as well ?
b) Also, does seek time mean the time from which the data was requested to the time when the data was delivered to the CPU or does it mean the time for the head to move between the tracks within a sector ? The definition keeps varying.
c) My answer for worst case:
Assume sector size is 512
Read Time = (50MB/512 * rotational delay) + SeekTime + 50MB/Throughput.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio