In the mode 0 timer (13 bit) of 8051, the entire 8 bits of TH and the lower 5 bits if TL are used. Then the highest value possible for the timer should be 0FF1F H but in many sources, it is given as 1FFF H. Why?
1FFF H suggests the maximum value till timer in 13 bit mode can go. In 13 bit mode, significant 13 bits of the counter are : TH[7:0]-TL[7:3].
Hope it helps.
Related
I've being looking for this for a lot of time now (over 3 days) without luck. Maybe one of you guys can tell me how can I solve it.
Consider you have a computer with a 16-bit size address and a byte addressable memory. The cache is 2-way set-associative mapped, write-back policy and a perfect LRU replacement strategy. Cache has an overhead of 4352 bits. What's the size of the block?
Very few resources talk about overhead and the ones I've found only relate it to total cache size. The problem is I only know how to calculate cache size with #blocks or at least with the fields of the address properly defined (which I have not being able to do for this problem since I can't calculate the size of the tag.).
Any help would be appreciated.
So, here's how I read this question:
Overhead bits are the bits that don't count toward the actual data that is being cached. They are bits that track maintenance state of the cache, and help the cache implement hits, write back, and eviction policy. To some way of looking at it, if one byte is being cached (8 bits) how many non-data bits are in the cache to help manage that (or at least for all the actual data bits how many non-data/overhead bits are there).
This is mathematical, so I hope I haven't made an error, but even if I have maybe you can see your way through the reasoning.
Let's derive some additional information:
A write-back policy means the cache needs to store "dirty" information for each data block: dirty is 1-bit: yes, dirty -or- no, clean.
For 2-way set associative cache, a "perfect" LRU algorithm is also 1 bit (yes: first block -or- no: second block) but this 1 bit costs per index position (i.e. per line) — not per block as there are two blocks per index.
What we don't know is if there is a valid bit, which would also be per data block, but most caches I see in coursework have the valid bits, so we might assume they have it.
And lastly, there's the tag bits where tag bits are: however many bits are leftover in the address space bits after accounting for index bits and block offset bits.
So, a formula for overhead might be:
overhead in bits = index positions * (1 x LRU bit + block overhead bits)
where block overhead bits = 2 [ways] * (1 x Dirty bit + 1 x Valid bit + tag bits)
We also know that tag bits = address space bits - index bits - block bits
So, we have:
4352 [overhead in bits] = index positions * (1 + 2 * (2 + tag bits))
-and-
tag bits = address space bits - index bits - block offset bits
-and-
index positions = 2index bits
-and-
We also know that the number of tag, index, and block offset bits has to be an integer (no fractions of bits).
So, we can begin to reduce those two formulas by substituting:
4352 = index positions * (1 + 2 * (2 + address space bits - index bits - block bits)
by reduction also then:
4352 = 2index bits * (1 + 2 * (2 + 16 - index bits - block bits)
Solving for block bits we have:
-((4352/2index bits - 1)/2 - 18 + index bits) = block bits
I don't know how to solve this directly mathematically, given the constraint that the variables must be integers, so, instead of solving directly, simply try/search different values:
If index bits is 7 then by this formula, block bits is fractional, so that doesn't work.
If index bits is 9 then by this formula, block bits is fractional, so that doesn't work.
No other values between 0 and 16 result in an integer number of bits, except:
If index bits is 8 then by this formula, block bits is 2, so:
16 = tag bits + 8 + 2, meaning tag bits is 6, index bits is 8, and block offset is 2.
Since block offset is 2 then block size is 22.
First of all, sorry for bad English since my English skill is not that good...
Before the question, I want to explain my situation to help understanding.
I want to use EEPROM as a kind of counter.
The value of that counter would be increased very frequenty so I should consider endurance problem.
My idea is, write counter value on multiple address alternatively so cell wearing is reduced by N.
for example, if I use 5x area for counting,
Count 1 -> 1 0 0 0 0
Count 2 -> 1 2 0 0 0
Count 3 -> 1 2 3 0 0
Count 4 -> 1 2 3 4 0
Count 5 -> 1 2 3 4 5
Count 6 -> 6 2 3 4 5
...
So cell endurance can be extended by a factor of N.
However, AFAIK, for current NAND flash, data erase/write is done by a group of bytes, called block. So, if all the bytes are within single write/erase block, my method would not work.
So, my main question : Does erase/write operation of EEPROM of PIC is done by a group of bytes? or done by a single word or byte?
For example, if it is done by a group of 8-bytes, then I should make 8-byte offset between each counter value to make my method properly work.
Otherwise, if it is done by a byte or a word, I don't have to consider about spacing/offset.
From datasheet PIC24FJ256GB110 section 5.0:
The user may write program memory data in blocks of 64 instructions
(192 bytes) at a time, and erase program memory in blocks of 512
instructions (1536 bytes) at a time.
However you can overwrite individual block several times if you left the rest of block erased (bits are one) and the privius content rest the same. Remember: you can clear single bit in block only ones.
How much will decerease the data retention after 8 writes in to single FLASH block I don't know!
In the documentation to MMIX machine mmix-doc page 3 paragraph 4:
We use the notation to stand for a number consisting of
consecutive bytes starting at location . (The notation
means that the least significant t bits of k are set to
0, and only the least 64 bits of the resulting address are retained.
...
The notation M2t[k] is just a formal symbolism to express an address divisible by 2t.
This is confirmed just after the definition
All accesses to 2t-byte quantities by MMIX are aligned, in the
sense that the first byte is a multiple of 2t.
Most architectures, specially RISC ones, require a memory access to be aligned, this means that the address must be a multiple of the size accessed.
So, for example, reading a 64 bits word (an octa in MMIX notation) from memory require the address to be divisible by 8 because MMIX memory is byte addressable(1) and there are 8 bytes in an octa.
If all the possible data sizes are power of two we see a pattern emerge:
Multiples of Multiples of Multiples of
2 4 8
0000 0000 0000
0010 0100 1000
0100 1000
0110 1100
1000
1010
1100
1110
Multiples of 2 = 21 have the least bit always set to zero(2), multiples of 4 = 22 have the the two least bits set to zero, multiples of 8 = 23 have the three least bits set to zero and so on.
In general multiples of 2t have the least t bits set to zero.
You can formally prove this by induction over t.
A way to align a 64 bit number (the size of the MMIX address space) is to clear its lower t bits, this can be done by performing an AND operation with a mask of the form
11111...1000...0
\ / \ /
64 - t t
Such mask can be expressed as 264 - 2t.
264 is a big number for an example, lets pretend the address space is only 25.
Lets say we have the address 17h or 10111b in binary and lets say we want to align it to octas.
Octas are 8 bytes, 23 so we need to clear the lower 3 bits and preserve the other 2 bits.
The mask to use is 11000b or 18h in hexadecimal. This number is 25-23 = 32 - 8 = 24 = 18h.
If we perform the boolean AND between 17h and 18h we get 10h which is the aligned address.
This explains the notation k ∧ (264 − 2t) used short after, the "wedge" symbol ∧ is a logic AND.
So this notation just "pictures" the steps necessary to align the address k.
Note that the notation k ∨ (2t − 1) is also introduced, this is the complementary, ∨ is the OR and the whole effect is to have the lower t bits set to 1.
This is the greatest address occupied by an aligned access of size 2t.
The notation itself is used to explain the endianess.
If you wonder why aligned access are important, it has to do with hardware implementation.
Long story short the CPU interface to the memory has a predefined size despite the memory being byte addressable, say 64 bits.
So the CPU access the memory in blocks of 64 bits each one starting at an address multiple of 64 bits (i.e. aligned on 8 bytes).
Accessing an unaligned location may require the CPU to perform two access:
CPU reading an octa at address 2, we need bytes at 2, 3, 4 and 5.
Address 0 1 2 3 4 5 6 7 8 9 A B ...
\ / \ /
A B
CPU read octa at 0 (access A) and octa at 4 (access B), then combines the two reads.
RISC machine tends to avoid this complexity and entirely forbid unaligned access.
(1) Quoting: "If k is any unsigned octabyte, M[k] is a 1-byte
quantity".
(2) 20 = 1 is the only odd power of two, so you can guess that by removing it we only get even numbers.
I was trying to verify the single precision peak performance of a reference GT200 card.
From http://www.realworldtech.com/gt200/9/, we have two facts about GT200 –
The latency of the fastest operation for an SP core is 4 cycles.
The SFU takes 4 cycles too to finish an operation.
Now, each SM has a total of 8 SPs and 2 SFUs, with each SFU having 4 FP multiply units and these SPs and SFUs can work at the same time as they are on two different ports as explained in their SM level diagrams. Each SP can perform MAD operation.
So, we are looking at 8 MAD operations and 8 MUL operations per 4 SP cycles. This gives us 16 + 8 = 24 operations per 4 SP clock cycles as MAD counts as 2 operations. Since 2 SP clock cycle counts as one shader clock, we have 24/2 = 12 operations per shader clock.
For a reference GT200 card, shader clock = 1296 MHz/s.
Thus, the single precision peak performance must be = 1296 MHz/s * 30 SM * 12 operations per shader clock = 466.560 GFLOPS
This is exactly half of the GFLOPS as reported in the specs. So where am I going wrong?
Edit: After Robert’s pointer to the CUDA Programming Guide that says 8MADs/shader clock can be performed in a GT200 SM, I would have to question how latency and throughput relate to each other in this particular SM.
There is a latency of one OP / 4 SP cycles (as pointed out earlier), thus one MAD every 4 SP cycles, right? We have 8 SPs, so it becomes 8 MADs for every 4 SP cycles in an SM.
Since 2 SP cycles form one shader cycle, so we are left with => 8MADs per 2 shader clock cycles
=> 4 MADs per shader clock.
This doesn’t match with the 8MADs/shader clock from the Programming Guide.
So, what am I doing wrong again?
Latency and throughput are not the same thing.
A cc 1.x SM can retire 8 single precision floating point MAD operations on every clock cycle.
This is the correct formula:
1296 MHz(cycle/s) * 30 SM * (8 SP/SM * 2 flop/cycle per SP + 2 SFU/SM * 4 FPU/SFU * 1 flop/cycle per FPU)
= 622080 Mflop/s + 311040 Mflop/s = 933 GFlop/s single precision
From here
EDIT: The 4-cycle latency you're referring to is the latency of a warp (i.e. 32 threads) MAD instruction, as issued to the SM, not the latency of a single MAD operation on a single SP. The FPU in each SP can generate one MAD result per clock, and there are 8 SP's in one SM, so each SM can generate 8 MAD results per clock. Since a warp (32 threads) MAD instruction requires 32 MAD results, it requires 4 total clocks to complete the warp instruction, as issued to the SPs in the SM.
The FPU in the SP can generate one new MAD result per clock. From the standpoint of instruction issue, the fundamental unit is the warp. Therefore a warp MAD instruction requires 4 clocks to complete.
EDIT2: Responding to question below.
Preface: The FPUs in the SFU are not independently schedulable. They only come into play when an instruction is scheduled to the SFUs. There are 4 FPU per SFU, and an SFU instruction requires 16 cycles (since there are 2 SFU/SM) to complete for a warp. If all 4 FPU in both SFUs were fully utilized, that would be 128 (16x4x2) flops produced during the computation of the SFU instruction, in 16 cycles. This is added to the 256 (16x2x8) total flops that could be generated by the "regular" MAD FPUs in the SM during the same time (16 cycles).
Your question seems to be interpreting the observed benchmark result and this statement in the text:
Table III also shows that the throughput for single-precision
floating point multiplication is 11.2 ops/clock, which means
that multiplication can be issued to both the SP and SFU
units. This suggests that each SFU unit is capable of doing
2 multiplications per cycle, twice the throughput of other
(more complex) instructions that map to this unit.
as an indication of either the throughput of the FPUs in the SFU or else the number of FPUs in the SFU. However you are conflating benchmark data with a theoretical number. The SFU has 4 FPU, but this does not mean that all 4 are independently schedulable for arbitrary arithmetic or instruction streams. Seeing all 4 FPU take on a new floating point instruction in a given cycle may require a specific instruction sequence that the authors haven't used.
I need to create a VHDL code for this situation:
**Draw a control circuit that generates a pulse signal with:
fixed working frequency (100 KHz)
variable working cycle
The phase difference should be increased or decreased by the direction of the spin of a rotary control of 8 bits.**
Additional info:
D = t (on) / T
D = working cycle
t (on) = Time the activated signal lasts (rotary control of 8 bits)
T = signal period (constant)
You seem to be wanting to generate a mark:space ratio of between 1:255 and 255:1, so you will need a clock frequency of 256 * 100kHz.
An 8 bit incrementing counter can be left free-running clocked at that rate.
Now have a flop that is SET when the counter overflows from X'FF to X'00 and that CLEARS when the timer value makes the transition from N-1 to N. Where N is the 8 bit value on your duty cycle setting control and controls the width of the mark.
The threshold controlled flop's output is your variable duty cycle 100kHz.