winapi with 64bit number in masm32

winapi with 64bit number in masm32 - winapi

I need determind size of a logical volume and print it. GetDiskFreeSpaceEx is returning size as 64bit number(?). What can i do with it?

You can do whatever you want with it, however it's a bit awkward to do calculations with in masm32. You should be able to fill any other data structure which uses 64 bit integers. It is also possible to do some arithmetic operations on 64 bits such as division, by loading the value into EDX:EAX (so load the first 4 bytes into EAX, and the next 4 into EDX). However, beware that overflow is possible here, which needs to be handled or avoided.
If you just want to print out the size of the volume using this function you can just invoke the C run-time library printf function:
invoke crt_printf,chr$("GetDiskFreeSpaceEx, total bytes: %I64d%c"),
dqTotalBytes,10
However, as the manual says "To determine the total number of bytes on a disk or volume, use IOCTL_DISK_GET_LENGTH_INFO." The previous code only tells you how many are available to the current user.

Related

Why is the register length static in any CPU

Why is the register length (in bits) that a CPU operates on not dynamically/manually/arbitrarily adjustable? Would it make the computer slower if it was adjustable this way?
Imagine you had an 8-bit integer. If you could adjust the CPU register length to 8 bits, the CPU would only have to go through the first 8 bits instead of extending the 8-bit integer to 64 bits and then going through all 64 bits.

At first I thought you were asking if it was possible to have a CPU with no definitive register size. That make no sense since the number and size of the registers is a physical property of the hardware and cannot be changed.
However some architecture let the programmer work on a smaller part of a register or to pair registers.
The x86 does both for example, with add al, 9 (uses only 8 bits of the 64-bit rax) and div rbx (pairs rdx:rax to form a 128-bit register).
The reason this scheme is not so diffuse is that it comes with a lot of trade-offs.
More registers means more bits needed to address them, simply put: longer instructions.
Longer instructions mean less code density, more complex decoders and less performance.
Furthermore most elementary operations, like the logic ones, addition and subtraction are already implemented as operating on a full register in a single cycle.
Finally, one execution unit can handle only one instruction at a time, we cannot issue eight 8-bit additions in a 64-bit ALU at the same time.
So there wouldn't be any improvement, nor in the latency nor in the throughput.
Accessing partial registers is useful for the programmer to fan-out the number of available registers, so for example if an algorithm works with 16-bit data, the programmer could use a single physical 64-bit register to store four items and operate on them independently (but not in parallel).
The ISAs that have variable length instructions can also benefit from using partial register because that usually means smaller immediate values, for example and instruction that set a register to a specific value usually have an immediate operand that matches the size of register being loaded (though RISC usually sign-extends or zero-extends it).

Architectures like ARM (presumably others as well) supports half precision floats. The idea is to do what you were speculating and #Margaret explained. With half precision floats, you can pack two float values in a single register, thereby introducing less bandwidth at a cost of reduced accuracy.
Reference:
[1] ARM
[2] GCC

vhdl 32bit counter on 16bit bus

what is the best practice to access a changing 32bit register (like a counter) through a 16bit databus ?
I suppose i have to 'freeze' or copy the 32bit value on a read of the LSB until the MSB is also read and vise versa on a write to avoid data corruption if the LSB overflows to the MSB between the 2 accesses.
Is there a standard approach to this ?

As suggested in both the question and Morten's answer, a second register to hold the value at the time of the read of the first half is a common method. In some MCUs this register is common to multiple devices, meaning you need to either disable interrupts across the two accesses or ensure ISRs don't touch the extra register. Writes are similarly handled, frequently in the opposite order (write second word temporary storage, then write first word on device thus triggering the device to read the second word simultaneously).
There have also been cases where you just can't access the register atomically. In such cases, you might need to implement additional logic to figure out the true value. An example of such an algorithm assuming three reads take much less than 1<<15 counter ticks might be:
earlyMSB = highreg;
midLSB = lowreg;
lateMSB = highreg;
fullword = ((midLSB<0x8000 ? lateMSB : earlyMSB)<<16) | midLSB;
Other variants might use an overflow flag to signal the more significant word needs an increment (frequently used to implement that part of the counter in software).

There is no standard way, but an often used approach is to make read one address return the first 16 bits, while the remaining 16 bits are captured at the same time, and read later at another address.

Storing values in bytecode format

I have created a prototype VM in Java (as it is the language I am the most comfortable with) and I am trying to store the instructions in a bytecode format. I am wondering how I can store values in bytecode, since bytes can only be 0 through 255.
As an example:
push 4752
Push would have the opcode value of 0.
But how could I store 4752? It doesn't fit into a single byte.
I could store values in 4 bytes, therefor allowing them to be 32-bit integers, but then I would have to decide wether to load an opcode (1 byte) or a value (4 bytes). Currently I pass the program as an integer array and the VM loops through the array and executes opcodes. If an opcode requires a value it takes it from the array and then increases the program counter to skip the value, so that it doesn't get executed.
I have tried to work out how virtual machines like the JVM do this, but I was unable to find out.

JVM has several options to allow smaller encoding of cases that are expected to be more frequent, and thus on average smaller encoding of methods and classes. Specifically see the following instructions under
https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5 (or se8, but IIRC none of the basic arithmetic/computation instructions were changed between 7 and 8, only one or two of the invoke instructions):
iconst_<i> are individual opcodes that push specific values "m1" (-1) through 5
bipush pushes the next one-byte from the instruction stream
sipush pushes the next two-bytes from the instruction stream
ldc or ldc_w pushes a four-byte value from the constant pool, selected by an index in the instruction stream
Your example value 4752 fits in two bytes and would use sipush.
To extend your question, long (64-bit or 8-byte) values in JVM are mostly created by pushing an int then widening it, or by pushing the value from a long variable or field (or method return). There is one instruction ldc2_w to push a 2-cell (8-byte) value from the constant pool, and two special-for-frequent instructions lconst_0 and lconst_1 for 0 and 1.

Calculating Cache Memory Hit and Miss, and Calculating Rows in Cache

I am studying an old exam for an upcoming exam, and the final questions consist of what the title describes. Now, I am familiar with assembly language instructions and I somewhat know what the code means. But, what the exam question actually wants me to do is confusing. I would really appreciate if someone could explain this question.
The question:
I am given a cache-memory which has room for 512 bytes and every row is 8 bytes long. The memory is direct-mapped and an "address" is 32 bits long. Also, the cache-memory is empty from the start.
After that, I get some instructions and am supposed to explain if it becomes a cache-hit or cache-miss. It should also be assumed that the instructions are all sequential and all data that is added/modified in an instruction still exists for the next instruction.
The instructions I get are
movia r8, 0xBEDA12C4
ldw r10, 0( r8 )
ldw r11, 8( r8 )
stw r10, 16( r8 )
ldw r10, 24(r8)
ldw r18, 32(r8)
Now I would really appreciate if someone could explain the details to me:
The cache-memory has room for a total of 512 bytes. What is this? Is it the total memory the cache is able to store? Also, I heard from somewhere that this is how you calculate rows in cache. For example, 512 bytes of memory and every row is 16 bytes. 512/16 = 32 rows in cache. For this example 512/8 = 64 rows. Which one is it? What does this mean!?
It also states that every row is 16 bytes long. I've seen the example with TAG, ROW, BYTE where they try to illustrate the cache. But how do I understand the 16 bytes per row? At least it doesn't seem to take part of the length on TAG, ROW, BYTE. What is this for?
Direct-mapped cache. I understand this somewhat. It's just a big row of slots of order which are empty or not, yeah? I found some information on this here.
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
*Updated link: https://web.archive.org/web/20150213025748/http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Memory/direct.html
Now to the main part. How do I calculate for each instruction if it will be a cache miss or hit? My guess is that the first instruction ought to be a miss, since the question said that the cache memory is empty from the start. The second instruction also must be a cache miss but from this point on I am not sure how to calculate if the instruction generates a cache hit or miss. To be honest, I am not even sure what a hit would be.
I would really appreciate if someone could show me how to calculate each step and how I know whether an instruction creates a cache hit or miss. The instructions we get for calculating this are really confusing. Thank you so much!

Generally you have to look at it as at a separate memory space, with only 512 bytes, addressable, readable and writable as arrays 8 bytes each. If you need byte 2, the address will be 0, you read the whole array and select byte 3 from it. If you need byte 8, the address will be 1, and you select byte 0 from the array. Such small memory have one huge advantage - it is fast. It alone can store the contents of some larger memory space, only first 512 bytes. If you store something to address 1 of larger memory space, it will go to that smaller memory instead, the address will become 0 and offset 1, internally for that small amount of memory. If you access beyond that, for example, 1000, you will have to wait more. In this case it would be just memory mapped "registers" - it would be actually faster and better in some cases, than "cache" - unfortunately for some reason processor makers generally won't let you use the cache in that way (probably marketing and support reasons - to sell other products as a separate market share, with higher price).
If you add some more space to each array to store some other value, you can store a part of address there. Without hardware support you could store there virtually anything, that second part is called tag. Now if you have some address fffff000, you can read the second space (assuming that you have the commands to do so), from address 0 - for simplicity and speed you can obtain the address from the primary memory space by masking all the bits except bits 3..8 and 0-2 (which are used to obtain offset in 8 byte array), and check the tag part from that address. One bit in that tag may be used to indicate whether there is something stored there, the other bits may be used to store the part of address from main memory. If you want to save something cached there, you set the bit indicating that the array is not "empty", and assign the upper bits of the main address there, and copy the 8 bytes from the main memory. Next time, before reading something within that range in memory, you read the tag part of the smaller memory array first, then decide whether to read from slow main memory, or from that smaller but faster part (and it would be cache hit).
If you write something with an address of (+-)x512 bytes in main memory, you would have to read the already mentioned array of 8 bytes, copy it into main memory, whole 8 bytes, and write what you want into the very same cell, and then modify the address with a new value. But you would lose the previous copy of your data in the smaller memory area (but faster). If you need the previous value again (any of those 8 bytes), you would have to copy it again from main memory (cache miss).
The same goes for all other arrays of that "cache" memory. So we have a sequence of cache checking, writing, reading and copying the data to or from main memory.
That is called 1 way associativity, for 2 ways there would be one more array (same) of 512 bytes, which can store different addresses though (with the step of 512 from main memory), the tags of those 2 arrays may be checked simultaneously, and if some array has the copy of that memory range it can return it instead of reading it from main memory. Without tag checking (extra cycles for that), the "cache" is essentially a small amount of memory.

overlapping variables and performance

Sorry in advance if I have some of this wrong. I may edit to correct later if it's not too disruptive.
When multiple variables are declared in adjacent memory, as I understand it, on a very low level, registers are created that encapsulate a number of bytes, commonly 1, 2, 4 or 8. This allows those bit ranges to be binary rotated, as well as thought of by the processor as numbers and so mutated with simple mathematics such as add, subtract, multiply and devide.
There may be abstraction reasons for not overlapping thease ranges, but as many langueges consider instructions to occur in a well defined sequential order that the coder will be aware of, are there any performance reasons to not overlap one or more in adjacent bytes of allocated memory?
For example in a block of allocated memory where every bit starts as 0. Bytes 0 to 3 could be being used as an int, as well as bytes 1 to 4. The first could be set to a value before the second range was multiplied by 3.
If there are performance reasons not to then are they overcome by otherwise having to to copy values in and out of completely new variables and perform more complicated processes to achieve certain algorithms that could otherwise be done on a very low level?

There is nothing wrong with this trick when it is done in assembly: optimizers have been routinely making use of knowing where parts of an integer are to save CPU cycles and reduce the size of the code. For example, when a 32-bit integer variable is initialized to a value that fits in only 16 bits, optimizing compilers would replace the instruction that stores a 32-bit value in memory with a faster instruction that stores a 16-bit value to the lower bits of the variable, and clear the upper 16 bits. Moreover, many optimizers would go even further: if a constant is divisible by 2^16, they would store the value divided by 2^16 to the upper 16 bits, and clear lower 16 bits.
Some architectures restrict such manipulations to addresses of certain properties, for example, by requiring all 4-byte memory load/store instructions to be done at addresses divisible by four. These restrictions may reduce applicability of partial-value writing tricks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio