LC-3 left-shift and store left-shifted bits - lc3

I am working on the following code to left shift the value in R0 - which I am sure will work. Also, as R0 is left-shifted, the value of the bit which gets removed should be stored in R2. I am not sure if what I am doing is correct for this.
Also, MASK .FILL x8000 does not seem to work. My LC-3 simulator returns an error. It states "invalid instruction. RTI executed with user mode privilege."
.ORIG x3000
LD R0 X
AND R2 R2 0
LD R3 MASK
LD R1 N
BRZ done
loop
AND R2 R0 R3 ;store leftmost digit of R0 into R2
ADD R0 R0 R0 ;left shift R0
ADD R1 R1 -1
BRP loop
done .FILL x0000
MASK .FILL x8000
X .FILL xFFFF
N .FILL 5 ;amount of times of leftshifts
.END

If you look at the Opcode for RTI:
1000 0000 0000 0000
It is identical to the value stored by "MASK .FILL X8000":
1000 0000 0000 0000
You haven't put a HALT instruction on anywhere before MASK, so the program will continue to run through MASK, X and N. When it runs through MASK, it thinks you are trying to use the RTI instruction because they have equivalent values.

Related

What does movsbl (%rax, %rcx, 1),%eax and $0xf, %eax do?

movsbl (%rax, %rcx, 1),%eax
and $0xf, %eax
I have:
%rax=93824992274748
%eax=1431693628
%rcx=0
I really don't know what the reason why I have these results:
How does the first instruction gives me %eax=97?
Why does the and between the binary representation of 97 and 1111 give me 1?
Bitwise AND compares the bits of both operands. In this case, 97 AND 15:
0110 0001 ;97
0000 1111 ;15
For each column of bits, if both bits in the column are 1, the resulting bit in that column is 1. Otherwise, it's zero.
0110 0001 ;97
0000 1111 ;15
---------------
0000 0001 ;1
You might be wondering what purpose this serves. There's a lot of things you can do with AND actually, many of which aren't obvious at first glance. It's very helpful to think of one of the operands as your data and the other as a "filter."
For example, let's say we have a function called rand that returns a random 32-bit unsigned integer in %eax every time you call it. Assume that all possible values are equally likely. Now, let's say that we have some function called myEvent, and whether we call it or not shall be based on the outcome of rand. If we want this event to have a 1 in 16 chance of occurring, we can do this:
call rand
and $0xf, %eax
jnz skip
call myEvent
skip:
The reason this works is because every multiple of 16 has the bottom 4 bits clear. So those are the only bits we're interested in, and we can use and $0xf, %eax ignore everything to the left of those 4 bits, since they'll all turn into zeroes after the and. Once we've done the and, we can then see if %eax contains zero. If it does, %eax contained a multiple of 16 prior to the and.
Here's another example. This tells you if %eax is odd or even:
and $1, %eax
jnz isOdd
This works because a number is odd if its rightmost binary digit is 1. Any time you do a n % 2 in a high-level language, the compiler replaces it with n & 1 rather than doing actual division.

Why does x3103 have the value x1482 at the end

Edit: The original question is this
Suppose the following LC-3 program is loaded into memory starting at
location x30FF:
x30FF 1110 0010 0000 0001
x3100 0110 0100 0100 0010
x3101 1111 0000 0010 0101
x3102 0001 0100 0100 0001
x3103 0001 0100 1000 0010
If the program is executed, what is the value in R2 at the end of
execution?
x30FF 1110 0010 0000 0001 ; R1 <- PC' + 1 ; R1 <- x3101
x3100 0110 0100 0100 0010 ; R2 <- mem[R1 + 2] ; R2 <- mem[x3103] = x1482
x3101 1111 0000 0010 0101 ; TRAP x25 = HALT
x3102 0001 0100 0100 0001 ; x1441
x3103 0001 0100 1000 0010 ; x1482
The question is what is the content of R2 at the end of the program
In this problem I understand everything until x3100
However I don't understand what mem[R1+2] means and how x3102 has x1441 in Register 2 and how x3103 has the value x1482.
As far I can tell, nothing is loaded into R2 at any point.
Where does x1441 and x1482 come from?
Can somebody explain how R2 has x1482 in it?
Going over the machine language you posted.
The first instruction which is LEA R1, 1 will simply store PC + 1 into R1. Since the PC will be x3100 at the time that instruction is executed x3101 is stored into R1.
The second instruction which is LDR R2, R1, 2 will take R1's value add 2
and then load from memory at the address formed from the previous computation and store it in R2. R1's value is x3101, x3101 + 2 is x3103 so whatever is at address x3103 will be stored in R2. Since you posted that x3103 contains x1482 that is what gets stored in R2.
The phrasing mem[R1+2] means to load from memory at the address computed by taking R1's value and adding 2 to it.
From your edit, yeah the x1441 and x1482 appear to just be data.

How to perform a circular left shift in LC3

We have been given an assignment for our Computer Systems Fundamentals class. The goal of the program is to display an unsigned integer value in three different ways: Binary, Decimal, and Hexadecimal. I have already completed the part for binary, and the decimal method will simply require division by 10 and printing the results. However, for hexadecimal the professor wants us to implement it using a circular left shift (in order to perform a left rotation).
IE.
0010 1111 0000 1001
+ 0010 1111 0000 1001
--------------------
0101 1110 0001 0010
We have to do this 4x and then apply a mask to clear all bits from [15:4] in order to print out the ascii value of bits [3:0].
Here is my algorithm to solve this program.
;Loop 4 times -->initialization of for loop
;Begin For Loop
;Perform the left shift
;If the value after the left shift is performed has a carry of 1
;then add 1 to this value (the rotation)
;Otherwise if the value after the left shift is performed has a carry of 0
;Continue the value is already rotated
;Get the value of the number after the loop has completed
;Create a new loop that will go through the Digits
;Load R0 with the value of the digit that we land on
;Print that value to the screen
The problem that I am running into is the fact that I do not know how to find the carry bit of the number being shifted though.
For example:0000 0000 0001 1011 ----> bit[4] is the carry bit after the left shift
I don't know how to keep track of that to perform the circular shift.
I tried a mask of
1000 0000 0000 0000
but I don't think it
keeps track of any carry bit that doesn't occur on bit[15].
Help would be greatly appreciated, the book doesn't provide an example, and I can't find any sources online, I know that answering homework questions is generally frowned upon, but I am at my wits end! Its the weekend so I cannot contact the professor, and the assignment is due soon (not last minute, I have been stuck on this since Friday -_- ).
Thank you!
EDIT: MORE INFORMATION
DIGITS is this: DIGITS .STRINGZ "0123456789ABCDEF" ;Digit_String
It will take the value of the of [3:0] after the 4 rotations and will determine which ascii character to print out.
never heard of LC3 before so I do not know the instruction set but from quick google you have:
and,add,not instructions
N,Z,P flags
2'os complement ALU
Which is really not much (it lacks the basic instructions) so you have to do everything backwards. However there are ways how to do cyclic rotation for example like this:
for rotation left of r0 you can use add r0,r0,r0
to obtain carry you can use sign of value before rotation
in 2'os complement if the MSB bit is set the value is negative. So you need to set flags with some neutral arithmetic operation and test N,P flags. For example add r0,r0,#0
Another way is like you guessed and the MSB and test Z flag but I am not sure if you can store 1000 0000 0000 0000 bin as positive number as I did not see any distinction between signed and unsigned constants I assume they are all signed and putting this value as positive will overflow compiler/interpreter making it most likely zero. try to use #-32768 instead.
When put together let have the value in r0 so:
LD R0,#your value
; extract MSB bit to R1
LD R1,#0 ; hope this is load
ADD R0,R0,#0 ; MSB -> N/P flags
BRp REL0 ; hope this is conditional jump if positive
LD R1,#1 ; carry was set
REL0
; shift
ADD R0,R0,R0 ; shift left (arithmetic)
ADD R0,R0,R1 ; shift left (cyclic)
I'm pretty much doing the same assignment. I think I actually have your same class lol. I know people on this site look down on sharing code, but heres the section that youre having trouble with. Take a look at it and try to append yours appropriately. Also, I was having the same problem with the mask, try using xF000 as the mask to get the MSB(most sig byte).
ST R1, HEX1
ST R7, HEX7
AND R1, R1, 0
AND R7, R7, 0
AND R2, R2, 0
ADD R1, R0, #0
LD R4, HEXMASK
ADD R6, R6, #4; counter 2
OUTERROT
LD R3, INMASK
LD R5, OUTMASK
ST R5, HEXSAVE
LD R5, HEXMASK2
AND R7, R1, R5; R7=MSB
LD R5, HEXSAVE
ADD R2, R2, #4
SHIFT
ADD R1, R1, R1
ADD R2, R2, #-1
BRz EXITSHF
BR SHIFT
EXITSHF
AND R0, R0, 0
ADD R0, R1, 0
AND R1, R1, 0
SHFTR
ST R4, HEXSAVE
AND R4, R4, 0
AND R4, R7, R3
BRz DONOTHING
NOT R1, R1
NOT R5, R5
AND R4, R1, R5
NOT R4, R4
NOT R1, R1
NOT R5, R5
ADD R1, R4, 0
BRz SHFTREX
DONOTHING
LD R4, HEXSAVE
ADD R5, R5, R5
ADD R3, R3, R3
BRz SHFTREX
BR SHFTR
SHFTREX
ADD R0, R1, R0
AND R1, R1, 0
ADD R1, R0, 0
ST R0, HEXSAVE
ST R1, HEXSAVE1
AND R0, R0, 0
AND R0, R1, R4
ST R3, CONV3
AND R3, R3, #0
LEA R3, DIGITS
START
ADD R3, R3, 1
ADD R0, R0, #-1
BRnp START
LDR R0, R3, 0
LD R3, CONV3
TRAP x21
LD R0, HEXSAVE
ADD R6, R6, #-1
BRz HEXEXIT
BR OUTERROT
HEXEXIT
LD R1, HEX1
LD R7, HEX7
RET
;Save Area
HEXMASK .FILL x000F
HEXMASK2 .FILL XF000
INMASK .FILL x1000
OUTMASK .FILL x0001
ASCII .FILL X0030
HEX1 .BLKW 1
HEX7 .BLKW 1
HEXSAVE1 .BLKW 1
HEXSAVE .BLKW 1
CONV3 .BLKW 1
Never specified earlier which mask. There are actually 4 masks to use 2 for right shifting the MSB, 1 for obtaining the MSB, and 1 for obtaining at the end the bits in B[3:0]. Try to recognize the difference.

Packing BCD to DPD: How to improve this amd64 assembly routine?

I'm writing a routine to convert between BCD (4 bits per decimal digit) and Densely Packed Decimal (DPD) (10 bits per 3 decimal digits). DPD is further documented (with the suggestion for software to use lookup-tables) on Mike Cowlishaw's web site.
This routine only ever requires the lower 16 bit of the registers it uses, yet for shorter instruction encoding I have used 32 bit instructions wherever possible. Is a speed penalty associated with code like:
mov data,%eax # high 16 bit of data are cleared
...
shl %al
shr %eax
or
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
where the alternative to a 16 bit imul would be either a 32 bit imul and a subsequent and or a series of lea instructions and a final and.
The whole code in my routine can be found below. Is there anything in it where performance is worse than it could be due to me mixing word and dword instructions?
.section .text
.type bcd2dpd_mul,#function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jz 1f
and $0x888,%edi # = 0000 a000 e000 i000
imul $0x0490,%di # = aei0 0000 0000 0000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $13,%edi # u = 0000 0000 0000 0aei
imul tab-8(,%rdi,4),%si # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or tab-6(,%rdi,4),%ax # x = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
1: ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
tab:
.short 0x0011 ; .short 0x000a
.short 0x0000 ; .short 0x004e
.short 0x0081 ; .short 0x000c
.short 0x0008 ; .short 0x002e
.short 0x0081 ; .short 0x000e
.short 0x0000 ; .short 0x006e
.size tab,.-tab
Improved Code
After applying some suggestions from the answer and comments and some other trickery, here is my improved code.
.section .text
.type bcd2dpd_mul,#function
.globl bcd2dpd_mul
# convert BCD to DPD with multiplication tricks
# input abcd efgh iklm in edi
.align 8
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jnz 1f
ret
.align 8
1: and $0x888,%edi # = 0000 a000 e000 i000
imul $0x49,%edi # = 0ae0 aei0 ei00 i000
mov %eax,%esi
and $0x66,%esi # q = 0000 0000 0fg0 0kl0
shr $8,%edi # = 0000 0000 0ae0 aei0
and $0xe,%edi # = 0000 0000 0000 aei0
movzwl lookup-4(%rdi),%edx
movzbl %dl,%edi
imul %edi,%esi # v = q * tab[u-2][0]
and $0x397,%eax # r = 0000 00bc d00h 0klm
xor %esi,%eax # w = r ^ v
or %dh,%al # = w | tab[u-2][1]
and $0x3ff,%eax # = 0000 00xx xxxx xxxx
ret
.size bcd2dpd_mul,.-bcd2dpd_mul
.section .rodata
.align 4
lookup:
.byte 0x11
.byte 0x0a
.byte 0x00
.byte 0x4e
.byte 0x81
.byte 0x0c
.byte 0x08
.byte 0x2e
.byte 0x81
.byte 0x0e
.byte 0x00
.byte 0x6e
.size lookup,.-lookup
TYVM for commenting the code clearly and well, BTW. It made is super easy to figure out what was going on, and where the bits were going. I'd never heard of DPD before, so puzzling it out from uncommented code and the wikipedia article would have sucked.
The relevant gotchas are:
Avoid 16bit operand size for instructions with immediate constants, on Intel CPUs. (LCP stalls)
avoid reading the full 32 or 64bit register after writing only the low 8 or 16, on Intel pre-IvyBridge. (partial-register extra uop). (IvB still has that slowdown if you modify an upper8 reg like AH, but Haswell removes that too). It's not just an extra uop: the penalty on Core2 is 2 to 3 cycles, according to Agner Fog. I might be measuring it wrong, but it seems a lot less bad on SnB.
See http://agner.org/optimize/ for full details.
Other than that, there's no general problem with mixing in some instructions using the operand-size prefix to make them 16-bit.
You should maybe write this as inline asm, rather than as a called function. You only use a couple registers, and the fast-path case is very few instructions.
I had a look at the code. I didn't look into achieving the same result with significantly different logic, just at optimizing the logic you do have.
Possible code suggestions: Switch the branching so the fast-path has the not-taken branch. Actually, it might make no diff either way in this case, or might improve the alignment of the slow-path code.
.p2align 4,,10 # align to 16, unless we're already in the first 6 bytes of a block of 16
bcd2dpd_mul:
mov %edi,%eax # = 0000 abcd efgh iklm
shl %al # = 0000 abcd fghi klm0
shr %eax # = 0000 0abc dfgh iklm
test $0x880,%edi # fast path for a = e = 0
jnz .Lslow_path
ret
.p2align 4 # Maybe fine-tune this alignment based on how the rest of the code assembles.
.Lslow_path:
...
ret
It's sometimes better to duplicate return instructions than to absolutely minimize code-size. The compare-and-branch in this case is the 4th uop of the function, though, so a taken branch wouldn't have prevented 4 uops from issuing in the first clock cycle, and a correctly-predicted branch would still issue the return on the 2nd clock cycle.
You should use a 32bit imul for the one with the table source. (see next section about aligning the table so reading an extra 2B is ok). 32bit imul is one uop instead of two on Intel SnB-family microarches. The result in the low16 should be the same, since the sign bit can't be set. The upper16 gets zeroed by the final and before ret, and doesn't get used in any way where garbage in the upper16 matters while it's there.
Your imul with an immediate operand is problematic, though.
It causes an LCP stall when decoding on Intel, and it writes the the low16 of a register that is later read at full width. Its upper16 would be a problem if not masked off (since it's used as a table index). Its operands are large enough that they will put garbage into the upper16, so it does need to be discarded.
I thought your way of doing it would be optimal for some architectures, but it turns out imul r16,r16,imm16 itself is slower than imul r32,r32,imm32 on every architecture except VIA Nano, AMD K7 (where it's faster than imul32), and Intel P6 (where using it from 32bit / 64bit mode will LCP-stall, and where partial-reg slowdowns are a problem).
On Intel SnB-family CPUs, where imul r16,r16,imm16 is two uops, imul32/movzx would be strictly better, with no downside except code size. On P6-family CPUs (i.e. PPro to Nehalem), imul r16,r16,imm16 is one uop, but those CPUs don't have a uop cache, so the LCP stall is probably critical (except maybe Nehalem calling this in a tight loop, fitting in the 28 uop loop buffer). And for those CPUs, the explicit movzx is probably better from the perspective of the partial-reg stall. Agner Fog says something about there being an extra cycle while the CPU inserts the merging uop, which might mean a cycle where that extra uop is issued alone.
On AMD K8-Steamroller, imul imm16 is 2 m-ops instead of 1 for imul imm32, so imul32/movzx is about equal to imul16 there. They don't suffer from LCP stalls, or from partial-reg problems.
On Intel Silvermont, imul imm16 is 2 uops (with one per 4 clocks throughput), vs. imul imm32 being 1 uops (with one per 1 clock throughput). Same thing on Atom (the in-order predecessor to Silvermont): imul16 is an extra uop and much slower. On most other microarchitectures, throughput isn't worse, just latency.
So if you're willing to increase the code-size in bytes where it will give a speedup, you should use a 32bit imul and a movzwl %di, %edi. On some architectures, this will be about the same speed as the imul imm16, while on others it will be much faster. It might be slightly worse on AMD bulldozer-family, which isn't very good at using both integer execution units at once, apparently, so a 2 m-op instruction for EX1 might be better than two 1 m-op instructions where one of them is still an EX1-only instruction. Benchmark this if you care.
Align tab to at least a 32B boundary, so your 32bit imul and or can do a 4B load from any 2B-aligned entry in it without crossing a cache-line boundary. Unaligned accesses have no penalty on all recent CPUs (Nehalem and later, and recent AMD), as long as they don't span two cache lines.
Making the operations that read from the table 32bit avoids the partial-register penalty that Intel CPUs have. AMD CPUs, and Silvermont, don't track partial-registers separately, so even instructions that write-only to the low16 have to wait for the result in the rest of the reg. This stops 16bit insns from breaking dependency chains. Intel P6 and SnB microarch families track partial regs. Haswell does full dual bookkeeping or something, because there's no penalty when merging is needed, like after you shift al, then shift eax. SnB will insert an extra uop there, and there may be a penalty of a cycle or two while it does this. I'm not sure, and haven't tested. However, I don't see a nice way to avoid this.
The shl %al could be replaced with a add %al, %al. That can run on more ports. Probably no difference, since port0/5 (or port0/6 on Haswell and later) probably aren't saturated. They have the same effect on the bits, but set flags differently. Otherwise they could be decoded to the same uop.
changes: split the pext/pdep / vectorize version into a separate answer, partly so it can have its own comment thread.
(I split the BMI2 version into a separate answer, since it could end up totally different)
After seeing what you're doing with that imul/shr to get a table index, I can see where you could use BMI2 pextr to replace and/imul/shr, or BMI1 bextr to replace just the shr (allowing use of imul32, instead of imul16, since you'd just extract the bits you want, rather than needing to shift zeros from the upper16). There are AMD CPUs with BMI1, but even steamroller lacks BMI2. Intel introduced BMI1 and BMI2 at the same time with Haswell.
You could maybe process two or four 16bit words at once, with 64bit pextr. But not for the whole algorithm: you can't do 4 parallel table lookups. (AVX2 VPGATHERDD is not worth using here.) Actually, you can use pshufb to implement a LUT with indices up to 4bits, see below.
Minor improvement version:
.section .rodata
# This won't won't assemble, written this way for humans to line up with comments.
extmask_lobits: .long 0b0000 0111 0111 0111
extmask_hibits: .long 0b0000 1000 1000 1000
# pext doesn't have an immediate-operand form, but it can take the mask from a memory operand.
# Load these into regs if running in a tight loop.
#### TOTALLY UNTESTED #####
.text
.p2align 4,,10
bcd2dpd_bmi2:
# mov %edi,%eax # = 0000 abcd efgh iklm
# shl %al # = 0000 abcd fghi klm0
# shr %eax # = 0000 0abc dfgh iklm
pext extmask_lobits, %edi, %eax
# = 0000 0abc dfgh iklm
mov %eax, %esi # insn scheduling for 4-issue front-end: Fast-path is 4 fused-domain uops
# And doesn't waste issue capacity when we're taking the slow path. CPUs with mov-elimination won't waste execution units from issuing an extra mov
test $0x880, %edi # fast path for a = e = 0
jnz .Lslow_path
ret
.p2align 4
.Lslow_path:
# 8 uops, including the `ret`: can issue in 2 clocks.
# replaces and/imul/shr
pext extmask_hibits, %edi, %edi #u= 0000 0000 0000 0aei
and $0x66, %esi # q = 0000 0000 0fg0 0kl0
imul tab-8(,%rdi,4), %esi # v = q * tab[u-2][0]
and $0x397, %eax # r = 0000 00bc d00h 0klm
xor %esi, %eax # w = r ^ v
or tab-6(,%rdi,4), %eax # x = w | tab[u-2][1]
and $0x3ff, %eax # = 0000 00xx xxxx xxxx
ret
Of course, if making this an inline-asm, rather than a stand-alone function, you'd change back to the fast path branching to the end, and the slow-path falling through. And you wouldn't waste space with alignment padding mid-function either.
There might be more scope for using pextr and/or pdep for more of the rest of the function.
I was thinking about how to do even better with BMI2. I think we could get multiple aei selectors from four shorts packed into 64b, then use pdep to deposit them in the low bits of different bytes. Then movq that to a vector register, where you use it as a shuffle control-mask for pshufb to do multiple 4bit LUT lookups.
So we could go from 60 BCD bits to 50 DPD bits at a time. (Use shrd to shift bits between registers to handle loads/stores to byte-addressable memory.)
Actually, 48 BCD bits (4 groups of 12bits each) -> 40 DPD bits is probably a lot easier, because you can unpack that to 4 groups of 16bits in a 64b integer register, using pdep. Dealing with the selectors for 5 groups is fine, you can unpack with pmovzx, but dealing with the rest of the data would require bit-shuffling in vector registers. Not even the slow AVX2 variable-shift insns would make that easy to do. (Although it might be interesting to consider how to implement this with BMI2 at all, for big speedups on CPUs with just SSSE3 (i.e. every relevant CPU) or maybe SSE4.1.)
This also means we can put two clusters of 4 groups into the low and high halves of a 128b register, to get even more parallelism.
As a bonus, 48bits is a whole number of bytes, so reading from a buffer of BCD digits wouldn't require any shrd insns to get the leftover 4 bits from the last 64b into the low 4 for the next. Or two offset pextr masks to work when the 4 ignored bits were the low or high 4 of the 64b.... Anyway, I think doing 5 groups at once just isn't worth considering.
Full BMI2 / AVX pshufb LUT version (vectorizable)
The data movement could be:
ignored | group 3 | group 2 | group 1 | group 0
16bits | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm | abcd efgh iklm
3 2 1 | 0
pext -> aei|aei|aei|aei # packed together in the low bits
2 | 1 | 0
pdep -> ... |0000 0000 0000 0aei|0000 0000 0000 0aei # each in a separate 16b word
movq -> xmm vector register.
(Then pinsrq another group of 4 selectors into the upper 64b of the vector reg). So the vector part can handle 2 (or AVX2: 4) of this at once
vpshufb xmm2 -> map each byte to another byte (IMUL table)
vpshufb xmm3 -> map each byte to another byte (OR table)
Get the bits other than `aei` from each group of 3 BCD digits unpacked from 48b to 64b, into separate 16b words:
group 3 | group 2 | group 1 | group 0
pdep(src)-> 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm | 0000 abcd efgh iklm
movq this into a vector reg (xmm1). (And repeat for the next 48b and pinsrq that to the upper64)
VPAND xmm1, mask (to zero aei in each group)
Then use the vector-LUT results:
VPMULLW xmm1, xmm2 -> packed 16b multiply, keeping only the low16 of the result
VPAND xmm1, mask
VPXOR xmm1, something
VPOR xmm1, xmm3
movq / pextrq back to integer regs
pext to pack the bits back together
You don't need the AND 0x3ff or equivalent:
Those bits go away when you pext to pack each 16b down to 10b
shrd or something to pack the 40b results of this into 64b chunks for store to memory.
Or: 32b store, then shift and store the last 8b, but that seems lame
Or: just do 64b stores, overlapping with the previous. So you write 24b of garbage every time. Take care at the very end of the buffer.
Use AVX 3-operand versions of the 128b SSE instructions to avoid needing movdqa to not overwrite the table for pshufb. As long as you never run a 256b AVX instruction, you don't need to mess with vzeroupper. You might as well use the v (VEX) versions of all vector instructions, though, if you use any. Inside a VM, you might be running on a virtual CPU with BMI2 but not AVX support, so it's prob. still a good idea to check both CPU feature flags, rather than assuming AVX if you see BMI2 (even though that's safe for all physical hardware that currently exists).
This is starting to look really efficient. It might be worth doing the mul/xor/and stuff in vector regs, even if you don't have BMI2 pext/pdep to do the bit packing/unpacking. I guess you could use code like the existing non-BMI scalar routing to get selectors, and mask/shift/or could build up the non-selector data into 16b chunks. Or maybe shrd for shifting data from one reg into another?

How data is stored in memory or register

I m new to assembly language and learning it for exams. I ama programmer and worked in C,C++, java, asp.net.
I have tasm with win xp.
I want to know How data is stored in memory or register. I want to know the process. I believe it is something like this:
While Entering data, eg. number:
Input Decimal No. -> Converted to Hex -> Store ASCII of hex in registers or memory.
While Fetching data:
ASCII of hex in registers or memory -> Converted to Hex -> Show Decimal No. on monitor.
Is it correct. ? If not, can anybody tell me with simple e.g
Ok, Michael: See the code below where I am trying to add two 1 digit numbers to display 2 digit result, like 6+5=11
Sseg segment stack
ends
code segment
;30h to 39h represent numbers 0-9
MOV BX, '6' ; ASCII CODE OF 6 IS STORED IN BX, equal to 36h
ADD BX, '5' ; ASCII CODE OF 5 (equal to 35h) IS ADDED IN BX, i.e total is 71h
Thanks Michael... I accept my mistake....
Ok, so here, BX=0071h, right ? Does it mean, BL=00 and BH=71 ?
However, If i do so, I can't find out how to show the result 11 ?
Hey Blechdose,
Can you help me in one more problem. I am trying to compare 2 values. If both are same then dl=1 otherwise dl=0. But in the following code it displays 0 for same values, it is showing me 0. Why is it not jumping ?
sseg segment stack
ends
code segment
assume cs:code
mov dl,0
mov ax,5
mov bx,5
cmp ax,bx
jne NotEqual
je equal
NotEqual:
mov dl,0
add dl,30h
mov ah,02h
int 21h
mov ax,4c00h
int 21h
equal: mov dl,1
add dl,30h
mov ah,02h
int 21h
mov ax,4c00h
int 21h
code ends
end NotEqual
end equal
Registers consist of bits. A bit can have the logic value 0 or 1. It is a "logic value" for us, but actually it is represented by some kind of voltage inside the hardware. For example 4-5 Volt is interpreted as "logic 1" and 0-1 Volt as "logic 0". The BX register has 16 of those bits.
Lets say the current content of BX(Base address register) is: 0000000000110110. Because it is very hard to read those long lines of 0s and 1s for humans, we combine every 4 bits to 1 Hexnumber, to get a more readable format to work with. The CPU does not know what a Hex or decimal number is. It can only work with binary code. Okay, let us use a more readable format for our BX register:
0000 0000 0011 0110 (actual BX content)
0 0 3 6 (HEX format for us)
54 (or corresponding decimal value)
When you send this value (36h), to your output terminal, it will interpret this value as an ASCII-charakter. Thus it will display a "6" for the 36h value.
When you want to add 6 + 2 with assembly, you put 0110 (6) and 0010 (2) in the registers. Your assembler TASM is doing the work for you. It allows you to write '6' (ASCII) or 0x6 (hex) or even 6 (decimal) in the asm-sourcecode and will convert that for you into a binary number, which the register accepts. WARNING: '6' will not put the value 6 into the register, but the ASCII-Code for 6. You cannot calculate with that directly.
Example: 6+2=8
mov BX, 6h ; We put 0110 (6) into BX. (actually 0000 0000 0000 0110,
; because BX is 16 Bit, but I will drop those leading 0s)
add BX, 2h ; we add 0010 (2) to 0110 (6). The result 1000 (8) is stored in BX.
add BX, 30h ; we add 00110000 (30h). The result 00111000 (38h) is stored in BX.
; 38h is the ASCII-code, which your terminal output will interpret as '8'
When you do a calculation like 6+5 = 11, it will be even more complicated, because you have to convert the result 1011 (11) into 2 ASCII-Digits '1' and '1' (3131h = 00110001 00110001)
After adding 6 (0110) + 5 (0101) = 11 (1011), BX will contain this (without blanks):
0000 0000 0000 1011 (binary)
0 0 0 B (Hex)
11 (decimal)
|__________________|
BX
|________||________|
BH BL
BH is the higher Byte of BX, while BL is the lower byte of BX. In our example BH is 00h, while BL contains 0bh.
To display your summation result on your terminal output, you need to convert it to ASCII-Code. In this case, you want to display an '11'. Thus you need two times a '1'-ASCII-Character. By looking up one of the hunderds ASCII-tables on the internet, you will find out, that the Code for the '1'-ASCII-Charakter is 31h. Consequently you need to send 3131h to your terminal:
0011 0001 0011 0001 (binary)
3 1 3 1 (hex)
12593 (decimal)
The trick to do this, is by dividing your 11 (1011) by 10 with the div instruction. After the division by 10 you get a result and a remainder. you need to convert the remainder into an ASCII-number, which you need to save into a buffer. Then you repeat the process by dividing the result from the last step by 10 again. You need to do this, until the result is 0. (using the div operation is a bit tricky. You have to look that up by yourself)
binary (decimal):
divide 1011 (11) by 1010 (10):
result: 0001 (1) remainder: 0001 (1) -> convert remainderto ASCII
divide result by 1010 (10) again:
result: 0000 (1) remainder: 0001 (1) -> convert remainderto ASCII

Resources