undefined reference to `__floatundisf' when using hard float (PowerPC) - gcc

I'm building code for PowerPC with hard float and suddenly getting this issue.
I understand that this symbol belongs to gcc's soft-float library. What I don't understand is why it's trying to use that at all, despite my efforts to tell it to use hard float.
make flags:
CFLAGS += -mcpu=750 -mhard-float -ffast-math -fno-math-errno -fsingle-precision-constant -shared -fpic -fno-exceptions -fno-asynchronous-unwind-tables -mrelocatable -fno-builtin -G0 -O3 -I$(GCBASE) -Iinclude -Iinclude/gc -I$(BUILDDIR)
ASFLAGS += -I include -mbroadway -mregnames -mrelocatable --fatal-warnings
LDFLAGS += -nostdlib -mhard-float $(LINKSCRIPTS) -Wl,--nmagic -Wl,--just-symbols=$(GLOBALSYMS)
Code in question:
static void checkTime() {
u64 ticks = __OSGetSystemTime();
//note timestamp here is seconds since 2000-01-01
float secs = ticks / 81000000.0f; //everything says this should be 162m / 4,
//but I only seem to get anything sensible with 162m / 2.
int days = secs / 86400.0f; //non-leap days
int years = secs / 31556908.8f; //approximate average
int yDay = days % 365;
debugPrintf("Y %d D %d", years, yDay);
}
What more do I need to stop gcc trying to use soft float? Why has it suddenly decided to do that?

Looking at the GCC docs, __floatundisf converts an unsigned long to a float. If we compile your code* with -O1 and run objdump, we can see that the __floatundisf indeed comes from dividing your u64 by a float:
u64 ticks = __OSGetSystemTime();
20: 48 00 00 01 bl 20 <checkTime+0x20> # Call __OSGetSystemTime
20: R_PPC_PLTREL24 __OSGetSystemTime
//note timestamp here is seconds since 2000-01-01
float secs = ticks / 81000000.0f; //everything says this should be 162m / 4,
24: 48 00 00 01 bl 24 <checkTime+0x24> # Call __floatundisf
24: R_PPC_PLTREL24 __floatundisf
28: 81 3e 00 00 lwz r9,0(r30)
2a: R_PPC_GOT16 .LC0
2c: c0 09 00 00 lfs f0,0(r9) # load the constant 1/81000000
30: ec 21 00 32 fmuls f1,f1,f0 # do the multiplication ticks * 1/81000000
So you're getting it for a u64 / float calculation.
If you convert the u64 to a u32, I also see it going away.
Why is it generated? Looking at the manual for the 750CL which I'm hoping is largely equivalent to your chip, there's no instruction that will read an 8 byte integer from memory and convert it to a float. (It looks like there isn't one for directly converting a 32-bit integer to a float either: gcc instead inlines a confusing sequence of integer and float manipulation instructions.)
I don't know what the units for __OSGetSystemTime are, but if you can reduce it to a 32-bit integer by throwing away some lower bits, or by doing some tricks with common divisors, you could get rid of the call.
*: Lightly modified to compile on my system.

Related

Understanding branch prediction efficiency

I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>
Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:
N time cycles branch-misses (%) approx-time
===============================================================
8 2.2 9,084,889,375 34,806 ( 0.00) 2.2
9 2.2 9,212,112,830 39,725 ( 0.00) 2.2
10 2.2 9,264,903,090 2,394,253 ( 0.06) 2.2
11 2.2 9,415,103,000 8,102,360 ( 0.19) 2.2
12 2.3 9,876,827,586 27,169,271 ( 0.63) 2.3
13 4.6 19,572,398,825 486,814,972 (11.33) 4.6
14 9.5 39,813,380,461 1,473,662,853 (34.31) 9.5
15 11.6 49,079,798,916 1,915,930,302 (44.61) 11.7
16 12.7 53,216,900,532 2,113,177,105 (49.20) 12.7
20 12.9 54,317,444,104 2,149,928,923 (50.06) 12.9
Note: branch-misses (%) is calculated for 2^32 branches
As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.
The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.
The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):
(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

Extending SRecord to handle crc32_mpeg2?

statement of problem:
I'm working with a Kinetis L series (ARM Cortex M0+) that has a dedicated CRC hardware module. Through trial and error and using this excellent online CRC calculator, I determined that the CRC hardware is configured to compute CRC32_MPEG2.
I'd like to use srec_input (a part of SRecord 1.64) to generate a CRC for a .srec file whose results must match the CRC_MPEG2 computed by the hardware. However, srec's built-in CRC algos (CRC32 and STM32) don't generate the same results as the CRC_MPEG2.
the question:
Is there a straightforward way to extend srec to handle CRC32_MPEG2? My current thought is to fork the srec source tree and extend it, but it seems likely that someone's already been down this path.
Alternatively, is there a way for srec to call an external program? (I didn't see one after a quick scan.) That might do the trick as well.
some details
The parameters of the hardware CRC32 algorithm are:
Input Reflected: No
Output Reflected: No
Polynomial: 0x4C11DB7
Initial Seed: 0xFFFFFFFF
Final XOR: 0x0
To test it, an input string of:
0x10 0xB5 0x06 0x4C 0x23 0x78 0x00 0x2B
0x07 0xD1 0x05 0x4B 0x00 0x2B 0x02 0xD0
should result in a CRC32 value of:
0x938F979A
what generated the CRC value in the first place?
In response to Mark Adler's well-posed question, the firmware uses the Freescale fsl_crc library to compute the CRC. The relevant code and parameters (mildly edited) follows:
void crc32_update(crc32_data_t *crc32Config, const uint8_t *src, uint32_t lengthInBytes)
{
crc_config_t crcUserConfigPtr;
CRC_GetDefaultConfig(&crcUserConfigPtr);
crcUserConfigPtr.crcBits = kCrcBits32;
crcUserConfigPtr.seed = 0xffffffff;
crcUserConfigPtr.polynomial = 0x04c11db7U;
crcUserConfigPtr.complementChecksum = false;
crcUserConfigPtr.reflectIn = false;
crcUserConfigPtr.reflectOut = false;
CRC_Init(g_crcBase[0], &crcUserConfigPtr);
CRC_WriteData(g_crcBase[0], src, lengthInBytes);
crcUserConfigPtr.seed = CRC_Get32bitResult(g_crcBase[0]);
crc32Config->currentCrc = crcUserConfigPtr.seed;
crc32Config->byteCountCrc += lengthInBytes;
}
Peter Miller be praised...
It turns out that if you supply enough filters to srec_cat, you can make it do anything! :) In fact, the following arguments the correct checksum:
$ srec_cat test.srec -Bit_Reverse -CRC32LE 0x1000 -Bit_Reverse -XOR 0xff -crop 0x1000 0x1004 -Output -HEX_DUMP
00001000: 93 8F 97 9A #....
In other words, bit reverse the bits going to the CRC32 algorithm, bit reverse them on the way out, and 1's compliment them.

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?

Two different threads within a single process can share a common memory location by reading and/or writing to it.
Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).
Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.
What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.
In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.
Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.
#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>
alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;
static inline uint64_t rdtsc()
{
unsigned int l, h;
__asm__ __volatile__ (
"rdtsc"
: "=a" (l), "=d" (h)
);
return ((uint64_t)h << 32) | l;
}
static void * consume(void *)
{
uint64_t value = 0;
uint64_t start = rdtsc();
for (unsigned n = 0; n < LOOPS; ++n) {
for (unsigned idx = 0; idx < SIZE; ++idx) {
value += data[idx] + shared.load(std::memory_order_relaxed);
}
}
elapsed = rdtsc() - start;
return reinterpret_cast<void*>(value);
}
static void * produce(void *)
{
do {
#ifdef EMPTY_PRODUCER
unshared.store(0, std::memory_order_relaxed);
#else
shared.store(0, std::memory_order_relaxed);
#enfid
} while (!stop_producer);
return nullptr;
}
int main()
{
pthread_t consumerId, producerId;
pthread_attr_t consumerAttrs, producerAttrs;
cpu_set_t cpuset;
for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
shared = 0;
stop_producer = false;
pthread_attr_init(&consumerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(CONSUMER_CPU, &cpuset);
pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);
pthread_attr_init(&producerAttrs);
CPU_ZERO(&cpuset);
CPU_SET(PRODUCER_CPU, &cpuset);
pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);
pthread_create(&consumerId, &consumerAttrs, consume, NULL);
pthread_create(&producerId, &producerAttrs, produce, NULL);
pthread_attr_destroy(&consumerAttrs);
pthread_attr_destroy(&producerAttrs);
pthread_join(consumerId, NULL);
stop_producer = true;
pthread_join(producerId, NULL);
std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
return 0;
}
Compile with the following command, replacing defines:
gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing
Where:
CONSUMER_CPU is the number of the cpu to run consumer thread on.
PRODUCER_CPU is the number of the cpu to run producer thread on.
SIZE is the size of the inner loop (matters for cache)
LOOPS is, well...
Here are the generated loops:
Consumer thread
400cc8: ba 80 24 60 00 mov $0x602480,%edx
400ccd: 0f 1f 00 nopl (%rax)
400cd0: 8b 05 2a 17 20 00 mov 0x20172a(%rip),%eax # 602400 <shared>
400cd6: 48 83 c2 08 add $0x8,%rdx
400cda: 48 03 42 f8 add -0x8(%rdx),%rax
400cde: 48 01 c1 add %rax,%rcx
400ce1: 48 81 fa 80 24 70 00 cmp $0x702480,%rdx
400ce8: 75 e6 jne 400cd0 <_ZL7consumePv+0x20>
400cea: 83 ee 01 sub $0x1,%esi
400ced: 75 d9 jne 400cc8 <_ZL7consumePv+0x18>
Producer thread, with empty loop (no writing to shared):
400c90: c7 05 e6 16 20 00 00 movl $0x0,0x2016e6(%rip) # 602380 <unshared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
Producer thread, writing to shared:
400c90: c7 05 66 17 20 00 00 movl $0x0,0x201766(%rip) # 602400 <shared>
400c97: 00 00 00
400c9a: 0f b6 05 5f 16 20 00 movzbl 0x20165f(%rip),%eax # 602300 <stop_producer>
400ca1: 84 c0 test %al,%al
400ca3: 74 eb je 400c90 <_ZL7producePv>
The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.
My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.
Result without starting the producer at all:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 n/a 2.11G 1.80G
Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 3.20G 3.26G # mono
3 2 2.10G 1.80G # other core
3 1 4.18G 3.24G # same core, HT
As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.
With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.
With producer on cpu#1, we see hyperthreading at work.
Results with disruptive producer:
CONSUMER PRODUCER cycles for 1M cycles for 128k
3 3 4.26G 3.24G # mono
3 2 22.1 G 19.2 G # other core
3 1 36.9 G 37.1 G # same core, HT
When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.
I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.
The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.
At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.
Other core example
These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.
Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.
The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
The Consumer receives his cache line still in shared state and happily reads the value.
The consumer continues to read the shared value until the exclusive request arrives.
At which point the Consumer sends a shared request for the cache line.
At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
While the Consumer waits for the data it runs ahead speculatively.
So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.
Same core HT
Here the two HT shares the core and must share its resources.
The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.
Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.
Next cycle a write from the Produces occures.
The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
The Consumer re-issues its instructions taking ~5-14 cycles to run again.
Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.
So for every read of the shared value the Consumer is reset.
Conclusion
The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.
What would have happened if the CPU waited to see if the value had actually changed?
For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.
But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.
So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.
Read more here

Unexpected exit code for a C program compiled for 32 bit architecture using gcc

I wrote a simple C program and compiled it for 32 bit architecture.
But when I ran it, I found unexpected results.
#include <stdio.h>
int foo(int n) {
int sum=0;
int i;
if (n <= 1 || n >= 0x1000)
return n;
for (i=0; i<= n; i++) {
sum = sum + i;
}
return foo(sum);
}
int main(int argc, char** argv) {
int n;
n = foo(200);
printf("\n\n main about to return %d \n\n", n);
return n;
}
➜ wbench gcc -o test.elf test.c -m32 -fno-stack-protector -mpreferred-stack-boundary=2 -Wall
➜ wbench ./test.elf
main about to return 20100
➜ wbench echo $?
132
I'm expecting 20100 to be the return value, as printed by the main function.
But, I'm getting 132 as the exit code.
I verified using GDB that 20100 is the value in the eax register when main is about to return.
➜ wbench gdb -q test.elf
gdb-peda$ b *main+44
Breakpoint 1 at 0x8048492
gdb-peda$ r
main about to return 20100
Breakpoint 1, 0x08048492 in main ()
0x8048489 <main+35>: call 0x80482f0 <printf#plt>
0x804848e <main+40>: mov eax,DWORD PTR [ebp-0x4]
0x8048491 <main+43>: leave
=> 0x8048492 <main+44>: ret
0x8048493: xchg ax,ax
gdb-peda$ p/d $eax
$1 = 20100
gdb-peda$ c
[Inferior 1 (process 32172) exited with code 0204]
Warning: not running or target is remote
gdb-peda$ p/d 0204
$2 = 132
I even verified that when control is transferred back to __libc_start_main and exit function is being called, 20100 is being pushed as argument to exit().
gdb-peda$ r
main returning 20100
Breakpoint 1, 0x08048492 in main ()
gdb-peda$ finish
=> 0xf7e1ca83 <__libc_start_main+243>: mov DWORD PTR [esp],eax
0xf7e1ca86 <__libc_start_main+246>: call 0xf7e361e0 <exit>
0xf7e1ca8b <__libc_start_main+251>: xor ecx,ecx
gdb-peda$ si
=> 0xf7e1ca86 <__libc_start_main+246>: call 0xf7e361e0 <exit>
0xf7e1ca8b <__libc_start_main+251>: xor ecx,ecx
gdb-peda$ x/wd $esp
0xffffd5c0: 20100
What could possibly be the reason for this ?
I don't think the exit code 132 here has got anything to do with SIGILL because when I changed the hardcoded argument to foo() from 200 to 2 , the exit code changed to 172 where the expected exit code is 26796.
It looks like what you're doing is invalid, as you only have 8 bits to return to the OS.
Assuming you're linking against libc:
When a program exits, it can return to the parent process a small amount of information about the cause of termination, using the exit status. This is a value between 0 and 255 that the exiting process passes as an argument to exit.
As indicated in its documentation here. Also relevant is this line:
Warning: Don’t try to use the number of errors as the exit status. This is actually not very useful; a parent process would generally not care how many errors occurred. Worse than that, it does not work, because the status value is truncated to eight bits. Thus, if the program tried to report 256 errors, the parent would receive a report of 0 errors—that is, success.
20100 decimal is 4E84 hex.
132 decimal is 84 hex.
Your shell is receiving the return value as only 8 bits.
While your program may be returning 20100, the system only fetches the lowest byte, e.g return % 256
So 20100 % 256 = 132

Invalid operands for binary AND (&)

I have this "assembly" file (containing only directives)
// declare protected region as somewhere within the stack
.equiv prot_start, $stack_top & 0xFFFFFF00 - 0x1400
.equiv prot_end, $stack_top & 0xFFFFFF00 - 0x0C00
Combined with this linker script:
SECTIONS {
"$stack_top" = 0x10000;
}
Assembling produces this output
file.s: Assembler messages:
file.s: Error: invalid operands (*UND* and *ABS* sections) for `&' when setting `prot_start'
file.s: Error: invalid operands (*UND* and *ABS* sections) for `&' when setting `prot_end'
How can I make this work?
Why it is not possible?
You have linked to the GAS docs, but what is the rationale for that inability?
Answer: GAS must communicate operations to the linker through the ELF object file, and the only things that can be conveyed like this are + and - (- is just + a negative value). So this is a fundamental limition of the ELF format, and not just lazyness from GAS devs.
When GAS compiles to the object file, a link step will follow, and it is the relocation which will determine the final value of the symbol.
Question: why can + be conveyed, but not &?
Answer: because + is transitive: (a + b) + c == a + (b + c) but + and & are not "transitive together": (a & b) + c!= a & (b + c).
Let us see how + is conveyed through the ELF format to convince ourselves that & is not possible.
First learn what relocation is if you are not familiar with it: https://stackoverflow.com/a/30507725/895245
Let's minimize your example with another that would generate the same error:
a: .long s
b: .long s + 0x12345678
/* c: .long s & 1 */
s:
Compile and decompile:
as --32 -o main.o main.S
objdump -dzr main.o
The output contains:
00000000 <a>:
0: 08 00 or %al,(%eax)
0: R_386_32 .text
2: 00 00 add %al,(%eax)
00000004 <b>:
4: 80 56 34 12 adcb $0x12,0x34(%esi)
4: R_386_32 .text
Ignore the disassembly since this is not code, and look just at the symbols, bytes and relocations.
We have two R_386_32 relocations. From the System V ABI for IA-32 (which defines the ELF format), that type of relocation is calculated as:
S + A
where:
S: the value before relocation in the object file.
Value of a before relocation == 08 00 00 00 == 8 in little endian
Value of b before relocation == 80 56 34 12 == 0x12345680 in little endian
A: addend, a field of the rellocation entry, here 0 (not shown by objdump), so lets just forget about it.
When relocation happens:
a will be replaced with:
address of text section + 8
There is a + 8 because s: is the 8th byte of the text section, preceded by 2 longs.
b will be replaced with:
address of text section + (0x12345678 + 8)
==
address of text section + 0x12345680
Aha, so that is why 0x12345680 appeared on the object file!
So as we've just seen, it is possible to express + on the ELF file by just adding to the actual offset.
But it would not possible to express & with this mechanism (or any other that I know of), because we don't what the address of the text section will be after relocation, so we can't apply & to it.
Darn:
Infix Operators
Infix operators take two arguments, one on either side. Operators have precedence, but operations with equal precedence are performed left to right. Apart from + or -, both arguments must be absolute, and the result is absolute.

Resources