Cross Compile for ARM926EJ-Sid

Cross Compile for ARM926EJ-Sid - gcc

My ultimate goal is to compile rtorrent for my NAS which is a Synology DS107+. To familiarize myself with cross compiling, I wanted to compile a helloworld. This already turned out to be a big hurdle for a novice as I am.
Executing the program results in a Segmentation Fault. I used gcc-arm-none-eabi to compile it on Ubuntu x84_64.
What tools do I need to compile a program for the target? I also consulted "Cross compile from linux to ARM-ELF (ARM926EJ-S/MT7108)" and needed a workaround, because gcc complained that _exit is not declared.
Following, there is more detailed information:
/proc/cpuinfo
Processor : ARM926EJ-Sid(wb) rev 0 (v5l)
BogoMIPS : 499.71
Features : swp half thumb fastmult vfp edsp
CPU implementer : 0x41
CPU architecture: 5TEJ
CPU variant : 0x0
CPU part : 0x926
CPU revision : 0
Cache type : write-back
Cache clean : cp15 c7 ops
Cache lockdown : format C
Cache format : Harvard
I size : 32768
I assoc : 1
I line length : 32
I sets : 1024
D size : 32768
D assoc : 4
D line length : 32
D sets : 256
Hardware : MV-88fxx81
Revision : 0000
Serial : 0000000000000000
uname -a
Linux DiskStation 2.6.15 #1637 Sat May 4 05:59:19 CST 2013
armv5tejl GNU/Linux synology_88f5281_107+
dmesg (snippets from head that seem informative to me)
Linux version 2.6.15 (root#build2) (gcc version 3.4.3 (CSL 2005Q1B) (Marvell 2006Q3))
#1637 Sat May 4 05:59:19 CST 2013
CPU: ARM926EJ-Sid(wb) [41069260] revision 0 (ARMv5TEJ)
Machine: MV-88fxx81
...
Synology Hareware Version: DS107v20
Memory policy: ECC disabled, Data cache writeback
...
CPU0: D VIVT write-back cache
CPU0: I cache: 32768 bytes, associativity 1, 32 byte lines, 1024 sets
CPU0: D cache: 32768 bytes, associativity 4, 32 byte lines, 256 sets
readelf -h busybox (pulled from the device)
ELF Header:
Magic: 7f 45 4c 46 01 01 01 61 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: ARM
ABI Version: 0
Type: EXEC (Executable file)
Machine: ARM
Version: 0x1
Entry point address: 0x8de0
Start of program headers: 52 (bytes into file)
Start of section headers: 1319468 (bytes into file)
Flags: 0x602, has entry point, GNU EABI, software FP, VFP
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 20
Section header string table index: 19
EDIT: Command line
I used the options -march=armv5te -mtune=arm926ej-s -mno-long-calls -msoft-float -static. I also experimented with leaving out march oder replaced mtune by mcpu. Without static, it says not found. Also tried -mfpu=vfp and armel binaries of Debian packages. For example busybox-static which runs fine on my HTC Desire S (armv7l), also results in segfault.
Using:
arm-linux-gnueabi-gcc → Segmentation fault
arm-none-eabi-gcc → Segmentation fault (or Illegal instruction, but rarely, can't tell when)
Haven't tried CodeBench, yet.
The code I tried is:
# include <stdlib.h>
void _exit (int x) { while (1) {} } // only needed for arm-none-eabi-gcc
int main (int argc, char* argv[]) {
return 47;
}

Related

linking llvm produced object code with ld

I have written a small compiler that uses llvm (through c++) to produce object files (in a linux system).
When I link the compiled output with gcc, the program runs fine:
myCompiler source.mylang -o objCode
gcc objCode -o program
./program #runs fine
But if I try to link it with ld, I get a segmentation fault when I run the program:
myCompiler source.mylang -o objCode
ld objCode -e main -o program #ld does not print any error or warning.
./program #Segmentation fault (core dumped)
Here is the llvm code that the compiler outputs (using myLlvmModule->print function):
; ModuleID = 'entryPointModule'
source_filename = "entryPointModule"
define i32 #main() {
entry:
%x = alloca i32
store i32 55, i32* %x
ret i32 0
ret i32 0
}
Why ld fails, when gcc succeeds?
I thought that after writting a compiler, the only needed step would be to call a linker. Is an other compiler (such as gcc) necessary?
If yes, why?
If no, how can I have ld working?
EDIT:
readelf -d of the working binary:
Dynamic section at offset 0xe00 contains 24 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x000000000000000c (INIT) 0x4b8
0x000000000000000d (FINI) 0x684
0x0000000000000019 (INIT_ARRAY) 0x200df0
0x000000000000001b (INIT_ARRAYSZ) 8 (bytes)
0x000000000000001a (FINI_ARRAY) 0x200df8
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x000000006ffffef5 (GNU_HASH) 0x298
0x0000000000000005 (STRTAB) 0x348
0x0000000000000006 (SYMTAB) 0x2b8
0x000000000000000a (STRSZ) 125 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000015 (DEBUG) 0x0
0x0000000000000003 (PLTGOT) 0x200fc0
0x0000000000000007 (RELA) 0x3f8
0x0000000000000008 (RELASZ) 192 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000000000001e (FLAGS) BIND_NOW
0x000000006ffffffb (FLAGS_1) Flags: NOW PIE
0x000000006ffffffe (VERNEED) 0x3d8
0x000000006fffffff (VERNEEDNUM) 1
0x000000006ffffff0 (VERSYM) 0x3c6
0x000000006ffffff9 (RELACOUNT) 3
0x0000000000000000 (NULL) 0x0
the same command for the corrupt binary:
There is no dynamic section in this file.

Your entry point attempts to return to a return address on the stack which does not exist, which is why the program jumps to address zero.
The entry point of a program is not expected to return. It must terminate the process by calling _exit (or a related system call).

How to analyze windows 7 bsod error minidump?

I got blue screen of death (bsod) error on my laptop some time ago. I read online that analyzing minidump file in "c:\windows\minidump" will help understand cause behind bsod error. (and probably point to the culprit driver causing the error)
I used this online tool to analyze error http://www.osronline.com/page.cfm?name=analyze
It created a report but I do not understand it. If you can understand it, please let me know.
Link to online crash analysis report: https://pastebin.com/raw/3Hhq7arw
Dump file location: https://drive.google.com/file/d/0BzNdoGke8tyRZk5YcHBKQV8ycFE/view?usp=sharing
Laptop config: Windows 7, 32 bit

You get an WHEA_UNCORRECTABLE_ERROR (124) Bugcheck, which measn there is a fatal hardware error:
The WHEA_UNCORRECTABLE_ERROR bug check has a value of 0x00000124. This
bug check indicates that a fatal hardware error has occurred.
Using the !errrec command in Windbg and the value from parameter 2 I see you have an Internal timer issue with your Intel i3-3217U CPU:
===============================================================================
Common Platform Error Record # 8a0a401c
-------------------------------------------------------------------------------
Record Id : 01d1ff7437bf4c24
Severity : Fatal (1)
Length : 928
Creator : Microsoft
Notify Type : Machine Check Exception
Timestamp : 8/26/2016 8:32:29 (UTC)
Flags : 0x00000000
===============================================================================
Section 0 : Processor Generic
-------------------------------------------------------------------------------
Descriptor # 8a0a409c
Section # 8a0a4174
Offset : 344
Length : 192
Flags : 0x00000001 Primary
Severity : Fatal
Proc. Type : x86/x64
Instr. Set : x86
Error Type : Micro-Architectural Error
Flags : 0x00
CPU Version : 0x00000000000306a9
Processor ID : 0x0000000000000003
===============================================================================
Section 1 : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor # 8a0a40e4
Section # 8a0a4234
Offset : 536
Length : 128
Flags : 0x00000000
Severity : Fatal
Local APIC Id : 0x0000000000000003
CPU Id : a9 06 03 00 00 08 10 03 - bf e3 ba 3d ff fb eb bf
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
Proc. Info 0 # 8a0a4234
===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor # 8a0a412c
Section # 8a0a42b4
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal
Error : Internal timer (Proc 3 Bank 3)
Status : 0xbe00000000800400
Address : 0x0000000085286f3c
Misc. : 0x0000000000000000
I see that you use the ASUS X550CA Laptop:
BiosVersion = X550CA.217
BiosReleaseDate = 01/23/2014
SystemManufacturer = ASUSTeK COMPUTER INC.
SystemProductName = X550CA
I see you se the older BIOS Version .217, so update the BIOS to .300, maybe it fixes the issue. If this doesn't fix the issue, do a stress test of the CPU with Prime95 and Intel CPU Diag tool.

CUDA code runs when compiled with sm_35, but fails with sm_30

The GPU device that I have is GeForce GT 750M, which I found is compute capability 3.0. I downloaded the CUDA code found here: (https://github.com/fengChenHPC/word2vec_cbow. Its makefile had the flag -arch=sm_35.
Since my device is compute capability 3.0, I changed the flag to -arch=sm_30. It compiled fine, but when I run the code, I get the following error:
word2vec.cu 449 : unspecified launch failure
word2vec.cu 449 : unspecified launch failure
It shows it multiple times, because there are multiple CPU threads launching the CUDA kernel. Please note that the threads do not use different streams to launch the kernel, so the kernel launches are all in order.
Now, when I let the flag be, i.e. -arch=sm_35, then the code runs fine. Can someone please explain why the code won't run when I set the flag to match my device?

Unfortunately your conclusion that the code works when compiled for sm_35 and run on an sm_30 GPU is incorrect. The culprit is this:
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
}
This code will silently fail if the kernel launch fails because of incomplete API error checking. And the kernel launch does fail if you build for sm_35 and run on sm_30. If you change the code of that function to this (adding kernel launch error checking):
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
checkCUDAError( cudaPeekAtLastError() );
}
and compile and run it for sm_35, you should get this on an sm_30 device:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_35 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_35'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 8 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
cbow.cu 114 : invalid device function
ie. the kernel launch failed because no appropriate device code was found in the CUDA cubin payload in your application. This also answers your earlier question about why the output of this code is incorrect. The analysis kernel simply never runs on your hardware when built with the default options.
If I build this code for sm_30 and run it on a GTX 670 with 2gb of memory (compute capability 3.0), I get this:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_30 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_30'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 12 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
Alpha: 0.000009 Progress: 100.00% Words/thread/sec: 1217.45k
ie. the code runs correctly to completion without any errors. I can't tell you why you are not able to get the code to run on your hardware because I cannot reproduce your error on my hardware. You will need to do some debugging on your own to find the root cause of that.

As this LINK shows there is no GeForce GTX 750M.
yours is either:
GeForce GTX 750 Ti
GeForce GTX 750
or
GeForce GT 750M
If yours is one of the first two then your GPU is Maxwell-based and has Compute Capability = 5.0.
Otherwise, your GPU is Kepler based and has Compute Capability = 3.0.
If you're not sure what your GPU is, first figure it out by running deviceQuery from the NVIDIA SAMPLE.

gcc / ld: overlapping sections (.tbss, .init_array) in statically-linked ELF binary

I'm compiling a very simple hello-world one-liner statically on Debian 7 system on x86_64 machine with gcc version 4.8.2 (Debian 4.8.2-21):
gcc test.c -static -o test
and I get an executable ELF file that includes the following sections:
[17] .tdata PROGBITS 00000000006b4000 000b4000
0000000000000020 0000000000000000 WAT 0 0 8
[18] .tbss NOBITS 00000000006b4020 000b4020
0000000000000030 0000000000000000 WAT 0 0 8
[19] .init_array INIT_ARRAY 00000000006b4020 000b4020
0000000000000010 0000000000000000 WA 0 0 8
[20] .fini_array FINI_ARRAY 00000000006b4030 000b4030
0000000000000010 0000000000000000 WA 0 0 8
[21] .jcr PROGBITS 00000000006b4040 000b4040
0000000000000008 0000000000000000 WA 0 0 8
[22] .data.rel.ro PROGBITS 00000000006b4060 000b4060
00000000000000e4 0000000000000000 WA 0 0 32
Note that .tbss section is allocated at addresses 0x6b4020..0x6b4050 (0x30 bytes) and it intersects with allocation of .init_array section at 0x6b4020..0x6b4030 (0x10 bytes), .fini_array section at 0x6b4030..0x6b4040 (0x10 bytes) and with .jcr section at 0x6b4040..0x6b4048 (8 bytes).
Note it does not intersect with the following sections, for example, .data.rel.ro, but that's probably because .data.rel.ro alignment is 32 and thus it can't be placed any earlier than 0x6b4060.
The resulting file runs ok, but I still don't exactly get how it works. From what I read in glibc documentation, .tbss is a just .bss section for thread local storage (i.e. allocated memory scratch space, not really mapped in physical file). Is it that .tbss section is so special that it can overlap other sections? Are .init_array, .fini_array and .jcr are so useless (for example, they are not needed anymore then TLS-related code runs), so they can be overwritten by bss? Or is it some sort of a bug?
Basically, what do I get to read and write if I'll try to read address 0x6b4020 in my application? .tbss contents or .init_array pointers? Why?

The virtual address of .tbss is meaningless as that section only serves as a template for the TLS storage as allocated by the threading implementation in GLIBC.
The way this virtual address comes into place is that .tbss follows .tbdata in the default linker script:
...
.gcc_except_table : ONLY_IF_RW { *(.gcc_except_table .gcc_except_table.*) }
/* Thread Local Storage sections */
.tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) }
.tbss : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
.preinit_array :
{
PROVIDE_HIDDEN (__preinit_array_start = .);
KEEP (*(.preinit_array))
PROVIDE_HIDDEN (__preinit_array_end = .);
}
.init_array :
{
PROVIDE_HIDDEN (__init_array_start = .);
KEEP (*(SORT(.init_array.*)))
KEEP (*(.init_array))
PROVIDE_HIDDEN (__init_array_end = .);
}
...
therefore its virtual address is simply the virtual address of the preceding section (.tbdata) plus the size of the preceding section (eventually with some padding in order to reach the desired alignment). .init_array (or .preinit_array if present) comes next and its location should be determined the same way, but .tbss is known to be so very special, that it is given a deeply hard-coded treatment inside GNU LD:
/* .tbss sections effectively have zero size. */
if ((os->bfd_section->flags & SEC_HAS_CONTENTS) != 0
|| (os->bfd_section->flags & SEC_THREAD_LOCAL) == 0
|| link_info.relocatable)
dotdelta = TO_ADDR (os->bfd_section->size);
else
dotdelta = 0; // <----------------
dot += dotdelta;
.tbss is not relocatable, it has the SEC_THREAD_LOCAL flag set, and it does not have contents (NOBITS), therefore the else branch is taken. In other words, no matter how large the .tbss is, the linker does not advance the location of the section that follows it (also know as "the dot").
Note also that .tbss sits in a non-loadable ELF segment:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000b1f24 0x00000000000b1f24 R E 200000
LOAD 0x00000000000b2000 0x00000000006b2000 0x00000000006b2000
0x0000000000002288 0x00000000000174d8 RW 200000
NOTE 0x0000000000000158 0x0000000000400158 0x0000000000400158
0x0000000000000044 0x0000000000000044 R 4
TLS 0x00000000000b2000 0x00000000006b2000 0x00000000006b2000 <---+
0x0000000000000020 0x0000000000000060 R 8 |
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 |
0x0000000000000000 0x0000000000000000 RW 8 |
|
Section to Segment mapping: |
Segment Sections... |
00 .note.ABI-tag ... |
01 .tdata .ctors ... |
02 .note.ABI-tag ... |
03 .tdata .tbss <---------------------------------------------------+
04

This is rather simple if you have an understanding about two things:
1) What is SHT_NOBITS
2) What is tbss section
SHT_NOBITS means that this section occupies no space inside file.
Normally, NOBITS sections, like bss are placed after all PROGBITS sections at the end of the loaded segments.
tbss is special section to hold uninitialized thread-local data that contribute to the program's memory image. Take an attention here: this section must hold unique data for each program thread.
Now lets talk about overlapping. We have two possible overlappings -- inside binary file and inside memory.
1) Binary files offset:
There is no data to write under this section in binary. Inside file it holds no space, so linker start next section init_array immediately after tbss declared. You may think about its size not as about size, but as about special service information for code like:
if (isTLSSegment) tlsStartAddr += section->memSize();
So it doesn't overlap anything inside file.
2) Memory offset
The tdata and tbss sections may be possibly modified at startup time by the dynamic linker
performing relocations, but after that the section data is kept around as the initialization image and not modified anymore. For each thread, including the initial one, new memory is allocated into which then the content of the initialization image is copied. This ensures that all threads get the same starting conditions.
This what makes tbss (and tdata) so special.
Do not think about their memory offsets as about statically known -- they are more like "generation patterns" for per-thread work. So they also can not overlap with "normal" memory offsets -- they are being processed in other way.
You may consult with this paper to know more.

Unexpected global variable read result in C++ using avr-gcc for (local variable access is as expected)

I am getting unexpected global variable read results when compiling the following code in avr-gcc 4.6.2 for ATmega328:
#include <avr/io.h>
#include <util/delay.h>
#define LED_PORT PORTD
#define LED_BIT 7
#define LED_DDR DDRD
uint8_t latchingFlag;
int main() {
LED_DDR = 0xFF;
for (;;) {
latchingFlag=1;
if (latchingFlag==0) {
LED_PORT ^= 1<<LED_BIT; // Toggle the LED
_delay_ms(100); // Delay
latchingFlag = 1;
}
}
}
This is the entire code. I would expect the LED toggling to never execute, seeing as latchingFlag is set to 1, however the LED blinks continuously. If latchingFlag is declared local to main() the program executes as expected: the LED never blinks.
The disassembled code doesn't reveal any gotchas that I can see, here's the disassembly of the main loop of the version using the global variable (with the delay routine call commented out; same behavior)
59 .L4:
27:main.cpp **** for (;;) {
60 .loc 1 27 0
61 0026 0000 nop
62 .L3:
28:main.cpp **** latchingFlag=1;
63 .loc 1 28 0
64 0028 81E0 ldi r24,lo8(1)
65 002a 8093 0000 sts latchingFlag,r24
29:main.cpp **** if (latchingFlag==0) {
66 .loc 1 29 0
67 002e 8091 0000 lds r24,latchingFlag
68 0032 8823 tst r24
69 0034 01F4 brne .L4
30:main.cpp **** LED_PORT ^= 1<<LED_BIT; // Toggle the LED
70 .loc 1 30 0
71 0036 8BE2 ldi r24,lo8(43)
72 0038 90E0 ldi r25,hi8(43)
73 003a 2BE2 ldi r18,lo8(43)
74 003c 30E0 ldi r19,hi8(43)
75 003e F901 movw r30,r18
76 0040 3081 ld r19,Z
77 0042 20E8 ldi r18,lo8(-128)
78 0044 2327 eor r18,r19
79 0046 FC01 movw r30,r24
80 0048 2083 st Z,r18
31:main.cpp **** latchingFlag = 1;
81 .loc 1 31 0
82 004a 81E0 ldi r24,lo8(1)
83 004c 8093 0000 sts latchingFlag,r24
27:main.cpp **** for (;;) {
84 .loc 1 27 0
85 0050 00C0 rjmp .L4
The lines 71-80 are responsible for port access: according to the datasheet, PORTD is at address 0x2B, which is decimal 43 (cf. lines 71-74).
The only difference between local/global declaration of the latchingFlag variable is how latchingFlag is accessed: the global variable version uses sts (store direct to data space) and lds (load direct from data space) to access latchingFlag, whereas the local variable version uses ldd (Load Indirect from Data Space to Register) and std (Store Indirect From Register to Data Space) using register Y as the address register (which can be used as a stack pointer, by avr-gcc AFAIK). Here are the relevant lines from the disassembly:
63 002c 8983 std Y+1,r24
65 002e 8981 ldd r24,Y+1
81 004a 8983 std Y+1,r24
The global version also has latchingFlag in the .bss section. I am really not what to attribute the different global vs. local variable behavior to. Here's the avr-gcc command-line (notice -O0):
/usr/local/avr/bin/avr-gcc \
-I. -g -mmcu=atmega328p -O0 \
-fpack-struct \
-fshort-enums \
-funsigned-bitfields \
-funsigned-char \
-D CLOCK_SRC=8000000UL \
-D CLOCK_PRESCALE=8UL \
-D F_CPU="(CLOCK_SRC/CLOCK_PRESCALE)" \
-Wall \
-ffunction-sections \
-fdata-sections \
-fno-exceptions \
-Wa,-ahlms=obj/main.lst \
-Wno-uninitialized \
-c main.cpp -o obj/main.o
With -Os compiler flags the loop is gone from the disassembly, but can be forced to be there again if latchingFlag is declared volatile, in which case the unexpected persists for me.

According to your disassembler listing, latchingFlag global variable is located at RAM address 0. This address corresponds to mirrored register r0 and is not a valid RAM address for global variable.

After couple checks and code compares in EE chat I noticed that my version of avr-gcc (4.7.0) stores the value for latchFlag in 0x0100, whereas Egor Skriptunoff mentioned SRAM addres 0 being in OP's assembly listing.
Looking at OP's disassembly (the avr-dump version), I noticed that OP's compiler (4.6.2) stores latchFlag value in a different address (specifically, 0x060) than my compiler (version 4.7.0), which stores latchFlag value at address 0x0100.
My advice is to update the avr-gcc version to at least version 4.7.0. The advantage of 4.7.0 rather than latest and greatest available is the ability to compare the generated code again with my findings.
Of course if 4.7.0 solves the issue, then there is harm in upgrading to a more recent version (if available).

Egor Skriptunoff suggestion is almost exactly right: the SRAM variable is mapped to the wrong memory address. The latchingFlag variable is not at 0x0100 address, which is the first valid SRAM address, but is mapped to 0x060, overlapping the WDTCSR register. This can be seen in the disassembly lines like the following one:
lds r24, 0x0060
THis line is supposed to load the value of latchingFlag from SRAM, and we can see that location 0x060 is used instead of 0x100.
The problem has to with a bug in the binutils which two conditions are met:
The linker is invoked with --gc-sections flag (compiler options: -Wl,--gc-sections) to save code space
None of your SRAM variables are initialized (i.e. initialized to non-zero values)
When both of these conditions are met, the .data section gets removed. When the .data section is missing, the SRAM variables start at address 0x060 instead of 0x100.
One solution is to reinstall binutils: the current versions have this bug fixed. Another solution is to edit your linker scripts: on Ubuntu this is probably in /usr/lib/ldscripts. For ATmega168/328 the script that needs to be edited is avr5.x, but you should really edit all them, otherwise you could run into this bug on other AVR platforms. The change that needs to be made is the following one:
.data : AT (ADDR (.text) + SIZEOF (.text))
{
PROVIDE (__data_start = .) ;
- *(.data)
+ KEEP(*(.data))
So replace the line *(.data) with KEEP(*(.data)). This ensures that the .data section is not discarded, and consequently the SRAM variable addresses start at 0x0100

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio