Why does the kernel function call clear_huge_page still occur? - c++11

My OS is CentOS 7 latest. I set the isolcpus and transparent_hugepage=never default_hugepagesz=1G hugepagesz=1G in the cmdline to isolate cpus and enable huge pages with 1G per page. And I move all the irqs that can be binded to specified CPU cores.
My init function for using huge pages
bool initSharedMemory(const char* nm, std::size_t bytes){
int tid = ftok(nm, 'R');
if (tid == -1) {
fprintf(stderr, "ftok error: %d - %s\n", tid, strerror(-tid));
return false;
}
mSHMID = shmget(tid, bytes, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
if (mSHMID < 0) {
fprintf(stderr, "shmget error: %d - %s\n", mSHMID, strerror(-mSHMID));
return false;
}
mMemory = (uint8_t*)shmat(mSHMID, nullptr, 0);
if (mMemory == (uint8_t*)(-1)) {
fprintf(stderr, "shmat error\n");
return false;
}
mSize = bytes;
return true;
}
I run my program with taskset and perf record. I find that there are some kallsyms about huge page clear,
99.73% test_stx2 test_stx2 [.] Stx::run [.] Stx::run -
0.03% test_stx2 [kernel.kallsyms] [k] clear_page [k] clear_page_c_e -
0.03% test_stx2 [kernel.kallsyms] [k] clear_huge_page [k] _cond_resched -
0.03% test_stx2 [kernel.kallsyms] [k] _cond_resched [k] clear_huge_page -
0.03% test_stx2 [kernel.kallsyms] [k] clear_huge_page [k] clear_page -
0.02% test_stx2 [kernel.kallsyms] [k] clear_page_c_e [k] clear_huge_page -
0.02% test_stx2 [kernel.kallsyms] [k] clear_huge_page [k] clear_huge_page -
0.02% test_stx2 test_stx2 [.] std::__fill_n_a<unsigned long*, unsigned long, unsigned long> [.] std::__fill_n_a<unsigned long*, unsigned long, unsigned long> -
0.00% test_stx2 [kernel.kallsyms] [.] retint_userspace_restore_args [.] retint_userspace_restore_args -
0.00% test_stx2 [kernel.kallsyms] [k] task_tick_fair [k] task_tick_fair -
0.00% test_stx2 [kernel.kallsyms] [k] irq_exit [k] irq_exit -
0.00% test_stx2 [kernel.kallsyms] [k] zone_statistics [k] zone_statistics -
0.00% test_stx2 [kernel.kallsyms] [k] do_softirq [k] do_softirq -
0.00% test_stx2 [kernel.kallsyms] [k] __update_cpu_load [k] __update_cpu_load -
0.00% test_stx2 [kernel.kallsyms] [k] scheduler_tick [k] scheduler_tick -
0.00% test_stx2 [kernel.kallsyms] [k] trigger_load_balance [k] trigger_load_balance -
0.00% test_stx2 [kernel.kallsyms] [k] __hrtimer_run_queues [k] __hrtimer_run_queues -
0.00% test_stx2 [kernel.kallsyms] [k] run_posix_cpu_timers [k] run_posix_cpu_timers -
0.00% test_stx2 [kernel.kallsyms] [k] zone_statistics [k] __inc_zone_state -
0.00% test_stx2 [kernel.kallsyms] [.] retint_userspace_restore_args [.] irq_return -
0.00% test_stx2 [kernel.kallsyms] [k] __inc_zone_state [k] zone_statistics -
0.00% test_stx2 [kernel.kallsyms] [.] handle_mm_fault [.] handle_mm_fault -
0.00% test_stx2 [kernel.kallsyms] [.] __do_page_fault [.] __do_page_fault -
0.00% test_stx2 [kernel.kallsyms] [k] __x86_indirect_thunk_rax [k] read_tsc
I cannot understand that why there are some clear_page operation.

Related

Odd-dim vs even-dim vectors allocated in Julia

Why do not odd dimensions seem to allocate extra memory when initializing Vectors? Example:
julia> for N in 1:10 #time a = [k for k in 1:N] end
0.000002 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 128 bytes)
0.000000 seconds (1 allocation: 128 bytes)
0.000000 seconds (1 allocation: 144 bytes)
The code deciding on array allocation has several branches. You can see it here. In general what you can see is byte-alignment of allocations, which is governed by two contestants defined here.
Now the code is more complicated than this but in your case for very small arrays alignment is 16 bytes, so since Int64 has 8 bytes every two consecutive sizes will have the same allocated size.
To read more about memory alignment see here.
We can check that if e.g. you allocate Int32 vector instead of Int64 vector every four values get the same allocation. Similarly for Int16 every eight values have the same allocation:
julia> for N in 1:12 #time a = Int32[k for k in 1:N] end
0.000004 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000001 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000001 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 112 bytes)
julia> for N in 1:20 #time a = Int16[k for k in 1:N] end
0.000003 seconds (1 allocation: 64 bytes)
0.000001 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
(remember that this is for small arrays, for lager arrays Julia uses different alignment, growing up to 64 bytes)

Interpreting the output of perf.data file using perf report

My C++ application was consuming a lot of CPU cycles. So I have profiled my application using the Linux profiler.The output of the profiler (perf.data) file was generated, and I have run perf report
The report is given below.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 43K of event 'cpu-clock:uhH'
# Event count (approx.): 10966500000
#
# Overhead Command Shared Object Symbol
# ........ ............... ......................... ....................................................................................................................................................................................
#
59.19% ZManager libc-2.23.so [.] __mcount_internal
14.15% ZManager libc-2.23.so [.] _mcount
1.41% ZManager ZManager [.] std::vector<unsigned int, std::allocator<unsigned int> >::~vector
1.17% ZManager ZManager [.] std::vector<unsigned int, std::allocator<unsigned int> >::size
1.16% ZManager ZManager [.] std::vector<unsigned int, std::allocator<unsigned int> >::vector
0.46% ZManager ZManager [.] std::_Vector_base<unsigned int, std::allocator<unsigned int> >::_Vector_impl::_Vector_impl
0.45% ZManager ZManager [.] std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<unsigned int const*, std::vector<unsigned int, std::allocator<unsigned int> > >, unsigned int*>
0.40% ZManager libpthread-2.23.so [.] pthread_mutex_lock
0.39% ZManager ZManager [.] std::__miter_base<__gnu_cxx::__normal_iterator<unsigned int const*, std::vector<unsigned int, std::allocator<unsigned int> > > >
0.38% ZManager libc-2.23.so [.] _int_malloc
0.36% ZManager ZManager [.] std::__niter_base<__gnu_cxx::__normal_iterator<unsigned int const*, std::vector<unsigned int, std::allocator<unsigned int> > > >
0.34% ZManager ZManager [.] std::allocator<unsigned int>::allocator
0.34% ZManager ZManager [.] std::allocator<unsigned int>::~allocator
From the above it is clear that majority of the CPU is consumed by some instruction in libc (which my program is calling)
It is shown as
59.19% ZManager libc-2.23.so [.] __mcount_internal
14.15% ZManager libc-2.23.so [.] _mcount
Does anyone know what __mcount_internal and _mcount means?

Cannot interpret memory bandwidth numbers

I have written a benchmark to compute memory bandwidth:
#include <benchmark/benchmark.h>
double sum_array(double* v, long n)
{
double s = 0;
for (long i =0 ; i < n; ++i) {
s += v[i];
}
return s;
}
void BM_MemoryBandwidth(benchmark::State& state) {
long n = state.range(0);
double* v = (double*) malloc(state.range(0)*sizeof(double));
for (auto _ : state) {
benchmark::DoNotOptimize(sum_array(v, n));
}
free(v);
state.SetComplexityN(state.range(0));
state.SetBytesProcessed(int64_t(state.range(0))*int64_t(state.iterations())*sizeof(double));
}
BENCHMARK(BM_MemoryBandwidth)->RangeMultiplier(2)->Range(1<<5, 1<<23)->Complexity(benchmark::oN);
BENCHMARK_MAIN();
I compile with
g++-9 -masm=intel -fverbose-asm -S -g -O3 -ffast-math -march=native --std=c++17 -I/usr/local/include memory_bandwidth.cpp
This produces a bunch of moves from RAM, and then some addpd instructions which perf says are hot, so I go into the generated asm and remove them, then assemble and link via
$ g++-9 -c memory_bandwidth.s -o memory_bandwidth.o
$ g++-9 memory_bandwidth.o -o memory_bandwidth.x -L/usr/local/lib -lbenchmark -lbenchmark_main -pthread -fPIC
At this point, get a perf output that I expect: Movement of data into xmm registers, increments of the pointer, and a jmp at the end of the loop:
All fine and well up to here. Now here's where things get weird:
I inquire of my hardware what the memory bandwidth is:
$ sudo lshw -class memory
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
vendor: AMI
physical id: 1
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
So I should be getting at most 8 bytes * 2.4 GHz = 19.2 gigabytes/second.
But instead I get 48 gigabytes/second:
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_MemoryBandwidth/32 6.43 ns 6.43 ns 108045392 bytes_per_second=37.0706G/s
BM_MemoryBandwidth/64 11.6 ns 11.6 ns 60101462 bytes_per_second=40.9842G/s
BM_MemoryBandwidth/128 21.4 ns 21.4 ns 32667394 bytes_per_second=44.5464G/s
BM_MemoryBandwidth/256 47.6 ns 47.6 ns 14712204 bytes_per_second=40.0884G/s
BM_MemoryBandwidth/512 86.9 ns 86.9 ns 8057225 bytes_per_second=43.9169G/s
BM_MemoryBandwidth/1024 165 ns 165 ns 4233063 bytes_per_second=46.1437G/s
BM_MemoryBandwidth/2048 322 ns 322 ns 2173012 bytes_per_second=47.356G/s
BM_MemoryBandwidth/4096 636 ns 636 ns 1099074 bytes_per_second=47.9781G/s
BM_MemoryBandwidth/8192 1264 ns 1264 ns 553898 bytes_per_second=48.3047G/s
BM_MemoryBandwidth/16384 2524 ns 2524 ns 277224 bytes_per_second=48.3688G/s
BM_MemoryBandwidth/32768 5035 ns 5035 ns 138843 bytes_per_second=48.4882G/s
BM_MemoryBandwidth/65536 10058 ns 10058 ns 69578 bytes_per_second=48.5455G/s
BM_MemoryBandwidth/131072 20103 ns 20102 ns 34832 bytes_per_second=48.5802G/s
BM_MemoryBandwidth/262144 40185 ns 40185 ns 17420 bytes_per_second=48.6035G/s
BM_MemoryBandwidth/524288 80351 ns 80347 ns 8708 bytes_per_second=48.6171G/s
BM_MemoryBandwidth/1048576 160855 ns 160851 ns 4353 bytes_per_second=48.5699G/s
BM_MemoryBandwidth/2097152 321657 ns 321643 ns 2177 bytes_per_second=48.5787G/s
BM_MemoryBandwidth/4194304 648490 ns 648454 ns 1005 bytes_per_second=48.1915G/s
BM_MemoryBandwidth/8388608 1307549 ns 1307485 ns 502 bytes_per_second=47.8017G/s
BM_MemoryBandwidth_BigO 0.16 N 0.16 N
BM_MemoryBandwidth_RMS 1 % 1 %
What am I misunderstanding about memory bandwidth that has made my calculations come out wrong by more than a factor of 2?
(Also, this is kinda an insane workflow to empirically determine how much memory bandwidth I have. Is there a better way?)
Full asm for sum_array after removing add instructions:
_Z9sum_arrayPdl:
.LVL0:
.LFB3624:
.file 1 "example_code/memory_bandwidth.cpp"
.loc 1 5 1 view -0
.cfi_startproc
.loc 1 6 5 view .LVU1
.loc 1 7 5 view .LVU2
.LBB1545:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 is_stmt 0 view .LVU3
test rsi, rsi # n
jle .L7 #,
lea rax, -1[rsi] # tmp105,
cmp rax, 1 # tmp105,
jbe .L8 #,
mov rdx, rsi # bnd.299, n
shr rdx # bnd.299
sal rdx, 4 # tmp107,
mov rax, rdi # ivtmp.311, v
add rdx, rdi # _44, v
pxor xmm0, xmm0 # vect_s_10.306
.LVL1:
.p2align 4,,10
.p2align 3
.L5:
.loc 1 8 9 is_stmt 1 discriminator 2 view .LVU4
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 discriminator 2 view .LVU5
movupd xmm2, XMMWORD PTR [rax] # tmp115, MEM[base: _24, offset: 0B]
add rax, 16 # ivtmp.311,
.loc 1 8 11 discriminator 2 view .LVU6
cmp rax, rdx # ivtmp.311, _44
jne .L5 #,
movapd xmm1, xmm0 # tmp110, vect_s_10.306
unpckhpd xmm1, xmm0 # tmp110, vect_s_10.306
mov rax, rsi # tmp.301, n
and rax, -2 # tmp.301,
test sil, 1 # n,
je .L10 #,
.L3:
.LVL2:
.loc 1 8 9 is_stmt 1 view .LVU7
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU8
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_3
.LVL3:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 5 view .LVU9
inc rax # i
.LVL4:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 view .LVU10
cmp rsi, rax # n, i
jle .L1 #,
.loc 1 8 9 is_stmt 1 view .LVU11
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU12
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_6
.LVL5:
.loc 1 8 11 view .LVU13
ret
.LVL6:
.p2align 4,,10
.p2align 3
.L7:
.loc 1 8 11 view .LVU14
.LBE1545:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU15
pxor xmm0, xmm0 # <retval>
.loc 1 10 5 is_stmt 1 view .LVU16
.LVL7:
.L1:
# example_code/memory_bandwidth.cpp:11: }
.loc 1 11 1 is_stmt 0 view .LVU17
ret
.p2align 4,,10
.p2align 3
.L10:
.loc 1 11 1 view .LVU18
ret
.LVL8:
.L8:
.LBB1546:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 15 view .LVU19
xor eax, eax # tmp.301
.LBE1546:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU20
pxor xmm0, xmm0 # <retval>
jmp .L3 #
.cfi_endproc
.LFE3624:
.size _Z9sum_arrayPdl, .-_Z9sum_arrayPdl
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
Full output of lshw -class memory:
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: 1.90
date: 10/21/2016
size: 64KiB
capacity: 15MiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: [empty]
physical id: 0
slot: ChannelA-DIMM0
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 1
serial: 00000000
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:2
description: [empty]
physical id: 2
slot: ChannelB-DIMM0
*-bank:3
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 3
serial: 00000000
slot: ChannelB-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
Is the CPU relevant here? Well here's the specs:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Pentium(R) CPU G4400 # 3.30GHz
Stepping: 3
CPU MHz: 3168.660
CPU max MHz: 3300.0000
CPU min MHz: 800.0000
BogoMIPS: 6624.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
The data produced by the clang compile is much more intelligible. The performance monotonically decreases until it hits 19.8Gb/s as the vector gets much larger than cache:
Here's the benchmark output:
It looks like from your hardware description that you have two DIMM slots that are placed into two channels. This interleaves memory between the two DIMM chips, so that memory accesses will be reading from both chips. (One possibility is that bytes 0-7 are in DIMM1 and bytes 8-15 are in DIMM2, but this depends on the hardware implementation.) This doubles the memory bandwidth because you're accessing two hardware chips instead of one.
Some systems support three or four channels, further increasing the maximum bandwidth.

Slow booting process after adding mem=16M in boot parameters

My linux-3.0 kernel was panicking saying ERROR: Failed to allocate 0x1000 bytes below 0x0. while booting. So I changed the bootargs and added a boot parameter mem = 16M. Now it boots fine but it takes a lot of time to boot. I have tried with higher mem value also but it does not work. Below are the logs:
`Machine: KZM9D
arm_add_memory: 0 0x40000000 0x1000000
Memory policy: ECC disabled, Data cache writealloc
bootmem_init: max_low=0x266240, max_high=0x266240
<6>Section 8256 and 8250 (node 0)<c> have a circular dependency on usemap and pgdat allocations
<7>On node 0 totalpages: 0
<7>On node 1 totalpages: 0
<7>On node 2 totalpages: 0
<7>On node 3 totalpages: 0
<7>On node 4 totalpages: 0
<7>On node 5 totalpages: 0
<7>On node 6 totalpages: 0
<7>On node 7 totalpages: 0
high_memory: e0000000
Zone PFN ranges:
Normal 0x00040000 -> 0x00041000
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0: 0x00040000 -> 0x00041000
<7>On node 0 totalpages: 4096
<7> Normal zone: 36 pages used for memmap
<7> Normal zone: 0 pages reserved
<7> Normal zone: 4060 pages, LIFO batch:0
<6>boottime: reserved memory at 0x40002000 size 0x2000
mm_init_owner
<6>PERCPU: Embedded 8 pages/cpu #c087f000 s9824 r8192 d14752 u32768
<7>pcpu-alloc: s9824 r8192 d14752 u32768 alloc=8*4096
<7>pcpu-alloc: [0] 0 [0] 1
build_all_zonelists
Built 1 zonelists in Node order, mobility grouping on. Total pages: 4060
Policy zone: Normal
page_alloc_init
<5>Kernel command line: console=ttyS1,115200n8 root=/dev/nfs ip=9.8.7.6 nfsroot=1.2.3.7:/tftpboot/arm/ rootwait rw mem=16M
parse_early_param
<6>PID hash table entries: 64 (order: -4, 256 bytes)
<6>Dentry cache hash table entries: 2048 (order: 2, 24576 bytes)
<6>Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>Memory: 16MB = 16MB total
<5>Memory: 7824k/7824k available, 8560k reserved, 0K highmem
<5>Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
DMA : 0xffc00000 - 0xffe00000 ( 2 MB)
vmalloc : 0xe0800000 - 0xf0000000 ( 248 MB)
lowmem : 0xc0000000 - 0xe0000000 ( 512 MB)
modules : 0xbf000000 - 0xc0000000 ( 16 MB)
.text : 0xc0008000 - 0xc0704024 (7153 kB)
.init : 0xc0705000 - 0xc0740660 ( 238 kB)
.data : 0xc0742000 - 0xc078dc18 ( 304 kB)
.bss : 0xc078dc18 - 0xc07f2950 ( 404 kB)
<6>Preemptible hierarchical RCU implementation.
<6>NR_IRQS:374`

msctf/d3d11 crash on exit()

I have an application using DX11.
The debug build works well. But the release build crash on exit().
The stack:
000007fef697d630()
user32.dll!DispatchHookA() + 0x72 bytes
user32.dll!CallHookWithSEH() + 0x27 bytes
user32.dll!__fnHkINLPMSG() + 0x59 bytes
ntdll.dll!KiUserCallbackDispatcherContinue()
user32.dll!NtUserPeekMessage() + 0xa bytes
user32.dll!PeekMessageW() + 0x89 bytes
msctf.dll!RemovePrivateMessage() + 0x52 bytes
msctf.dll!SYSTHREAD::DestroyMarshalWindow() - 0x1b7a bytes
msctf.dll!TF_UninitThreadSystem() + 0xc4 bytes
msctf.dll!CicFlsCallback() + 0x40 bytes
ntdll.dll!RtlProcessFlsData() + 0x84 bytes
ntdll.dll!LdrShutdownProcess() + 0xa9 bytes
ntdll.dll!RtlExitUserProcess() + 0x90 bytes
msvcr100.dll!doexit(int code=0, int quick=0, int retcaller=0) Line 621 + 0x11 bytes
If I call LoadLibrary("d3d11.dll") before calling exit(), there is no crash.

Resources