Odd-dim vs even-dim vectors allocated in Julia - memory-management

Why do not odd dimensions seem to allocate extra memory when initializing Vectors? Example:
julia> for N in 1:10 #time a = [k for k in 1:N] end
0.000002 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 128 bytes)
0.000000 seconds (1 allocation: 128 bytes)
0.000000 seconds (1 allocation: 144 bytes)

The code deciding on array allocation has several branches. You can see it here. In general what you can see is byte-alignment of allocations, which is governed by two contestants defined here.
Now the code is more complicated than this but in your case for very small arrays alignment is 16 bytes, so since Int64 has 8 bytes every two consecutive sizes will have the same allocated size.
To read more about memory alignment see here.
We can check that if e.g. you allocate Int32 vector instead of Int64 vector every four values get the same allocation. Similarly for Int16 every eight values have the same allocation:
julia> for N in 1:12 #time a = Int32[k for k in 1:N] end
0.000004 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000001 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000001 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 112 bytes)
0.000000 seconds (1 allocation: 112 bytes)
julia> for N in 1:20 #time a = Int16[k for k in 1:N] end
0.000003 seconds (1 allocation: 64 bytes)
0.000001 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 64 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 80 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000001 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
0.000000 seconds (1 allocation: 96 bytes)
(remember that this is for small arrays, for lager arrays Julia uses different alignment, growing up to 64 bytes)

Related

How to create a mapped device with a specific sector size?

I have implemented my own device mapper target and I am able to create a mapped device with dmsetup create command.
The problem is that the sector size for this device becomes the default 512 bytes, and I would like to change it to 4096 bytes similar to dm-verity targets.
For instance, below is the sector size for a dm-verity device, and fdisk reports 4096 bytes:
$sudo fdisk -l /dev/mapper/dmv
Disk /dev/mapper/dmv: 8 KiB, 8192 bytes, 2 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Below is the sector size for my own target, and fdisk reports 512 bytes:
sudo fdisk -l /dev/mapper/my-target
Disk /dev/mapper/my-target: 8 KiB, 8192 bytes, 16 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
How can I set the sector size for my own device mapper target? I couldn't find where this is done in libdevmapper or cryptsetup source for the dm-verity case.
Cheers!

Why are no dts pts written to my mp4 container

Based on my (self-answered) question here Muxing AVPackets into mp4 file - revisited, I have to ask, what could be the reason why there are no values written for pts/dts in the resulting mp4 container.
I examined the container file with the tool MediaInfo.
I observe that only the very first Frame contains a value for pts in the container. After that, pts is not even shown in the mp4 file anymore, but dts is, with a value of all zeros.
This is the output from MediaInfo for the first 3 frames:
0000A2 slice_layer_without_partitioning (IDR) - 0 (0x0) - Frame 0 - slice_type I - frame_num 0 - DTS 00:00:00.000 - PTS 00:00:00.017 (141867 bytes)
0000A2 Header (5 bytes)
0000A2 zero_byte: 0 (0x00)
0000A3 start_code_prefix_one_3bytes: 1 (0x000001)
0000A6 nal_ref_idc: 3 (0x3) - (2 bits)
0000A6 nal_unit_type: 5 (0x05) - (5 bits)
0000A7 slice_header (3 bytes)
0000A7 first_mb_in_slice: 0 (0x0)
0000A7 slice_type: 7 (0x07) - I
0000A8 pic_parameter_set_id: 0 (0x0)
0000A8 frame_num: 0 (0x0)
0000A8 idr_pic_id: 0 (0x0)
0000A8 no_output_of_prior_pics_flag: No
0000A8 long_term_reference_flag: No
0000A9 slice_qp_delta: -5 (0xFFFFFFFB)
0000AA disable_deblocking_filter_idc: 0 (0x0)
0000AA slice_alpha_c0_offset_div2: 0 (0x0)
0000AA slice_beta_offset_div2: 0 (0x0)
0000AA slice_data (141856 bytes)
0000AA (ToDo): (Data)
022ACD slice_layer_without_partitioning (IDR) - 0 (0x0) - Frame 0 - slice_type I - frame_num 0 - DTS 00:00:00.000 - PTS 00:00:00.017 - first_mb_in_slice 8040 (2248 bytes)
022ACD Header (5 bytes)
022ACD zero_byte: 0 (0x00)
022ACE start_code_prefix_one_3bytes: 1 (0x000001)
022AD1 nal_ref_idc: 3 (0x3) - (2 bits)
022AD1 nal_unit_type: 5 (0x05) - (5 bits)
022AD2 slice_header (6 bytes)
022AD2 first_mb_in_slice: 8040 (0x001F68)
022AD5 slice_type: 7 (0x07) - I
022AD6 pic_parameter_set_id: 0 (0x0)
022AD6 frame_num: 0 (0x0)
022AD6 idr_pic_id: 0 (0x0)
022AD6 no_output_of_prior_pics_flag: No
022AD6 long_term_reference_flag: No
022AD7 slice_qp_delta: -5 (0xFFFFFFFB)
022AD8 disable_deblocking_filter_idc: 0 (0x0)
022AD8 slice_alpha_c0_offset_div2: 0 (0x0)
022AD8 slice_beta_offset_div2: 0 (0x0)
022AD8 slice_data (2237 bytes)
022AD8 (ToDo): (Data)
023395 1 (36212 bytes)
023395 slice_layer_without_partitioning (non-IDR) - 2 (0x2) - Frame 1 - slice_type P - frame_num 1 - DTS 00:00:00.000 (36017 bytes)
023395 Header (5 bytes)
023395 zero_byte: 0 (0x00)
023396 start_code_prefix_one_3bytes: 1 (0x000001)
023399 nal_ref_idc: 3 (0x3) - (2 bits)
023399 nal_unit_type: 1 (0x01) - (5 bits)
02339A slice_header (3 bytes)
02339A first_mb_in_slice: 0 (0x0)
02339A slice_type: 5 (0x5) - P
02339A pic_parameter_set_id: 0 (0x0)
02339A frame_num: 1 (0x1)
02339B num_ref_idx_active_override_flag (0 bytes)
02339B num_ref_idx_active_override_flag: Yes
02339B num_ref_idx_l0_active_minus1: 0 (0x0)
02339B ref_pic_list_modification_flag_l0: No
02339B adaptive_ref_pic_marking_mode_flag: No
02339C cabac_init_idc: 0 (0x0)
02339C slice_qp_delta: -3 (0xFFFFFFFD)
02339C disable_deblocking_filter_idc: 0 (0x0)
02339C slice_alpha_c0_offset_div2: 0 (0x0)
02339D slice_beta_offset_div2: 0 (0x0)
02339D slice_data (36012 bytes)
02339D (ToDo): (Data)
02C046 slice_layer_without_partitioning (non-IDR) - 2 (0x2) - Frame 1 - slice_type P - frame_num 1 - DTS 00:00:00.000 - first_mb_in_slice 8040 (195 bytes)
02C046 Header (5 bytes)
02C046 zero_byte: 0 (0x00)
02C047 start_code_prefix_one_3bytes: 1 (0x000001)
02C04A nal_ref_idc: 3 (0x3) - (2 bits)
02C04A nal_unit_type: 1 (0x01) - (5 bits)
02C04B slice_header (6 bytes)
02C04B first_mb_in_slice: 8040 (0x001F68)
02C04E slice_type: 5 (0x5) - P
02C04E pic_parameter_set_id: 0 (0x0)
02C04E frame_num: 1 (0x1)
02C04F num_ref_idx_active_override_flag (0 bytes)
02C04F num_ref_idx_active_override_flag: Yes
02C04F num_ref_idx_l0_active_minus1: 0 (0x0)
02C04F ref_pic_list_modification_flag_l0: No
02C04F adaptive_ref_pic_marking_mode_flag: No
02C050 cabac_init_idc: 0 (0x0)
02C050 slice_qp_delta: -3 (0xFFFFFFFD)
02C050 disable_deblocking_filter_idc: 0 (0x0)
02C050 slice_alpha_c0_offset_div2: 0 (0x0)
02C051 slice_beta_offset_div2: 0 (0x0)
02C051 slice_data (190 bytes)
02C051 (ToDo): (Data)
02C109 1 (26280 bytes)
02C109 slice_layer_without_partitioning (non-IDR) - 4 (0x4) - Frame 2 - slice_type P - frame_num 2 - DTS 00:00:00.000 (26157 bytes)
02C109 Header (5 bytes)
02C109 zero_byte: 0 (0x00)
02C10A start_code_prefix_one_3bytes: 1 (0x000001)
02C10D nal_ref_idc: 3 (0x3) - (2 bits)
02C10D nal_unit_type: 1 (0x01) - (5 bits)
02C10E slice_header (3 bytes)
02C10E first_mb_in_slice: 0 (0x0)
02C10E slice_type: 5 (0x5) - P
02C10E pic_parameter_set_id: 0 (0x0)
02C10E frame_num: 2 (0x2)
02C10F num_ref_idx_active_override_flag (0 bytes)
02C10F num_ref_idx_active_override_flag: Yes
02C10F num_ref_idx_l0_active_minus1: 0 (0x0)
02C10F ref_pic_list_modification_flag_l0: No
02C10F adaptive_ref_pic_marking_mode_flag: No
02C110 cabac_init_idc: 0 (0x0)
02C110 slice_qp_delta: -2 (0xFFFFFFFE)
02C110 disable_deblocking_filter_idc: 0 (0x0)
02C110 slice_alpha_c0_offset_div2: 0 (0x0)
02C111 slice_beta_offset_div2: 0 (0x0)
02C111 slice_data (26152 bytes)
02C111 (ToDo): (Data)
032736 slice_layer_without_partitioning (non-IDR) - 4 (0x4) - Frame 2 - slice_type P - frame_num 2 - DTS 00:00:00.000 - first_mb_in_slice 8040 (123 bytes)
032736 Header (5 bytes)
032736 zero_byte: 0 (0x00)
032737 start_code_prefix_one_3bytes: 1 (0x000001)
03273A nal_ref_idc: 3 (0x3) - (2 bits)
03273A nal_unit_type: 1 (0x01) - (5 bits)
03273B slice_header (6 bytes)
03273B first_mb_in_slice: 8040 (0x001F68)
03273E slice_type: 5 (0x5) - P
03273E pic_parameter_set_id: 0 (0x0)
03273E frame_num: 2 (0x2)
03273F num_ref_idx_active_override_flag (0 bytes)
03273F num_ref_idx_active_override_flag: Yes
03273F num_ref_idx_l0_active_minus1: 0 (0x0)
03273F ref_pic_list_modification_flag_l0: No
03273F adaptive_ref_pic_marking_mode_flag: No
032740 cabac_init_idc: 0 (0x0)
032740 slice_qp_delta: -2 (0xFFFFFFFE)
032740 disable_deblocking_filter_idc: 0 (0x0)
032740 slice_alpha_c0_offset_div2: 0 (0x0)
032741 slice_beta_offset_div2: 0 (0x0)
032741 slice_data (118 bytes)
032741 (ToDo): (Data)
0327B1 1 (21125 bytes)
It goes on like that, even though I set pts and dts. The settings may not be correct already (I do some calculations like (1 / framerate) * FrameNumber), but I would expect at least some numbers in pts and dts, when I set the according fields in the avPacket structure and write that via av_interleaved_write_frame(outFmtCtx, &avPacket); to the file.
What could be wrong here?
Edit:
(please see below in the comments the download to my testdata and source file)
One thing that bugs me is the fact, if I compare the output of MediaInfo from my file and that of the muxing.c generated is, that in the header, the muxing.c generated already mentions the duration of the file as 9960 ms, whereas mine is only 40 ms.
muxing.c also does call avformat_write_header before even one frame is drawn. Yes, I suppose that the header will be updated, when the either av_interleaved_write_frame or av_write_trailer is called, but I totally not understand the mechanics behind it.
Maybe somebody can enlighten me with some background information of any kind.
Additionally, I think it could be necessarry to extract some SPS and PPS from my raw data (preceding the I-slice), and give that as extra data to the avformat_write_header call. But I just cannot figure out myself if I have to do that at all and if so, how to do it.

How to extract USB device type and its drive letter from ETW

So I'm writing a simple ETW logger to provide a trigger-event state machine to wake up whenever a new USB device is connected. Using microsoft's Messages analyzer I managed to trace and receive USB "new usb device information" traces using the following filter Microsoft_Windows_USB_USBHUB3.Summary == "New USB Device Information"
However, after examining the packet, there is no way for me to differentiate between USB mass storage devices and other USB devices(camera?)
Available values from the trace:
Name Value Bit Offset Bit Length Type
pointerValue 132972247379928 64 64 UInt64
Fid_HubDevice 0x000078F011FC3CC8 0 64 Etw.EtwPointer
pointerValue 132972489227464 0 64 UInt64
Fid_UsbDevice 0x000078F00391EFD8 64 64 Etw.EtwPointer
Fid_PortNumber 1 128 32 UInt32
Fid_DeviceDescription USB Mass Storage Device 160 384 String
Fid_DeviceInterfacePath \??\USB#VID_0781&PID_5567#200602669107DD62F0E0#{a5dcbf10-6530-11d2-901f-00c04fb951ed} 544 1376 String
Fid_DeviceDescriptor fid_DeviceDescriptor{Fid_bLength=18,Fid_bDescriptorType=1,Fid_bcdUSB=512,Fid_bDeviceClass=0,Fid_bDeviceSubClass=0,Fid_bDeviceProtocol=0,Fid_bMaxPacketSize0=64,Fid_idVendor=1921,Fid_idProduct=21863,Fid_bcdDevice=295,Fid_iManufacturer=1,Fid_iProduct=2,Fid_iSerialNumber=3,Fid_bNumConfigurations=1} 1920 144 Microsoft_Windows_USB_USBHUB3.fid_DeviceDescriptor
Fid_bLength 18 1920 8 Byte
Fid_bDescriptorType 1 1928 8 Byte
Fid_bcdUSB 0x0200 1936 16 UInt16
Fid_bDeviceClass 0 1952 8 Byte
Fid_bDeviceSubClass 0 1960 8 Byte
Fid_bDeviceProtocol 0 1968 8 Byte
Fid_bMaxPacketSize0 64 1976 8 Byte
Fid_idVendor 0x0781 1984 16 UInt16
Fid_idProduct 0x5567 2000 16 UInt16
Fid_bcdDevice 0x0127 2016 16 UInt16
Fid_iManufacturer 1 2032 8 Byte
Fid_iProduct 2 2040 8 Byte
Fid_iSerialNumber 3 2048 8 Byte
Fid_bNumConfigurations 1 2056 8 Byte
Fid_ConfigurationDescriptorLength 0x0020 2064 16 UInt16
Fid_ConfigurationDescriptor [9,2,32,0,1,1,0,128,100,9,4,0,0,2,8,6,80,0,7,5,129,2,0,2,0,7,5,2,2,0,2,1] 2080 256 ArrayValue`1
Fid_PdoName \Device\USBPDO-13 2336 288 String
Fid_Suspended 1 2624 8 Byte
Fid_PortPathDepth 1 2632 32 UInt32
Fid_PortPath [1,0,0,0,0,0] 2664 192 ArrayValue`1
Fid_PciBus 0x00000000 2856 32 UInt32
Fid_PciDevice 0x00000014 2888 32 UInt32
Fid_PciFunction 0x00000000 2920 32 UInt32
Fid_PciVendorId 0x00008086 2952 32 UInt32
Fid_PciDeviceId 0x0000A12F 2984 32 UInt32
Fid_PciRevisionId 0x00000031 3016 32 UInt32
Fid_CurrentWdfPowerDeviceState 0x00000005 3048 32 UInt32
Fid_Usb20LpmStatus 0x00000006 3080 32 UInt32
Fid_ControllerParentBusType ControllerParentBusTypePci 3112 32 MapControllerParentBusType
Fid_AcpiVendorId NULL 3144 40 String
Fid_AcpiDeviceId NULL 3184 40 String
Fid_AcpiRevisionId NULL 3224 40 String
Fid_PortFlagAcpiUpcValid 1 3264 8 Byte
Fid_PortConnectorType 255 3272 8 Byte
Fid_UcmConnectorId 0x0000000000000001 3280 64 UInt64
EtwKeywords Keywords{StandardKeywords=WindowsEtwKeywords{EventlogClassic=False,CorrelationHint=False,AuditSuccess=False,AuditFailure=False,SQM=False,WDIDiag=False,WDIContext=False,Reserved=False},Default=True,USBError=False,IRP=False,Power=False,PnP=True,Performance=False,HeadersBusTrace=False,PartialDataBusTrace=False,FullDataBusTrace=False,StateMachine=False,Enumeration=False,VerifyDriver=False,HWVerifyHost=False,HWVerifyHub=False,HWVerifyDevice=False,Rundown=False,Device=False,Hub=False,Compat=False,ControllerCommand=False,MsMeasures=True} Microsoft_Windows_USB_USBHUB3.Keywords
Limitations:
No strings comparisons
Must use ETW mechanism

Measuring memory access time x86

I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
1 #include <stdio.h>
2 #include <x86intrin.h>
3 #include <stdint.h>
4
5 #define SIZE 32*1024
6
7 char arr[SIZE];
8
9 int main()
10 {
11 char *addr;
12 unsigned int dummy;
13 uint64_t tsc1, tsc2;
14 unsigned i;
15 volatile char val;
16
17 memset(arr, 0x0, SIZE);
18 for (addr = arr; addr < arr + SIZE; addr += 64) {
19 _mm_clflush((void *) addr);
20 }
21 asm volatile("sfence\n\t"
22 :
23 :
24 : "memory");
25
26 tsc1 = __rdtscp(&dummy);
27 for (i = 0; i < SIZE; i++) {
28 asm volatile (
29 "mov %0, %%al\n\t" // load data
30 :
31 : "m" (arr[i])
32 );
33
34 }
35 tsc2 = __rdtscp(&dummy);
36 printf("(1) tsc: %llu\n", tsc2 - tsc1);
37
38 tsc1 = __rdtscp(&dummy);
39 for (i = 0; i < SIZE; i++) {
40 asm volatile (
41 "mov %0, %%al\n\t" // load data
42 :
43 : "m" (arr[i])
44 );
45
46 }
47 tsc2 = __rdtscp(&dummy);
48 printf("(2) tsc: %llu\n", tsc2 - tsc1);
49
50 return 0;
51 }
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
400614: 0f 01 f9 rdtscp
400617: 89 ce mov %ecx,%esi
400619: 48 8b 4d d8 mov -0x28(%rbp),%rcx
40061d: 89 31 mov %esi,(%rcx)
40061f: 48 c1 e2 20 shl $0x20,%rdx
400623: 48 09 d0 or %rdx,%rax
400626: 48 89 45 c0 mov %rax,-0x40(%rbp)
40062a: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%rbp)
400631: eb 0d jmp 400640 <main+0x8a>
400633: 8b 45 b4 mov -0x4c(%rbp),%eax
400636: 8a 80 80 10 60 00 mov 0x601080(%rax),%al
40063c: 83 45 b4 01 addl $0x1,-0x4c(%rbp)
400640: 81 7d b4 ff 7f 00 00 cmpl $0x7fff,-0x4c(%rbp)
400647: 76 ea jbe 400633 <main+0x7d>
400649: 48 8d 45 b0 lea -0x50(%rbp),%rax
40064d: 48 89 45 e0 mov %rax,-0x20(%rbp)
400651: 0f 01 f9 rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?
mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.
Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).
To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)
But your bottleneck is far worse than that!
You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.) http://agner.org/optimize/
(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)
This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.
Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.
You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.
Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.

Slow booting process after adding mem=16M in boot parameters

My linux-3.0 kernel was panicking saying ERROR: Failed to allocate 0x1000 bytes below 0x0. while booting. So I changed the bootargs and added a boot parameter mem = 16M. Now it boots fine but it takes a lot of time to boot. I have tried with higher mem value also but it does not work. Below are the logs:
`Machine: KZM9D
arm_add_memory: 0 0x40000000 0x1000000
Memory policy: ECC disabled, Data cache writealloc
bootmem_init: max_low=0x266240, max_high=0x266240
<6>Section 8256 and 8250 (node 0)<c> have a circular dependency on usemap and pgdat allocations
<7>On node 0 totalpages: 0
<7>On node 1 totalpages: 0
<7>On node 2 totalpages: 0
<7>On node 3 totalpages: 0
<7>On node 4 totalpages: 0
<7>On node 5 totalpages: 0
<7>On node 6 totalpages: 0
<7>On node 7 totalpages: 0
high_memory: e0000000
Zone PFN ranges:
Normal 0x00040000 -> 0x00041000
Movable zone start PFN for each node
early_node_map[1] active PFN ranges
0: 0x00040000 -> 0x00041000
<7>On node 0 totalpages: 4096
<7> Normal zone: 36 pages used for memmap
<7> Normal zone: 0 pages reserved
<7> Normal zone: 4060 pages, LIFO batch:0
<6>boottime: reserved memory at 0x40002000 size 0x2000
mm_init_owner
<6>PERCPU: Embedded 8 pages/cpu #c087f000 s9824 r8192 d14752 u32768
<7>pcpu-alloc: s9824 r8192 d14752 u32768 alloc=8*4096
<7>pcpu-alloc: [0] 0 [0] 1
build_all_zonelists
Built 1 zonelists in Node order, mobility grouping on. Total pages: 4060
Policy zone: Normal
page_alloc_init
<5>Kernel command line: console=ttyS1,115200n8 root=/dev/nfs ip=9.8.7.6 nfsroot=1.2.3.7:/tftpboot/arm/ rootwait rw mem=16M
parse_early_param
<6>PID hash table entries: 64 (order: -4, 256 bytes)
<6>Dentry cache hash table entries: 2048 (order: 2, 24576 bytes)
<6>Inode-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>Memory: 16MB = 16MB total
<5>Memory: 7824k/7824k available, 8560k reserved, 0K highmem
<5>Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
DMA : 0xffc00000 - 0xffe00000 ( 2 MB)
vmalloc : 0xe0800000 - 0xf0000000 ( 248 MB)
lowmem : 0xc0000000 - 0xe0000000 ( 512 MB)
modules : 0xbf000000 - 0xc0000000 ( 16 MB)
.text : 0xc0008000 - 0xc0704024 (7153 kB)
.init : 0xc0705000 - 0xc0740660 ( 238 kB)
.data : 0xc0742000 - 0xc078dc18 ( 304 kB)
.bss : 0xc078dc18 - 0xc07f2950 ( 404 kB)
<6>Preemptible hierarchical RCU implementation.
<6>NR_IRQS:374`

Resources