How to extract USB device type and its drive letter from ETW - winapi

So I'm writing a simple ETW logger to provide a trigger-event state machine to wake up whenever a new USB device is connected. Using microsoft's Messages analyzer I managed to trace and receive USB "new usb device information" traces using the following filter Microsoft_Windows_USB_USBHUB3.Summary == "New USB Device Information"
However, after examining the packet, there is no way for me to differentiate between USB mass storage devices and other USB devices(camera?)
Available values from the trace:
Name Value Bit Offset Bit Length Type
pointerValue 132972247379928 64 64 UInt64
Fid_HubDevice 0x000078F011FC3CC8 0 64 Etw.EtwPointer
pointerValue 132972489227464 0 64 UInt64
Fid_UsbDevice 0x000078F00391EFD8 64 64 Etw.EtwPointer
Fid_PortNumber 1 128 32 UInt32
Fid_DeviceDescription USB Mass Storage Device 160 384 String
Fid_DeviceInterfacePath \??\USB#VID_0781&PID_5567#200602669107DD62F0E0#{a5dcbf10-6530-11d2-901f-00c04fb951ed} 544 1376 String
Fid_DeviceDescriptor fid_DeviceDescriptor{Fid_bLength=18,Fid_bDescriptorType=1,Fid_bcdUSB=512,Fid_bDeviceClass=0,Fid_bDeviceSubClass=0,Fid_bDeviceProtocol=0,Fid_bMaxPacketSize0=64,Fid_idVendor=1921,Fid_idProduct=21863,Fid_bcdDevice=295,Fid_iManufacturer=1,Fid_iProduct=2,Fid_iSerialNumber=3,Fid_bNumConfigurations=1} 1920 144 Microsoft_Windows_USB_USBHUB3.fid_DeviceDescriptor
Fid_bLength 18 1920 8 Byte
Fid_bDescriptorType 1 1928 8 Byte
Fid_bcdUSB 0x0200 1936 16 UInt16
Fid_bDeviceClass 0 1952 8 Byte
Fid_bDeviceSubClass 0 1960 8 Byte
Fid_bDeviceProtocol 0 1968 8 Byte
Fid_bMaxPacketSize0 64 1976 8 Byte
Fid_idVendor 0x0781 1984 16 UInt16
Fid_idProduct 0x5567 2000 16 UInt16
Fid_bcdDevice 0x0127 2016 16 UInt16
Fid_iManufacturer 1 2032 8 Byte
Fid_iProduct 2 2040 8 Byte
Fid_iSerialNumber 3 2048 8 Byte
Fid_bNumConfigurations 1 2056 8 Byte
Fid_ConfigurationDescriptorLength 0x0020 2064 16 UInt16
Fid_ConfigurationDescriptor [9,2,32,0,1,1,0,128,100,9,4,0,0,2,8,6,80,0,7,5,129,2,0,2,0,7,5,2,2,0,2,1] 2080 256 ArrayValue`1
Fid_PdoName \Device\USBPDO-13 2336 288 String
Fid_Suspended 1 2624 8 Byte
Fid_PortPathDepth 1 2632 32 UInt32
Fid_PortPath [1,0,0,0,0,0] 2664 192 ArrayValue`1
Fid_PciBus 0x00000000 2856 32 UInt32
Fid_PciDevice 0x00000014 2888 32 UInt32
Fid_PciFunction 0x00000000 2920 32 UInt32
Fid_PciVendorId 0x00008086 2952 32 UInt32
Fid_PciDeviceId 0x0000A12F 2984 32 UInt32
Fid_PciRevisionId 0x00000031 3016 32 UInt32
Fid_CurrentWdfPowerDeviceState 0x00000005 3048 32 UInt32
Fid_Usb20LpmStatus 0x00000006 3080 32 UInt32
Fid_ControllerParentBusType ControllerParentBusTypePci 3112 32 MapControllerParentBusType
Fid_AcpiVendorId NULL 3144 40 String
Fid_AcpiDeviceId NULL 3184 40 String
Fid_AcpiRevisionId NULL 3224 40 String
Fid_PortFlagAcpiUpcValid 1 3264 8 Byte
Fid_PortConnectorType 255 3272 8 Byte
Fid_UcmConnectorId 0x0000000000000001 3280 64 UInt64
EtwKeywords Keywords{StandardKeywords=WindowsEtwKeywords{EventlogClassic=False,CorrelationHint=False,AuditSuccess=False,AuditFailure=False,SQM=False,WDIDiag=False,WDIContext=False,Reserved=False},Default=True,USBError=False,IRP=False,Power=False,PnP=True,Performance=False,HeadersBusTrace=False,PartialDataBusTrace=False,FullDataBusTrace=False,StateMachine=False,Enumeration=False,VerifyDriver=False,HWVerifyHost=False,HWVerifyHub=False,HWVerifyDevice=False,Rundown=False,Device=False,Hub=False,Compat=False,ControllerCommand=False,MsMeasures=True} Microsoft_Windows_USB_USBHUB3.Keywords
No strings comparisons
Must use ETW mechanism


How to explain embedded message binary wire format of protocol buffer?

I'm trying to understand protocol buffer encoding method, when translating message to binary(or hexadecimal) format, I can't understand how the embedded message is encoded.
I guess maybe it's related to memory address, but I can't find the accurate relationship.
Here is what i've done.
Step 1: I defined two messages in test.proto file,
syntax = "proto3";
package proto_test;
message Education {
string college = 1;
message Person {
int32 age = 1;
string name = 2;
Education edu = 3;
Step 2: And then I generated some go code,
protoc --go_out=. test.proto
Step 3: Then I check the encoded format of the message,
p := proto_test.Person{
Age: 666,
Name: "Tom",
Edu: &proto_test.Education{
College: "SOMEWHERE",
var b []byte
out, err := p.XXX_Marshal(b, true)
if err != nil {
log.Fatalln("fail to marshal with error: ", err)
fmt.Printf("hexadecimal format:% x \n", out)
fmt.Printf("binary format:% b \n", out)
which outputs,
hexadecimal format:08 9a 05 12 03 54 6f 6d 1a fd 96 d1 08 0a 09 53 4f 4d 45 57 48 45 52 45
binary format:[ 1000 10011010 101 10010 11 1010100 1101111 1101101 11010 11111101 10010110 11010001 1000 1010 1001 1010011 1001111 1001101 1000101 1010111 1001000 1000101 1010010 1000101]
what I understand is ,
08 - int32 wire type with tag number 1
9a 05 - Varints for 666
12 - string wire type with tag number 2
03 - length delimited which is 3 byte
54 6f 6d - ascii for "TOM"
1a - embedded message wire type with tag number 3
fd 96 d1 08 - ? (here is what I don't understand)
0a - string wire type with tag number 1
09 - length delimited which is 9 byte
53 4f 4d 45 57 48 45 52 45 - ascii for "SOMEWHERE"
What does fd 96 d1 08 stands for?
It seems like that d1 08 always be there, but fd 96 sometimes change, don't know why. Thanks for answering :)
I debugged the marshal process and reported a bug here.
At that location I/you would expect the number of bytes in the embedded message.
I have repeated your experiment in Python.
msg = Person()
msg.age = 666 = "Tom" = "SOMEWHERE"
I got a different result, the one I would expect. A varint stating the size of the embedded message.
0x9A, 0x05
0x54 0x6F 0x6D
0x0B <- Equals to 11 decimal.
0x53 0x4F 0x4D 0x45 0x57 0x48 0x45 0x52 0x45
Next I deserialized your bytes:
msg2 = Person()
str = bytearray(b'\x08\x9a\x05\x12\x03\x54\x6f\x6d\x1a\xfd\x96\xd1\x08\x0a\x09\x53\x4f\x4d\x45\x57\x48\x45\x52\x45')
The result of this is perfect:
age: 666
name: "Tom"
edu {
college: "SOMEWHERE"
The conclusion I come to is that there are some different ways of encoding in Protobuf. I do not know what is done in this case but I know the example of a negative 32 bit varint. A positive varint is encoded in five bytes, a negative value is cast as a 64 bit value and encoded to ten bytes.

What's wrong with my PNG IDAT chunk?

I'm trying to manually construct a simple, 4x1, uncompressed PNG.
So far, I have:
89504E47 // PNG Header
0000000D // byte length of IHDR chunk contents, 4 bytes, value 13
49484452 // IHDR start - 4 bytes
00000004 // Width 4 bytes }
00000001 // Height 4 bytes }
08 // bit depth 8 = 24/32 bit 1 byte }
06 // color type, 6 - RGBa 1 byte }
00 // compression, 0 = Deflate 1 byte }
00 // filter, 0 = no filter 1 byte }
00 // interlace, 0 = no interlace 1 byte } Total, 13 Bytes
F93C0FCD // CRC of IHDR chunk, 4 bytes
00000013 // byte length of IDAT chunk contents, 4 bytes, value 19
49444154 // IDAT start - 4 bytes
0000 // ZLib 0 compression, 2 bytes }
00 // Filter = 0, 1 bytes }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes } Total, 19 Bytes
6464C2B0 // CRC of IHDR chunk, 4 bytes
00000000 // byte length of IEND chunk, 4 bytes (value: 0)
49454E44 // IEND start - 4 bytes
AE426082 // CRC of IEND chunk, 4 bytes
I think the issue I'm having is down to the ZLib/Deflate ordering.
I think that I have to include the "Non-compressed blocks format" details from RFC 1951, sec. 3.2.4, but I'm a little unsure as to the interactions. The only examples I can find are for Compressed blocks (understandably!)
So I've now tried:
49444154 // IDAT start - 4 bytes
01 // BFINAL = 1, BTYPE = 00 1 byte }
11EE // LEN & NLEN of data 2 bytes }
00 // Filter = 0, 1 byte }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes } Total, 19 Bytes
6464C2B0 // CRC of IHDR chunk, 4 bytes
So the whole PNG file is:
89504E47 // PNG Block
0000000D // IHDR Block
00000014 // IDAT Block
00000000 // IEND Block
I'd be really grateful for some pointers as to where the issue lies... or even the PNG data for a working file so that I can reverse-engineer it?
Update 2
Thanks to Mark Adler, I've corrected my newbie errors, and now have functional code that can reproduce the result shown in his answer below, i.e. 4x1 pixel image. From this I can now happily produce a 100x1 image!
However, as a last step, I'd hoped, by tweaking the height field in the IHDR and adding additional non-terminal IDATs, to extend this to say a 4 x 2 image. Unfortunately this doesn't appear to work the way I'd expected.
I now have something like...
89504E47 // PNG Header
0000000D // re calc'ed IHDR with 2 rows
00000002 // changed - now 2 rows
7FA87D63 // CRC of IHDR updated
0000001C // row 1 IDAT, non-terminal
00 // BFINAL = 0, BTYPE = 00
0000001C // row 2, terminal IDAT, as Mark Adler's answer
01 // BFINAL = 1, BTYPE = 00
11EE // LEN & NLEN of data 2 bytes }
is wrong. LEN and NLEN are both 16 bits, not 8 bits. So that needs to be:
1100EEFF // LEN & NLEN of data 4 bytes }
You also need a zlib wrapper around the deflate data. See RFC 1950.
Lastly you will need to update the CRC of the chunk. (Which has the wrong comment by the way -- it should say CRC of IDAT chunk.)
Thusly repaired:
89504E47 // PNG Header
0000000D // byte length of IHDR chunk contents, 4 bytes, value 13
49484452 // IHDR start - 4 bytes
00000004 // Width 4 bytes }
00000001 // Height 4 bytes }
08 // bit depth 8 = 24/32 bit 1 byte }
06 // color type, 6 - RGBa 1 byte }
00 // compression, 0 = Deflate 1 byte }
00 // filter, 0 = no filter 1 byte }
00 // interlace, 0 = no interlace 1 byte } Total, 13 Bytes
F93C0FCD // CRC of IHDR chunk, 4 bytes
0000001C // byte length of IDAT chunk contents, 4 bytes, value 28
49444154 // IDAT start - 4 bytes
7801 // zlib Header 2 bytes }
01 // BFINAL = 1, BTYPE = 00 1 byte }
1100EEFF // LEN & NLEN of data 4 bytes }
00 // Filter = 0, 1 byte }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes }
3d3a0892 // Adler-32 check 4 bytes }
ba0400b4 // CRC of IDAT chunk, 4 bytes
00000000 // byte length of IEND chunk, 4 bytes (value: 0)
49454E44 // IEND start - 4 bytes
AE426082 // CRC of IEND chunk, 4 bytes

WriteFile to an HID vendor Output report returns 1 because OutputReportByteLength is 0

I am trying to get data into my microcontroller over i2c-hid from Windows. What I have is working if I hook up to a linux host (Raspberry PI). But on Windows 10, both WriteFile and HidD_SetOutputReport return 1 (ERROR_INVALID_FUNCTION). I believe this is because the OutputReportByteLength in the CAPS structure returned by HidP_GetCaps is 0 (InputReportByteLength is also 0, same problem). The feature report I have in parallel to these input and output reports has the expected length, and I can get and set feature reports. Why does Windows incorrectly parse my report descriptor? Note that I have tried to rearrange the order of the feature, output, and input sections in my descriptor, and the feature report always works (977 for byte length in CAPS), and the input and output reports always return 0 for byte length in the CAPS structure.
devHandle = CreateFile(currentInterface,
OPEN_EXISTING, // No special create flags
0, // No special attributes
NULL); // No template file
UINT8 buf[9];
buf[0] = 0x9; // Report ID = 9
success = WriteFile(
device->file, // HANDLE hFile,
buf, // LPVOID lpBuffer,
sizeof(buf), // DWORD nNumberOfBytesToRead,
&bytes_written, // LPDWORD lpNumberOfBytesRead,
Report descriptor
// Decoded Application Collection
06 00FF (GLOBAL) USAGE_PAGE 0xFF00 Vendor-defined
09 01 (LOCAL) USAGE 0xFF000001
A1 01 (MAIN) COLLECTION 0x00000001 Application (Usage=0xFF000001: Page=Vendor-defined, Usage=, Type=)
85 01 (GLOBAL) REPORT_ID 0x01 (1)
09 01 (LOCAL) USAGE 0xFF000001 <-- Warning: Undocumented usage
75 08 (GLOBAL) REPORT_SIZE 0x08 (8) Number of bits per field
96 D003 (GLOBAL) REPORT_COUNT 0x03D0 (976) Number of fields
B1 02 (MAIN) FEATURE 0x00000002 (976 fields x 8 bits) 0=Data 1=Variable 0=Absolute 0=NoWrap 0=Linear 0=PrefState 0=NoNull 0=NonVolatile 0=Bitmap
85 09 (GLOBAL) REPORT_ID 0x09 (9)
09 01 (LOCAL) USAGE 0xFF000001 <-- Warning: Undocumented usage
95 08 (GLOBAL) REPORT_COUNT 0x08 (8) Number of fields
91 02 (MAIN) OUTPUT 0x00000002 (8 fields x 8 bits) 0=Data 1=Variable 0=Absolute 0=NoWrap 0=Linear 0=PrefState 0=NoNull 0=NonVolatile 0=Bitmap
09 01 (LOCAL) USAGE 0xFF000001 <-- Warning: Undocumented usage
95 40 (GLOBAL) REPORT_COUNT 0x40 (64) Number of fields
81 02 (MAIN) INPUT 0x00000002 (64 fields x 8 bits) 0=Data 1=Variable 0=Absolute 0=NoWrap 0=Linear 0=PrefState 0=NoNull 0=NonVolatile 0=Bitmap

Measuring memory access time x86

I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
1 #include <stdio.h>
2 #include <x86intrin.h>
3 #include <stdint.h>
5 #define SIZE 32*1024
7 char arr[SIZE];
9 int main()
10 {
11 char *addr;
12 unsigned int dummy;
13 uint64_t tsc1, tsc2;
14 unsigned i;
15 volatile char val;
17 memset(arr, 0x0, SIZE);
18 for (addr = arr; addr < arr + SIZE; addr += 64) {
19 _mm_clflush((void *) addr);
20 }
21 asm volatile("sfence\n\t"
22 :
23 :
24 : "memory");
26 tsc1 = __rdtscp(&dummy);
27 for (i = 0; i < SIZE; i++) {
28 asm volatile (
29 "mov %0, %%al\n\t" // load data
30 :
31 : "m" (arr[i])
32 );
34 }
35 tsc2 = __rdtscp(&dummy);
36 printf("(1) tsc: %llu\n", tsc2 - tsc1);
38 tsc1 = __rdtscp(&dummy);
39 for (i = 0; i < SIZE; i++) {
40 asm volatile (
41 "mov %0, %%al\n\t" // load data
42 :
43 : "m" (arr[i])
44 );
46 }
47 tsc2 = __rdtscp(&dummy);
48 printf("(2) tsc: %llu\n", tsc2 - tsc1);
50 return 0;
51 }
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
400614: 0f 01 f9 rdtscp
400617: 89 ce mov %ecx,%esi
400619: 48 8b 4d d8 mov -0x28(%rbp),%rcx
40061d: 89 31 mov %esi,(%rcx)
40061f: 48 c1 e2 20 shl $0x20,%rdx
400623: 48 09 d0 or %rdx,%rax
400626: 48 89 45 c0 mov %rax,-0x40(%rbp)
40062a: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%rbp)
400631: eb 0d jmp 400640 <main+0x8a>
400633: 8b 45 b4 mov -0x4c(%rbp),%eax
400636: 8a 80 80 10 60 00 mov 0x601080(%rax),%al
40063c: 83 45 b4 01 addl $0x1,-0x4c(%rbp)
400640: 81 7d b4 ff 7f 00 00 cmpl $0x7fff,-0x4c(%rbp)
400647: 76 ea jbe 400633 <main+0x7d>
400649: 48 8d 45 b0 lea -0x50(%rbp),%rax
40064d: 48 89 45 e0 mov %rax,-0x20(%rbp)
400651: 0f 01 f9 rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?
mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.
Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).
To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)
But your bottleneck is far worse than that!
You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.)
(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)
This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.
Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.
You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.
Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.

ASCII 7x5 side-feeding characters for led modules

I am looking at the code for the font file here:
The code starts like this:
const byte font[][5] = {
{0x00,0x00,0x00,0x00,0x00}, // 0x20 32
{0x00,0x00,0x6f,0x00,0x00}, // ! 0x21 33
{0x00,0x07,0x00,0x07,0x00}, // " 0x22 34
{0x14,0x7f,0x14,0x7f,0x14}, // # 0x23 35
{0x00,0x07,0x04,0x1e,0x00}, // $ 0x24 36
{0x23,0x13,0x08,0x64,0x62}, // % 0x25 37
{0x36,0x49,0x56,0x20,0x50}, // & 0x26 38
{0x00,0x00,0x07,0x00,0x00}, // ' 0x27 39
{0x00,0x1c,0x22,0x41,0x00}, // ( 0x28 40
{0x00,0x41,0x22,0x1c,0x00}, // ) 0x29 41
{0x14,0x08,0x3e,0x08,0x14}, // * 0x2a 42
{0x08,0x08,0x3e,0x08,0x08}, // + 0x2b 43
and so on...
I am very confused as to how this code works - can someone explain it to me please?
Each array of 5 bytes = 40 bits which map to the 7x5 = 35 pixels in the character grid (there are 5 unused bits presumably).
When you want to display a character you copy the corresponding 5 byte bitmap for that character to the appropriate memory location. E.g. to display the character X you would copy the data from font['X'].
