I'm trying to manually construct a simple, 4x1, uncompressed PNG.
So far, I have:
89504E47 // PNG Header
0D0A1A0A
0000000D // byte length of IHDR chunk contents, 4 bytes, value 13
49484452 // IHDR start - 4 bytes
00000004 // Width 4 bytes }
00000001 // Height 4 bytes }
08 // bit depth 8 = 24/32 bit 1 byte }
06 // color type, 6 - RGBa 1 byte }
00 // compression, 0 = Deflate 1 byte }
00 // filter, 0 = no filter 1 byte }
00 // interlace, 0 = no interlace 1 byte } Total, 13 Bytes
F93C0FCD // CRC of IHDR chunk, 4 bytes
00000013 // byte length of IDAT chunk contents, 4 bytes, value 19
49444154 // IDAT start - 4 bytes
0000 // ZLib 0 compression, 2 bytes }
00 // Filter = 0, 1 bytes }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes } Total, 19 Bytes
6464C2B0 // CRC of IHDR chunk, 4 bytes
00000000 // byte length of IEND chunk, 4 bytes (value: 0)
49454E44 // IEND start - 4 bytes
AE426082 // CRC of IEND chunk, 4 bytes
Update
I think the issue I'm having is down to the ZLib/Deflate ordering.
I think that I have to include the "Non-compressed blocks format" details from RFC 1951, sec. 3.2.4, but I'm a little unsure as to the interactions. The only examples I can find are for Compressed blocks (understandably!)
So I've now tried:
49444154 // IDAT start - 4 bytes
01 // BFINAL = 1, BTYPE = 00 1 byte }
11EE // LEN & NLEN of data 2 bytes }
00 // Filter = 0, 1 byte }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes } Total, 19 Bytes
6464C2B0 // CRC of IHDR chunk, 4 bytes
So the whole PNG file is:
89504E47 // PNG Block
0d0a1a0A
0000000D // IHDR Block
49484452
00000004
00000001
08060000
00
F93C0FCD
00000014 // IDAT Block
49444154
0111EE00
CC0000FF
00CC00FF
0000CCFF
CCCCCCCC
6464C2B0
00000000 // IEND Block
49454E44
AE426082
I'd be really grateful for some pointers as to where the issue lies... or even the PNG data for a working file so that I can reverse-engineer it?
Update 2
Thanks to Mark Adler, I've corrected my newbie errors, and now have functional code that can reproduce the result shown in his answer below, i.e. 4x1 pixel image. From this I can now happily produce a 100x1 image!
However, as a last step, I'd hoped, by tweaking the height field in the IHDR and adding additional non-terminal IDATs, to extend this to say a 4 x 2 image. Unfortunately this doesn't appear to work the way I'd expected.
I now have something like...
89504E47 // PNG Header
0D0A1A0A
0000000D // re calc'ed IHDR with 2 rows
49484452
00000004
00000002 // changed - now 2 rows
08
06
00
00
00
7FA87D63 // CRC of IHDR updated
0000001C // row 1 IDAT, non-terminal
49444154
7801
00 // BFINAL = 0, BTYPE = 00
1100EEFF
00
CC0000FF
00CC00FF
0000CCFF
CCCCCCCC
3D3A0892
5D19A623
0000001C // row 2, terminal IDAT, as Mark Adler's answer
49444154
7801
01 // BFINAL = 1, BTYPE = 00
1100EEFF
00
CC0000FF
00CC00FF
0000CCFF
CCCCCCCC
3D3A0892
BA0400B4
00000000
49454E44
AE426082
This:
11EE // LEN & NLEN of data 2 bytes }
is wrong. LEN and NLEN are both 16 bits, not 8 bits. So that needs to be:
1100EEFF // LEN & NLEN of data 4 bytes }
You also need a zlib wrapper around the deflate data. See RFC 1950.
Lastly you will need to update the CRC of the chunk. (Which has the wrong comment by the way -- it should say CRC of IDAT chunk.)
Thusly repaired:
89504E47 // PNG Header
0D0A1A0A
0000000D // byte length of IHDR chunk contents, 4 bytes, value 13
49484452 // IHDR start - 4 bytes
00000004 // Width 4 bytes }
00000001 // Height 4 bytes }
08 // bit depth 8 = 24/32 bit 1 byte }
06 // color type, 6 - RGBa 1 byte }
00 // compression, 0 = Deflate 1 byte }
00 // filter, 0 = no filter 1 byte }
00 // interlace, 0 = no interlace 1 byte } Total, 13 Bytes
F93C0FCD // CRC of IHDR chunk, 4 bytes
0000001C // byte length of IDAT chunk contents, 4 bytes, value 28
49444154 // IDAT start - 4 bytes
7801 // zlib Header 2 bytes }
01 // BFINAL = 1, BTYPE = 00 1 byte }
1100EEFF // LEN & NLEN of data 4 bytes }
00 // Filter = 0, 1 byte }
CC0000FF // Pixel 1, Red-ish, 4 bytes }
00CC00FF // Pixel 2, Green-ish, 4 bytes }
0000CCFF // Pixel 3, Blue-ish, 4 bytes }
CCCCCCCC // Pixel 4, transclucent grey, 4 bytes }
3d3a0892 // Adler-32 check 4 bytes }
ba0400b4 // CRC of IDAT chunk, 4 bytes
00000000 // byte length of IEND chunk, 4 bytes (value: 0)
49454E44 // IEND start - 4 bytes
AE426082 // CRC of IEND chunk, 4 bytes
Related
I'm trying to understand protocol buffer encoding method, when translating message to binary(or hexadecimal) format, I can't understand how the embedded message is encoded.
I guess maybe it's related to memory address, but I can't find the accurate relationship.
Here is what i've done.
Step 1: I defined two messages in test.proto file,
syntax = "proto3";
package proto_test;
message Education {
string college = 1;
}
message Person {
int32 age = 1;
string name = 2;
Education edu = 3;
}
Step 2: And then I generated some go code,
protoc --go_out=. test.proto
Step 3: Then I check the encoded format of the message,
p := proto_test.Person{
Age: 666,
Name: "Tom",
Edu: &proto_test.Education{
College: "SOMEWHERE",
},
}
var b []byte
out, err := p.XXX_Marshal(b, true)
if err != nil {
log.Fatalln("fail to marshal with error: ", err)
}
fmt.Printf("hexadecimal format:% x \n", out)
fmt.Printf("binary format:% b \n", out)
which outputs,
hexadecimal format:08 9a 05 12 03 54 6f 6d 1a fd 96 d1 08 0a 09 53 4f 4d 45 57 48 45 52 45
binary format:[ 1000 10011010 101 10010 11 1010100 1101111 1101101 11010 11111101 10010110 11010001 1000 1010 1001 1010011 1001111 1001101 1000101 1010111 1001000 1000101 1010010 1000101]
what I understand is ,
08 - int32 wire type with tag number 1
9a 05 - Varints for 666
12 - string wire type with tag number 2
03 - length delimited which is 3 byte
54 6f 6d - ascii for "TOM"
1a - embedded message wire type with tag number 3
fd 96 d1 08 - ? (here is what I don't understand)
0a - string wire type with tag number 1
09 - length delimited which is 9 byte
53 4f 4d 45 57 48 45 52 45 - ascii for "SOMEWHERE"
What does fd 96 d1 08 stands for?
It seems like that d1 08 always be there, but fd 96 sometimes change, don't know why. Thanks for answering :)
Add
I debugged the marshal process and reported a bug here.
At that location I/you would expect the number of bytes in the embedded message.
I have repeated your experiment in Python.
msg = Person()
msg.age = 666
msg.name = "Tom"
msg.edu.college = "SOMEWHERE"
I got a different result, the one I would expect. A varint stating the size of the embedded message.
0x08
0x9A, 0x05
0x12
0x03
0x54 0x6F 0x6D
0x1A
0x0B <- Equals to 11 decimal.
0x0A
0x09
0x53 0x4F 0x4D 0x45 0x57 0x48 0x45 0x52 0x45
Next I deserialized your bytes:
msg2 = Person()
str = bytearray(b'\x08\x9a\x05\x12\x03\x54\x6f\x6d\x1a\xfd\x96\xd1\x08\x0a\x09\x53\x4f\x4d\x45\x57\x48\x45\x52\x45')
msg2.ParseFromString(str)
print(msg2)
The result of this is perfect:
age: 666
name: "Tom"
edu {
college: "SOMEWHERE"
}
The conclusion I come to is that there are some different ways of encoding in Protobuf. I do not know what is done in this case but I know the example of a negative 32 bit varint. A positive varint is encoded in five bytes, a negative value is cast as a 64 bit value and encoded to ten bytes.
Suppose we use avr-gcc to compile code which has the following structure:
typedef struct {
uint8_t bLength;
uint8_t bDescriptorType;
int16_t wString[];
} S_string_descriptor;
We initialize it globally like this:
const S_string_descriptor sn_desc PROGMEM = {
1 + 1 + sizeof L"1234" - 2, 0x03, L"1234"
};
Let's check what is generated from it:
000000ac <__trampolines_end>:
ac: 0a 03 fmul r16, r18
ae: 31 00 .word 0x0031 ; ????
b0: 32 00 .word 0x0032 ; ????
b2: 33 00 .word 0x0033 ; ????
b4: 34 00 .word 0x0034 ; ????
...
So, indeed string content follows the first two elements of the structure, as required.
But if we try to check sizeof sn_desc, result is 2.
Definition of the variable is done in compile-time, sizeof is also a compile-time operator. So, why sizeof var does not show true size of var? And where this behavior of the compiler (i.e., adding arbitrary data to a structure) is documented?
sn_desc is a 2-byte pointer into flash. It is meant to be used with LPM et alia in order to retrieve the actual data. There is no way to get the actual size of this data; store it separately.
I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
1 #include <stdio.h>
2 #include <x86intrin.h>
3 #include <stdint.h>
4
5 #define SIZE 32*1024
6
7 char arr[SIZE];
8
9 int main()
10 {
11 char *addr;
12 unsigned int dummy;
13 uint64_t tsc1, tsc2;
14 unsigned i;
15 volatile char val;
16
17 memset(arr, 0x0, SIZE);
18 for (addr = arr; addr < arr + SIZE; addr += 64) {
19 _mm_clflush((void *) addr);
20 }
21 asm volatile("sfence\n\t"
22 :
23 :
24 : "memory");
25
26 tsc1 = __rdtscp(&dummy);
27 for (i = 0; i < SIZE; i++) {
28 asm volatile (
29 "mov %0, %%al\n\t" // load data
30 :
31 : "m" (arr[i])
32 );
33
34 }
35 tsc2 = __rdtscp(&dummy);
36 printf("(1) tsc: %llu\n", tsc2 - tsc1);
37
38 tsc1 = __rdtscp(&dummy);
39 for (i = 0; i < SIZE; i++) {
40 asm volatile (
41 "mov %0, %%al\n\t" // load data
42 :
43 : "m" (arr[i])
44 );
45
46 }
47 tsc2 = __rdtscp(&dummy);
48 printf("(2) tsc: %llu\n", tsc2 - tsc1);
49
50 return 0;
51 }
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
400614: 0f 01 f9 rdtscp
400617: 89 ce mov %ecx,%esi
400619: 48 8b 4d d8 mov -0x28(%rbp),%rcx
40061d: 89 31 mov %esi,(%rcx)
40061f: 48 c1 e2 20 shl $0x20,%rdx
400623: 48 09 d0 or %rdx,%rax
400626: 48 89 45 c0 mov %rax,-0x40(%rbp)
40062a: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%rbp)
400631: eb 0d jmp 400640 <main+0x8a>
400633: 8b 45 b4 mov -0x4c(%rbp),%eax
400636: 8a 80 80 10 60 00 mov 0x601080(%rax),%al
40063c: 83 45 b4 01 addl $0x1,-0x4c(%rbp)
400640: 81 7d b4 ff 7f 00 00 cmpl $0x7fff,-0x4c(%rbp)
400647: 76 ea jbe 400633 <main+0x7d>
400649: 48 8d 45 b0 lea -0x50(%rbp),%rax
40064d: 48 89 45 e0 mov %rax,-0x20(%rbp)
400651: 0f 01 f9 rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?
mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.
Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).
To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)
But your bottleneck is far worse than that!
You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.) http://agner.org/optimize/
(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)
This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.
Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.
You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.
Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.
At the bottom of Page 264 of CLRS, the authors say after obtaining r0 = 17612864, the 14 most significant bits of r0 yield the hash value h(k) = 67. I do not understand why it gives 67 since 67 in binary is 1000011 which is 7 bits.
EDIT
In the textbook:
As an example, suppose we have k = 123456, p = 14, m = 2^14 = 16384, and w = 32. Adapting Knuth's suggestion, we choose A to be the fraction of the form s/2^32 that is closest to (\sqrt(5) - 1) / 2, so that A = 2654435769/2^32. Then k*s = 327706022297664 = (76300 * 2^32) + 17612864, and so r1 = 76300 and r0 = 17612864. The 14 most significant bits of r0 yield the value h(k)=67.
17612864 = 0x010CC040 =
0000 0001 0000 1100 1100 0000 0100 0000
Most significant 14 bits of that is
0000 0001 0000 11
Which is 0x43, which is 67
Also:
int32 input = 17612864;
int32 output = input >> (32-14); //67
In a 32 bit world
17612864 = 00000001 00001100 11000000 01000000 (binary)
top fourteen bits = 00000001 000011 = 67
I am looking at the code for the font file here:
http://www.openobject.org/opensourceurbanism/Bike_POV_Beta_4
The code starts like this:
const byte font[][5] = {
{0x00,0x00,0x00,0x00,0x00}, // 0x20 32
{0x00,0x00,0x6f,0x00,0x00}, // ! 0x21 33
{0x00,0x07,0x00,0x07,0x00}, // " 0x22 34
{0x14,0x7f,0x14,0x7f,0x14}, // # 0x23 35
{0x00,0x07,0x04,0x1e,0x00}, // $ 0x24 36
{0x23,0x13,0x08,0x64,0x62}, // % 0x25 37
{0x36,0x49,0x56,0x20,0x50}, // & 0x26 38
{0x00,0x00,0x07,0x00,0x00}, // ' 0x27 39
{0x00,0x1c,0x22,0x41,0x00}, // ( 0x28 40
{0x00,0x41,0x22,0x1c,0x00}, // ) 0x29 41
{0x14,0x08,0x3e,0x08,0x14}, // * 0x2a 42
{0x08,0x08,0x3e,0x08,0x08}, // + 0x2b 43
and so on...
I am very confused as to how this code works - can someone explain it to me please?
Thanks,
Majd
Each array of 5 bytes = 40 bits which map to the 7x5 = 35 pixels in the character grid (there are 5 unused bits presumably).
When you want to display a character you copy the corresponding 5 byte bitmap for that character to the appropriate memory location. E.g. to display the character X you would copy the data from font['X'].