Swift 2: find the location of the first byte of a sequence

Swift 2: find the location of the first byte of a sequence - swift2

In NSData object I need to find the location of the first byte of a specific bytes sequence "D0 CF 11 E0 A1 B1 1A E1". How may I do it in Swift 2?

Related

How to simplify algorithm of pasting addresses of data to immediate fields in machine code?

When developing my assembler I got one problem. In assembly language we can define data values (e. g. msg db 'Hi') and paste address of this data anywhere, even above this data. However when assembling code assembler don't know address of data until is not processed the line with those data.
Of course, assembler can remember addresses of machine code where we use address of defined data and after processing a code replace values in remembered addresses to address of our data. But if we define data in 2-byted address (e. g. at 01 F1) assembler would be to replace 1 bytes in remembered addresses to 2 bytes of address (01 F1) and so immediate field size will be changed (imm8 -> imm16) and assembler shall be rewrite instruction in same address (change bits w and s at opcode and maybe set prefix 0x66). If assembler will set prefix 0x66 and our data definition be after this instruction it shall be rewrite immediate field bytes (increment address value).
Illustration of this algoritm :
The following code:
mov dh, 09h
add dx, msg
;...
msg db 'Hello$'
will be assembled in the following principle:
preparing the code:
Comment : |===> Remember address of this byte (0x0004)
Comment : | ADD DX,MSG |
Address : 0000 0001 |0002 0003 0004| ... 01F1 01F2 01F3 01F4 01F5 01F6
Code : B4 09 | 83 C2 00 | ... 48 65 6C 6C 6F 24
Comment : ---------------- H e l l o $
rewriting code in remebered addresses:
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F1 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 83 C2 F1 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
rewriting instruction's opcode 83h -> 81h (10000011b -> 10000001b: bit s=0):
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F1 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 81 C2 F1 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
write to immediate field the new address of data (0x01F2):
Comment : |=============|-This address (msg)
Comment : | ADD DX,01F2 | v
Address : 0000 0001 |0002 0003 0004 0005| ... 01F2 01F3 01F4 01F5 01F6 01F7
Code : B4 09 | 81 C2 F2 01 | ... 48 65 6C 6C 6F 24
Comment : --------------------- H e l l o $
I think that this algorithm is difficult. Is it possible to simplify its?

If the assembler isn't emitting a flat binary (i.e. also being a linker), the assembler has to assume the symbol address might be 2 bytes, because the final absolute address won't be known until link time, after the assembler is done. (So it will just leave space for a 2-byte address and a relocation for the linker to fill it in).
But if you are assembling directly into a flat binary and want to do this optimization, presumably you'd treat it like branch displacements with a start-small algorithm and do multi-pass optimization until everything fits. In fact you'd do this as part of the same passes that look at jmp/jcc rel8 vs. jmp/jcc rel16. (Why is the "start small" algorithm for branch displacement not optimal? - it is optimal if you don't have stuff like align 8, otherwise there are corner cases where it does ok but not optimal.)
These optimization passes are just looping over internal data-structures that represent the code, not actually writing final machine-code at each step. There's no need to calculate or look up the actual opcodes and ModRM encodings until after the last optimization pass.
You just need your optimizer to know rules for instruction size, e.g. that add reg, imm8 is 3 bytes, add reg, imm16 is 4 bytes (except for AX where add ax, imm16 has a 3-byte special encoding, same as add ax, imm8, so add to AX doesn't need to be part of the multi-pass optimization at all, it can just choose an encoding when we reach it after all symbol addresses are known.)
Note that it's much more common to use addresses as immediates for mov which doesn't allow narrow immediates at all (mov reg, imm16 is always 3 bytes). But this optimization is also relevant for disp8 vs. disp16 in addressing modes, e.g. for xor cl, [di + msg] could use reg+disp8 for small addresses so it's worth having this optimization.
So again, your optimizer passes would know that [di + disp8] takes 1 byte after the ModRM, [di + disp16] takes 2 extra.
And [msg] always takes 2 bytes after ModRM, there is no [disp8] encoding. The asm source would need to have a zeroed register if it wanted to take advantage of disp8 for small addresses.
Of course a simplistic or one-pass assembler could always just assume addresses are 16-bit and encode the other parts of the machine code accordingly, only going back to fill in numeric addresses once unresolved symbols are seen. (Or emit relocation info at the end for a linker to do it.)

Hexdump: Convert between bytes and two-byte decimal

When I use hexdump on a file with no options, I get rows of hexadecimal bytes:
cf fa ed fe 07 00 00 01 03 00 00 80 02 00 00 00
When I used hexdump -d on the same file, that same data is shown in something called two-byte decimal groupings:
64207 65261 00007 00256 00003 32768 00002 00000
So what I'm trying to figure out here is how to convert between these two encodings. cf and fa in decimal are 207 and 250 respectively. How do those numbers get combined to make 64207?
Bonus question: What is the advantage of using these groupings? The octal display uses 16 groupings of three digits, why not use the same thing with the decimal display?

As commented by #georg.
0xfa * 256 + 0xcf == 0xfacf == 64207
The conversion exactly works like this.
So, if you see man hexdump:
-d, --two-bytes-decimal
Two-byte decimal display. Display the input offset in hexadecimal, followed by eight
space-separated, five-column, zero-filled, two-byte units of input data, in unsigned
decimal, per line.
So, for example:
00000f0 64207 65261 00007 00256 00003 32768 00002 00000
Here, 00000f0 is a hexadecimal offset.
Followed by two-byte units of input data, for eg.: 64207 in decimal (first 16 bits - i.e. two bytes of the file).
The conversion (in your case):
cf fa ----> two-byte unit (the byte ordering depends on your architecture).
fa * 256 + cf = facf ----> appropriately ----> 0xfacf (re-ording)
And dec of oxfacf is 64207.
Bonus question: It is a convention to display octal numbers using three digits (unlike hex and decimal), so it uses a triplet for each byte.

Why strings are stored in the following way in PE file

I opened a .exe file and I found a string "Premium" was stored in the following way
50 00 72 00 65 00 6D 00 69 00 75 00 6D 00
I just don't know why "00" is appended to each of characters and what's its usage.
Thanks,

It's probably a UTF-16 encoding of a Unicode string. Here's an example using Python:
>>> u"Premium".encode("utf16")
'\xff\xfeP\x00r\x00e\x00m\x00i\x00u\x00m\x00'
# ^ ^ ^ ^ ^ ^ ^
After the byte marker to indicate endianness, you can see alternating letters and null bytes.
\xff\xfe is the byte-order marker; it indicates that the low-order byte of each 16-bit value comes first. (If the high-order byte came first, the byte marker would be \xfe\xff; there's nothing particularly meaningful about which marker means which.)
Each character is then encoded as a 16-bit value. For many values, the UTF-16 encoding is simply the straightforward unsigned 16-bit representation of its Unicode code point. Specifically, 8-bit ASCII values simply use a null byte as the high-order byte, and its ASCII value as the low-order byte.

Understanding .bmp file

I have a .bmp file
I sort of do understand and sort of do not understand. I understand that the first 14 bytes are my Bitmapfileheader. I furthermore do understand that my Bitmapinfoheader contains information about the bitmap as well and is about 40 bytes large (in version 3).
What I do not understand is, how the information is stored in there.
I have this image:
Why is all the colorinformation stored in "FF"? I know that the "00" are "Junk Bytes". What I do not understand is why there is everything in "FF"?!
Furthermore, I do not understand what type of "encoding" that is? 42 4D equals do "BM". What is that? How can I translate what I see there to colors or letters or numbers?!
What I can read in this case:
Bitmapfileheader:
First 2 bytes. BM if it is a .bmp file: 42 4D = BM (However 42 4D transforms to BM)
Next 4 Bytes: Size of the bitmap. BA 01 00 00. Dont know what size that should be.
Next 4 Bytes: Something reserved.
Next 4 Bytes: Offset (did not quite understand that)
Bitmapinfoheader
Next 4 Bytes: Size of the bitmapinfoheader. 6C 00 00 00 here.
Next 4 Bytes: Width of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 4 Bytes: Height of the .bmp. 0A 00 00 00. I know that that must be 10px since I created that file.
Next 2 Bytes: Something from another file format.
Next two Bytes: Color depth. 18 00 00 00. I thought that can only by 1,2,4,8, 16, 24, 32?

The first 2 bytes of information that you see "42 4D" are what we call the magic numbers. They are the signature of the file, 42 4d is the hex notation of 01000010 01001101 in binary.
Every file has one, .jpg, .gif. You get it.
Here is an image that illustrate a BMP complete header of 54 bytes(24-bit BMP).
BMP Header
The total size of the BMP is calculated by the size of the header + BMP.width x BMP.height * 3 (1 byte for Red, 1 byte for Green, 1 byte for Blue - in the case of 8bits of information per channel) + padding(if it exists).
Junk bytes that you refer, is the padding, they are needed if the size of each scanline(row) is not a multiple of 4.
White in hex notation if ffffff, being the first two red, green and blue.
While in decimal notation each channel will have the value of 255, because 2^8(8 bits) -1 = 255.
Hope this clears a bit(non intended pun) for you.

Figuring out how to decode obfuscated URL parameters

I have web based system that uses encrypted GET parameters. I need to figure out what encryption is used and create a PHP function to recreate it. Any ideas?
Example URL:
...&watermark=ISpQICAK&width=IypcOysK&height=IypcLykK&...

You haven't provided nearly enough sample data for us to reliably guess even the alphabet used to encode it, much less what structure it might have.
What I can tell, from the three sample values you've provided, is:
There is quite a lot of redundancy in the data — compare e.g. width=IypcOysK and height=IypcLykK (and even watermark=ISpQICAK, though that might be just coincidence). This suggests that the data is neither random nor securely encrypted (which would make it look random).
The alphabet contains a fairly broad range of upper- and lowercase letters, from A to S and from c to y. Assuming that the alphabet consists of contiguous letter ranges, that means a palette of between 42 and 52 possible letters. Of course, we can't tell with any certainty from the samples whether other characters might also be used, so we can't even entirely rule out Base64.
This is not the output of PHP's base_convert function, as I first guessed it might be: that function only handles bases up to 36, and doesn't output uppercase letters.
That, however, is just about all. It would help to see some more data samples, ideally with the plaintext values they correspond to.
Edit: The id parameters you give in the comments are definitely in Base64. Besides the distinctive trailing = signs, they both decode to simple strings of nine printable ASCII characters followed by a line feed (hex 0A):
_Base64___________Hex____________________________ASCII_____
JiJQPjNfT0MtCg== 26 22 50 3e 33 5f 4f 43 2d 0a &"P>3_OC-.
JikwPClUPENICg== 26 29 30 3c 29 54 3c 43 48 0a &)0<)T<CH.
(I've replaced non-printable characters with a . in the ASCII column above.) On the assumption that all the other parameters are Base64 too, let's see what they decode to:
_Base64___Hex________________ASCII_
ISpQICAK 21 2a 50 20 20 0a !*P .
IypcOysK 23 2a 5c 3b 2b 0a #*\;+.
IypcLykK 23 2a 5c 2f 29 0a #*\/).
ISNAICAK 21 23 40 20 20 0a !## .
IyNAPjIK 23 23 40 3e 32 0a ###>2.
IyNAKjAK 23 23 40 2a 30 0a ###*0.
ISggICAK 21 28 20 20 20 0a !( .
IikwICAK 22 29 30 20 20 0a ")0 .
IilAPCAK 22 29 40 3c 20 0a ")#< .
So there's definitely another encoding layer involved, but we can already see some patterns:
All decoded values consist of a constant number of printable ASCII characters followed by a trailing line feed character. This cannot be a coincidence.
Most of the characters are on the low end of the printable ASCII range (hex 20 – 7E). In particular, the lowest printable ASCII character, space = hex 20, is particularly common, especially in the watermark strings.
The strings in each URL resemble each other more than they resemble the corresponding strings from other URLs. (But there are resemblances between URLs too: for example, all the decoded watermark values begin with ! = hex 21.)
In fact, the highest numbered character that occurs in any of the strings is _ = hex 5F, while the lowest (excluding the line feeds) is space = hex 20. Their difference is hex 3F = decimal 63. Coincidence? I think not. I'll guess that the second encoding layer is similar to uuencoding: the data is split into 6-bit groups (as in Base64), and each group is mapped to an ASCII character simply by adding hex 20 to it.
In fact, it looks like the second layer might be uuencoding: the first bytes of each string have the right values to be uuencode length indicators. Let's see what we get if we try to decode them:
_Base64___________UUEnc______Hex________________ASCII___re-UUE____
JiJQPjNfT0MtCg== &"P>3_OC- 0b 07 93 fe f8 cd ...... &"P>3_OC-
JikwPClUPENICg== &)0<)T<CH 25 07 09 d1 c8 e8 %..... &)0<)T<CH
_Base64___UUEnc__Hex_______ASC__re-UUE____
ISpQICAK !*P 2b + !*P``
IypcOysK #*\;+ 2b c6 cb +.. #*\;+
IypcLykK #*\/) 2b c3 c9 +.. #*\/)
ISNAICAK !## 0e . !##``
IyNAPjIK ###>2 0e 07 92 ... ###>2
IyNAKjAK ###*0 0e 02 90 ... ###*0
ISggICAK !( 20 !(```
IikwICAK ")0 25 00 %. ")0``
IilAPCAK ")#< 26 07 &. ")#<`
This is looking good:
Uudecoding and re-encoding the data (using Perl's unpack "u" and pack "u") produces the original string, except that trailing spaces are replaced with ` characters (which falls within acceptable variation between encoders).
The decoded strings are no longer printable ASCII, which suggests that we might be closer to the real data.
The watermark strings are now single characters. In two cases out of three, they're prefixes of the corresponding width and height strings. (In the third case, which looks a bit different, the watermark might perhaps have been added to the other values.)
One more piece of the puzzle — comparing the ID strings and corresponding numeric values you give in the comments, we see that:
The numbers all have six digits. The first two digits of each number are the same.
The uudecoded strings all have six bytes. The first two bytes of each string are the same.
Coincidence? Again, I think not. Let's see what we get if we write the numbers out as ASCII strings, and XOR them with the uudecoded strings:
_Num_____ASCII_hex___________UUDecoded_ID________XOR______________
406747 34 30 36 37 34 37 25 07 09 d1 c8 e8 11 37 3f e6 fc df
405174 34 30 35 31 37 34 25 07 0a d7 cb eb 11 37 3f e6 fc df
405273 34 30 35 32 37 33 25 07 0a d4 cb ec 11 37 3f e6 fc df
What is this 11 37 3f e6 fc df string? I have no idea — it's mostly not printable ASCII — but XORing the uudecoded ID with it yields the corresponding ID number in three cases out of three.
More to think about: you've provided two different ID strings for the value 405174: JiJQPjNfT0MtCg== and JikwPCpVXE9LCg==. These decode to 0b 07 93 fe f8 cd and 25 07 0a d7 cb eb respectively, and their XOR is 2e 00 99 29 33 26. The two URLs from which these ID strings came from have decoded watermarks of 0e and 20 respectively, which accounts for the first byte (and the second byte is the same in both, anyway). Where the differences in the remaining four bytes come from is still a mystery to me.

That's going to be difficult. Even if you find the encryption method and keys, the original data is likely salted and the salt is probably varied with each record.
That's the point of encryption.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio