Reversing byte input with basic assembly code

Reversing byte input with basic assembly code - algorithm

I'm participating in a ctf where one task is to reverse a row of input bytes using an assembly-ish environment. The input is x bytes long and the last byte is always 0x00. One example would be :
Input 4433221100, output 0011223344
I'm thinking that a loop that loops until it reaches input 00 is a place to start.
Do any of you have a suggestion on how to approach this? I don't need specific code examples, but some advice to point me in the right direction would be great. I only have basic alu operations, jumps and conditional jumps, storing and reading memory addresses, and some other basic stuff available. All alu operations are mod 256.

Yes, finding the length by searching for the 0 byte to find the end / length is one way to start. Depending on where you want the destination, it's possible to copy in the same loop that searches for the end.
If you want to reverse in-place, you need to find the end first (with a separate loop). Then you can load from both ends, store registers to opposite locations, and walk your pointers inward until they cross, standard in-place reverse that you can find examples of anywhere.
If you want make a reversed copy into other space, you could do it in one pass over the source (without finding the length first). Store output starting from the end of a buffer, decrementing the output pointer as you increment the read pointer. When you're done, you have a pointer to the start of the reversed copy, which you can pass to an output function. You won't know where you're going to stop, so the buffer needs to be big enough. But since you're just passing the pointer to another function, it's fine that you don't know (until you're done copying) where the start of the reversed copy will be.
You could still separately find the length and then copy, but that would be pointlessly inefficient.
If you need the reversed copy to start at some known position in another buffer (e.g. to append to another string or array), you would need the length or a pointer to the end before you store anything, so it's a 2-pass operation like reversing in-place.
You can then read the source backwards and write the destination forwards (or "output" each byte 1 at a time to some IO stream). Your loop termination condition could be a down-counter or a pointer compare using a pointer in a register, comparing src against the already-known start of the source or dst against the calculated end of the destination.
Or you can read the source forwards until you reach the position you found for the end, storing in reverse order starting from the calculated end of where the destination should go.
(If your machine is like 6502 and can easily index into a static array, but not easily keep a whole pointer in a register, obviously you'll want to use indices that count from 0. That makes detecting the start even easier, like sub reg, 1 / jnz if subtract already sets flags for a conditional branch to test.)

save your stackpointer in a variable
for each byte of the string
push byte onto the stack
repeat if byte was <> 0
pull byte from stack
output byte
repeat until old_stackpointer is reached
in 6502 assembler this could look like
tsx
stx OLD_STACKPTR
ldy#$ff
loop:
iny
lda INPUT,y
pha
bne loop
ldy#$ff
loop2:
iny
pla
sta INPUT,y
tsx
cpx OLD_STACKPTR
bne loop2

Related

How does rsync identify insertions of size that is not divisible by block size?

I'm trying to understand how rsync algorithm sends only changed parts of the file. Here's the situation I can't understand:
Consider this example, where A is the sender's version of file and B is the reciever's.
A = 1111122233333....
B = 1111133333....
Let's select 5 bytes as block size for the sake of example.
As you can see, only change is that three 2s were inserted inside A's file.
My understanding is that after correctly determining that 1st block is the same for both files, rsync sees that next blocks are '22233' and '33333', obviously different, proceeds to send that chunk to receiver, and continues in such a way until the end of the file - all blocks would be different due to that insertion, and it will need to send the whole remaining file over the network (possibly gigabytes).
Is there some way for rsync to resolve this situation?

I took a closer look at rsync's documentation (the part about The Sender), and here's the explanation (how I understood it):
Rsync uses "rolling checksum", thus it is able to compute accumulated checksum at each byte position.
Thus, in our example, when it read first block (first 5 bytes), it has checksum A1-5, which is located in the hash as a block that is present on the receiver - thus 1st block doesn't need to be sent.
Then it proceeds to computing the checksum for the next block (A6-10), and doesn't see it in the receiver's blocks. Then it takes the 1st byte after the last matching block, appends it to non-matching data, and proceeds to compute A7-11 checksum. After two more 2's are appended to non-matching data, it computes A9-13, which matches B6-10 checksum. At this point, it can send 222 to the receiver as inserted data, and skip block of 3's.

Splitting files with redundancy

I need a cmd tool or an algorithm description to do the following:
Given a file of any size (size < 1GB) I need to split the file into n parts (n may be any reasonable number) and 1 file with metadata.
Metadata is something similar to "parity" byte in RAID systems, that will allow me to restore the whole file if one part of the file is lost. So, the metadata is kind of 'extra' or redundant info that helps to restore original file.
Other n parts should not have any extra information.
For now, I tries to use pacrhive and par2 tools but they don't implement the idea I described above.
Also I tried to get into Forward error corrections (FEC) and Reed-Solomon codes but also didn't succeed there.
Thanks for any help.

You can apply something almost identical to RAID 5 (although that's for hard drives, the same logic applies).
So, send the 1st byte to the 1st file, the 2nd to the 2nd, ..., the nth to the nth, the (n+1)th to the 1st, the (n+2)th to the 2nd, etc.
Then just put the parity of each bit across the n files into the last file, i.e.:
1st bit of last file = 1st bit of 1st file
XOR 1st bit of 2nd file
XOR ...
XOR 1st bit of nth file
2nd bit of last file = XOR of 2nd bits of all the other files as above
This will allow you to restore any file (including the 'metadata' file) by just calculating the parity of the rest of files.

Efficient Algorithm for Parsing OpCodes

Let's say I'm writing a virtual machine. I read in the program data into an array of bytes. Now I need to loop through those bytes (instructions are two bytes) and instantiate a little class representing each instruction and it's arguments.
What would be a fast parsing approach? Here are the two way's I've thought of:
Logically branching by inspecting each bit from the left to the right until I narrowed it down to a particular op code. This would be like a binary search.
Inspecting some programs to come up with a list of opcodes ordered by frequency of use, and then checking the for the full opcode in that order.
Note: I will be using bit shifting and masking in C to check, not regexes or string comps or anything high-level like that.

You don't need to parse anything. If this is in C, you make a table of function pointers which has 256 entries in it, one for each possible byte value, then jump to the appropriate function based on the first byte value. If the second byte is significant then a switch statement can be used within the function to handle the second byte. This is how the original Visual Basic interpreter (versions 1-6) worked.

How is an array stored in memory?

In an interest to delve deeper into how memory is allocated and stored, I have written an application that can scan memory address space, find a value, and write out a new value.
I developed a sample application with the end goal to be able to programatically locate my array, and overwrite it with a new sequence of numbers. In this situation, I created a single dimensional array, with 5 elements, e.g.
int[] array = new int[] {8,7,6,5,4};
I ran my application and searched for a sequence of the five numbers above. I was looking for any value that fell between 4 and 8, for a total of 5 numbers in a row. Unfortunately, my sequential numbers within the array matched hundreds of results, as the numbers 4 through 8, in no particular sequence happened to be next to each other, in memory, in many situations.
Is there any way to distinguish that a set of numbers within memory, represents an array, not simply integers that are next to each other? Is there any way of knowing that if I find a certain value, that the matching values proceeding it are that of an array?
I would assume that when I declare int[] array, its pointing at the first address of my array, which would provide some kind of meta-data to what existed in the array, e.g.
0x123456789 meta-data, 5 - 32 bit integers
0x123456789 + 32 "8"
0x123456789 + 64 "7"
0x123456789 + 96 "6"
0x123456789 + 128 "5"
0x123456789 + 160 "4"
Am I way off base?

Debug + Windows + Memory + Memory 1, set the Address field to "array". You'll see this when you switch the view to "4-byte Integer":
0x018416BC 6feb2c84 00000005 00000008 00000007 00000006 00000005 00000004
The first address is the address of the object in the garbage collected heap, plus the part of the object header that's at a negative offset (syncblk index). You cannot guess this value, the GC moves it around. The 2nd hex number is the 'type handle' for the array type (aka method table pointer). You cannot guess this value, type handles are created by the CLR on demand. The 3rd number is the array length. The rest of them are the array element values.
The odds of reliably finding this array back at runtime without a debugger are quite low. There isn't much point in trying.

Don't. Array is stored on the heap and subject to re-location due to garbage collection. You have to use fixed if you need to make sure memory is not moved in which can you can use but only very carefully.
If you are after high-performance arrays, use stackalloc and use your code scheme.

I don't know exactly but this article seems to suggest that you can get a pointer to your array, with which i would think you can determine the actual address.

Although I see you are using C# and, presumably, .NET, most of your question is in very general terms about memory. Keep mind that, in the most general sense, all memory is just bits whether that memory holds an array, strings, or code.
With that in mind, unless you can find tell-tale signs of your current platform's way of allocating different data types, there is no difference between memory that contains arrays, strings, or code.
Also, I wouldn't make any assumptions about if an array "points" to the first item in the array. Perhaps someone else can address this issue specifically, but I would assume some sort of header is involved.

Memory is not always stored contiguously. If you can ensure that it is, what you are asking is possible.

optimizing byte-pair encoding

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?

This is a summary of my progress so far:
Googling found this little report that links to the original code and cites the source:
Philip Gage, titled 'A New Algorithm
for Data Compression', that appeared
in 'The C Users Journal' - February
1994 edition.
The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.
That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.
My test data is enwik8 from the Hutter Prize.
|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2 | 1.24 | //The current version in the large text benchmark
| bpe_c | 1.07 | //The original version by Gage, using a hashtable
| bpev3 | 0.25 | //Uses a list, custom sort, less memcpy
|----------------|-----------------|
bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.
As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.
The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.
Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.
Its sill multi-pass though. And so its not a generally competitive algorithm.
If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.
There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.
To speed up performance on byte-pair encoding:
Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).

Code in JustBasic can be found here complete with input text file.
Just BASIC Files Archive – forum post
EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding
EBPE features two post processes to Byte Pair Encoding
1. Is compressing the dictionary (believed to be a novelty)
A dictionary entry is composed of 3 bytes:
AA – the two char to be replaced by (byte pair)
1 – this single token (tokens are unused symbols)
So "AA1" tells us when decoding that every time we see a "1" in the
data file, replace it with "AA".
While long runs of sequential tokens are possible, let’s look at this
8 token example:
AA1BB3CC4DD5EE6FF7GG8HH9
It is 24 bytes long (8 * 3)
The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.
We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we
see a sequential run of 4 tokens or more, let’s modify our dictionary to be:
AA1BB3<255>CCDDEEFFGGHH<255>
Where the <255> tells us that the tokens for the byte pairs are implied and
are incremented by 1 more than the last token we saw (3). We increment
by one until we see the next <255> indicating an end of run.
The original dictionary was 24 bytes,
The enhanced dictionary is 20 bytes.
I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.
2. Is compressing the data file
Re-using rarely used characters as tokens is nothing new.
After using all of the symbols for compression (except for <255>),
we scan the file and find a single "j" in the file. Let this char do double
duty by:
"<255>j" means this is a literal "j"
"j" is now used as a token for re-compression,
If the j occurred 1 time in the data file, we would need to add 1 <255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.
If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary
entry so we need to save more than 9 bytes in BPE for this to be worth it.
Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.
Note: When decompressing make sure not to decompress every "j".
One needs to look at the prior character to make sure it is not a <255> in order
to decompress. Finally, after all decompression, go ahead and remove the <255>'s
to recreate your original file.
3. What’s next in EBPE?
Unknown at this time

I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
Number of byte-pair occurrences in chunk of data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!

Yes, keep us posted.
guarantee?
BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.
EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)

You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.
Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text
files and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.

Here is a new BPE(http://encode.ru/threads/1874-Alba).
Example for compile,
gcc -O1 alba.c -o alba.exe
It's faster than default.

There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.

the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio