What does CONCAT15 and CONCAT412 means in ghidra? - ghidra

I decompiled a file in ghidra and I sawed a lot of CONCAT+RandomNumber in decompiled file!
what does they mean ?

Let me cite the Ghidra Help (F1) first:
CONCAT31(x,y) - Concatenation operator - PIECE
The digit '3' indicates the size of the input operand 'x' in bytes.
The digit '1' indicates the size of the input operand 'y' in bytes.
The parameters 'x' and 'y' hold the values being concatenated.
CONCAT31(0xaabbcc,0xdd) = 0xaabbccdd
Concatenate the bytes in 'x' with the bytes in 'y'. 'x' becomes the most significant bytes, and 'y' the least significant bytes, in the result. So all these "functions" prefixed with CONCAT belong to a set of internal decompiler functions used by Ghidra to express things that normally not simply expressed in the C-like high level representation.
CONCAT in particular could be modeled as a left shift of the first argument by the size of the second argument and then logical and-ing the two parameters. But for humans it's much easier to think of it as "put the two things next to each other".
The numbers following CONCAT only matter if the passed arguments are not the expected sizes and are probably mainly there to make things more explicit. Concretely, you shouldn't read CONCAT15 as "concat fifteen" but as "concat one five": The first argument is expected to have a size of one byte while the second has a size of five, totaling to an amount of six bytes: CONCAT15(0x12, 0x3456789012) is the same as 0x123456789012.
P.S.: CONCAT412 almost certainly means concating 4 and 12 bytes, not 41 and 2.

Related

How many numbers can we store with 1 bit?

I want to know how many characters or numbers can I store in 1 bit only. It will be more helpful if you tell it in octal, hexadecimal.
I want to know how many characters or numbers can I store in 1 bit only.
It is not practical to use a single bit to store numbers or characters. However, you could say:
One integer provided that the integer is in the range 0 to 1.
One ASCII character provided that the character is either NUL (0x00) or SOH (0x01).
The bottom line is that a single bit has two states: 0 and 1. Any value domain with more that two values in the domain cannot be represented using a single bit.
It will be more helpful if you tell it in octal, hexadecimal.
That is not relevant to the problem. Octal and hexadecimal are different textual representations for numeric data. They make no difference to the meaning of the numbers, or (in most cases1) the way that you represent the numbers in a computer.
1 - The exception is when you are representing numbers as text; e.g. when you represent the number 42 in a text document as the character '4' followed by the character '2'.
A bit is a "binary digit", or a value from a set of size two. If you have one or more bits, you raise 2 to the power of the number of bits. So, 2ยน gives 2. The field in Mathematics is called combinatorics.

Base91, how is it calculated?

I've been looking online to find out how basE91 is calculated. I have found resources such as this one which specifies the characters used for a specific value but nowhere have I found how I get that value.
I have tried changing the input values into binary and taking chunks of both 6 and 7 bits but these do not work and I get the incorrect output. I do not want code that will do this for me as I which to write that myself, I only want to know the process needed to encode a string into basE91.
First, you need to see the input as a bit stream.
Then, read 13 bits from the stream, and form an integer value from it. If the value of this integer is lower than or equal to 88, then read one additional bit, and put it into the 14th bit (lowest bit being 1st) of the integer. This integer's (let's call it v) maximum value is: 8192+88 = 8280.
Then split v into two indices: i0 = v%91, i1 = v/91. Then use a 91-element character table, and output two characters: table[i0], table[i1].
(now you can see the reason of 88: for the maximal value (8280), both i0 and i1 become 90)
So this process is more complicated than base64, but more space efficient. Furthermore, unlike base64, the size of the output is a little bit dependent of the input bytes. A N-length sequence of 0x00 will be shorter than a N-length sequence of 0xff (where N is a sufficiently large number).

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:
Initialize dictionary of size 2^n
While output is less than known output size:
Read flag
If the flag is set, output literal byte (and append it at the end of dictionary)
If the flag is not set:
Read length and look behind position
Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.
The thing is that the implementations follow two schools of how to encode the flag. The first one treats the input as sequence of bits:
(...)
Read flag as one bit
If it's set, read literal byte as 8 unaligned bits
If it's not set, read length and position as n and m unaligned bits
This involves lots of bit shift operations.
The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:
(...)
Read 8 flags at once by reading one byte. For each of these 8 flags:
If it's set, read literal as aligned byte
If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)
My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?
They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.
Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.
lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

LZ4 compression algorithm explanation

Description from Wikipedia:
The LZ4 algorithm represents the data as a series of sequences. Each sequence begins with a one byte token that is broken into two 4 bit fields. The first field represents the number of literal bytes that are to be copied to the output. The second field represents the number of bytes to copy from the already decoded output buffer (with 0 representing the minimum match length of 4 bytes). A value of 15 in either of the bitfields indicates that the length is larger and there is an extra byte of data that is to be added to the length. A value of 255 in these extra bytes indicates that yet another byte to be added. Hence arbitrary lengths are represented by a series of extra bytes containing the value 255. After the string of literals comes the token and any extra bytes needed to indicate string length. This is followed by an offset that indicates how far back in the output buffer to begin copying. The extra bytes (if any) of the match-length come at the end of the sequence
I didn't understand that at all! Does anyone have an easy way to understand example?
For example, in the above explanation what is a literal byte and what is a match? How can we have a decoded output buffer when we're just beginning to compress? Length of what?
The explanation at here was also impenetrable for me.
A simple example would be nice unless you have a better way of explaining it.
First, read about LZ77, the core approach being used. The text is a description of a particular way to code a series of literals and string matches in the preceding data.
A match is when the next bytes in the uncompressed data occur in the previously decompressed data. So instead of sending those bytes directly, a length and an offset is sent. Then you go offset bytes backwards and copy length bytes to the output.
Yes, you can't have a match at the beginning of the stream. You have to start with literals. (Unless there is a preset dictionary, which is another topic.)

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

Resources