Improve this compression algorithm? - algorithm

I am compressing 8 bit bytes and the algorithm works only if the number of unique single bytes found on the data is 128 or less.
I take all the unique bytes. At the start I store a table containing once each unique byte. If they are 120 I store 120 bytes.
Then, instead of storing each item in space of 8 bits, I store each item in 7 bits, one after another. Those 7 bits contain the item's position on the table.
Question: how can I avoid storing those 120 bytes at the start, by storing the possible tables in my code?

What you are trying do is special case of huffman coding where you are only considering unique byte not their frequency hence giving each byte fixed length code but you can do better use their frequency to give them variable length codes using huffman coding and get more compression.
But if you intend to use the same algorithm then consider this way :-
Dont store 120 bytes store 256 bits (32 bytes) where 1 indicate if value is present
because it will give you all info. You use bit to get the values which
are found in the file and construct the mapping tables again

I don't know the exact algorithm, but probably the idea of the compression algorithm is that you cannot. It has to store those values, so it can write a shortcut for all other bytes in the data.
There is one way in which you could avoid writing those 120 bytes: when you know the contents of those bytes beforehand. For example, when you know that whatever you are going to send, will only contain those bytes. Then you can simply make the table known on both sides, and simply store everything but those 120 bytes.

Related

Procedural generation and permutations question

I probably won't pursue this but I had this idea of generating a procedural universe in the most memory efficient way possible.
Like in the game Elite, you could use a random number generator based on a seed, and so each star system can be represented by a single seed number instead of lists of stats and other info. But if each star system is a 64 bit number, to make the milky way, 100 billion stars, that is 6.4 terabytes of memory. But if you use only 8-bits per star system, you'll only have 256 unique star systems in your game. So my other idea was to have each star system represented by 8 bits, but simply grab the next 7 star system's bytes in memory and use that combination to form a 64 bit number for the planet's seed. Obviously there would be 7 extra bytes at the end to account for the last star system in memory.
So is there any way to organize the values in these bytes such that every set of 8 bytes over the entire file covers all 64 bit values (hypothetically) with no repeats? Or is it impossible and I should just accept repeats? Or could I possibly use the address of the byte itself as part of the seed? So how would that work in C? Like if I have a file of 100 billion bytes, does that actually take up exactly 100 billion bytes in memory or is it more and how are the addresses for those bytes stored? And is accessing large files like that (like 100gb+) in a server client relationship practical? Thank you.

Cache calculating block offset and index

I've read several topics about this theme but I could not get the answer. So my question is:
1) How is the block offset calculated?
I want to know not the formula but the concept of it. As I know it is quantity of cases which a block can store the address. For example If there is a block with 8 byte storage and has to store 2 byte addresses. Does its block offset is 2 bit?(So there is 4 cases to store the address (the diagram below might make easier to see what I am saying).
The block offset is simply calculated as log2 cache_line_size.
The reason is that all system that I know of are byte addressable. So you need enough bits to index any byte in the block. Although most systems have a word size that is larger than a single byte, they still support offsets of a single byte gradulatrity, even if that is not the common case.
So for the example you mentioned of an 8-byte block size with 2-byte word, you would still need 3 bits in order to allow accessing any byte. If you had a system that was not byte addressable then you could use just 2 bits for the block offset. But in practice all systems that I know of are byte addressable.

Can AES algorithm work the same over plain text and over bytes sequences?

It's clear how the algorithm manages plain text as the characters byte values to the state matrix.
But what about AES encryption of binary files?
How does the algorithm manages larger than 16 bytes files, as long as the state is standarized to be 4x4 bytes?
The AES primitive is the basis of constructions that allow encryption/decryption of arbitrary binary streams.
AES-128 takes a 128-bit key and a 128-bit data block and "encrypts" or "decrypts" this block. 128 bit is 16 bytes. Those 16 bytes can be text (e.g. ASCII, one character per byte) or binary data.
A naive implementation would just break a file with longer than 16 bytes into groups of 16 bytes and encrypt each of these with the same key. You might also need to "pad" the file to make it a multiple of 16 bytes. The problem with that is that it exposes information about the file because every time you encrypt the same block with the same key you'll get the same ciphertext.
There are different ways to build on the AES function to encrypt/decrypt more than 16 bytes securely. For example you can use CBC or use counter mode.
Counter mode is a little easier to explain so let's look at that. If we have AES_e(k, b) encrypt block b with key k we do not want to re-use the same key to encrypt the same block more than once. So the construction we'll use is something like this:
Calculate AES_e(k, 0), AES_e(k, 1), AES_e(k, n)
Now we can take arbitrary input, break it into 16 bytes blocks, and XOR with this sequence. Since the attacker does not know they key they can not regenerate this sequence and decode our (longer) message. The XOR is applied bit by bit between the blocks generated above and the cleartext. The receiving side can now generate the same sequence, XOR it with the ciphertext and retrieve the cleartext.
In application you also want to combine this with some sort of authentication mechanism so you something like AES-GCM or AES-CCM.
Imagine you have a 17 byte plain text.
state matrix will be filled with the first 16 bytes and one block will be encrypt.
Next block will be 1 byte that left and state matrix will be padded with data in order to fill those 16 bytes AES needs.
It works well with bytes/binary files because AES always consider bytes unities.Does not matter if that is a ascii chunk or any other think. Just remember that everything in a computer is binary/bytes/bits. Once data be a stream data (chunks of information in bytes) it'll work fine.

How can I approximate the size of a data structure in scala?

I have a query that returns me around 6 million rows, which is too big to process all at once in memory.
Each query is returning a Tuple3[String, Int, java.sql.Timestamp]. I know the string is never more than about 20 characters, UTF8.
How can I work out the max size of one of these tuples, and more generally, how can I approximate the size of a scala data-structure like this?
I've got 6Gb on the machine I'm using. However, the data is being read from the database using scala-query into scala's Lists.
Scala objects follow approximately the same rules as Java objects, so any information on those is accurate. Here is one source, which seems at least mostly right for 32 bit JVMs. (64 bit JVMs use 8 bytes per pointer, which generally works out to 4 bytes extra overhead plus 4 bytes per pointer--but there may be less if the JVM is using compressed pointers, which it does by default now, I think.)
I'll assume a 64 bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus an Int (4 bytes) plus object overhead (~12 bytes) rounded to the nearest 8, or 32 bytes, plus an extra object (8 bytes) as a stub for the non-specialized version of Int. (Sadly, if you use primitives in tuples they take even more space than when you use wrapped versions.). String is 32 bytes, IIRC, plus the array for the data which is 16 plus 2 per character. java.sql.Timestamp needs to store a couple of Longs (I think it is), so that's 32 bytes. All told, it's on the order of 120 bytes plus two per character, which at ~20 characters is ~160 bytes.
Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my estimate above has been corrected using this data so it matches; I had several small errors before).
How much memory have you got at your disposal? 6 million instances of a triple is really not very much!
Each reference has an overhead which is either 4 or 8 bytes, dependent on whether you are running 32- or 64-bit (without compressed "oops", although this is the default in JDK7 for heaps under 32Gb).
So your triple has 3 references (there may be extra ones due to specialisation - so you might get 4 refs), your Timestamp is a wrapper (reference) around a long (8 bytes). Your Int will be specialized (i.e. an underlying int), so this makes another 4 bytes. The String is 20 x 2 bytes. So you basically have a worst case of well under 100 bytes per row; so 10 rows per kb, 10,000 rows per Mb. So you can comfortably process your 6 million rows in under 1 Gb of heap.
Frankly, I think I've made a mistake here because we process daily several million rows of about twenty fields (including decimals, Strings etc) comfortably in this space.

How to calculate Storage when ftping to MainFrame

How can I calculate storage when FTPing to MainFrame? I was told LRECL will always remain '80'. Not sure how I can calculate PRI and SEC dynamically based on the file size...
QUOTE SITE LRECL=80 RECFM=FB CY PRI=100 SEC=100
If the site has SMS, you shouldn't need to, but if you need to calculate the number of tracks is the size of the file in bytes divided by 56,664, or the number of cylinders is the size of the file in bytes divided by 849,960. In either case, you would round up.
Unfortunately IBM's FTP server does not support the newer space allocation specifications in number of records (the JCL parameter AVGREC=U/M/K plus the record length as the first specification in the SPACE parameter).
However, there is an alternative, and that is to fall back on one of the lesser-used SPACE parameters - the blocksize specification. I will assume 3390 disk types for simplicity, and standard data sets.
For fixed-length records, you want to calculate the largest number that will fit in half a track (27994 bytes), because z/OS only supports block sizes up to 32760. Since you are dealing with 80-byte records, that number is 27290. Divide your file size by that number and that will give you the number of blocks. Then in a SITE server command, specify
SITE BLKSIZE=27920 LRECL=80 RECFM=FB BLOCKS=27920 PRI=calculated# SEC=a_little_extra
This is equivalent to SPACE=(27920,(calculated#,a_little_extra)).
z/OS space allocation calculates the number of tracks required and rounds up to the nearest track boundary.
For variable-length records, if your reading application can handle it, always use BLKSIZE=27994. The reason I have the warning about the reading application is that even today there are applications from ISVs that still have strange hard-coded maximum variable length blocks such as 12K.
If you are dealing with PDSEs, always use BLKSIZE=32760 for variable-length and the closest-to-32760 for fixed-length in your specification (32720 for FB/80), but calculate requirements based on BLKSIZE=4096. PDSEs are strange in their underlying layout; the physical records are 4096 bytes, which is because there is some linear data set VSAM code that handles the physical I/O.

Resources