Is it possible to detect changes in the base64 encoding of an object to detect the degree of changes in the object.
Suppose I send a document attachment to several users and each makes changes to it and emails back to me, can I use the string distance between original base64 and the received base64s to detect which version has the most changes. Would that be a valid metric?
If not, would there be any other metrics to quantify the deltas?
That would depend entirely on the type of the document you had encoded. If it was a text file, then sure, the base64 encoded difference are probably on a par with the actual changes. However, you may have a format of a file where changes to the contents effectively produce a completely different binary file. An example of this would be a ZIP file.
you should do the same that diff does. Then for example do the metrics on diff fiel size.
In theory, yes, if do a smart diff (detecting inserts, deletions, and modifications).
In practice, no, unless the documents are absolutely plain text. Binary formats can't be meaningfully diff'd.
Base64 packs groups of 3x8 bit values into 4x6. If you change one 8 bit value by one bit, then you'll impact only one of the 6 bit values. If you change by two bits, then you have about a 5/12 chance of hitting one of the other 6 bit values. So if you're counting bits, it is entirely equivalent; otherwise, you will introduce noise depending on the metric you use.
Related
My question is: Are there any standard compression formats which can ensure that a certain delimiter sequence does not occur in the compressed data stream?
We want to design a binary file format, containing chunks of sequential data (3D coordinates + other data, not really important for the question). Each chunk should be compressed using a standard compression format, like GZIP, ZIP, ...
So, the file structure will be like:
FileHeader
ChunkDelimiter Chunk1_Header compress(Chunk1_Data)
ChunkDelimiter Chunk2_Header compress(Chunk2_Data)
...
Use case is the following: The files should be read in splits in Hadoop, so we want to be able to start at an arbitrary byte position in the file, and find the start of the next chunk by looking for the delimiter sequence. -> The delimiter sequence should not occur within the chunks.
I know that we could post-process the compressed data, "escaping" the delimiter sequence in case that it occurs in the compressed output. But we'd better avoid this, since the "reverse escaping" would be required in the decoder, adding complexity.
Some more facts why we chose this file format:
Should be easily readable by third parties -> standard compression algorithm preferred.
Large files; streaming operation: amount of data and number of chunks is not known when starting to write the file -> Difficult to write start-of-chunk byte positions in the header.
I won't answer your question with a compression scheme name but will give you a hint of how other solved the same issue.
Let's give a look at Avro. Basically, they have similar requirements: files must be splitable and each data block can be compressed (you can even choose your compression scheme).
From the Avro Specification we learn that splittability is achieved with the help of a synchronization marker ("Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing."). We also discover that the synchronization marker is a 16-byte randomly-generated value ("The 16-byte, randomly-generated sync marker for this file.").
How does it solve your issue ? Well, since Martin Kleppmann provided a great answer to this question a few years ago I will just copy paste his message
On 23 January 2013 21:09, Josh Spiegel wrote:
As I understand it, Avro container files contain synchronization markers
every so often to support splitting the file. See:
https://cwiki.apache.org/AVRO/faq.html#FAQ-Whatisthepurposeofthesyncmarkerintheobjectfileformat%3F
(1) Why isn't the synchronization marker the same for every container file?
(i.e. what is the point of generating it randomly every time)
(2) Is it possible, at least in theory, for naturally occurring data to
contain bytes that match the sync marker? If so, would this break
synchronization?
Thanks,
Josh
Because if it was predictable, it would inevitably appear in the actual data sometimes (e.g. imagine the Avro documentation, stating
what the sync marker is, is downloaded by a web crawler and stored in
an Avro data file; then the sync marker will appear in the actual
data). Data may come from malicious sources; making the marker random
makes it unfeasible to exploit.
Possibly, but extremely unlikely. The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data
is about 10^-23. It's more likely that your data center is wiped out
by a meteorite
(http://preshing.com/20110504/hash-collision-probabilities).
If the sync marker appears in your data, it only breaks reading the file if you happen to also seek to that place in the file. If you just
read over it sequentially, nothing happens.
Martin
Link to the Avro mailing list archive
If it works for Avro, it will work for you too.
No. I know of no standard compression format that does not allow any sequence of bits to occur somewhere within. To do otherwise would (slightly) degrade compression, going against the original purpose of a compression format.
The solutions are a) post-process the sequence to use a specified break pattern and insert escapes if the break pattern accidentally appears in the compressed data -- this is guaranteed to work, but you don't like this solution, or b) trust that the universe is not conspiring against you and use a long break pattern whose length assures that it is incredibly unlikely to appear accidentally in all the sequences this is applied to anytime from now until the heat death of the universe.
For b) you can protect somewhat against the universe conspiring against you by selecting a random pattern for each file, and providing the random pattern at the start of the file. For the truly paranoid, you could go even further and generate a new random pattern for each successive break, from the previous pattern.
Note that the universe can conspire against you for a fixed pattern. If you make one of these compressed files with a fixed break pattern, and then you include that file in another compressed archive also using that break pattern, that archive will likely not be able to compress this already compressed file and will simply store it, leaving exposed the same fixed break pattern as is being used by the archive.
Another protection for b) would be to detect the decompression failure of an incorrect break by seeing that the piece before the break does not terminate, and handle that special case by putting that piece and the following piece back together and trying the decompression again. You would also very likely detect this on the following piece as well, with that decompression failing.
I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.
I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.
Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.
How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.