If I were to save an array of bytes from a recorded wave file, of someone saying their name for example. Then compare the array of bytes in a new wav file of the same person saying their name to the original one, would I be able to tell it was the same person saying their name?
No, there are so many factors that simply comparing the file will not suffice.
There's background noise and static, sample rate could be off from the first one, they could have their mouth closer to the microphone then last time..and that's without starting the technical stuff like your definition of 'compare', plus how many bytes you would have to 'compare' would be ridiculous - each byte is so small in relation to the whole thing they basically mean nothing individually.
Related
I have a text file with numbers, letters, special characters, and symbols. There are some lines where I want to insert an RLE unicode control character at the beginning/middle/end of the line.
First I needed to find out how to catch and represent RLE. I thought of streams. I found out that RLE takes up 3 bytes -30 -128 -85
InputStream input = new BufferedInputStream (new FileInputStream (file_name_here_with_path));
byte[] = input.read();
If what the app reads included an RLE character, then when printing the array you'll get those 3 signed numbers.
Next problem, CURRENT PROBLEM, is to find a suitable container for this information.
input.read(): this returns the byte the app read. I can save it in a byte array but I can't even create the array unless I knew its size. No, the file size isn't the size of the array because I need to insert those 3 bytes into the array more than once and at different locations depending on some conditions I set.
input.read(byte[] array): this returns an int representing the number of bytes that were read. The parameter will have all that info saved. Same problem as the above. Array of fixed size
input.read(byte[] array, offset, length): same as previous only I can make it read from any point I want and for as long as I want unlike the previous ones where it reads from the beginning to the end or until some exception is thrown
Use bufferedReader: same problem. I read a line, save it in a string, turn the string to byte array (stringname.getBytes()). fixed size. can't insert.
The solution to all 4 methods is to create a new byte array and move the bytes from the old array to the new one while inserting the control character. THE PROBLEM, maybe, is that according to a member here, Javier, the read method is slow. I haven't received confirmation yet as I wasn't sure if he meant one specific read or all 3. Also, even if I knew how many extra slots I needed in the new array, would it be good practice to create the new array of such a size?
This reminds me, my txt file is 200KB tops. It's not much but I'm looking for the RIGHT practice. The generic solution!
Anyways, I looked for alternatives. I recall using vectors. Yes they're obsolete. I don't know why and since I'm not creating a big app or an app for a client then I can use vectors :P HOWEVER, i thought I should keep reading. Then I came across ArrayList and I read a post here about how it performs better.
So ... what will it be? possible slow performing read methods or bufferedReader or the obsolete vector or the fast performing ArrayList? :P
Vectors were replaced by the faster ArrayLists with the caveat that that ArrayLists are not thread safe (but you could call a synchronize method for Collections to negate this) and has a different data growth method (every time a resize needs to be made, ArrayList increases size by 50% while vectors basically double). Apart from this, they are pretty much the same.
Given your options, I would go with an ArrayList that holds Objects (ArrayList) as in this manner, you could retain the original element type. Although, in such a case, you need to keep a track of what type of element is in each index position (if it is necessary).
I have 2 large text files (csv, to be precise). Both have the exact same content except that the rows in one file are in one order and the rows in the other file are in a different order.
When I compress these 2 files (programmatically, using DotNetZip) I notice that always one of the files is considerably bigger -for example, one file is ~7 MB bigger compared to the other.-
My questions are:
How does the order of data in a text file affect compression and what measures can one take in order to guarantee the best compression ratio? - I presume that having similar rows grouped together (at least in the case of ZIP files, which is what I am using) would help compression but I am not familiar with the internals of the different compression algorithms and I'd appreciate a quick explanation on this subject.
Which algorithm handles this sort of scenario better in the sense that would achieve the best average compression regardless of the order of the data?
"How" has already been answered. To answer your "which" question:
The larger the window for matching, the less sensitive the algorithm will be to the order. However all compression algorithms will be sensitive to some degree.
gzip has a 32K window, bzip2 a 900K window, and xz an 8MB window. xz can go up to a 64MB window. So xz would be the least sensitive to the order. Matches that are further away will take more bits to code, so you will always get better compression with, for example, sorted records, regardless of the window size. Short windows simply preclude distant matches.
In some sense, it is the measure of the entropy of the file defines how well it will compress. So, yes, the order definitely matters. As a simple example, consider a file filled with values abcdefgh...zabcd...z repeating over and over. It would compress very well with most algorithms because it is very ordered. However, if you completely randomize the order (but leave the same count of each letter), then it has the exact same data (although a different "meaning"). It is the same data in a different order, and it will not compress as well.
In fact, because I was curious, I just tried that. I filled an array with 100,000 characters a-z repeating, wrote that to a file, then shuffled that array "randomly" and wrote it again. The first file compressed down to 394 bytes (less than 1% of the original size). The second file compressed to 63,582 bytes (over 63% of the original size).
A typical compression algorithm works as follows. Look at a chunk of data. If it's identical to some other recently seen chunk, don't output the current chunk literally, output a reference to that earlier chunk instead.
It surely helps when similar chunks are close together. The algorithm will only keep a limited amount of look-back data to keep compression speed reasonable. So even if a chunk of data is identical to some other chunk, if that old chunk is too old, it could already be flushed away.
Sure it does. If the input pattern is fixed, there is a 100% chance to predict the character at each position. Given that two parties know this about their data stream (which essentially amounts to saying that they know the fixed pattern), virtually nothing needs to be communicated: total compression is possible (to communicate finite-length strings, rather than unlimited streams, you'd still need to encode the length, but that's sort of beside the point). If the other party doesn't know the pattern, all you'd need to do is to encode it. Total compression is possible because you can encode an unlimited stream with a finite amount of data.
At the other extreme, if you have totally random data - so the stream can be anything, and the next character can always be any valid character - no compression is possible. The stream must be transmitted completely intact for the other party to be able to reconstruct the correct stream.
Finite strings are a little trickier. Since finite strings necessarily contain a fixed number of instances of each character, the probabilities must change once you begin reading off initial tokens. One can read some sort of order into any finite string.
Not sure if this answers your question, but it addresses things a bit more theoretically.
Question about Byte-Pairing for data compression. If byte pairing converts two byte values to a single byte value, splitting the file in half, then taking a gig file and recusing it 16 times shrinks it to 62,500,000. My question is, is byte-pairing really efficient? Is the creation of a 5,000,000 iteration loop, to be conservative, efficient? I would like some feed back on and some incisive opinions please.
Dave, what I read was:
"The US patent office no longer grants patents on perpetual motion machines, but has recently granted at least two patents on a mathematically impossible process: compression of truly random data."
I was not inferring the Patent Office was actually considering what I am inquiring about. I was merely commenting on the notion of a "mathematically impossible process." If someone has, in some way created a method of having a "single" data byte as a placeholder of 8 individual bytes of data, that would be a consideration for a patent. Now, about the mathematically impossibility of an 8 to 1 compression method, it is not so much a mathematically impossibility, but a series of rules and conditions that can be created. As long as there is the rule of 8 or 16 bit representation of storing data on a medium, there are ways to manipulate data that mirrors current methods, or creation by a new way of thinking.
In general, "recursive compression" as you have described it is a mirage: compression doesn't actually work that way.
First, you should realize that all compression algorithms have the potential to expand the input file instead of compressing it. You can demonstrate this by a simple counting argument: note that the compressed version of any file must be different from the compressed version of any other file (or you will not be able to decompress that file properly). Also, for any file size N, there is a fixed number of possible files of size <=N. If any files of size > N are compressible to size <= N, then an equal number of files of size <= N must expand to size >N when "compressed".
Second, "truly random" files are uncompressible. Compression works because the compression algorithm expects to receive files with certain kinds of predictable regularities. However, "truly random" files are by definition unpredictable: every random file is as likely as every other random file of the same length, so they don't compress.
Effectively, you have a model which treats some files as more likely than others; to compress such files, you want to choose shorter output files for the input files which are more likely. Information theory tells us the most efficient way to compress files is to assign each input file of probability P an output file of length ~ log2(1/P) bits. This means that, ideally, every output file of a given length has roughly equal probability, just like "truly random" files.
Among completely random files of a given length, each has probability (0.5)^(#original bits). The optimal length from above is ~ log2(1/ 0.5^(#original bits) ) = (#original bits) -- which is to say, the original length is the best you can do.
Because the output of a good compression algorithm is nearly random, re-compressing the compressed file will get you little to no gain. Any further improvements are effectively "leakage" due to suboptimal modeling and encoding; also, compression algorithms tend to scramble any regularity they don't take advantage of, making further compression of such "leakage" more difficult.
For a much longer exposition on this topic, with many examples of failed propositions of this type, see the comp.compression FAQ. Claims of "recursive compression" feature prominently.
I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.
Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.