What is a small and fast real time compression technique, like lz77? - algorithm

What is the minimum source length (in bytes) for LZ77? Can anyone suggest a small and fast real time compression technique (preferable with c source). I need it to store compressed text and fast retrieval for excerpt generation in my search engine.
thanks for all the response, im using D language for this project so it's kinda hard to port LZO to D codes. so im going with either LZ77 or Predictor. thanks again :)

I long ago had need for a simple, fast compression algorithm, and found Predictor.
While it may not be the best in terms of compression ratio, Predictor is certainly fast (very fast), easy to implement, and has a good worst-case performance. You also don't need a license to implement it, which is goodness.
You can find a description and C source for Predictor in Internet RFC 1978: PPP Predictor Compression Protocol.

The lzo compressor is noted for its smallness and high speed, making it suitable for real-time use. Decompression, which uses almost zero memory, is extremely fast and can even exceed memory-to-memory copy on modern CPUs due to the reduced number of memory reads. lzop is an open-source implementation; versions for several other languages are available.

If you're looking for something more well known this is about the best compressor in terms of general compression you'll get. LZMA, the 7-zip encoder. http://www.7-zip.org/sdk.html

There's also LZJB:
https://hg.java.net/hg/solaris~on-src/file/tip/usr/src/uts/common/os/compress.c
It's pretty simple, based on LZRW1, and is used as the basic compression algorithm for ZFS.

Related

Deflate Compression Algorithm Implemented in High Level Language?

There are lots of implementations of the Deflate decompression algorithm in different languages. The decompression algorithm itself is described in RFC1951. However, the compression algorithm seems more elusive and I've only ever seen it implemented in long C/C++ files.
I'd like to find an implementation of the compression algorithm in a higher level language, e.g. Python/Ruby/Lua/etc., for study purposes. Can someone point me to one?
Pyflate is a pure python implementation of gzip (which uses DEFLATE).
http://www.paul.sladen.org/projects/pyflate/
Edit: Here is a python implementation of LZ77 compression, which is the first step in DEFLATE.
https://github.com/olle/lz77-kit/blob/master/src/main/python/lz77.py
The next step, Huffman encoding of the symbols, is a simple greedy algorithm which shouldn't be too hard to implement.

most suitable language for computationally and memory expensive algorithms

Let's say you have to implement a tool to efficiently solve an NP-hard problem, with unavoidable possible explosion of memory usage (the output size in some cases exponential to the input size) and you are particularly concerned about the performances of this tool at running time. The source code has also to be readable and understandable once the underlying theory is known, and this requirement is as important as the efficiency of the tool itself.
I personally think that 3 languages could be suitable for these three requirements: c++, scala, java.
They all provide the right abstraction on data types that makes it possible to compare different structures or apply the same algorithms (which is also important) to different data types.
C++ has the advantage of being statically compiled and optimized, and with function inlining (if the data structures and algorithms are designed carefully) and other optimisation techniques it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability.
If you also put a lot of care in data representation you can optimise the cache performance, which can gain orders of magnitude in speed when the cache miss rate is low.
Java is instead JIT compiled, which allows to apply optimisations during runtime, and in this category of algorithms that could have different behaviours between different runs, that may be a plus. I fear instead that such an approach could suffer from garbage collector, however in the case of this algorithm it's common to continuously allocate memory and java heap performance is notoriously better than C/C++ and if you implement your own memory manager inside the language you could even achieve good efficiency.
This approach instead is not able to inline method invocation (which induces a huge performance penalty) and doesn't give you control over the cache performance. Among the pros there's a better and cleaner syntax than C++.
My concerns about scala are more or less the same as Java, plus the fact that I can't control how the language is optimised unless I have a deep knowledge on the compiler and the standard library. But well: I get a very clean syntax :)
What's your take on the subject? Have you had to deal with this already? Would you implement an algorithm with such properties and requirements in any of these languages or would you suggest something else? How would you compare them?
Usually I’d say “C++” in a heartbeat. The secret being that C++ simply produces less (memory) garbage that needs managing.
On the other hand, your observation that
however in the case of this algorithm it's common to continuously allocate memory
is a hint that Java / Scala may actually be more suited. But then you could use a small object heap in C++ as well. Boost has one that uses the standard allocator interface, if memory serves.
Another advantage of C++ is obviously the use of abstraction without penalty through templates – i.e. that you can easily create generic algorithmic components that can interact without incurring a runtime overhead due to abstraction. In fact, you noted that
it's possible to achieve a performance close to that of pure C while maintaining a fairly good readability
– this is looking at things the wrong way: Templates allow C++ to achieve performance superior to that of C while still maintaining high abstraction.
D might be worth a look, seeing as how it tries to be a better C++.
From a superficial glance, it has better source code readability than C++ does, so that's one of your points covered.
It also has memory management, which makes playing with algorithms a bit easier.
And templates
Here is a stackoverflow discussion comparing the performance of C++ and D
The languages you noticed were my first guesses as well.
Each language has a different take on how to handle specific issues like compilation, memory management and source code, but in theory, any of them should be fitting to your problem.
It is impossible to tell which is best, and there is likely no major difference if you are familiar enough with all of them to work around their respective quirks.
And obviously, if you actually find the need to optimize (I'm not sure if that's a given), that's possible in each language. Lower level languages obviously offer more options, but are also (far) more complex to actually improve.
A single note about C++ vs Java: This is really a holy war, and if you've followed the recent development you'll probably have your own opinion. I, for one, think Java offers enough good aspects to make up for its flaws, usually.
And a final note on C++ vs C: According to my knowledge, the difference usually amounts to a sufficiently low percentage to ignore this. It it doesn't make a difference for the source code, it's fine to go with C, if C++ could make for easier-to-read source code, go with C++. In any case, the choice is kind of negligible.
In the end, remember that money spent on a few hours of programming/optimizing this could as well go into slightly superior hardware to make up for missed tiny details.
It all boils down to: Any of your options is fine as long as you do it right (domain knowledge).
I would use a language which makes it very easy to work on the algorithm. Get the algorithm right and it could very easily outweigh any advantage from fine-tuning the wrong algorithm. Don't be scared to play around in a language normally thought of as slow in execution speed if that language makes it easier to express algorithmic ideas. It is usually much easier to transcribe the right algorithm into another language than it is to eek-out the last dregs of speed from the wrong algorithm in the fastest executing language.
So do it in a language you are comfortable with and which is expressive. You might surprise yourself and find that what is produced is fast enough!

Computationally intensive algorithms

I am planning to write a bunch of programs on computationally intensive algorithms. The programs would serve as an indicator of different compiler/hardware performance.
I would want to pick up some common set of algorithms which are used in different fields, like Bioinformatics, Gaming, Image Processing, et al. The reason I want to do this would be to learn the algorithms and have a personal mini benchmark suit that would be small | useful | easy to maintain.
Any advice on algorithm selection would be tremendously helpful.
Benchmarks are not worthy of your attention!
The very best guide is to processor performance is: http://www.agner.org/optimize/
And then someone will chuck it in a box with 3GB of RAM defeating dual-channel hopes and your beautifully tuned benchmark will give widely different results again.
If you have a piece of performance critical code and you are sure you've picked the winning algorithm you can't then go use a generic benchmark to determine the best compiler. You have to actually compile your specific piece of code with each compiler and benchmark them with that. And the results you gain, whilst useful for you, will not extrapolate to others.
Case in point: people who make compression software - like zip and 7zip and the high-end stuff like PPMs and context-mixing and things - are very careful about performance and benchmark their programs. They hang out on www.encode.ru
And the situation is this: for engineers developing the same basic algorithm - say LZ or entropy coding like arithmetic-coding and huffman - the engineers all find very different compilers are better.
That is to say, two engineers solving the same problem with the same high-level algorithm will each benchmark their implementation and have results recommending different compilers...
(I have seen the same thing repeat itself repeatedly in competition programming e.g. Al Zimmermann's Programming Contests which is an equally performance-attentive community.)
(The newer GCC 4.x series is very good all round, but that's just my data-point, others still favour ICC)
(Platform benchmarks for IO-related tasks is altogether another thing; people don't appreciate how differently Linux, Windows and FreeBSD (and the rest) perform when under stress. And benchmarks there - on the same workload, same machine, different machines or different core counts - would be very generally informative. There aren't enough benchmarks like that about sadly.)
There was some work done at Berkeley about this a few years ago. The identified 13 common application paterns for parallel programming, the "13 Dwarves". The include things like linear algebra, n-body models, FFTs etc
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
See page 10 onwards.
There are some sample .NET implementations here:
http://paralleldwarfs.codeplex.com/
The typical one is Fast Fourier Transform, perhaps you can also do something like the Lucas–Lehmer primality test.
I remember a guy who tested computational performance of machines, compiler versions by inverting Hilbert matrices.
For image processing, median filtering (used for removing noise, bad pixels) is always too slow. It might make a good test, given a large enough image say 1000x1000.

Image Recognition

I'd like to do some work with the nitty-gritties of computer imaging. I'm looking for a way to read single pixels of data, analyze them programatically, and change them. What is the best language to use for this (Python, c++, Java...)? What is the best fileformat?
I don't want any super fancy software/APIs... I'm looking for the bare basics.
If you need speed (you'll probably always want speed with image processing) you definitely have to work with raw pixel data.
Java has some real disadvantages as you cannot access memory directly which makes pixel access quite slow compared to accessing the memory directly.
C++ is definitely the language of choice for production use image processing. But you can, for example, also use C# as it allows for unsafe code in specific areas. (Take a look at the scan0 pointer property of the bitmapdata class.)
I've used C# successfully for image processing applications and they are definitely much faster than their java counterparts.
I would not use any scripting language or java for such a purpose.
It's very east to manipulate the large multi-dimensional or complex arrays of pixel information that are pictures using high-level languages such as Python. There's a library called PIL (the Python Imaging Library) that is quite useful and will let you do general filters and transformations (change the brightness, soften, desaturate, crop, etc) as well as manipulate the raw pixel data.
It is the easiest and simplest image library I've used to date and can be extended to do whatever it is you're interested in (edge detection in very little code, for example).
I studied Artificial Intelligence and Computer Vision, thus I know pretty well the kind of tools that are used in this field.
Basically: you can use whatever you want as long as you know how it works behind the scene.
Now depending on what you want to achieve, you can either use:
C language, but you will lose a lot of time in bugs checking and memory management when implementing your algorithms. So theoretically, this is the fastest language to do that kind of job, but if your algorithms are not computationnally efficient (in terms of complexity) or if you lose too much time in bugs checking, this is clearly not worth it. So I would advise to first implement your application in another language, and then later you can always optimize small parts of your code with C bindings.
Octave/MatLab: very efficient language, almost as much as C, and you can make very elegant and succinct algorithms. If you are into vectorization, matrix and linear operations, you should go with that. However, you won't be able to develop a whole application with this language, it's more focused on algorithms, but then you can always develop an interface using another language later.
Python: all-in-one elegant and accessible language, used in gigantically large scale applications such as Google and Facebook. You can do pretty much everything you want with Python, any kind of application. It will be perfectly adapted if you want to make a full application (with client interaction and all, not only algorithms), or if you want to quickly draft a prototype using existent libraries since Python has a very large set of high quality libraries, like OpenCV. However if you only want to make algorithms, you should better use Octave/MatLab.
The answer that was selected as a solution is very biaised, and you should be careful about this kind of archaic comment.
Nowadays, hardware is cheaper than wetware (humans), and thus, you should use languages where you will be able to produce results faster, even if it's at the cost of a few CPU cycles or memory space.
Also, a lot of people tends to think that as long as you implement your software in C/C++, you are making the Saint Graal of speedness: this is just not true. First, because algorithms complexity matters a lot more than the language you are using (a bad algorithm will never beat a better algorithm, even if implemented in the slowest language in the universe), and secondly because high-level languages are nowadays doing a lot of caching and speed optimization for you, and this can make your program run even faster than in C/C++.
Of course, you can always do everything of the above in C/C++, but how much of your time are you willing to waste to reinvent the wheel?
Not only will C/C++ be faster, but most of the image processing sample code you find out there will be in C as well, so it will be easier to incorporate things you find.
if you are looking to numerical work on your images (think matrix) and you into Python check out http://www.scipy.org/PyLab - this is basically the ability to do matlab in python, buddy of mine swears by it.
(This might not apply for the OP who only wanted the bare basics -- but now that the speed issue was brought up, I do need to write this, just for the record.)
If you really need speed, it's better to forget about working on the pixel-by-pixel level, and rather see whether the operations that you need to perform could be vectorized. For example, for your C/C++ code you could use the excellent Intel IPP library (no, I don't work for Intel).
It depends a little on what you're trying to do.
If runtime speed is your issue then c++ is the best way to go.
If speed of development is an issue, though, I would suggest looking at java. You said that you wanted low level manipulation of pixels, which java will do for you. But the other thing that might be an issue is the handling of the various file formats. Java does have some very nice APIs to deal with the reading and writing of various image formats to file (in particular the java2d library. You choose to ignore the higher levels of the API)
If you do go for the c++ option (or python come to think of it) I would again suggest the use of a library to get you over the startup issues of reading and writing files. I've previously had success with libgd
What language do you know the best? To me, this is the real question.
If you're going to spend months and months learning one particular language, then there's no real advantage in using Python or Java just for their (to be proven) development speed.
I'm particularly proficient in C++ and I think that for this particular task I can be as speedy as a Java programmer, for example. With the aid of some good library (OpenCV comes to mind) you can create anything you need in a matter of a couple of lines of C++ code, really.
Short answer: C++ and OpenCV
Short answer? I'd say C++, you have far more flexibility in manipulating raw chunks of memory than Python or Java.

Identifying Algorithms in Binaries

Does anyone of you know a technique to identify algorithms in already compiled files, e.g. by testing the disassembly for some patterns?
The rare information I have are that there is some (not exported) code in a library that decompresses the content of a Byte[], but I have no clue how that works.
I have some files which I believe to be compressed in that unknown way, and it looks as if the files come without any compression header or trailer. I assume there's no encryption, but as long as I don't know how to decompress, its worth nothing to me.
The library I have is an ARM9 binary for low capacity targets.
EDIT:
It's a lossless compression, storing binary data or plain text.
You could go a couple directions, static analysis with something like IDA Pro, or load into GDB or an emulator and follow the code that way. They may be XOR'ing the data to hide the algorithm, since there are already many good loss less compression techniques.
Decompression algorithms involve significantly looping in tight loops. You might first start looking for loops (decrement register, jump backwards if not 0).
Given that it's a small target, you have a good chance of decoding it by hand, though it looks hard now once you dive into it you'll find that you can identify various programming structures yourself.
You might also consider decompiling it to a higher level language, which would be easier than assembly, though still hard if you don't know how it was compiled.
http://www.google.com/search?q=arm%20decompiler
-Adam
The reliable way to do this is to disassemble the library and read the resulting assembly code for the decompression routine (and perhaps step through it in a debugger) to see exactly what it is doing.
However, you might be able to look at the magic number for the compressed file and so figure out what kind of compression was used. If it's compressed with DEFLATE, for example, the first two bytes will be hexadecimal 78 9c; if with bzip2, 42 5a; if with gzip, 1f 8b.
From my experience, most of the times the files are compressed using plain old Deflate. You can try using zlib to open them, starting from different offset to compensate for custom headers. Problem is, zlib itself adds its own header. In python (and I guess other implementations has that feature as well), you can pass to zlib.decompress -15 as the history buffer size (i.e. zlib.decompress(data,-15)), which cause it to decompress raw deflated data, without zlib's headers.
Reverse engineering done by viewing the assembly may have copyright issues. In particular, doing this to write a program for decompressing is almost as bad, from a copyright standpoint, as just using the assembly yourself. But the latter is much easier. So, if your motivation is just to be able to write your own decompression utility, you might be better off just porting the assembly you have.

Resources