Compression Algorithm to compress text files

Compression Algorithm to compress text files - algorithm

I am doing my graduation project which is arabic language Compression and encryption.
The Encryption algorithm is done using AES and it works very well.
But my problem is the Compression I don't know what algorithms I use what is the most easy to implement and has a good output performance.

it is important that you FIRST compress and THEN encrypt.
otherwise, you won't be able to compress.
that said, use whatever compression you can get, like xz, bzip2, gzip, ...

I know this is an old answer, but it feels like there is an insufficient answer with regards to what compression algorithm the asker should use, hence this answer.
If I were you I would use an existing library such as zlib, which has been implemented in multiple languages.
Using this library you can decide if you want to use the deflate algorithm or the gzip one. The difference between these two is nicely described in their FAQ.
Hope this helps.

Related

What is an example of an algorithm which uses Huffman coding only?

For example, a LZM algorithm example could be the LZMA, but the Huffman example I can't find. I understand that BWT uses it to some extent, but it uses another type of algorithm too.

I think you mean implementation, not algorithm. Huffman coding is an algorithm.
zlib provides the Z_HUFFMAN_ONLY compression strategy, which only uses Huffman coding to compress the input. The string matching zlib normally uses is turned off with that option.

Pseudocode (or code) for main compression algorithms

I'm really interested in image and video compression, but its hard for me to find a main source to start implementing the major algorithms.
What I want is just a source of information to begin the implementation of my own codec. I want to implement it from scratch (for example, for jpeg, implement my own Huffman, cosine conversion ...). All I need is a little step by step guide showing me which steps are involved in each algorithm.
I'm interested mainly on image compression algorithms (by now, JPEG) and video compression algorithms (MPEG-4, M-JPEG, and maybe AVI and MP4).
Can anyone suggest me an on-line source, with a little more information than wikipedia? (I checked it, but information is not really comprehensive)
Thank you so much :)

Start with JPEG. You'll need the JPEG standard. It will take a while to go through, but that's the only way to have a shot at writing something compatible. Even then, the standard won't help much with deciding on how and how much you quantize the coefficients, which requires experimentation with images.
Once you get that working, then get the H.264 standard and read that.

ImpulseAdventure site has fantastic series of articles about basics of JPEG encoding.
I'm working on an experimental JPEG encoder that's partly designed to be readable and easy to change (rather than obfuscated by performance optimizations).

Deflate Compression Algorithm Implemented in High Level Language?

There are lots of implementations of the Deflate decompression algorithm in different languages. The decompression algorithm itself is described in RFC1951. However, the compression algorithm seems more elusive and I've only ever seen it implemented in long C/C++ files.
I'd like to find an implementation of the compression algorithm in a higher level language, e.g. Python/Ruby/Lua/etc., for study purposes. Can someone point me to one?

Pyflate is a pure python implementation of gzip (which uses DEFLATE).
http://www.paul.sladen.org/projects/pyflate/
Edit: Here is a python implementation of LZ77 compression, which is the first step in DEFLATE.
https://github.com/olle/lz77-kit/blob/master/src/main/python/lz77.py
The next step, Huffman encoding of the symbols, is a simple greedy algorithm which shouldn't be too hard to implement.

What is a small and fast real time compression technique, like lz77?

What is the minimum source length (in bytes) for LZ77? Can anyone suggest a small and fast real time compression technique (preferable with c source). I need it to store compressed text and fast retrieval for excerpt generation in my search engine.
thanks for all the response, im using D language for this project so it's kinda hard to port LZO to D codes. so im going with either LZ77 or Predictor. thanks again :)

I long ago had need for a simple, fast compression algorithm, and found Predictor.
While it may not be the best in terms of compression ratio, Predictor is certainly fast (very fast), easy to implement, and has a good worst-case performance. You also don't need a license to implement it, which is goodness.
You can find a description and C source for Predictor in Internet RFC 1978: PPP Predictor Compression Protocol.

The lzo compressor is noted for its smallness and high speed, making it suitable for real-time use. Decompression, which uses almost zero memory, is extremely fast and can even exceed memory-to-memory copy on modern CPUs due to the reduced number of memory reads. lzop is an open-source implementation; versions for several other languages are available.

If you're looking for something more well known this is about the best compressor in terms of general compression you'll get. LZMA, the 7-zip encoder. http://www.7-zip.org/sdk.html

There's also LZJB:
https://hg.java.net/hg/solaris~on-src/file/tip/usr/src/uts/common/os/compress.c
It's pretty simple, based on LZRW1, and is used as the basic compression algorithm for ZFS.

Identifying Algorithms in Binaries

Does anyone of you know a technique to identify algorithms in already compiled files, e.g. by testing the disassembly for some patterns?
The rare information I have are that there is some (not exported) code in a library that decompresses the content of a Byte[], but I have no clue how that works.
I have some files which I believe to be compressed in that unknown way, and it looks as if the files come without any compression header or trailer. I assume there's no encryption, but as long as I don't know how to decompress, its worth nothing to me.
The library I have is an ARM9 binary for low capacity targets.
EDIT:
It's a lossless compression, storing binary data or plain text.

You could go a couple directions, static analysis with something like IDA Pro, or load into GDB or an emulator and follow the code that way. They may be XOR'ing the data to hide the algorithm, since there are already many good loss less compression techniques.

Decompression algorithms involve significantly looping in tight loops. You might first start looking for loops (decrement register, jump backwards if not 0).
Given that it's a small target, you have a good chance of decoding it by hand, though it looks hard now once you dive into it you'll find that you can identify various programming structures yourself.
You might also consider decompiling it to a higher level language, which would be easier than assembly, though still hard if you don't know how it was compiled.
http://www.google.com/search?q=arm%20decompiler
-Adam

The reliable way to do this is to disassemble the library and read the resulting assembly code for the decompression routine (and perhaps step through it in a debugger) to see exactly what it is doing.
However, you might be able to look at the magic number for the compressed file and so figure out what kind of compression was used. If it's compressed with DEFLATE, for example, the first two bytes will be hexadecimal 78 9c; if with bzip2, 42 5a; if with gzip, 1f 8b.

From my experience, most of the times the files are compressed using plain old Deflate. You can try using zlib to open them, starting from different offset to compensate for custom headers. Problem is, zlib itself adds its own header. In python (and I guess other implementations has that feature as well), you can pass to zlib.decompress -15 as the history buffer size (i.e. zlib.decompress(data,-15)), which cause it to decompress raw deflated data, without zlib's headers.

Reverse engineering done by viewing the assembly may have copyright issues. In particular, doing this to write a program for decompressing is almost as bad, from a copyright standpoint, as just using the assembly yourself. But the latter is much easier. So, if your motivation is just to be able to write your own decompression utility, you might be better off just porting the assembly you have.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Compression Algorithm to compress text files - algorithm

I am doing my graduation project which is arabic language Compression and encryption. The Encryption algorithm is done using AES and it works very well. But my problem is the Compression I don't know what algorithms I use what is the most easy to implement and has a good output performance.

it is important that you FIRST compress and THEN encrypt. otherwise, you won't be able to compress. that said, use whatever compression you can get, like xz, bzip2, gzip, ...

Related

What is an example of an algorithm which uses Huffman coding only?

Pseudocode (or code) for main compression algorithms

Deflate Compression Algorithm Implemented in High Level Language?

What is a small and fast real time compression technique, like lz77?

Identifying Algorithms in Binaries

Categories

Resources