Text compression - random

If we use less unique elements like (a-z) instead of (A-z) do the families of LZ algorithms efficiency increase?
Is there any other compression algorithm efficient for random unique elements at present?

Yes, efficiency increases (on average) because there will be more and longer matches. Also, Huffman-style compression becomes stronger.
Is there any other compression algorithm efficient for random unique elements at present ?
Not sure what you mean. If you mean that individual characters are independent then even basic compression algorithms will achieve theoretically optimal efficiency.

Related

sorting algorithm to increase correlation with actual order

I have small (n= 50 to 100) "somewhat sorted" vectors e.g. their rank-order would correlate to the actual rank-order by around r=0.4
What sorting algorithm would require the least amount of pairwise comparisons to increase this correlation to 0.9?
I'm not an expert specifically on approximate sorting algorithms, but Pearson correlation of permutations isn't really considered in the CS theory I've seen. My intuition is that a complicated algorithm with good asymptotic behavior won't scale down well to 50--100 elements, so the first thing I'd try would be Shellsort but skip the smallest gaps.

Disjoint-sets for really big data

Are there any enhanced Disjoint-sets algorithm for really big data (such as more than 2^32 elements and more than 2^32 pair to union)?
Obviously the biggest problem is that I cannot make such a large array, so I'm wondering if there is a better algorithm or better data structure to accomplish my task?
One way to deal with really big data is to run the thing in external memory. http://terrain.cs.duke.edu/pubs/union-find.pdf (I/O-Efficient Batched Union-Find and Its Applications to Terrain Analysis) contains both a theoretical algorithm, consisting of a fairly complex sequence of calls to other batched algorithms, and (Section 3) a self-contained recursive algorithm which is not as asymptotically efficient, but looks as if it might be practical.

Criteria for selecting a sorting algorithm

I was curious to know how to select a sorting algorithm based on the input, so that I can get the best efficiency.
Should it be on the size of the input or how the input is arranged(Asc/Desc) or the data structure used etc ... ?
The importance of algorithms generally, and in sorting algorithms as well is as following:
(*) Correctness - This is the most important thing. It worth nothing if your algorithm is super fast and efficient, but is wrong. In sorting, even if you have 2 candidates that are sorting correctly, but you need a stable sort - you will chose the stable sort algorithm, even if it is less efficient - because it is correct for your purpose, and the other is not.
Next are basically trade offs between running time, needed space and implementation time (If you will need to implement something from scratch rather then use a library, for a minor performance enhancement - it probably doesn't worth it)
Some things to take into consideration when thinking about the trade off mentioned above:
Size of the input (for example: for small inputs, insertion sort is empirically faster then more advanced algorithms, though it takes O(n^2)).
Location of the input (sorting algorithms on disk are different from algorithms on RAM, because disk reads are much less efficient when not sequential. The algorithm which is usually used to sort on disk is a variation of merge-sort).
How is the data distributed? If the data is likely to be "almost sorted" - maybe a usually terrible bubble-sort can sort it in just 2-3 iterations and be super fast comparing to other algorithms.
What libraries do you have already implemented? How much work will it take to implement something new? Will it worth it?
Type (and range) of the input - for enumerable data (integers for example) - an integer designed algorithm (like radix sort) might be more efficient then a general case algorithm.
Latency requirement - if you are designing a missile head, and the result must return within specific amount of time, quicksort which might decay to quadric running time on worst case - might not be a good choice, and you might want to use a different algorithm which have a strict O(nlogn) worst case instead.
Your hardware - if for example you are using a huge cluster and a huge data - a distributed sorting algorithm will probably be better then trying to do all the work on one machine.
It should be based on all those things.
You need to take into account size of your data as Insertion sort can be faster than quicksort for small data sets, etc
you need to know the arrangement of your data due to differing worst/average/best case asymptotic runtimes for each of the algorithm (and some whose worst/avg cases are the same whereas the other may have significantly worse worst case vs avg)
and you obviously need to know the data structure used as there are some very specialized sorting algorithms if your data is already in a special format or even if you can put it into a new data structure efficiently that will automatically do your sorting for you (a la BST or heaps)
The 2 main things that determine your choice of a sorting algorithm are time complexity and space complexity. Depending on your scenario, and the resources (time and memory) available to you, you might need to choose between sorting algorithms, based on what each sorting algorithm has to offer.
The actual performance of a sorting algorithm depends on the input data too, and it helps if we know certain characteristics of the input data beforehand, like the size of input, how sorted the array already is.
For example,
If you know beforehand that the input data has only 1000 non-negative integers, you can very well use counting sort to sort such an array in linear time.
The choice of a sorting algorithm depends on the constraints of space and time, and also the size/characteristics of the input data.
At a very high level you need to consider the ratio of insertions vs compares with each algorithm.
For integers in a file, this isn't going to be hugely relevant but if say you're sorting files based on contents, you'll naturally want to do as few comparisons as possible.

Algorithm to align elements of differing sizes in fixed-size buckets

The problem:
1) I have buckets of fixed size, in my case 64. This cannot change.
2) Values vary in size, but are never bigger than a bucket (64).
3) Access is much slower if any element is split between buckets.
Is there some algorithm that computes the optimal order of the elements in the buckets?
There are two variations here, and I'm interested in both, to enable the user of the code to choose between speed and memory usage:
A) Splitting is allowed, but should be minimized.
B) Splitting is not allowed, and padding should be minimized.
Please either post the algorithms, or a link to them, or at least their name, if they are "well known". An Internet search did not return anything useful, probably because the answer was drowned in irrelevant results, like optimal bucket size, and partitioning data in hash-tables.
The target language is Java, but I don't think it should make a difference.
I think you could use memory allocation algorithms, like first fit, best fit, or worst fit (you could read about them here): they are only heuristic approximations of the optimal solution, but I suspect that your original problem is related to the bin-packing problem somehow and may be NP-hard to solve optimally. Try using the approximation algorithms, and split only if you can't place your elements without it.

Best data structure to store one million values?

Please mention time complexity and best data structure to store these values, when values are:
Integers
Strings (dictionary like sorting)
I know Counting sort is preferred when integers are in a small range.
Thanks.
Edit:
Sorry, I asked a bit different question. Actual question is what would be the best data structure to store these values, if the integers are phone numbers (and strings are names) and then find the best sorting algorithm.
Have a look at:
Btrees and red-black trees.
You should be able to find open source implementations of each of these. (Note, I'm assuming that you want to maintain a sorted structure, rather than just sorting once and forgetting.)
Sorting algorithms wiki link: Sorting Algorithm Wiki
Merge sort and quick sort are pretty good, they are n log n in best cases.
How about a heap? Relatively easy to implement and pretty fast. For strings, you could use a Trie along with something like Burst sort which is supposedly the fastest string sorting algorithm in its class.
For most sorting algorithms there is an in-place version, so a simple array may be sufficient. For Strings you may consider a http://en.wikipedia.org/wiki/Trie, which could save space. The right sorting algorithm depends on a lot of factors, e.g. if the results may be already sorted or partially sorted. Of course if you have just a few different values, Countingsort, Bucketsort etc can be used.
On a 32-bit machine, a million integers can fit in an array of 4 million bytes. 4MB isn't all that much; it'll fit in this system's memory 500 times over (and it's not that beefy by modern standards). A million strings will be the same size, except for the storage space for those strings; for short strings it's still no problem, so slurp it all in. You can even have an array of pointers to structures holding an integer and a reference to a string; it will all fit just fine. It's only when you're dealing with much more data than that (e.g., a billion items) that you need to take special measures, data-structure-wise.
For sorting that many things, choose an algorithm that is O(nlogn) instead of one that is O(n2). The O(n) algorithms are only useful when you've got particularly compact key spaces, which is pretty rare in practice. Choosing which algorithm from the set that are in O(nlogn) is a matter of balancing speed and other good properties such as stability.
If you're doing this for real, use a database with appropriate indices instead of futzing around with doing this all by hand.

Resources