Why Counting Sort is made harder? - algorithm

I was reading: https://en.wikipedia.org/wiki/Counting_sort and https://www.geeksforgeeks.org/counting-sort/
There is one little detail which I don't get at all, why to complicate things where they can be so much easier? What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.

What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.
From your phrase "how many times each number appeared", it sounds like you're picturing an array of positive integers, where you want to sort them in increasing order, and where you can use those integers directly as indices in your helper array?
But that's not what the Wikipedia article describes. The algorithm in the Wikipedia article is for an array whose elements can have whatever data-type we choose, provided there's a function key that maps from that data-type to the set of indices in the helper array, with the property that we want to stably sort elements according to the result of key (so, if key(x) < key(y) then we want to sort x before y, and if key(x) = key(y) then we want to keep x and y in the same order they originally had).
In particular, the counting-sort algorithm in the Wikipedia article is useful as a component of radix sort: first you sort by the last digit (using a key function that gives the last digit of a number), then by the second-to-last digit, and so on, until an array of numbers is sorted.
There is one little detail which I don't get at all, why to complicate things where they can be so much easier?
A pro tip: we all usually think that our own code is "easier" and that other people are "complicating things", because code is easier to write than to read, so the code that we understand best is the code that we've come up with ourselves.
As it happens, in this case the Wikipedia code really is more complicated, because it serves a much more general use-case than you were picturing; but in general, it's not a good idea to just assume that everyone will agree that your code is the easy version and that others' is unnecessarily complicated.

Related

What is the name for this sorting algorithm?

So, I work in industrial automation, and normally program with ladder logic. So its rather odd compared to what I would consider normal programing. Anyway I needed to sort a list of numbers from smallest to biggest. I was looking through sorting algorithms trying to find one I could easily implement using ladder logic. I was having a hard time, but after some thinking I came up with something that wasn't even on the Wikipedia list of sorting algorithms. Well, It might be but I can't find it. I know this isn't very efficient sorting algorithm, but it does work. I want to know the name of it if it has one.
The basic version of this is, imagine an array of numbers. Take the first number in the list and compare it to all other numbers in the list, count the number of times that it is bigger than any of the other numbers. This accumulated value is the index number for where it goes in the output array. To place it in the array, check if there is already something written to that spot, if there is add one to the index and check again until there isn't anything in its spot. When the empty spot is found write it to the output array. Once you have done that to every number in the list you will have an output array with the same size as the input, but with it sorted smallest to biggest. I should note that this is assuming the language uses zero based indexing.
If this wasn't clear enough, I'm happy to elaborate further if needed.
I would say it's a worse version of counting sort:
It operates by counting the number of objects that possess distinct key values, and applying prefix sum on those counts to determine the positions of each key value in the output sequence
So it basically does the same thing you're doing: put each element in its final position by using counts. Counting sort uses an array to store the needed counts, you iterate the array to find them at each step for the current element.
I don't think there's a name for your exact algorithm.

Understanding the Count Sketch data structure and associated algorithms

Working on wrapping my head around the CountSketch data structure and its associated algorithms. It seems to be a great tool for finding common elements in streaming data, and the additive nature of it makes for some fun properties with finding large changes in frequency, perhaps similar to what Twitter uses for trending topics.
The paper is a little difficult to understand for someone that has been away from more academic approaches for a while, and a previous post here did help some, for me at least it still left quite a few questions.
As I understand it, the Count Sketch structure is similar to a bloom filter. However the selection of hash functions has me confused. The structure is an N by M table with N hash functions with M possible values determining the "bucket" to alter, and another hash function s for each N that is "pairwise independent"
Are the hashes to be selected from a universal hashing family, say something of the h(x) = ((ax+b) % some_prime) % M?
And if so, where are the s hashes that return either +1 or -1 chosen from? And what is the reason for ever subtracting from one of the buckets?
They subtract from the buckets to make average effect of additions/subtractions caused by other occurrences to be 0. If half the time I add the count of 'foo', and half the time I subtract the count of 'foo', then in expectation, the count of 'foo' does not influence the estimate of the count for 'bar'.
Picking a universal hash function like you describe will indeed work, but it's mostly important for the theory rather than the practice. Salting your favorite reasonable hash function will work too, you just can't meaningfully write proofs based on the expected values using a few fixed hash functions.

What is a good way to find pairs of numbers, each stored in a different array, such that the difference between the first and second number is 1?

Suppose you have several arrays of integers. What is a good way to find pairs of integers, not both from the same list, such that the difference between the first and second integer is 1?
Naturally I could write a naive algorithm that just looks through each other list until it finds such a number or hits one bigger. Is there a more elegant solution?
I only mention the condition that the difference be 1 because I'm guessing there might be some use to that knowledge to speed up the computation. I imagine that if the condition for a 'hit' were something else, the algorithm would work just the same.
Some background: I'm engaged in a bit of research mathematics and I seek to find examples of a certain construction. Any help would be much appreciated.
I'd start by sorting each array. Preferably with an algorithm that runs in O( n log(n) ) time.
When you've got a bunch of sorted arrays, you can set a pointer to the start of each array, check for any +/- 1 differences in the values of the pointers, and increment the value of the smallest-valued pointer, repeating until you've reached the max length of all but one of the arrays.
To further optimize, you could keep the pointers-values in a sorted linked list, and build the check function into an insertion sort. For each increment, you could remove the previous value from the list, and step through the list checking for +/- 1 comparison until you get to a term that is larger than a possible match. That way, if you're searching a bazillion arrays, you needn't check all bazillion pointer-values - you only need to check until you find a value that is too big, and ignore all larger values.
If you've got any more information about the arrays (such as the range of the terms or number of arrays), I can see how you could take advantage of that to make much faster algorithms for this through clever uses of array properties.
This sounds like a good candidate for the classic merge sort where the final stage is not a unification but comparison.
And the magnitude of the difference wouldn't affect this, but thanks for adding the information.
Even though you state the second list is in an array, if you could put it in a hashmap of some sort then you could make it faster than just the naive approach.
Basically,
Loop through the first array.
Look to see if there exists an object in the hashmap that is one larger than the current array value.
That way you can build up pairs of numbers that meet your requirements.
I don't know if it would be as flexible as you would like though.
Basically, you may want to consider other data structures, to help you find a better solution.
You have o(n log n) from the sorting.
You can also the the search in o(log n) for each element, if you have some dynamic queryset. You can sort the arrays and then for each element in the first array binary search his upper_bound and lower_bound in the second array and check the difference.

Grouping items in an array?

Hey guys, if I have an array that looks like [A,B,C,A,B,C,A,C,B] (random order), and I wish to arrange it into [A,A,A,B,B,B,C,C,C] (each group is together), and the only operations allowed are:
1)query the i-th item of the array
2)swap two items in the array.
How to design an algorithm that does the job in O(n)?
Thanks!
Sort algorithms aren't something you design fresh (i.e. first step of your development process) anymore; you should research known sort algorithms and see what meets your needs.
(It is of course possible you might really require your own new sort algorithm, but usually that has different—and highly-specific—requirements.)
If this isn't your first step (but I don't think that's the case), it would be helpful to know what you've already tried and how it failed you.
This is actually just counting sort.
Scan the array once, count the number of As, Bs, Cs—that should give you an idea. This becomes like bucket sort—not quite but along those lines. The count of As Bs and Cs should give you an idea about where the string of As, Bs and Cs belongs.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources