Grouping items in an array? - sorting

Hey guys, if I have an array that looks like [A,B,C,A,B,C,A,C,B] (random order), and I wish to arrange it into [A,A,A,B,B,B,C,C,C] (each group is together), and the only operations allowed are:
1)query the i-th item of the array
2)swap two items in the array.
How to design an algorithm that does the job in O(n)?
Thanks!

Sort algorithms aren't something you design fresh (i.e. first step of your development process) anymore; you should research known sort algorithms and see what meets your needs.
(It is of course possible you might really require your own new sort algorithm, but usually that has different—and highly-specific—requirements.)
If this isn't your first step (but I don't think that's the case), it would be helpful to know what you've already tried and how it failed you.

This is actually just counting sort.
Scan the array once, count the number of As, Bs, Cs—that should give you an idea. This becomes like bucket sort—not quite but along those lines. The count of As Bs and Cs should give you an idea about where the string of As, Bs and Cs belongs.

Related

Why Counting Sort is made harder?

I was reading: https://en.wikipedia.org/wiki/Counting_sort and https://www.geeksforgeeks.org/counting-sort/
There is one little detail which I don't get at all, why to complicate things where they can be so much easier? What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.
What's the problem of allocating an array of size k where the field of numbers is [1...k] and count how many times each number appeared and lastly walking down the array and printing according to the counter in each cell.
From your phrase "how many times each number appeared", it sounds like you're picturing an array of positive integers, where you want to sort them in increasing order, and where you can use those integers directly as indices in your helper array?
But that's not what the Wikipedia article describes. The algorithm in the Wikipedia article is for an array whose elements can have whatever data-type we choose, provided there's a function key that maps from that data-type to the set of indices in the helper array, with the property that we want to stably sort elements according to the result of key (so, if key(x) < key(y) then we want to sort x before y, and if key(x) = key(y) then we want to keep x and y in the same order they originally had).
In particular, the counting-sort algorithm in the Wikipedia article is useful as a component of radix sort: first you sort by the last digit (using a key function that gives the last digit of a number), then by the second-to-last digit, and so on, until an array of numbers is sorted.
There is one little detail which I don't get at all, why to complicate things where they can be so much easier?
A pro tip: we all usually think that our own code is "easier" and that other people are "complicating things", because code is easier to write than to read, so the code that we understand best is the code that we've come up with ourselves.
As it happens, in this case the Wikipedia code really is more complicated, because it serves a much more general use-case than you were picturing; but in general, it's not a good idea to just assume that everyone will agree that your code is the easy version and that others' is unnecessarily complicated.

Search multiple lists by indirective ids

There are x (x=3 in this example) unsorted lists with identificators:
list1 list2 list3
array1[id3], array2[id4,id4a], array3[id1a,id1b]
array1[id4], array2[id3,id3a], array3[id4a,id4b]
array1[id1], array2[id2,id2a], array3[id3a,id3b]
array1[id2], array2[id1,id1a], array3[id2a,id2b]
...
array1[idn], array2[idn,idna], array3[idn,idnb]
I want to make pairs: {id1,id1b}, {id2,id2b} and so on. Sadly, i cannot do it directly. That's how it works: take id3 from list1 then find id3 in list2 then take id3a from list2 then find id3a in list3 and finally we got id3b.
It could be done with nested loops but what if there were more lists? Seems to be inefficient. Is there a better solution?
The only better solutions algorithmically would require a different representation. For example, if the lists can be sorted, then searches to get from key1->key2->key3->value could all be binary searches. That's probably the easiest and least intrusive solution to implement if you can just slightly change the data representation to be sorted.
If you use a different data structure outright like multiple hash tables, then each search could be constant-time (assuming no collisions). You could even consolidate this all to a single hash table with a 3-part key that maps to a single hash index storing the value.
You could also use BSTs, possibly tries, etc., but all of these algorithmic improvements will hinge on a different data representation.
Any search through an unsorted list is generally going to have to be O(N), since we cannot make any assumptions and are helpless but to potentially search the entire list. With three lists and 3 nested searches, we end up looking at a cubic complexity O(N^3) algorithm (doesn't scale very well).
Without changing the data representation, I think linear-time searches for each unsorted list is as good as you can get (and yes, that could be quite horrible), and you're probably looking at micro-optimizations like multithreading or SIMD.
I forgot to mention that after each iteration i'll get a new set of lists.
For example, in the first iteration:
array1[id1], array2[id2,id2a], array3[id3a,id3b]
array1[id2], array2[id1,id1a], array3[id2a,id2b]`
In the second one:
array1[id3], array2[id4,id4a], array3[id1a,id1b]
array1[id4], array2[id3,id3a], array3[id4a,id4b]
etc. So if I touch the keys to link them together in one iteration I will have to do the same in next one with the new set. It looks like each auxiliary structure has to be rebuilt. is it worthwhile then? No doubt, it depends. But more or less?

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

What is a good way to find pairs of numbers, each stored in a different array, such that the difference between the first and second number is 1?

Suppose you have several arrays of integers. What is a good way to find pairs of integers, not both from the same list, such that the difference between the first and second integer is 1?
Naturally I could write a naive algorithm that just looks through each other list until it finds such a number or hits one bigger. Is there a more elegant solution?
I only mention the condition that the difference be 1 because I'm guessing there might be some use to that knowledge to speed up the computation. I imagine that if the condition for a 'hit' were something else, the algorithm would work just the same.
Some background: I'm engaged in a bit of research mathematics and I seek to find examples of a certain construction. Any help would be much appreciated.
I'd start by sorting each array. Preferably with an algorithm that runs in O( n log(n) ) time.
When you've got a bunch of sorted arrays, you can set a pointer to the start of each array, check for any +/- 1 differences in the values of the pointers, and increment the value of the smallest-valued pointer, repeating until you've reached the max length of all but one of the arrays.
To further optimize, you could keep the pointers-values in a sorted linked list, and build the check function into an insertion sort. For each increment, you could remove the previous value from the list, and step through the list checking for +/- 1 comparison until you get to a term that is larger than a possible match. That way, if you're searching a bazillion arrays, you needn't check all bazillion pointer-values - you only need to check until you find a value that is too big, and ignore all larger values.
If you've got any more information about the arrays (such as the range of the terms or number of arrays), I can see how you could take advantage of that to make much faster algorithms for this through clever uses of array properties.
This sounds like a good candidate for the classic merge sort where the final stage is not a unification but comparison.
And the magnitude of the difference wouldn't affect this, but thanks for adding the information.
Even though you state the second list is in an array, if you could put it in a hashmap of some sort then you could make it faster than just the naive approach.
Basically,
Loop through the first array.
Look to see if there exists an object in the hashmap that is one larger than the current array value.
That way you can build up pairs of numbers that meet your requirements.
I don't know if it would be as flexible as you would like though.
Basically, you may want to consider other data structures, to help you find a better solution.
You have o(n log n) from the sorting.
You can also the the search in o(log n) for each element, if you have some dynamic queryset. You can sort the arrays and then for each element in the first array binary search his upper_bound and lower_bound in the second array and check the difference.

Most efficient sorting algorithm for a large set of numbers

I'm working on a large project, I won't bother to summarize it here, but this section of the project is to take a very large document of text (minimum of around 50,000 words (not unique)), and output each unique word in order of most used to least used (probably top three will be "a" "an" and "the").
My question is of course, what would be the best sorting algorithm to use? I was reading of counting sort, and I like it, but my concern is that the range of values will be too large compared to the number of unique words.
Any suggestions?
First, you will need a map of word -> count.
50,000 words is not much - it will easily fit in memory, so there's nothing to worry. In C++ you can use the standard STL std::map.
Then, once you have the map, you can copy all the map keys to a vector.
Then, sort this vector using a custom comparison operator: instead of comparing the words, compare the counts from the map. (Don't worry about the specific sorting algorithm - your array is not that large, so any standard library sort will work for you.)
I'd start with a quicksort and go from there.
Check out the wiki page on sorting algorithms, though, to learn the differences.
You should try an MSD radix sort. It will sort your entries in lexicographical order. Here is a google code project you might be interested in.
Have a look at the link. A Pictorial representation on how different algorithm works. This will give you an hint!
Sorting Algorithms
You can get better performance than quicksort with this particular problem assuming that if two words occur the same number of times, then it doesn't matter in which order you output them.
First step: Create a hash map with the words as key values and frequency as the associated values. You will fill this hash map in as you parse the file. While you are doing this, make sure to keep track of the highest frequency encountered. This step is O(n) complexity.
Second step: Create a list with the number of entries equal to the highest frequency from the first step. The index of each slot in this list will hold a list of the words with the frequency count equal to the index. So words that occur 3 times in the document will go in list[3] for example. Iterate through the hash map and insert the words into the appropriate spots in the list. This step is O(n) complexity.
Third step: Iterate through the list in reverse and output all the words. This step is O(n) complexity.
Overall this algorithm will accomplish your task in O(n) time rather than O(nlogn) required by quicksort.
In almost every case I've ever tested, Quicksort worked the best for me. However, I did have two cases where Combsort was the best. Could have been that combsort was better in those cases because the code was so small, or due to some quirk in how ordered the data was.
Any time sorting shows up in my profile, I try the major sorts. I've never had anything that topped both Quicksort and Combsort.
I think you want to do something as explained in the below post:
http://karephul.blogspot.com/2008/12/groovy-closures.html
Languages which support closure make the solution much easy, like LINQ as Eric mentioned.
For large sets you can use what is known as the "sort based indexing" in information retrieval, but for 50,000 words you can use the following:
read the entire file into a buffer.
parse the buffer and build a token vector with
struct token { char *term, int termlen; }
term is a pointer to the word in the buffer.
sort the table by term (lexicographical order).
set entrynum = 0, iterate through the term vector,
when term is new, store it in a vector :
struct { char *term; int frequency; } at index entrynum, set frequency to 1 and increment the entry number, otherwise increment frequency.
sort the vector by frequency in descending order.
You can also try implementing digital trees also known as Trie. Here is the link

Resources