Pseudocode of sorting a list of strings without using loops - algorithm

I was trying to think of an algorithm that would sort a list of strings according to its first 4 chars (say each line from a file), without using the conventional looping methods such as while,for. An example of inputs would be:
1231COME1900123
1233COME1902030
2031COME1923919
1231GO 1231203
1233GO 1932911
2031GO 1239391
The thing is, we do not know the number of records there can be beforehand. And each 4-digit ID number can have multiple COME and GO records. But they are sorted as above beforehand. And I want to sort the file by their 4-digit ID number. And achieve this:
1231COME1900123
1231GO 1231203
1233COME1902030
1233GO 1932911
2031COME1923919
2031GO 1239391
The only logical comment I can have is that we should be using a recursive way to read through the records, but the sorting part is a bit tricky for me. Also GOTO could be used as well. Any ideas?

Assuming that the 1st 4 characters of each entry are always digits, you do something as follows:
Create a list of length 10000, where each element can hold a pair of values.
Enter into that element of the list based upon the first 4 digits.
The shape of the individual elements will be as follows -> [COME_ELEMENT, GO_ELEMENT].
Each COME_ELEMENT and GO_ELEMENT is a list in itself, of length equal to the maximum value + 1 that can appear after the words COME & GO.
Now, as the string arrives break it at the 1st 4 digits. Now, go to that element of the list.
After that, check whether it's a go or come.
If it's a go (suppose), then determine the number after the word go.
Insert the string at the index (determined in 7th step) in the inner list.
When you're done with inserting values, just traverse the non-empty elements.
The result so obtained will contain the sorted order that you require without the use of looping.

Related

algorithm to print unique unrepeated words across list

list1 -->aaa,bbb,ddd,xyxz,...
list2-->bbb,ccc,ccc,glk,hkp,..
list3> ddd,eee,ffff,lmn,..
Inside a list the words are sorted
I want to remove words which are repeated across the list and print
in sorted order
If the words are repeated in same list its valid.
In the above case it should print
aaa-->ccc--> ccc-->eee-->fff-->glk-->hkp-->lmn-->xyxz
In this case ccc is in same list hence it is printed and bbb and ddd is removed since its across list.
I am not looking for code looking for better way to solve this.Tried searching for 3 hrs so just wanted to know the approach
Get an empty list for results
Get 3 pointers (or indices) pointing to the beginning of the 3 sorted list
Compare the words pointed by the 3 pointers, find the smallest and add it to the result list
Move each of the 3 pointers until the word pointed is larger than the last added result
Do so (3. and 4.) until all the pointers reach the end of the lists
For each list, make a copy of list and store the list in set to remove duplicate string in the same list.
e.g.
list2-->bbb,ccc,ccc,glk,hkp
copy as
set2-->bbb,ccc,glk,hkp,...
(This step is for building the following frequency table, you can skip it if you have other way to build the table)
Then use hashtable to make a frequency table which map string s to number of set that contain s.
Using the table, you can check whether a string appear in more than one list or not.
Then you just concat the input word lists, remove those string that appear in more than one list.

Where to start: what set of N letters makes the most words?

I'm having trouble coming up with a non-brute force approach to solve this problem I've been wondering: what set of N letters can be used to make the most words from a given dictionary? Letters can be used any number of times.
For example, for N=3, we can have EST to give words like TEST and SEE, etc...
Searching online, I found some answers (such as listed above for EST), but no description of the approach.
My question is: what well-known problems are similar to this, or what principles should I use to tackle this problem?
NOTE: I know it's not necessarily true that if EST is the best for N=3, then ESTx is the best for N=4. That is to say, you can't just append a letter to the previous solution.
In case you're wondering, this question came to mind because I was wondering what set of 4 ingredients could make the most cocktails, and I started searching for that. Then I realized my question was specific, and so I figured this letter question is the same type of problem, and started searching for it as well.
For each word in dictionary, sort it letters and remove duplicates. Let it be the skeleton of the word. For each skeleton, count how many words contain it. Let it be its frequency. Ignore all skeletons whose size is higher than N.
Let a subskeleton be any possible removals of 1 or more letters from the skeleton, i.e. EST has subskeletons of E,S,T,ES,ET,ST. For each skeleton of size N, add the count of this skeleton and all its subskeletons. Select the skeleton with maximal sum.
You need O(2**N*D) operations, where D is size of the dictionary.
Correction: we need to take into account all skeletons of size up to N (not only of words), and the numbet of operations will be O(2**N*C(L,N)), where L is the number of letters (maybe 26 in english).
So I coded up a solution to this problem that uses a hash table to get things done. I had to deal with a few problems along the way too!
Let N be the size of the group of letters you are looking for that can make the most words. Let L be the length of the dictionary.
Convert each word in the dictionary into a set of letters: 'test' -> {'e','s','t'}
For each number 1 to N inclusive, create a cut list that contains the words you can make with exactly that many letters.
Make a hash table for each number 1 to N inclusive, then go through the corresponding cut list and use the set as a key, and increment by 1 for each member of the cut list.
This was the part that gave me trouble! Create a set out of your cut list (unique_cut_list) for N. This is essentially all the populated key-value pairs for the hash table for N.
For each set in unique_cut_list, generate all subsets, and check the corresponding hash table (the size of the subset) to see if there is a value. If there is, add that value to the hash table for N with the key of the original set.
Finally, go through the hash table and find the max value. The corresponding key is the group of letters you're after.
You go through the dictionary 1+2N times for steps 1-5, step 6 goes through a version of the dictionary and check (2^N)-1 subsets each time (ignore null set). That gives O(2NL + L*2^N) which should approach O(L*2^N). Not bad, since N will not be too big in most applications!

Need an algorithm to print a list of unique words and number of occurences

I need to create a program that inputs an English language text file and outputs a list of words contained in the file and the number of occurrences. I need to make one using a brute force method and one with divide and conquer.
I will code it myself so please don't give me code, but I need help figuring out how to go about doing it - basically what is the algorithm behind each method, especially the divide and conquer? Pseudo code would be great
Using a brute force method:
Create a list that stores word (say, key) and corresponding counter for their occurrences(say cnt). Traverse through the file and :
if the word is not present, append it in the list and start the counter as cnt = 1.
if the word is present, do cnt++.
Using divide and conquer:
Use separate list for each starting alphabet (a to z). So there would be at most 26 separate lists.
Traverse through the list. For each word, pick up the starting character and choose the corresponding list to search the word. Then, perform the search and update method used in brute force method.
Note: None of them are optimized. They perform poorly against hashmap implementation.

Finding duplicate digits in integers

Lets say we have an integer array of N elements which consists of integers between 0 and 10000. We need to detect the numbers including a digit more than once e.g 1245 is valid while 1214 is not. How can we do this optimally? Thanks!
You need two loops. One loop you scan each element of the array.
In the inner loop, you determine if for the given element it's valid or not based on the criteria you indicated. To determine if a number has the same digit more than once, you need a routine that effectively extracts each digit one by one. I think the most optimal way to do that is to do "mod 10" on the number, then loop dividing the original by 10. keep doing that until you don't have number left (zero). Now that you have a routine for looking at each digit of an integer, the way to determine if there are duplicate digits the most optimally is to create an array of 10 booleans. Start with a cleared array. For every digit, use it as an index into the bool array and set it to true. If you see "true" again in that spot before you set it, that means that element in the bool array was visited before, thus it's a duplicate digit. So you break out of the loop altogether and say you found an invalid value.

statistical/weighted/probabilistic selection of random element

I am creating a set of items and with each I am counting its number of occurrences in a sample. Later I wish to choose an item at random but I want the chance of choosing any particular item to be equal to the number of occurrences as compared to the total of all occurrences of all items.
I believe I have found a nice solution but I'm interested what the standard term for this concept is and what the standard methods of achieving it are.
This doesn't have a name on it's own, but it's an important step in updating your beliefs based on evidence during PARTICLE FILTERING which is probably the term you're looking for.
Choose a random number (r) from 0 to n-1 (n is the total number of occurrences of all items). Then iterate over each item and subtract the number of occurrences from r. When you get below zero, select the last item. Note that it's not important to group the same item in the same place. You may have repeats and this will still work.
Alternatively, if your occurrences are stored individually in an array (rather than a histogram), simply select a random index from the array.

Resources