algorithm to print unique unrepeated words across list - algorithm

list1 -->aaa,bbb,ddd,xyxz,...
list2-->bbb,ccc,ccc,glk,hkp,..
list3> ddd,eee,ffff,lmn,..
Inside a list the words are sorted
I want to remove words which are repeated across the list and print
in sorted order
If the words are repeated in same list its valid.
In the above case it should print
aaa-->ccc--> ccc-->eee-->fff-->glk-->hkp-->lmn-->xyxz
In this case ccc is in same list hence it is printed and bbb and ddd is removed since its across list.
I am not looking for code looking for better way to solve this.Tried searching for 3 hrs so just wanted to know the approach

Get an empty list for results
Get 3 pointers (or indices) pointing to the beginning of the 3 sorted list
Compare the words pointed by the 3 pointers, find the smallest and add it to the result list
Move each of the 3 pointers until the word pointed is larger than the last added result
Do so (3. and 4.) until all the pointers reach the end of the lists

For each list, make a copy of list and store the list in set to remove duplicate string in the same list.
e.g.
list2-->bbb,ccc,ccc,glk,hkp
copy as
set2-->bbb,ccc,glk,hkp,...
(This step is for building the following frequency table, you can skip it if you have other way to build the table)
Then use hashtable to make a frequency table which map string s to number of set that contain s.
Using the table, you can check whether a string appear in more than one list or not.
Then you just concat the input word lists, remove those string that appear in more than one list.

Related

Pseudocode of sorting a list of strings without using loops

I was trying to think of an algorithm that would sort a list of strings according to its first 4 chars (say each line from a file), without using the conventional looping methods such as while,for. An example of inputs would be:
1231COME1900123
1233COME1902030
2031COME1923919
1231GO 1231203
1233GO 1932911
2031GO 1239391
The thing is, we do not know the number of records there can be beforehand. And each 4-digit ID number can have multiple COME and GO records. But they are sorted as above beforehand. And I want to sort the file by their 4-digit ID number. And achieve this:
1231COME1900123
1231GO 1231203
1233COME1902030
1233GO 1932911
2031COME1923919
2031GO 1239391
The only logical comment I can have is that we should be using a recursive way to read through the records, but the sorting part is a bit tricky for me. Also GOTO could be used as well. Any ideas?
Assuming that the 1st 4 characters of each entry are always digits, you do something as follows:
Create a list of length 10000, where each element can hold a pair of values.
Enter into that element of the list based upon the first 4 digits.
The shape of the individual elements will be as follows -> [COME_ELEMENT, GO_ELEMENT].
Each COME_ELEMENT and GO_ELEMENT is a list in itself, of length equal to the maximum value + 1 that can appear after the words COME & GO.
Now, as the string arrives break it at the 1st 4 digits. Now, go to that element of the list.
After that, check whether it's a go or come.
If it's a go (suppose), then determine the number after the word go.
Insert the string at the index (determined in 7th step) in the inner list.
When you're done with inserting values, just traverse the non-empty elements.
The result so obtained will contain the sorted order that you require without the use of looping.

Need an algorithm to print a list of unique words and number of occurences

I need to create a program that inputs an English language text file and outputs a list of words contained in the file and the number of occurrences. I need to make one using a brute force method and one with divide and conquer.
I will code it myself so please don't give me code, but I need help figuring out how to go about doing it - basically what is the algorithm behind each method, especially the divide and conquer? Pseudo code would be great
Using a brute force method:
Create a list that stores word (say, key) and corresponding counter for their occurrences(say cnt). Traverse through the file and :
if the word is not present, append it in the list and start the counter as cnt = 1.
if the word is present, do cnt++.
Using divide and conquer:
Use separate list for each starting alphabet (a to z). So there would be at most 26 separate lists.
Traverse through the list. For each word, pick up the starting character and choose the corresponding list to search the word. Then, perform the search and update method used in brute force method.
Note: None of them are optimized. They perform poorly against hashmap implementation.

Best data structure to count letter frequencies?

Task:
What is the most common first letter found in all the words in this document?
-unweighted (count a word once regardless of how many times it shows up)
-weighted (count a word separately for each time it shows up)
What is the most common word of a given length in this document?
I'm thinking of using a hashmap to count the most common first letter. But should I use a hashmap for both the unweighted and weighted?
And for most common word of a given length(ex. 5) could I use something more simple like an array list?
For the unweighted, you need a hash table to keep track of the words you've already seen, as well as a hash map to count the occurrences of the first letter. That is, you need to write:
if words_seen does not contain word
add word to words seen
update hash map with first letter of word
end-if
For the weighted, you don't need that hash table, because you don't care how many times the word occurs. So you can just write:
update hash map with first letter of word
For the most common words, you need a hash map to keep track of all the unique words you see, and the number of times you see the word. After you've scanned the entire document, make a pass through that hash map to determine the most frequent one with the desired length.
You probably don't want to use an array list for the last task, because you want to count occurrences. If you used an array list then after scanning the entire document you'd have to sort that list and count frequencies. That would take more memory and more time than just using the hash map.

Checking if a word is made up of one or more concatenated dictionary words

Here's the scenario:
I have an array of millions of random strings of letters of length 3-32, and an array of words (the dictionary).
I need to test if a random string can be made up by concatenating 1, 2, or 3 different dictionary words or not.
As the dictionary words would be somewhat fixed, I can do any kind of pre-processing on them.
Ideally, I'd like something that optimizes lookup speeds by doing some kind of pre-processing on the dictionary.
What kind of data structures / algorithms should I be looking at to implement this?
First, Build a B-Tree like Trie structure from your dict. Each root would map to a letter. Each 2nd level subtree would then have all of the words that could be made with two letters, and so on.
Then take your word and start with the first letter and walk down the B-Tree Trie until you find a match and then recursively apply this algorithm to the rest of the word. If you don't find a match at any point you know you can't form the word via concats.
Store the dictionary strings in a hashed set data structure. Iterate through all possible splits of the string you want to check in 1, 2 or 3 parts, and for each such split look up all parts in the hash set.
Make a regex matching every word in your dictionary.
Put parentheses around it.
Put a + on the end.
Compile it with any correct (DFA-based) regex engine.

How to get a group of names all with the same first letter from an alphabetically sorted list?

I was wondering what is the best way to get a group of names given the first letter. The present application I am working in is in javascript, but I had a similar problem in another language sometimes ago. One idea I have thought of would be to do a binary search for the end of the names from a particular letter and then do another binary search for the beginning. Another idea was to take the ratio of of the distance of the given letter from the beginning and applying that ratio to find where to start the search. For example if the letter was 'e' then I would start start a quarter of the way through the list, and do some kind of search to see how close I am to the letter I need. The program will be working with several hundred names so I really didn't want to just do a for loop and search the whole thing. Also, I am interested what kind of algorithms for this are out there?
Both your approaches have their advantages and disadvantages. Binary search gives exactly O(log(N)) complexity, and your second method will give approximately O(log(N)) with some advantage for uniform distribution of names and possibly disadvantage for another type of distribution. What is better is up to your needs.
One big improvement I can propose is to index character positions while creating names list. Make simple hash map with first letters as keys and start positions as values. It will take O(N), but only once, and then you will get exact position for each letter in a constant time. For JavaScript you can do it, for example, while loading data to the page, when you walk trough the list anyway.
Guys I think we could use an approach similar to count sort.We could create an array of size 26 .This array would not be an normal array but would be an array of pointers to linked list which has the following structure.
Struct node
{
char *ptr ;
struct node *next;
};
struct node * names[26]; //Our array.
Now we would scan the list in O(n) time and corresponding to the first character we could subtract 65 (if ASCII value of letter is in the range 65 - 90).Guys i am subtracting 65 so as to fix the letter in 26 sized array.
At each location we could create a linked list and can store the corresponding words in that location.
Now suppose if we want to find all letters that begin with D we could directly do to array location 3(No need to apply hash function again) and then traverse linked list created till null is reached.
And what i think space complexity required in hashing would be same as that of above but hashing would also involve computing hash function every time when we want to insert or search for words beginning with same letter.
If the plan is to do something with the names (as opposed to just find out how many there are), then it will be necessary to scan the names that fit the criteria of matching the first letter. If so, then it seems that a binary search for the first name in the entire set is the fastest method. The "do something" part would involve scanning the names starting from the location found by the binary search. When a name is read that no longer starts with the given letter, you are done.
If you have an unsorted set of filenames then I would propose following algorithm:
1) Create two variables: 1) currently found first letter (I will call it currentLetter) 2) list of filenames which start with this letter (currentFilenames)
2) firstLetter = null
currentFilenames = [] - empty list or array
3) Iterate over filenames. If current filenames starts with currentLetter then add this filename to the currentFilenames. If it starts with letter which goes before currentLetter then assign currentLetter to the first letter of new filename and create a new currentFilenames list which consists only of one current filename.
With such an algorithm you will have at the end a letter which goes first in the alphabet and list of files starting from that letter.
Sample code (tried to write in Javascript but do not blame if I wrote anything wrong):
function GetFirstLetterAndFilenames(allFilenames) {
var currentLetter = null;
var currentFilenames = null;
for (int i = 0; i < allFilenames.length ; i++) {
var thisLetter = allFilenames[i][0];
if (currentLetter == null || thisLetter < currentLetter) {
currentLetter = thisLetter;
currentFilenames = [allFilenames[i]];
} else if (currentLetter == thisLetter) {
currentFilenames.push(allFilenames[i]);
}
}
return new {lowestLetter = currentLetter, filenames = currentFilenames};
}
Names have a funny way of not distributing themselves evenly over the alphabet, so you're probably not going to win by as much as you'd hope by predicting where to look.
But a really easy way to cut your search down by an average of two steps is as follows: if the letter is from a to m, binary search for the next letter. Then binary search from the beginning of the list only to the position you just found for the next letter. If the letter is from n to z, binary search for it. Then, again, only search the portion of the list after what you just found.
Is this worth saving two steps? Dunno. It's pretty easy to implement, but then again, two steps don't take very long. (Correctly guessing the letter would save you maybe 4 steps at best.)
Another possibility is to have bins for each letter to begin with. It starts out already sorted, and if you have to re-sort, you only have to sort within one letter, not the whole list. The downside is that if you need to manipulate the whole list frequently, you have to glue all the bins together.

Resources