Description of the algorithm for sorting ASCII characters alphabetically - algorithm

I am a beginner web developer and as a test task I got the following:
An unordered array of printed ASCII characters is given. Describe in your own words (without code or pseudocode) a sorting algorithm that allows you to sort this array alphabetically in linear time. It is necessary to describe the actions at each step of the algorithm. Is a stable version of such a sorting algorithm possible?
I'm not very good at algorithms, because I have just started studying, so I do not understand how to approach this task.
Thanks for the help.

printed ASCII characters
I suppose they mean printable ASCII characters, which are characters with ASCII code in the range 32-126, so 95 characters.
Describe in your own words
For each relevant ASCII code, count how many times that character occurs in the input. The idea is that you do this in one pass over the input: for each encountered character increment the relevant counter.
Iterate the above (95) counters in order of ASCII code, and output that many times the corresponding character. So if the counter is zero, don't output the character, if it the counter is 3, output that character three times.
Is a stable version of such a sorting algorithm possible?
Yes. This is only relevant when each character in the input is accompanied by some related data (payload). In that case we should not only maintain a counter per ASCII code, but collect the associated payloads in an array that is associated with that ASCII code.
For more information, see Counting Sort on Wikipedia

To do this, sorting is used by counting:
Create an array based on the length of the input array to record the number of numbers encountered in the indexes, and write the numbers themselves to the values standing on these indexes.
In one pass, for each corresponding number (ASCII code), count how many characters occur at the input.
If the number occurs more than once, the counter increments the number of times by one.
Print the ASCII letter codes in the ranges from 65 to 90 and 97 to 122 as many times as it was counted once
Yes, a stable variant of such a sorting algorithm is possible, because elements with the same value are in the output array in the same order as in the input

Related

LC-3 How to store a number large than 16-bit and print it out to console?

I'm having difficulty storing and displaying numbers greater than 32767 in LC-3 since a register can only hold values from -32768 to 32767. My apology for not being able to come up with any idea for the algorithm. Please give me some suggestion. Thanks!
You'll need a representation to store the larger number in a pair or more of words.
There are several approaches to how big integers are stored: in a fixed number of words, and in a variable number of words or bytes.  The critical part is being able to detect the presence and amount of overflow/carry on mathematical operations like *10.
For that reason, one simple approach is to use a variable number of words/bytes (for a single number), and store only one decimal digit in each of the words/bytes.  That way multiplication by 10, means simply adding a digit on the end (which has the effect of moving each existing digit to the next higher power of ten position).  Adding numbers of this form numbers is fairly easy as well, we need to line up the digits and then, we add them up and detect when the sum is >= 10, then there is a carry (of 1) to be added to the next higher order digit of the sum.  (If adding two such (variable length) numbers is desired, I would store the decimal digits in reverse order, because then the low order numbers are already lined up for addition.)  See also https://en.wikipedia.org/wiki/Binary-coded_decimal .  (In some sense, this is like storing numbers in a form like string, but using binary values instead of ascii characters.)
To simplify this approach for your needs, you can fix the number of words to use, e.g. at 7, for 7 digits.
A variation on (unpacked) Binary-coded Decimal to pack them two decimal digits per byte.  Its a bit more complicated but saves some storage.
Another approach is to store as many decimal digits as will fit full in a word, minus 1.  Which is to say if we can store 65536 in 16-bits that's only 4 full decimal digits, which means putting 3 digits at a time into a word.  You'd need 3 words for 9 digits.  Multiplication by 10 means multiplying each word by 10 numerically, and then checking for larger than 999, and if larger, then carry the 1 to the next higher order word while also subtracting 10,000 from the overflowing word.
This approach will require actual multiplication and division by 10 on each of the individual words.
There are other approaches, such as using all 16-bits in a word as magnitude, but the difficulty there is determining the amount of overflow/carry on *10 operations.  It is not a monumental task but will require work.  See https://stackoverflow.com/a/1815371/471129, for example.
(If you also want to store negative numbers, that is also an issue for representation.  We can either store the sign as separately known as sign-magnitude form (as in stored its own word/byte or packed into the highest byte) or store the number in a compliment form.  The former is better for variable length implementations and the latter can be made to work for fixed length implementations.)

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

What is the probability that a UUID, stripped of all of its letters and dashes, is unique?

Say I have a UUID a9318171-2276-498c-a0d6-9d6d0dec0e84.
I then remove all the letters and dashes to get 9318171227649806960084.
What is the probability that this is unique, given a set of ID's that are generated in the same way? How does this compare to a normal set of UUID's?
UUIDs are represented as 32 hexadecimal (base-16) digits, displayed in 5 groups separated by hyphens. The issue with your question is that for any generated UUID we could get any valid hexadecimal number from the set of [ 0-9,A-F ] inclusive.
This leaves us with a dilemma since we don't know, beforehand, how many of the hexadecimal digits generated for each UUID would be an alpha-characte: [A-F]. The only thing that we can be certain of, is that each generated character of the UUID has a 5/16 chance of being an alpha character: [A-F]. Knowing this makes it impossible to answer this question accurately since removing the hyphens and alpha characters leaves us with variable length UUIDs for each generated UUID...
With that being said, to give you something to think about we know that each UUID is 36 characters in length, including the hyphens. So if we simplify and say, we have no hyphens, now each UUID can be only be 32 characters in length. Building on this if we further simplify and say that each of the 32 characters can only be a numeric character: [0-9] we could now give an accurate probability for uniqueness of each generated, simplified, UUID (according to our above mentioned simplifications):
Assuming a UUID is represented by 32 characters, where each character is a numerical character from the set of [0-9]. We know that we need to generate 32 numbers in order to create a valid simplified UUID. Now the chances of selecting any given number: [0-9] is 1/10. Another way to think about this is the following: each number has an equal opportunity of being generated and since there are 10 numbers: each number has a 10% chance of being generated.
Furthermore, when a number is generated, the number is generated independently of the previously generated numbers i.e. each number generated doesn't depend on the outcome of the previous number generated. Therefore, for each of the 32 numeric characters generated: each number is independent of one another and since the outcome of any number selected is a number and only a number from [0-9] we can say that each number selected is mututally exclusive to one another.
Knowing these facts we can take advantage of the Product Rule which states that the probability of the occurrence of two independent events is the product of their individual probabilities. For example, the probability of getting two heads on two coin tosses is 0.5 x 0.5 or 0.25. Therefore, the generation of two identical UUIDs would be:
1/10 * 1/10 * 1/10 * .... * 1/10 where the number of 1/10s would be 32.
Simplifying to 1/(10^32), or in general: to 1/(10^n) where n is the length of your UUID. So with all that being said the possibility of generating two unique UUIDs, given our assumptions, is infinitesimally small.
Hopefully that helps!

Using primes to determine anagrams faster than looping through?

I had a telephone recently for a SE role and was asked how I'd determine if two words were anagrams or not, I gave a reply that involved something along the lines of getting the character, iterating over the word, if it exists exit loop and so on. I think it was a N^2 solution as one loop per word with an inner loop for the comparing.
After the call I did some digging and wrote a new solution; one that I plan on handing over tomorrow at the next stage interview, it uses a hash map with a unique prime number representing each character of the alphabet.
I'm then looping through the list of words, calculating the value of the word and checking to see if it compares with the word I'm checking. If the values match we have a winner (the whole mathematical theorem business).
It means one loop instead of two which is much better but I've started to doubt myself and am wondering if the additional operations of the hashmap and multiplication are more expensive than the original suggestion.
I'm 99% certain the hash map is going to be faster but...
Can anyone confirm or deny my suspicions? Thank you.
Edit: I forgot to mention that I check the size of the words first before even considering doing anything.
An anagram contains all the letters of the original word, in a different order. You are on the right track to use a HashMap to process a word in linear time, but your prime number idea is an unnecessary complication.
Your data structure is a HashMap that maintains the counts of various letters. You can add letters from the first word in O(n) time. The key is the character, and the value is the frequency. If the letter isn't in the HashMap yet, put it with a value of 1. If it is, replace it with value + 1.
When iterating over the letters of the second word, subtract one from your count instead, removing a letter when it reaches 0. If you attempt to remove a letter that doesn't exist, then you can immediately state that it's not an anagram. If you reach the end and the HashMap isn't empty, it's not an anagram. Else, it's an anagram.
Alternatively, you can replace the HashMap with an array. The index of the array corresponds to the character, and the value is the same as before. It's not an anagram if a value drops to -1, and it's not an anagram at the end if any of the values aren't 0.
You can always compare the lengths of the original strings, and if they aren't the same, then they can't possibly be anagrams. Including this check at the beginning means that you don't have to check if all the values are 0 at the end. If the strings are the same length, then either something will produce a -1 or there will be all 0s at the end.
The problem with multiplying is that the numbers can get big. For example, if letter 'c' was 11, then a word with 10 c's would overflow a 32bit integer.
You could reduce the result modulo some other number, but then you risk having false positives.
If you use big integers, then it will go slowly for long words.
Alternative solutions are to sort the two words and then compare for equality, or to use a histogram of letter counts as suggested by chrylis in the comments.
The idea is to have an array initialized to zero containing the number of times each letter appears.
Go through the letters in the first word, incrementing the count for each letter. Then go through the letters in the second word, decrementing the count.
If the counts reach zero at the end of this process, then the words are anagrams.

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Resources