What is the purpose of retaining first letter in Soundex? - algorithm

Why do we retain the first letter in Soundex?
What happens if we assign a numeric value to it and compare the full numerically converted string?

Related

{ w | at every odd position of w is a 1}

The task is to construct a DFA for this language over the alphabet {0,1}.
I have constructed a DFA that consists of 4 states and that does not accept an empty word. However, in the answers they give a 3 state DFA that accepts it.
Why should my DFA accept an empty word if in the empty word there is no 1 at the odd position which means that it is not in the language?
The only requirement is that any symbol at an odd position must be 1. There is no requirement for a particular number of symbols, and specifically not that there be at least one.
Therefore, a DFA with an initial state where 0 leads to a rejection state and where 1 leads to a second state which accepts either symbol and returns to the start would be an acceptable answer, and would accept the empty string. This would be a three-state machine:
I think you are confused why should an empty string be a part of a mentioned set.
Let's take a look at another example. Consider you have a set of all possible strings having every character equal to 0. Such strings would be 0, 00, 000, 00000, etc. What about an empty string *? It actually pertain to this set as well. Empty string does not violate the definition of the set.
Compare this example with yours. You should check every odd position of the string and if you'll find anything other than 1 you should say that it is not an element of you set. It is not said anything about whether a string should have an odd position to be checked.

Using primes to determine anagrams faster than looping through?

I had a telephone recently for a SE role and was asked how I'd determine if two words were anagrams or not, I gave a reply that involved something along the lines of getting the character, iterating over the word, if it exists exit loop and so on. I think it was a N^2 solution as one loop per word with an inner loop for the comparing.
After the call I did some digging and wrote a new solution; one that I plan on handing over tomorrow at the next stage interview, it uses a hash map with a unique prime number representing each character of the alphabet.
I'm then looping through the list of words, calculating the value of the word and checking to see if it compares with the word I'm checking. If the values match we have a winner (the whole mathematical theorem business).
It means one loop instead of two which is much better but I've started to doubt myself and am wondering if the additional operations of the hashmap and multiplication are more expensive than the original suggestion.
I'm 99% certain the hash map is going to be faster but...
Can anyone confirm or deny my suspicions? Thank you.
Edit: I forgot to mention that I check the size of the words first before even considering doing anything.
An anagram contains all the letters of the original word, in a different order. You are on the right track to use a HashMap to process a word in linear time, but your prime number idea is an unnecessary complication.
Your data structure is a HashMap that maintains the counts of various letters. You can add letters from the first word in O(n) time. The key is the character, and the value is the frequency. If the letter isn't in the HashMap yet, put it with a value of 1. If it is, replace it with value + 1.
When iterating over the letters of the second word, subtract one from your count instead, removing a letter when it reaches 0. If you attempt to remove a letter that doesn't exist, then you can immediately state that it's not an anagram. If you reach the end and the HashMap isn't empty, it's not an anagram. Else, it's an anagram.
Alternatively, you can replace the HashMap with an array. The index of the array corresponds to the character, and the value is the same as before. It's not an anagram if a value drops to -1, and it's not an anagram at the end if any of the values aren't 0.
You can always compare the lengths of the original strings, and if they aren't the same, then they can't possibly be anagrams. Including this check at the beginning means that you don't have to check if all the values are 0 at the end. If the strings are the same length, then either something will produce a -1 or there will be all 0s at the end.
The problem with multiplying is that the numbers can get big. For example, if letter 'c' was 11, then a word with 10 c's would overflow a 32bit integer.
You could reduce the result modulo some other number, but then you risk having false positives.
If you use big integers, then it will go slowly for long words.
Alternative solutions are to sort the two words and then compare for equality, or to use a histogram of letter counts as suggested by chrylis in the comments.
The idea is to have an array initialized to zero containing the number of times each letter appears.
Go through the letters in the first word, incrementing the count for each letter. Then go through the letters in the second word, decrementing the count.
If the counts reach zero at the end of this process, then the words are anagrams.

Best data structure to count letter frequencies?

Task:
What is the most common first letter found in all the words in this document?
-unweighted (count a word once regardless of how many times it shows up)
-weighted (count a word separately for each time it shows up)
What is the most common word of a given length in this document?
I'm thinking of using a hashmap to count the most common first letter. But should I use a hashmap for both the unweighted and weighted?
And for most common word of a given length(ex. 5) could I use something more simple like an array list?
For the unweighted, you need a hash table to keep track of the words you've already seen, as well as a hash map to count the occurrences of the first letter. That is, you need to write:
if words_seen does not contain word
add word to words seen
update hash map with first letter of word
end-if
For the weighted, you don't need that hash table, because you don't care how many times the word occurs. So you can just write:
update hash map with first letter of word
For the most common words, you need a hash map to keep track of all the unique words you see, and the number of times you see the word. After you've scanned the entire document, make a pass through that hash map to determine the most frequent one with the desired length.
You probably don't want to use an array list for the last task, because you want to count occurrences. If you used an array list then after scanning the entire document you'd have to sort that list and count frequencies. That would take more memory and more time than just using the hash map.

Best way to implement partial key hashing

I appeared for an interview where I was asked to write an algorithm for partial key hashing i.e; if ABCBC is inserted in the hash then searching for any of the sub strings should return the value stored.
My answer was to create a collection of all possible substrings of a given key and maintain a mapping between each substring to its one or more parent string. Then maintain a BST of the collection of all substrings. Each node will point to a collection of actual values which that substring might match to.
For eg.
A, AB, ABC, ABCB, ABCBC, B, BC, BCB, BCBC, C, CB, CBC are possible substrings for this string. There may be other strings also like BAB of which, AB and B are substring of.
So given AB, it will map to two strings BAB and ABCBC.
Is there any other more efficient way ?
Thanks
Store each substring in the hash, with a note for whether it is final, and the possible next characters and previous characters. Store previous characters for all words that could have this substring in the middle, and next characters for all words that have this substring as their start.
Thus the entry for a does not need to have all words with a in it. But it is easy enough to build that list if you want to. Also during an insert as soon as you are going down in size on substrings and you find that you already have the current substring with the current continuation, you can stop.
Assuming that you have many words with the same letters, this will save some on space and inserts, at the cost of making actually generating the list slower. Worst case is still O(n*n) for an n letter string though.
To delete you can follow a similar procedure, stopping deletes at any substring that has other substrings coming into it.

When using XPath to find variable, taking only a piece of that variable

I am new to XPath. I am writing a code to grab all the 3 digit numbers from a page. They are not constant, varying between 105, 515, and 320. I want two be able to tokenize these numbers into two separate pieces...
i would love to be able to grab the first digit in one X-path expression
and
the second two digits in a second X-Path expression
on doing my research I came across that you couldn't tokenize with 'zero value,' but is there any way to do this?
Thanks
It seems to me that the question is actually about the possible ways to split a 3-digit number into two strings, the first containing the first digit and the second containing the remaining two digits.
Here is one possible solution:
The following XPath expression when evaluated produces a string containing the first digit of a number $vNum (in an actual XPath expression, substitute $vNum with the XPath expression that produces this value):
substring($vNum, 1, 1)
The following XPath expression when evaluated produces a string containing the last two digits of a 3-digit number $vNum (in an actual XPath expression, substitute $vNum with the XPath expression that produces this value):
substring($vNum, 2)
In case if we are not sure about the number of digits $vNum has, the following XPath expression when evaluated produces a string containing the two digits that immediately follow the first digit of a 3+ digit number $vNum (in an actual XPath expression, substitute $vNum with the XPath expression that produces this value):
substring($vNum, 2, 2)
And lastly, if we again don't know the exact number of digits, but want to get the last two of them, the following XPath expression when evaluated produces a string containing the two digits at the end of a 2+ digit number $vNum (in an actual XPath expression, substitute $vNum with the XPath expression that produces this value):
substring($vNum, string-length($vNum) - 1)

Resources