Turing Machine Algorithm - algorithm

Could you please help me? I need to write code for a one-tape Turing Machine that uses the following two-letter alphabet a and b.
So the programme should show the common prefix of the two words.
For example:
g(aab,aaaba) -> aa; g(_,abab) -> _; g(aaba,baa) -> _; g(_,_) -> _; g(babaab,babb) -> bab
Where g is the function of the Machine and underscore means an Empty word, between words we have space
I tried to implement the following option:
If at the start we see the letter a, then we erase it and move to the beginning of the second word. If we also see a letter a there, we erase it too and after both words we write a through a space. After that we return to the beginning of the first word and repeat this operation. When the first letter of the first word and the first letter of the second no longer match, we erase everything that is left.
But I have some troubles with code, because after each operation a space between two words gets longer and I don't know how to control this. Also there is a trouble when the first or the second word is a common prefix fully, like this:
g(baa,baabab) -> baa

Your approach seems reasonable. From your description it sounds like you just have trouble generalizing some of the individual steps.
For instance, to deal with the growing spaces between the two words, remember that at any time in the program, the two words are separated by one or more spaces. So implement your seek operation for that general case.
For the common prefix case you have to deal with the situation that you eventually run out of characters to compare. So after deleting the character from the first word, while seeking for the beginning of the second word, check whether the first character you pass over is a letter or a space. If it's a space, you're in the prefix case and need to take care that you don't try to seek back to the first word later, because you already erased all of it and there's only spaces left. Similarly, if the second word is the prefix, you can detect this when seeking to the output.
Try breaking the algorithm down into its essential steps and test each of those steps in isolation. It is much easier to make sure you handle the corner cases correctly when you can focus on a simple step in isolation, instead of having to test it as part of the larger algorithm. Doing this is an essential skill in debugging code, so consider this a good exercise for that. Even if it seems painful at first, make sure you have a structured approach to analyzing problems and breaking your code down into smaller parts, and you will be able to fix any problems eventually. Happy coding!

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Stem comparsion algorithm

I'm writing a program that makes word declension for Polish language. In this language stems can vary in some cases (because of palatalization or mobile/fleeting e and other effects).
For example, we have word "karzeł" and it is basic dictionary form of word. It's stem is also 'karzeł'. But genitive form of this word is "karła" and stem is "karł". We can see here that 'e' dissapeared and 'rz' changes to 'r'.
Another example:
'uzda' -> stem 'uzd'
'uździe' -> stem 'uździ'
Alternation: 'zd' -> 'ździ'
I'd like to store in dictionary only basic form of stem ('karzeł' and 'uzd') and when I'll put in my program stem 'karł' or 'uździ' it will find proper basic stems. Alternations takes place only at the end of stem and contains maximum 4 letters of it.
Is there any algorithms that could do that? Levensthein distance treats all letters equally so if I type word 'barzeł' then the distance to stem 'karzeł' will be less than to stem 'karł'.
I thought also about neural networks but I'm not sure how to encode words (give each stem variation different id?).
Another idea is to write algorith which makes something like reversed alternation and creates set of possible stems and try to find them in dictionary.
I would like to highlight that I only want store basic form of stem and everything else makes on the fly.
First of all, I remember seeing a number of projects on Polish morphology around. So I would look at them first, before starting one of your own.
Regarding Levenshtein, as Pierre correctly noted in the comment, the distance function can be customized. And it should be. Let me put it this way: think of Levenshtein not as an algorithm of and in itself, but as a solution to a specific error model. First he suggests a model which says that when you are typing a word every letter can be either dropped or replaced by another one due to some random process (fingers not pressing the right keys). Then, his algorithm is just a generator of maximum likelihood solutions under this model. The more errors you allow, the smaller is the probability of this sequence of errors actually happening, the bigger is the score.
You (implicitly) state a very different hypothesis, though. That Polish stems may have certain flexibility at the end (some linguistic process that you do not fully understand within this framework). Then, when you strip your suffix (or something that looks like one), there are three options:
1) there is a chance that what you have here is just a different form of a stem you have stored in your dictionary, or
2) it is a completely different stem, or
3) you've stripped your suffix improperly and what you have is not stem at all.
You can heuristically estimate these probabilities by looking at how many letters in the beginning of the supposed stem match some dictionary entries, for example (how to find these entries is a related but different question). And then you can pick the guess that is the most plausible according to your metric/heuristic.
Now, note that you can use any algorithm to find the candidates in the dictionary. Including the Levenshtein algorithm - as long as you are reasonably sure that the right ones will be picked up. But obviously you are better off writing your own dictionary search algorithm that follows your own metric or emulates it. For example, by giving the biggest/prohibitive cost to the change of letters in the beginning of the word and reducing it as you go towards the end.

Visualizing and Solving recusrive questions without a computer

Say I've something like this: (predict the output)
void abc (char *s){
if(s[0]=='\0')
return;
abc(s+1);
abc(s+1);
printf(“%c “, s[0]);
}
It's not tough to solve, but I take too much time doing it and I've to redo such questions 2-3 times because I lose track of the recursion and values of variables(especially when there are 2-3 such recursive statements)
Is there any good method to use when one has to solve such questions?
The basic technique is to first start with a small input. Then try with one larger. Then try with one larger than that. For recursive functions, a pattern should emerge that lets you predict what the next one will look like given you know what the previous one looked like.
So, let's start with an empty string. Easy, nothing is printed.
input: ""
output:
Next is a string of length one. Almost as easy, the two recursive calls each do nothing (empty string case), and then the string's character is printed.
input: "z"
output: z
Next is a string of length two. Each of the recursive calls end up printing the second character (string of length one case), and then the first character is printed.
input: "yz"
output: zzy
So, let's try to predict what will happen for the string of length three case. What will happen is that the substring that excludes the first character gets worked on twice, then the first character is printed. That substring is the string of length two case. So:
input: "xyz"
output: zzyzzyx
So, it should be clear now how to derive the next output sequence given the current output sequence.
The easiest example for analyzing recursion is Fibonacci and Factorial function.
This will help you in analyzing recursive functions in a better manner. Whenever you lose track of recursive functions just recall these examples.
Take a stack of index cards of an appropriate size. Start tracing the initial call to the recursive function. When you get another call start a new index card and either put it in front of the first card or behind it (as appropriate). Sooner or later you will (unless you are tracing an infinite recursion) trace the execution of a call which does not make a recursive call, in which case copy the return value back to the card you came from.
It's probably a good idea to include 'go to card X' and 'came from card Y' on your cards.
In complicated situations you might find it useful to create more than one stack of cards to trace your function calls, oh why the heck, why not call them call stacks.

Phonetically Memorable Password Generation Algorithms

Background
While at the Gym the other day, I was working with my combination lock, and realized something that would be useful to me as a programmer. To wit, my combination is three seperate sets of numbers that either sound alike, or have some other relation that makes them easy to remember. For instance, 5-15-25, 7-17-2, 6-24-5. These examples seem easy to remember.
Question
How would I implement something similar for passwords? Yes, they ought to be hard to crack, but they also should be easy for the end user to remember. Combination Locks do that with a mix of numbers that have similar sounds, and with numbers that have similar properties (7-17-23: All Prime, 17 rolls right off the tongue after 7, and 23 is another prime, and is (out of that set), the 'hard' one to remember).
Criteria
The Password should be easy to remember. Dog!Wolf is easy to remember, but once an attacker knows that your website gives out that combination, it makes it infinitely easier to check.
The words or letters should mostly follow the same sounds (for the most part).
At least 8 letters
Not use !##$%^&*();'{}_+<>?,./ These punctuation marks, while appropriate for 'hard' passwords, do not have an 'easy to remember' sound.
Resources
This question is language-agnostic, but if there's a specific implementation for C#, I'd be glad to hear of it.
Update
A few users have said that 'this is bad password security'. Don't assume that this is for a website. This could just be for me to make an application for myself that generates passwords according to these rules. Here's an example.
The letters
A-C-C-L-I-M-O-P 'flow', and they happen to be two
regular words put together
(Acclimate and Mop). Further,
when a user says these letters, or
says them as a word, it's an actual
word for them. Easy to remember, but
hard to crack (dictionary attack,
obviously).
This question has a two-part goal:
Construct Passwords from letters that sound similar (using alliteration) or
Construct Passwords that mesh common words similarly to produce a third set of letters that is not in a dictionary.
You might want to look at:
The pronouncable password generation algorithm used by apg and explained in FIPS-181
Koremutake
First of all make sure the password is long. Consider using a "pass-phrase" instead of a single "pass-word". Breaking pass-phrases like "Dogs and wolves hate each other." is very hard yet they are quite easy to remember.
Some sites may also give you an advice which may be helpful, like Strong passwords: How to create and use them (linked from Password checker, which is a useful tool on its own).
Also, instead of trying to create easy to remember password, in some cases a much better alternative is to avoid remembering the password at all by using (and educating your users to use) a good password management utility (see What is your favourite password storage tool?) - when doing this, the only part left is to create a hard to crack password, which is easy (any long enough random sentence will do).
I am surprised no one has mentioned the Multics algorithm described at http://www.multicians.org/thvv/gpw.html , which is similar to the FIPS algorithm but based on trigraphs rather than digraphs. It produces output such as
ahmouryleg
thasylecta
tronicatic
terstabble
I have ported the code to python as well: http://pastebin.com/f6a10de7b
You could use Markov Chains to generate words that sounds like English(or any other language you want) but they are not actual words.
The question of easy to remember is really subjective, so I don't think you can write an algorithm like this that will be good for everyone.
And why use short passwords on web sites/computer applications instead of pass phrases? They are easy to remember but hard to crack.
After many years, I have decided to use the first letter of words in a passphrase. It's impossible to crack, versatile for length and restrictions like "you must have a digit", and hard to make errors.
This works by creating a phrase. A crazy fun vivid topic is useful!
"Stack Overflow aliens landed without using rockets or wheels".
Take the first letter, your password is "soalwurow"
You can type this quickly and accurately since you're not remembering letter by letter, you're just speaking a sentence inside your head.
I also like having words alternate from the left and right side of the keyboard, it gives you a fractionally faster typing speed and more pleasing rhythm. Notice in my example, your hands alternate left-right-left-right.
I have a few times used a following algorithm:
Put all lowercase vowels (from a-z) into an array Vowels
Put all lowercase consonants (from a-z) into another array Consonants
Create a third array Pairs of two letters in such a way, that you create all possible pairs of letters between Vowels and Consonants ("ab", "ba", "ac", etc...)
Randomly pick 3-5 elements from Pairs and concatenate them together as string Password
Randomly pick true or false
If true, remove the last letter from Password
If false, don't do anything
Substitute 2-4 randomly chosen characters in Password with its uppercase equivalent
Substitute 2-4 randomly chosen characters in Password with a randomly chosen integer 0-9
Voilá - now you should have a password of length between 5 and 10 characters, with upper and lower case alphanumeric characters. Having vowels and consonants take turns frequently make them semi-pronounceable and thus easier to remember.
FWIW I quite like jumbling word syllables for an easy but essentially random password. Take "Bongo" for example as a random word. Swap the syllables you get "Gobong". Swap the o's for zeros on top (or some other common substitution) and you've got an essentially random character sequence with some trail that helps you remember it.
Now how you pick out syllables programmatically - that's a whole other question!
When you generate a password for the user and send it by email, the first thing you should do when they first login if force them to change their password. Passwords created by the system do not need to be easy to remember because they should only be needed once.
Having easy to remember, hard to guess passwords is a useful concept for your users but is not one that the system should in some manner enforce. Suppose you send a password to your user's gmail account and the user doesn't change the password after logging in. If the password to the gmail account is compromised, then the password to your system is compromised.
So generating easy to remember passwords for your users is not helpful if they have to change the password immediately. And if they aren't changing it immediately, you have other problems.
I prefer giving users a "hard" password, requiring them to change it on the first use, and giving them guidance on how to construct a good, long pass phrase. I would also couple this with reasonable password complexity requirements (8+ characters, upper/lower case mix, and punctuation or digits). My rationale for this is that people are much more likely to remember something that they choose themselves and less likely to write it down somewhere if they can remember it.
A spin on the 'passphrase' idea is to take a phrase and write the first letters of each word in the phrase. E.g.
"A specter is haunting Europe - the specter of communism."
Becomes
asihe-tsoc
If the phrase happens to have punctation, such as !, ?, etc - might as well shove it in there. Same goes for numbers, or just substitute letters, or add relevant numbers to the end. E.g. Karl Marx (who said this quote) died in 1883, so why not 'asihe-tsoc83'?
I'm sure a creative brute-force attack could capitalise on the statistical properties of such a password, but it's still orders of magnitude more secure than a dictionary attack.
Another great approach is just to make up ridiculous words, e.g. 'Barangamop'. After using it a few times you will commit it to memory, but it's hard to brute-force. Append some numbers or punctuation for added security, e.g. '386Barangamop!'
Here's part 2 of your idea prototyped in a shell script. It takes 4, 5 and 6 letter words (roughly 50,000) from the Unix dictionary file on your computer, and concatenate those words on the first character.
#! /bin/bash
RANDOM=$$
WORDSFILE=./simple-words
DICTFILE=/usr/share/dict/words
grep -ve '[^a-z]' ${DICTFILE} | grep -Ee '^.{4,6}$' > ${WORDSFILE}
N_WORDS=$(wc -l < ${WORDSFILE})
for i in $(seq 1 20); do
password=""
while [ ! "${#password}" -ge 8 ] || grep -qe"^${password}$" ${DICTFILE}; do
while [ -z "${password}" ]; do
password="$(sed -ne "$(( (150 * $RANDOM) % $N_WORDS + 1))p" ${WORDSFILE})"
builtfrom="${password}"
done
word="$(sort -R ${WORDSFILE} | grep -m 1 -e "^..*${password:0:1}")"
builtfrom="${word} ${builtfrom}"
password="${word%${password:0:1}*}${password}"
done
echo "${password} (${builtfrom})"
done
Like most password generators, I cheat by outputting them in sets of twenties. This is often defended in terms of "security" (someone looking over your shoulder), but really its just a hack to let the user just pick the friendliest password.
I found the 4-to-6 letter words from the dictionary file still containing obscure words.
A better source for words would be a written document. I copied all the words on this page and pasted them into a text document, and then ran the following set of commands to get the actual english words.
perl -pe 's/[^a-z]+/\n/gi' ./624425.txt | tr A-Z a-z | sort -u > ./words
ispell -l ./words | grep -Fvf - ./words > ./simple-words
Then I used these 500 or so very simple words from this page to generate the following passwords with the shell script -- the script parenthetically shows the words that make up a password.
backgroundied (background died)
soundecrazy (sounding decided crazy)
aboupper (about upper)
commusers (community users)
reprogrammer (replacing programmer)
alliterafter (alliteration after)
actualetter (actual letter)
statisticrhythm (statistical crazy rhythm)
othereplacing (other replacing)
enjumbling (enjoying jumbling)
feedbacombination (feedback combination)
rinstead (right instead)
unbelievabut (unbelievably but)
createdogso (created dogs so)
apphours (applications phrase hours)
chainsoftwas (chains software was)
compupper (computer upper)
withomepage (without homepage)
welcomputer (welcome computer)
choosome (choose some)
Some of the results in there are winners.
The prototype shows it can probably be done, but the intelligence you require about alliteration or syllable information requires a better data source than just words. You'd need pronunciation information. Also, I've shown you probably want a database of good simple words to choose from, and not all words, to better satisfy your memorable-password requirement.
Generating a single password the first time and every time -- something you need for the Web -- will take both a better data source and more sophistication. Using a better programming language than Bash with text files and using a database could get this to work instantaneously. Using a database system you could use the SOUNDEX algorithm, or some such.
Neat idea. Good luck.
I'm completely with rjh. The advantage of just using the starting letters of a pass-phrase is that it looks random, which makes it damn hard to remember if you don't know the phrase behind it, in case Eve looks over your shoulder as you type the password.
OTOH, if she sees you type about 8 characters, among which 's' twice, and then 'o' and 'r' she may guess it correctly the first time.
Forcing the use of at least one digit doesn't really help; you simply know that it will be "pa55word" or "passw0rd".
Song lyrics are an inexhaustible source of pass-phrases.
"But I should have known this right from the start"
becomes "bishktrfts". 10 letters, even only lowercase gives you 10^15 combinations, which is a lot, especially since there's no shortcut for cracking it. (At 1 million combinations a second it takes 30 years to test all 10^15 combinations.)
As an extra (in case Eve knows you're a Police fan), you could swap e.g. the 2nd and 3rd letter, or take the second letter of the third word. Endless possibilities.
System generated passwords are a bad idea for anything other than internal service accounts or temporary resets (etc).
You should always use your own "passphrases" that are easy for you to remember but that are almost impossible to guess or brute force. For example the password for my old university account was.
Here to study again!
That is 20 characters using upper and lower case with punctuation. This is an unbelievably strong password and there is no piece of software that could generate a more secure one that is easier to remember for me.
Take look at the gpw tool. The package is also available in Debian/Ubuntu repositories.
One way to generate passwords that 'sound like' words would be to use a markov chain. An n-degree markov chain is basically a large set of n-tuples that appear in your input corpus, along with their frequency. For example, "aardvark", with a 2nd-degree markov chain, would generate the tuples (a, a, 1), (a, r, 2), (r, d, 1), (d, v, 1), (v, a, 1), (r, k, 1). Optionally, you can also include 'virtual' start-word and end-word tokens.
In order to create a useful markov chain for your purposes, you would feed in a large corpus of english language data - there are many available, including, for example, Project Gutenburg - to generate a set of records as outlined above. For generating natural language words or sentences that at least mostly follow rules of grammar or composition, a 3rd degree markov chain is usually sufficient.
Then, to generate a password, you pick a random 'starting' tuple from the set, weighted by its frequency, and output the first letter. Then, repeatedly select at random (again weighted by frequency) a 'next' tuple - that is, one that starts with the same letters that your current one ends with, and has only one letter different. Using the example above, suppose I start at (a, a, 1), and output 'a'. My only next choice is (a, r, 2), so I output another 'a'. Now, I can choose either (r, d, 1) or (r, k, 1), so I pick one at random based on their frequency of occurrence. Suppose I pick (r, k, 1) - I output 'r'. This process continues until you reach an end-of-word marker, or decide to stop independently (since most markov chains form a cyclic graph, you can potentially never finish generating if you don't apply an artificial length limitation).
At a word level (eg, each element of the tuple is a word), this technique is used by some 'conversation bots' to generate sensible-seeming nonsense sentences. It's also used by spammers to try and evade spam filters. At a letter level, as outlined above, it can be used to generate nonsense words, in this case for passwords.
One drawback: If your input corpus doesn't contain anything other than letters, nor will your output phrases, so they won't pass most 'secure' password requirements. You may want to apply some post-processing to substitute some characters for numbers or symbols.
edit: After answering, I realized that this is in no way phonetically memorable. Leaving the answer anyway b/c I find it interesting. /edit
Old thread, I know... but it's worth a shot.
1) I'd probably build the largest dictionary you can ammass. Arrange them into buckets by part of speech.
2)Then, build a grammar that can make several types of sentences. "Type" of sentence is determined by permutations of parts of speech.
3)Randomly (or as close to random as possible), pick a type of sentence. What is returned is a pattern with placeholders for parts of speech (n-v-n would be noun-verb-noun)
3)Pick words at random in each part of speech bucket to stand in for the placeholders. Fill them in. (The example above might become something like car-ate-bicycle.)
4)randomly scan each character deciding whether or not you want to replace it with either a similar-sounding character (or set of characters), or a look-alike. This is the hardest step of the problem.
5) resultant password would be something like kaR#tebyCICle
6) laugh at humorous results like the above that look like "karate bicycle"
I would really love to see someone implement passwords with control characters like "<Ctrl>+N" or even combo characters like "A+C" at the same time. Converting this to some binary equivalent would, IMHO, make password requirements much easier to remember, faster to type, and harder to crack (MANY more combinations to check).

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Resources