I have a final state automata and i want to see all the patterns. This automata is about morphology of Turkish.
Turkish is an agglutinative language. I analyze words and and find their morphemes and allomorphs.
For example:
?- analyze('kediler', Allomorphs, Morphemes).
Allomorphs = [kedi, ler],
Morphemes = [noun, plur]
An example about my final state automata:
final(qv4b).
final(qv4c).
final(qv5).
final(qv5a).
% First Step
t(q0,noun,qn1).
t(q0,verb,qv1).
A part of the graph
I have issues about some points therefore the results are wrong. I have to test word by word and morpheme by morpheme myself. I want to create a Deep First Search algorithm and see all the results.
What path should I follow in order to do this?
Related
Question:
Given a piece of text like "This is a test"; how to build a machine learning model to get the number of word occurrences for example in this piece, word count is 4. After training, it is possible to predict text word count.
I know it is easy to write a program (like below pseudo code),
data: memory.punctuation['~', '`', '!', '#', '#', '$', '%', '^', '&', '*', ...]
f: count.word(text) -> count =
f: tokenize(text) --list-->
f: count.token(list, filter) where filter(token)<not in memory.punctuation> -> count
however in this question, we require to use machine learning algorithm. I wonder how machine can learn the concept of count (currently, we know machine learning is good at classification). Any idea and suggestions? Thanks in advance.
Failures:
We can use sth like word2vec (encoder) to build word vectors; if we consider seq2seq approach, we can train sth like This is a test <s> 4 <e> This is very very long sentence and the word count is greater than ten <s> 4 1 <e> (4 1 to represent the number 14). However, it does not work since attention model is used to get similar vector for example text translating (This is a test --> 这(this) 是(is) 一个(a) 测试(test)). It is hard to find relationship between [this ...] and 4 which is an aggregated number (i.e. model not convergent).
We know machine learning is good at classification. If we treat "4" as a class, the number of classes is infinite; if we do a tricky and use count/text.length as prediction, i have not got a model that fit even training data set (model not convergent); for example, if we use many short sentence to train the model, it will fail to predict long sentence length. And it may be related to an information paradox: we can encode data in a book as 0.x and use a machine to to mark a position on a rod to split it into 2 parts with length a and b, where a/b = 0.x; but we cannot find a machine.
What about a regression problem?
I think it would work quite well and that at the end it will output a nearly whole numbers all the time.
Also you can train a simple RNN to do the job, assuming you are using a hot one encoding and take an output from the last state.
If V_h is all zeros but the space index (which will be 1) and V_x as well, than the network will actually sum the spaces, and if c is 1 at the end so the output will be the number of words - For every length!
I think we can take it as a classification problem for a character being the input and if word breaker as the output.
In other words, at some time point t, we output whether the input character at the same time point is a word breaker (YES) or not (NO). If yes, then increase the word count. If no, then read the next character.
In modern English language I don't think there are going to be long words. So simple RNN model should do perhaps without the concern of vanishing gradient.
Let me know what you think!
Use NLTK for counting words,
from nltk.tokenize import word_tokenize
text = "God is Great!"
word_count = len(word_tokenize(text))
print(word_count)
What is the Prolog predicate that helps to show wasteful representations of Prolog terms?
Supplement
In a aside of an earlier Prolog SO answer, IIRC by mat, it used a Prolog predicate to analyze a Prolog term and show how it was overly complicated.
Specifically for a term like
[op(add),[[number(0)],[op(add),[[number(1)],[number(1)]]]]]
it revealed that this has to many [].
I have searched my Prolog questions and looked at the answers twice and still can't find it. I also recall that it was not in SWI-Prolog but in another Prolog so instead of installing the other Prolog I was able to use the predicate with an online version of Prolog.
If you read along in the comments you will see that mat identified the post I was seeking.
What I was seeking
I have one final note on the choice of representation. Please try out the following, using for example GNU Prolog or any other conforming Prolog system:
| ?- write_canonical([op(add),[Left,Right]]).
'.'(op(add),'.'('.'(_18,'.'(_19,[])),[]))
This shows that this is a rather wasteful representation, and at the same time prevents uniform treatment of all expressions you generate, combining several disadvantages.
You can make this more compact for example using Left+Right, or make all terms uniformly available using for example op_arguments(add, [Left,Right]), op_arguments(number, [1]) etc.
Evolution of a Prolog data structure
If you don't know it already the question is related to writing a term rewriting system in Prolog that does symbolic math and I am mostly concentrating on simplification rewrites at present.
Most people only see math expressions in a natural representation
x + 0 + sin(y)
and computer programmers realize that most programming languages have to parse the math expression and convert it into an AST before using
add(add(X,0),sin(Y))
but most programming languages can not work with the AST as written above and have to create data structures See: Compiler/lexical analyzer, Compiler/syntax analyzer, Compiler/AST interpreter
Now if you have ever done more than dipped your toe in the water when learning about Prolog you will have come across Program 3.30 Derivative rules, which is included in this, but the person did not give attribution.
If you try and roll your own code to do symbolic math with Prolog you might try using is/2 but quickly find that doesn't work and then find that Prolog can read the following as compound terms
add(add(X,0),sin(Y))
This starts to work until you need to access the name of the functor and find functor/3 but then we are getting back to having to parse the input, however as noted by mat and in "The Art of Prolog" if one were to make the name of the structure accessible
op(add,(op(add,X,0),op(sin,Y)))
now one can access not only the terms of the expression but also the operator in a Prolog friendly way.
If it were not for the aside mat made the code would still be using the nested list data structure and now is being converted to use the compound terms that expose the name of the structure. I wonder if there is a common phrase to describe that, if not there should be one.
Anyway the new simpler data structure worked on the first set of test, now to see if it holds up as the project is further developed.
Try it for yourself online
Using GNU Prolog at tutorialspoint.com enter
:- initialization(main).
main :- write_canonical([op(add),[Left,Right]]).
then click Execute and look at the output
sh-4.3$ gprolog --consult file main.pg
GNU Prolog 1.4.4 (64 bits)
Compiled Aug 16 2014, 23:07:54 with gcc
By Daniel Diaz
Copyright (C) 1999-2013 Daniel Diaz
compiling /home/cg/root/main.pg for byte code...
/home/cg/root/main.pg:2: warning: singleton variables [Left,Right] for main/0
/home/cg/root/main.pg compiled, 2 lines read - 524 bytes written, 9 ms
'.'(op(add),'.'('.'(_39,'.'(_41,[])),[]))| ?-
Clean vs. defaulty representations
From The Power of Prolog by Markus Triska
When representing data with Prolog terms, ask yourself the following question:
Can I distinguish the kind of each component from its outermost functor and arity?
If this holds, your representation is called clean. If you cannot distinguish the elements by their outermost functor and arity, your representation is called defaulty, a wordplay combining "default" and "faulty". This is because reasoning about your data will need a "default case", which is applied if everything else fails. In addition, such a representation prevents argument indexing, and is considered faulty due to this shortcoming. Always aim to avoid defaulty representations! Aim for cleaner representations instead.
Please see the last part of:
https://stackoverflow.com/a/42722823/1613573
It uses write_canonical/1 to display the canonical representation of a term.
This predicate is very useful when learning Prolog and helps to clear several misconceptions that are typical for beginners. See for example the recent question about hyphens, where it would have helped too.
Note that in SWI, the output deviates from canonical Prolog syntax in general, so I am not using SWI when explaining Prolog syntax.
You could also programmatially count how many subterms are a single-element list using something like this (not optimized);
single_element_list_subterms(Term, Index) :-
Term =.. [Functor|Args],
( Args = []
-> Index = 0
; maplist(single_element_list_subterms, Args, Indices),
sum_list(Indices, SubIndex),
( Functor = '.', Args = [_, []]
-> Index is SubIndex + 1
; Index = SubIndex
)
).
Trying it on the example compound term:
| ?- single_element_list_subterms([op(add),[[number(0)],[op(add),[[number(1)],[number(1)]]]]], Count).
Count = 7
yes
| ?-
Indicating that there are 7 subterms consisting of a single-element list. Here is the result of write_canonical:
| ?- write_canonical([op(add),[[number(0)],[op(add),[[number(1)],[number(1)]]]]]).
'.'(op(add),'.'('.'('.'(number(0),[]),'.'('.'(op(add),'.'('.'('.'(number(1),[]),'.'('.'(number(1),[]),[])),[])),[])),[]))
yes
| ?-
I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.
I have a sentence in English. Now I want to jumble the words up and input that set of words into a program which should unscramble the words according to normal rules of English grammar to output the original sentence. I can vaguely assume it would require Natural Language Generation algorithms.
For eg:
Sentence: Mary has gone for a walk with her dog.
Set of words: {has, for, a, with, her, dog, Mary, gone, walk}
The output should be the same sentence.
I can assume only the set of words will never be enough to generate the original sentence. But what more information must be included to revive the original sentence?
Please guide me as to where I should be starting with.
Language models are things that can take in a text or sentence (any sequence of words) and assign it a probability based on how well the model "recognizes" that text.
To solve your problem, you could take a language model and use it to compute the probability of each possible permutation you can make of the input words. The most probable sentence accord to the model is probably the most coherent one.
For a situation like yours, trying a n-gram model (for n > 2.. I think 2 or 3 should do the trick) or a Hidden Markov model leveraging part of speech tags should do the trick.
You will not be able to solve your problem without additional information. Take this example:
{"happy", "you" "are"}
Can you reconstruct the sentence? Is it "You are happy" or is it "Are you happy"? Note that the words are the same but the meaning changes radically. No matter how good algorithm you write it will not be able to reconstruct the sentence if you can not.
You need to do following thing to get started :-
Maintain a dictionary of english words classified as nouns,adjectives,verbs,etc.
Build grammer rules for english language which you can get from any english tutorial.
try to rearrange the words to match the grammer rules.
Note:- English is very ambiguous language so you might end with something else.
eg.
grammer rule : article noun verb
input words : dog , barks , the
dictionary lookup : dog => noun , barks => verb , the => article
rearrange the words according to the rule.
There can be mutliple rule and word can also be of multiple type so
try all possibilities.
A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
If you have a model of the grammar in a normalized form (all rules like this):
LHS = RHS1 RHS2 ... RHSn ;
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
Repeat until no nonterminals are left:
Pick a nonterminal N in the tree;
Expand by adding children for the right hand side of any rule
whose left-hand side matches the nonterminal N
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)