So I'm currently working as an intern at a company and have been tasked with creating the middle tier layer of a UI rule editor for a analytical engine. As part of this task I have ensure that all rules created are valid rules. These rules can be quite complex, consisting of around 10 fields with multiple possibilities for each field.
I'm in way over my head here , I've been trying to find some material to guide me on this task but I cant seem to find much. Is there any pattern or design approach I can take to break this up into more manageable tasks? A book to read? Anything ideas or guidance would be appreciated.
You may consider to invest the time to learn a lexer/parser e.g. Anltr4. You can use the Antlrwork2 ide to assist in the visualization and debugging.
You can get off the ground by searching for example grammars and then tweak them for your particular needs.
Antlr provides output bindings in a number of different languages - so you will likely have one that fits your needs.
This is not a trivial task in any case - but an interesting and rewarding one.
You need to build the algorithm for the same.
Points to be followed
1.) Validating for Parameters based on datatype support and there compatibility.
2.) Which operator to be followed by operand of specific datatype.
3.) The return result of some expression should again be compatible with next operand or operator.
Give a feature of simulating the rule, where in user can select the dataset on which rule has to be fired.
a + b > c
Possible combinations.
1.) A, b can be String, number or integer.
2.) But combination result of a+b if String then operator ">" cannot come.
I have a text pattern matching problem that I could use some direction with. Not being very familiar with pattern recognition overall, I don't know if this is one of those "oh, just use blah-blah algorithm", or this is a really hard pattern problem.
The general statement of what I want to do is identify similarities between a series of SQL statements, in order to allow me to refactor those statements into a smaller number of stored procedures or other dynamically-generated SQL snippets. For example,
SELECT MIN(foo) FROM bar WHERE baz > 123;
SELECT MIN(footer) FROM bar;
SELECT MIN(foo), baz FROM bar;
are all kind of the same, but I would want to recognize that the value inside the MIN() should be a replaceable value, that I may have another column in the SELECT list, or have an optional WHERE clause. Note that this example is highly cooked up, but I hope it allows you to see what I am after.
In terms of scope, I would have a set of thousands of SQL statements that I would hope to reduce to dozens (?) of generic statements. In research so far, I have come across w-shingles, and n-grams, and have discarded approaches like "bag of words" because ordering is important. Taking this out of the SQL realm, another way of stating this problem might be "given a series of text statements, what is the smallest set of text fragments that can be used to reassemble those statements?"
What you really want is to find code clones across the code base.
There's a lot of ways to do that, but most of them seem to ignore the structure that the (SQL) language brings. That structure makes it "easier" to find code elements that make conceptual sense, as opposed to say N-grams (yes, "FROM x WHERE" is common but is an awkward chunk of SQL).
My abstract syntax tree (AST) based clone detection scheme parses source text to ASTs, and then finds shared trees that can be parameterized in a way that produces sensible generalizations by using the language grammar as a guide. See my technical paper Clone Detection Using Abstract Syntax Trees.
With respect to OP's example:
It will recognize that the value inside the MIN() should be a replaceable value
that the SELECT singleton column could be extended to a list
and that the WHERE clause is optional
It won't attempt to make those proposals, unless it find two candidate clones that vary in way these generalizations explain. It gets the generalizations basically by extracting them from the (SQL) grammar. OP's examples have exactly enough variation to force those generalizations.
A survey of clone detection techniques (Comparison and Evaluation of Code Clone Detection Techniques
and Tools: A Qualitative Approach) rated this approach at the top of some 30 different clone detection methods; see table 14.
Question is a bit too broad, but I would suggest to give a shot the following approach:
This sounds like a document clustering problem, where you have a set of pieces of text (SQL statements) and you want to cluster them together to find if some of the statements are close to each other. Now, the trick here is in the distance measure between text statements. I would try something like edit distance
So in general the following approach could work:
Do some preprocessing of the sql statements you have. Tokenization, removing some words from statements etc. Just be careful here - you are not analysing just some natural language text, its an SQL statements so you will need some clever approach.
After that, try to write a function which would count distance between 2 sql queries. The edit distance should work for you.
Finally, try to run document clustering on all your SQL queries, using edit distance as a distance measure for clustering algorithm
Hope that helps.
I have been trying some frameworks and algorithms, and I can't find one that do what I want - which is classify the column of the data based on the value.
I tried to use Bayes algorithm, but it isn't very precise because I can't expect that the data that is being searched for is in the training set - but I can expect that the pattern is in the training.
I don't have background in Machine Learning / AI, but I was looking for some working example before really going deeper in the implementation.
I built a smaller ARFF to exemplify. Also tried lots of Weka classifying algorithms but none of them gave me good results.
#relation recommend
#attribute class {name,email,taxid,phone}
#attribute text String
name,'Erik Kolh'
name,'Eric Candid'
name,'Allan Pavinan'
name,'Jubaru Guttenberg'
name,'Barabara Bere'
name,'Chuck Azul'
My expectation is train a huge dataset like the above one and get recommendations based on the pattern, e.g.: -> email
Joao Vitor -> name
400-123-5519 -> phone
Can you please suggest any algorithms, examples or ideas to research?
I couldn't find a good fit, maybe it's just lack of vocabulary.
Thank you!
What you are trying to do is called named entity recognition (NER). Weka is most likely not a real help here. The library Mallet ( might be a good fit. I would recommend a Conditional Random Field (CRF) based approach.
If you would like to stay with weka, you need to change your feature space. Then Naive bayes will be do ok on your data as presented
E.g. add a features for
whether the word has only characters
whether it is alphanumeric
whether it is numeric data
number of Numbers,
whether it starts captilized
... (just be creative)
I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".
I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API ( which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.
I have a need to build an app (Ruby) that allows the user to select one or more patterns and in case those patterns are matched to proceed and complete a set of actions.
While doing my research I've discovered the new (to me) field of rules based systems and have spent some time reading about it and it seems exactly the kind of functionality I need.
The app will be integrated with different web services and would allow rules like these one:
When Highrise contact is added and Zendesk ticket is created do add email to database
I had two ideas to build this. The first is to build some kind os DSL to be able to specify the rule conditions and build them on the fly with the user input.
The second one is to build some rule classes each one having a pattern/matcher and action methods. The pattern would evaluate the expression and return true or false and the action would be executed if the match is positive.
The rules will then need to be persisted and then evaluated periodically.
Can anyone shed some light on this design or point somewhere where I can get more information on this?
In a commercial rules engine e.g. Drools, FlexRule... the pattern matching is handled by RETE algorithm. And also, some of them provide multiple different engines for different logic e.g. procedural, validation, inference, flow, workflow,... and they also provide DSL customization...
Rule sequencing and execution is handled based on agenda and activation that can be defined on the engine. And conflict resolution strategy would help you to find proper activation to fire.
I recommend you to use a commercial product hosting on a host/service. And use simple Json/Xml format to communicate to the rule server and execute your rules. This will be giving you a better result probably than creating your own one. However if you are interested in creating your own one as a pattern matching engine consider RETE algorithm, agenda and activation mechanisms for complex production system.
In RETE algorithm you may consider at least implementing Positive and Negative conditions. In implementing RETE you need to implement beta and alpha memories as well ad join nodes that supports left and right activations.
Do you think you could represent your problem in a graph-based representation? I'm pretty sure that your problem can be considered as a graph-based problem
If yes, why don't you use a graph transformation system to define and apply you rules. The one that I would recommend is GrGen.NET. The use of GrGen.NET builds on five steps
Definition of the metamodel: Here, you define you building blocks, i.e. types of graph nodes and graph edges.
Definition of the ruleset: This is where you can put your pattern detecting rules. Moreover, you can create rule encapsulating procedures to manipulate your graph-based data structure.
Compilation: Based on the previous two steps, a C#-assembly (DLL) is created. There should be a way to access such a DLL from Ruby.
Definition of a rule sequence: Rule sequences contain the structure in which individual rules are executed. Typically, it's a logical structure in which the rules are concatenated.
Graph transformation: The application of a rule sequences on a DLL results in the transformation of a graph that can subsequently be exported, saved or further manipulated.
You can find a very good manual of GrGen.NET here: