Pattern recognition of SQL statements - algorithm

I have a text pattern matching problem that I could use some direction with. Not being very familiar with pattern recognition overall, I don't know if this is one of those "oh, just use blah-blah algorithm", or this is a really hard pattern problem.
The general statement of what I want to do is identify similarities between a series of SQL statements, in order to allow me to refactor those statements into a smaller number of stored procedures or other dynamically-generated SQL snippets. For example,
SELECT MIN(foo) FROM bar WHERE baz > 123;
SELECT MIN(footer) FROM bar;
SELECT MIN(foo), baz FROM bar;
are all kind of the same, but I would want to recognize that the value inside the MIN() should be a replaceable value, that I may have another column in the SELECT list, or have an optional WHERE clause. Note that this example is highly cooked up, but I hope it allows you to see what I am after.
In terms of scope, I would have a set of thousands of SQL statements that I would hope to reduce to dozens (?) of generic statements. In research so far, I have come across w-shingles, and n-grams, and have discarded approaches like "bag of words" because ordering is important. Taking this out of the SQL realm, another way of stating this problem might be "given a series of text statements, what is the smallest set of text fragments that can be used to reassemble those statements?"

What you really want is to find code clones across the code base.
There's a lot of ways to do that, but most of them seem to ignore the structure that the (SQL) language brings. That structure makes it "easier" to find code elements that make conceptual sense, as opposed to say N-grams (yes, "FROM x WHERE" is common but is an awkward chunk of SQL).
My abstract syntax tree (AST) based clone detection scheme parses source text to ASTs, and then finds shared trees that can be parameterized in a way that produces sensible generalizations by using the language grammar as a guide. See my technical paper Clone Detection Using Abstract Syntax Trees.
With respect to OP's example:
It will recognize that the value inside the MIN() should be a replaceable value
that the SELECT singleton column could be extended to a list
and that the WHERE clause is optional
It won't attempt to make those proposals, unless it find two candidate clones that vary in way these generalizations explain. It gets the generalizations basically by extracting them from the (SQL) grammar. OP's examples have exactly enough variation to force those generalizations.
A survey of clone detection techniques (Comparison and Evaluation of Code Clone Detection Techniques
and Tools: A Qualitative Approach) rated this approach at the top of some 30 different clone detection methods; see table 14.

Question is a bit too broad, but I would suggest to give a shot the following approach:
This sounds like a document clustering problem, where you have a set of pieces of text (SQL statements) and you want to cluster them together to find if some of the statements are close to each other. Now, the trick here is in the distance measure between text statements. I would try something like edit distance
So in general the following approach could work:
Do some preprocessing of the sql statements you have. Tokenization, removing some words from statements etc. Just be careful here - you are not analysing just some natural language text, its an SQL statements so you will need some clever approach.
After that, try to write a function which would count distance between 2 sql queries. The edit distance should work for you.
Finally, try to run document clustering on all your SQL queries, using edit distance as a distance measure for clustering algorithm
Hope that helps.

Related

How are keyword clouds constructed?

How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.

Is there an algorithm to compound multiple sentences into a more complex one?

I'm looking to do the opposite to what is described here: Tools for text simplification (Java)
Finding meaningful sub-sentences from a sentence
That is, take two simple sentences and combine them as a compound sentence.
Are there any algorithms to do this?
I'm particularly sure that you will not be able to compound sentences like in the example from the linked question (John played golf. John was the CEO of a company. -> John, who was the CEO of a company, played golf), because it requires such language understanding that is too far from now.
So, it seems that the best option is to bluntly replace dot by comma and concatenate simple sentences (if you have to choose sentences to be compounded from text, you can try simple heuristics like approximating semantic similarity by number of common words or tools like those based on WordNet). I guess, in most cases human readers can infer missed conjunction from the context.
Of course, you could develop more sophisticated solutions, but it requires either narrow domain (e.g. all sentences share very similar structure), or tools that can determine relations between sentences, e.g. relationship of cause and effect. I'm not aware of such tools and doubt in their existence, because this level (sentences and phrases) are much more diverse and sparse than the level of words and collocations.

Finding an experiment to evaluate how good an algorithm for keywords extraction is

I have a few algorithms that extract and rank keywords [both terms and bigrams] from a paragraph [most are based on the tf-idf model].
I am looking for an experiment to evaluate these algorithms. This experiment should give a grade to each algorithm, indicating "how good was it" [on the evaluation set, of course].
I am looking for an automatic / semi-automatic method to evaluate each algorithm's results, and an automatic / semi-automatic method to create the evaluation set.
Note: These experiments will be ran off-line, so efficiency is not an issue.
The classic way to do this would be to define a set of key words you want the algorithms to find per paragraph, then check how well the algorithms do with respect to this set, e.g. (generated_correct - generated_not_correct)/total_generated (see update, this is nonsense). This is automatic once you have defined this ground truth. I guess constructing that is what you want to automate as well when you talk about constructing the evaluation set? That's a bit more tricky.
Generally, if there was a way to generate key words automatically that's a good way to use as a ground truth - you should use that as your algorithm ;). Sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something's probably going wrong (unless you specifically want to benchmark against that algorithm).
So you might start harvesting key words from common sources. For example:
Download scientific papers that have a keyword section. Check if those keywords actually appear in the text, if they do, take the section of text including the keywords, use the keyword section as ground truth.
Get blog posts, check if the terms in the heading appear in the text, then use the words in the title (always minus stop words of course) as ground truth
...
You get the idea. Unless you want to employ people to manually generate keywords, I guess you'll have to make do with something like the above.
Update
The evaluation function mentioned above is stupid. It does not incorporate how many of the available key words have been found. Instead, the way to judge a ranked list of relevant and irrelevant results is to use precision and recall. Precision rewards the absence of irrelevant results, Recall rewards the presence of relevant results. This again gives you two measures. In order to combine these two into a single measure, either use the F-measure, which combines those two measures into a single measure, with an optional weighting. Alternatively, use Precision#X, where X is the number of results you want to consider. Precision#X interestingly is equivalent to Recall#X. However, you need a sensible X here, ie if you have less than X keywords in some cases, those results will be punished for never providing an Xth keyword. In the literature on tag recommendation for example, which is very similar to your case, F-measure and P#5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall

Method for runtime comparison of two programs' objects

I am working through a particular type of code testing that is rather nettlesome and could be automated, yet I'm not sure of the best practices. Before describing the problem, I want to make clear that I'm looking for the appropriate terminology and concepts, so that I can read more about how to implement it. Suggestions on best practices are welcome, certainly, but my goal is specific: what is this kind of approach called?
In the simplest case, I have two programs that take in a bunch of data, produce a variety of intermediate objects, and then return a final result. When tested end-to-end, the final results differ, hence the need to find out where the differences occur. Unfortunately, even intermediate results may differ, but not always in a significant way (i.e. some discrepancies are tolerable). The final wrinkle is that intermediate objects may not necessarily have the same names between the two programs, and the two sets of intermediate objects may not fully overlap (e.g. one program may have more intermediate objects than the other). Thus, I can't assume there is a one-to-one relationship between the objects created in the two programs.
The approach that I'm thinking of taking to automate this comparison of objects is as follows (it's roughly inspired by frequency counts in text corpora):
For each program, A and B: create a list of the objects created throughout execution, which may be indexed in a very simple manner, such as a001, a002, a003, a004, ... and similarly for B (b001, ...).
Let Na = # of unique object names encountered in A, similarly for Nb and # of objects in B.
Create two tables, TableA and TableB, with Na and Nb columns, respectively. Entries will record a value for each object at each trigger (i.e. for each row, defined next).
For each assignment in A, the simplest approach is to capture the hash value of all of the Na items; of course, one can use LOCF (last observation carried forward) for those items that don't change, and any as-yet unobserved objects are simply given a NULL entry. Repeat this for B.
Match entries in TableA and TableB via their hash values. Ideally, objects will arrive into the "vocabulary" in approximately the same order, so that order and hash value will allow one to identify the sequences of values.
Find discrepancies in the objects between A and B based on when the sequences of hash values diverge for any objects with divergent sequences.
Now, this is a simple approach and could work wonderfully if the data were simple, atomic, and not susceptible to numerical precision issues. However, I believe that numerical precision may cause hash values to diverge, though the impact is insignificant if the discrepancies are approximately at the machine tolerance level.
First: What is a name for such types of testing methods and concepts? An answer need not necessarily be the method above, but reflects the class of methods for comparing objects from two (or more) different programs.
Second: What are standard methods exist for what I describe in steps 3 and 4? For instance, the "value" need not only be a hash: one might also store the sizes of the objects - after all, two objects cannot be the same if they are massively different in size.
In practice, I tend to compare a small number of items, but I suspect that when automated this need not involve a lot of input from the user.
Edit 1: This paper is related in terms of comparing the execution traces; it mentions "code comparison", which is related to my interest, though I'm concerned with the data (i.e. objects) than with the actual code that produces the objects. I've just skimmed it, but will review it more carefully for methodology. More importantly, this suggests that comparing code traces may be extended to comparing data traces. This paper analyzes some comparisons of code traces, albeit in a wholly unrelated area of security testing.
Perhaps data-tracing and stack-trace methods are related. Checkpointing is slightly related, but its typical use (i.e. saving all of the state) is overkill.
Edit 2: Other related concepts include differential program analysis and monitoring of remote systems (e.g. space probes) where one attempts to reproduce the calculations using a local implementation, usually a clone (think of a HAL-9000 compared to its earth-bound clones). I've looked down the routes of unit testing, reverse engineering, various kinds of forensics, and whatnot. In the development phase, one could ensure agreement with unit tests, but this doesn't seem to be useful for instrumented analyses. For reverse engineering, the goal can be code & data agreement, but methods for assessing fidelity of re-engineered code don't seem particularly easy to find. Forensics on a per-program basis are very easily found, but comparisons between programs don't seem to be that common.
(Making this answer community wiki, because dataflow programming and reactive programming are not my areas of expertise.)
The area of data flow programming appears to be related, and thus debugging of data flow programs may be helpful. This paper from 1981 gives several useful high level ideas. Although it's hard to translate these to immediately applicable code, it does suggest a method I'd overlooked: when approaching a program as a dataflow, one can either statically or dynamically identify where changes in input values cause changes in other values in the intermediate processing or in the output (not just changes in execution, if one were to examine control flow).
Although dataflow programming is often related to parallel or distributed computing, it seems to dovetail with Reactive Programming, which is how the monitoring of objects (e.g. the hashing) can be implemented.
This answer is far from adequate, hence the CW tag, as it doesn't really name the debugging method that I described. Perhaps this is a form of debugging for the reactive programming paradigm.
[Also note: although this answer is CW, if anyone has a far better answer in relation to dataflow or reactive programming, please feel free to post a separate answer and I will remove this one.]
Note 1: Henrik Nilsson and Peter Fritzson have a number of papers on debugging for lazy functional languages, which are somewhat related: the debugging goal is to assess values, not the execution of code. This paper seems to have several good ideas, and their work partially inspired this paper on a debugger for a reactive programming language called Lustre. These references don't answer the original question, but may be of interest to anyone facing this same challenge, albeit in a different programming context.

normalize boolean expression for caching reasons. is there a more efficient way than truth tables?

My current project is an advanced tag database with boolean retrieval features. Records are being queried with boolean expressions like such (e.g. in a music database):
funky-music and not (live or cover)
which should yield all funky music in the music database but not live or cover versions of the songs.
When it comes to caching, the problem is that there exist queries which are equivalent but different in structure. For example, applying de Morgan's rule the above query could be written like this:
funky-music and not live and not cover
which would yield exactly the same records but of cause break caching when caching would be implemented by hashing the query string, for example.
Therefore, my first intention was to create a truth table of the query which could then be used as a caching key as equivalent expressions form the same truth table. Unfortunately, this is not practicable as the truth table grows exponentially with the number of inputs (tags) and I do not want to limit the number of tags used in one query.
Another approach could be traversing the syntax tree applying rules defined by the boolean algebra to form a (minimal) normalized representation which seems to be tricky too.
Thus the overall question is: Is there a practicable way to implement recognition of equivalent queries without the need of circuit minimization or truth tables (edit: or any other algorithm which is NP-hard)?
The ne plus ultra would be recognizing already cached subqueries but that is no primary target.
A general and efficient algorithm to determine whether a query is equivalent to "False" could be used to solve NP-complete problems efficiently, so you are unlikely to find one.
You could try transforming your queries into a canonical form. Because of the above, there will be always be queries that are very expensive to transform into any given form, but you might find that, in practice, some form works pretty well most of the time - and you can always give up halfway through a transformation if it is becoming too hard.
You could look at http://en.wikipedia.org/wiki/Conjunctive_normal_form, http://en.wikipedia.org/wiki/Disjunctive_normal_form, http://en.wikipedia.org/wiki/Binary_decision_diagram.
You can convert the queries into conjunctive normal form (CNF). It is a canonical, simple representation of boolean formulae that is normally the basis for SAT solvers.
Most likely "large" queries are going to have lots of conjunctions (rather than lots of disjunctions) so CNF should work well.
The Quine-McCluskey algorithm should achieve what you are looking for. It is similiar to Karnaugh's Maps, but easier to implement in software.

Resources