I have been reading Eigen doc and some articles on internet for Eigen. I found the term "Eigen transformation of expression". I don't understand what it means. I found space transformation in geometry module in the doc but I think space and expression transformations are not similar.
It would be nice if someone could point out what is transformation of expressions in terms of Eigen. Which kind of transformations they perform, whether they have some sort of explicit list of transformations or they are hardcoded in the classes representing operands/operators?
I think that what are you referring to as "Eigen transformation of expression" is the Eigen Expression Template.
Expression-templates-based libraries can avoid evaluating sub-expressions into temporaries, which in many cases results in large speed improvements. This is called lazy evaluation as an expression is getting evaluated as late as possible, instead of immediately. However, most other expression-templates-based libraries always choose lazy evaluation. There are two problems with that: first, lazy evaluation is not always a good choice for performance; second, lazy evaluation can be very dangerous, for example with matrix products: doing matrix = matrix*matrix gives a wrong result if the matrix product is lazy-evaluated, because of the way matrix product works.
see more on Eigen Lazy Evaluation and Aliasing page.
This a a simple way of abstracting the expression (as *, dot...) from the actual calculation among some other staffing. You can read more about in this paper :
A New Vectorization Technique for Expression Templates in C++
and it looks not documented at all in Eigen Expression Template page.
Related
For example if I have I = V / R as input, I want V = I * R and R = V / I as output. I get it can be a broad question but how should I get started with it? Should I use stack/tree as when building postfix notation/interpreter?
You need to be able to represent the formulas symbolically, and apply algebraic rules to manipulate those formulas.
The easiest way to do this is to define a syntax that will accept your formula; best if defined explicitly as BNF. With that you can build a parser for such formulas; done appropriately your parser can build an abstract syntax tree representing the formula. You can do with tools like lex and yacc or ANTLR. Here's my advice on how do this with a custom recursive descent parser: Is there an alternative for flex/bison that is usable on 8-bit embedded systems?.
Once you have trees encoding your formulas, you can implement procedures to modify the trees according to algebraic laws, such as:
X=Y/Z => X*Z = Y if Z ~= 0
Now you can implement such a rule by writing procedural code that climbs over the tree, finds a match to pattern, and then smashes the tree to produce the result. This is pretty straightforward compiler technology. If you are enthusiastic, you can probably code a half dozen algebraic laws fairly quickly. You'll discover the code that does this is pretty grotty, what with climbing up and down the tree, matching node, and smashing links between nodes to produce the result.
Another way to do this is to use a program transformation system that will let you
define a grammar for your formulas directly,
define (tree) rewrite rules directly in terms of your grammar (e.g., essentially you provide the algebra rule above directly),
apply the rewrite rules on demand for you
regenerate the symbolic formula from the AST
My company's DMS Software Reengineering Toolkit can do this. You can see a fully worked example (too large to copy here) of algebra and calculus at Algebra Defined By Transformation Rules
I have a text pattern matching problem that I could use some direction with. Not being very familiar with pattern recognition overall, I don't know if this is one of those "oh, just use blah-blah algorithm", or this is a really hard pattern problem.
The general statement of what I want to do is identify similarities between a series of SQL statements, in order to allow me to refactor those statements into a smaller number of stored procedures or other dynamically-generated SQL snippets. For example,
SELECT MIN(foo) FROM bar WHERE baz > 123;
SELECT MIN(footer) FROM bar;
SELECT MIN(foo), baz FROM bar;
are all kind of the same, but I would want to recognize that the value inside the MIN() should be a replaceable value, that I may have another column in the SELECT list, or have an optional WHERE clause. Note that this example is highly cooked up, but I hope it allows you to see what I am after.
In terms of scope, I would have a set of thousands of SQL statements that I would hope to reduce to dozens (?) of generic statements. In research so far, I have come across w-shingles, and n-grams, and have discarded approaches like "bag of words" because ordering is important. Taking this out of the SQL realm, another way of stating this problem might be "given a series of text statements, what is the smallest set of text fragments that can be used to reassemble those statements?"
What you really want is to find code clones across the code base.
There's a lot of ways to do that, but most of them seem to ignore the structure that the (SQL) language brings. That structure makes it "easier" to find code elements that make conceptual sense, as opposed to say N-grams (yes, "FROM x WHERE" is common but is an awkward chunk of SQL).
My abstract syntax tree (AST) based clone detection scheme parses source text to ASTs, and then finds shared trees that can be parameterized in a way that produces sensible generalizations by using the language grammar as a guide. See my technical paper Clone Detection Using Abstract Syntax Trees.
With respect to OP's example:
It will recognize that the value inside the MIN() should be a replaceable value
that the SELECT singleton column could be extended to a list
and that the WHERE clause is optional
It won't attempt to make those proposals, unless it find two candidate clones that vary in way these generalizations explain. It gets the generalizations basically by extracting them from the (SQL) grammar. OP's examples have exactly enough variation to force those generalizations.
A survey of clone detection techniques (Comparison and Evaluation of Code Clone Detection Techniques
and Tools: A Qualitative Approach) rated this approach at the top of some 30 different clone detection methods; see table 14.
Question is a bit too broad, but I would suggest to give a shot the following approach:
This sounds like a document clustering problem, where you have a set of pieces of text (SQL statements) and you want to cluster them together to find if some of the statements are close to each other. Now, the trick here is in the distance measure between text statements. I would try something like edit distance
So in general the following approach could work:
Do some preprocessing of the sql statements you have. Tokenization, removing some words from statements etc. Just be careful here - you are not analysing just some natural language text, its an SQL statements so you will need some clever approach.
After that, try to write a function which would count distance between 2 sql queries. The edit distance should work for you.
Finally, try to run document clustering on all your SQL queries, using edit distance as a distance measure for clustering algorithm
Hope that helps.
I have a few algorithms that extract and rank keywords [both terms and bigrams] from a paragraph [most are based on the tf-idf model].
I am looking for an experiment to evaluate these algorithms. This experiment should give a grade to each algorithm, indicating "how good was it" [on the evaluation set, of course].
I am looking for an automatic / semi-automatic method to evaluate each algorithm's results, and an automatic / semi-automatic method to create the evaluation set.
Note: These experiments will be ran off-line, so efficiency is not an issue.
The classic way to do this would be to define a set of key words you want the algorithms to find per paragraph, then check how well the algorithms do with respect to this set, e.g. (generated_correct - generated_not_correct)/total_generated (see update, this is nonsense). This is automatic once you have defined this ground truth. I guess constructing that is what you want to automate as well when you talk about constructing the evaluation set? That's a bit more tricky.
Generally, if there was a way to generate key words automatically that's a good way to use as a ground truth - you should use that as your algorithm ;). Sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something's probably going wrong (unless you specifically want to benchmark against that algorithm).
So you might start harvesting key words from common sources. For example:
Download scientific papers that have a keyword section. Check if those keywords actually appear in the text, if they do, take the section of text including the keywords, use the keyword section as ground truth.
Get blog posts, check if the terms in the heading appear in the text, then use the words in the title (always minus stop words of course) as ground truth
...
You get the idea. Unless you want to employ people to manually generate keywords, I guess you'll have to make do with something like the above.
Update
The evaluation function mentioned above is stupid. It does not incorporate how many of the available key words have been found. Instead, the way to judge a ranked list of relevant and irrelevant results is to use precision and recall. Precision rewards the absence of irrelevant results, Recall rewards the presence of relevant results. This again gives you two measures. In order to combine these two into a single measure, either use the F-measure, which combines those two measures into a single measure, with an optional weighting. Alternatively, use Precision#X, where X is the number of results you want to consider. Precision#X interestingly is equivalent to Recall#X. However, you need a sensible X here, ie if you have less than X keywords in some cases, those results will be punished for never providing an Xth keyword. In the literature on tag recommendation for example, which is very similar to your case, F-measure and P#5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall
I have been working on an Implementation of a Plagiarism Detection Engine based on the academic paper behind MOSS(Measure of Software Similarity)
Link to MOSS
For designing a noise filter for a language like C/C++/Java, I have some decisions to make.
Are keywords relevant for detecting plagiarism or they should be removed?
Source files in same language are bound to share the same set of keywords. The paper does not discuss on how to deal with them.
How to deal with identifiers?
Replacing all keywords with a single character 'V' making matches independent of variable name makes sense.
What to do with package imports and library includes?
Whitespaces, Commments and punctuations are to be stripped definitely.
I am wondering after doing all the operations, the source file will be just a bunch of 'V' and some other garbled text.
What operations should the noise filter perform?
Insights and Opinions on the best way to deal with noise ?
For single functions: compile them, and compare the resulting assembler code or objects.
For a whole program: do the above for all the functions and create a fuzzy search to find back the fragments in a database of known functions and fragments.
So basically, you need to build a compiler, which emits a canonised representation of its input it, similar to P-code, but preferably human readable.
Some fragments are more characteristic than others, the fragment
for (i=0; i < 12345; i++) {
array[i] = 54321;
}
Will probably occur in some form in every program. It is 100% functional identical to
j=0;
while ( j < 12345) {
foobar[j++] = 54321;
}
, and a compiler would probably produce identical code.
There can be differences in variable-names, numerical constants, address constants, anything. But the "skeleton" of keywords (-> {comparisons, loops, expressions, assignments, function calls}) will be the same. So: don't drop the keywords, they are the scaffolding of a program.
There is quite a lot to find on google if you search for "text fingerprint shingle". A shingle is a x-word (x=7 in many research projects). You build a set of all shingles word by word.
You the build a hash over a shingle and then compare the 1000end of shingles in a text. It's pretty simple. There are a few things like special hash functions you for sure haven't heared outside this context etc.
Start with reading for example, it's not really rocket science but not trivial either.
"Text Origin Detection in an Efficient Way" Besnik Fetahu, Andreas Frische
http://resources.mpi-inf.mpg.de/d5/teaching/ws10_11/hir/reports/BesnikFetahu.pdf
"Algorithms for duplicate documents", Andrei Broder
http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf
My current project is an advanced tag database with boolean retrieval features. Records are being queried with boolean expressions like such (e.g. in a music database):
funky-music and not (live or cover)
which should yield all funky music in the music database but not live or cover versions of the songs.
When it comes to caching, the problem is that there exist queries which are equivalent but different in structure. For example, applying de Morgan's rule the above query could be written like this:
funky-music and not live and not cover
which would yield exactly the same records but of cause break caching when caching would be implemented by hashing the query string, for example.
Therefore, my first intention was to create a truth table of the query which could then be used as a caching key as equivalent expressions form the same truth table. Unfortunately, this is not practicable as the truth table grows exponentially with the number of inputs (tags) and I do not want to limit the number of tags used in one query.
Another approach could be traversing the syntax tree applying rules defined by the boolean algebra to form a (minimal) normalized representation which seems to be tricky too.
Thus the overall question is: Is there a practicable way to implement recognition of equivalent queries without the need of circuit minimization or truth tables (edit: or any other algorithm which is NP-hard)?
The ne plus ultra would be recognizing already cached subqueries but that is no primary target.
A general and efficient algorithm to determine whether a query is equivalent to "False" could be used to solve NP-complete problems efficiently, so you are unlikely to find one.
You could try transforming your queries into a canonical form. Because of the above, there will be always be queries that are very expensive to transform into any given form, but you might find that, in practice, some form works pretty well most of the time - and you can always give up halfway through a transformation if it is becoming too hard.
You could look at http://en.wikipedia.org/wiki/Conjunctive_normal_form, http://en.wikipedia.org/wiki/Disjunctive_normal_form, http://en.wikipedia.org/wiki/Binary_decision_diagram.
You can convert the queries into conjunctive normal form (CNF). It is a canonical, simple representation of boolean formulae that is normally the basis for SAT solvers.
Most likely "large" queries are going to have lots of conjunctions (rather than lots of disjunctions) so CNF should work well.
The Quine-McCluskey algorithm should achieve what you are looking for. It is similiar to Karnaugh's Maps, but easier to implement in software.