Algorithm to parse an expression and assinging a value that satisfies the conditions - expression

The requirement is to parse a complex expression and assign the value to the variable that satisfies the condition. For e.g.
((!(($weatherResult.cityName=="Seattle")||($weatherResult.cityName=="Portland")))&&($weatherResult.cityName=="Folsom"))
So based on this expression, the value of $weatherResult1.cityName should be Folsom.
Now if we take this expression
((!(($weatherResult.cityName=="Seattle")||($weatherResult.cityName=="Portland")))&&(!($weatherResult.cityName=="Folsom")))
Here the value of $weatherResult1.cityName should be any other city that does not match Seattle or Portland or Folsom. E.g. Boston
There is a catalog of US cities but not necessary that every expression needs to be backed by a catalog. The values can be plain strings as well.
One idea is to randomly pick a value and evaluate the expression to true. If it is false, then keep repeating, but this approach is time consuming and expensive. Rather I want to parse the expression and pick the value intelligently. I was thinking to use ANTLR to parse the expression but still not able to come up with an algorithm that would allow me to parse the tree and pick/assign values.
Anyone has any recommendations, please suggest.

A potential solution is to use a constraint programming solver. See https://mlabonne.github.io/blog/constraintprogramming/
Note that this problem is NP complete. It will quickly be solved for small inputs but runtime will explode for large inputs.
In fact, if all what you got is equalities to city names, you could use a SAT solver.

Related

How to decide to convert to categorical variable or keep it numeric?

This might be a basic or trivial question and might be straightforward. Still I would like to ask this to clear my doubt once and for all.
Take example of Passanger Class in Famous Titanic Data. Functionally it is indeed a Categorical Data, so it will make perfect sense to convert it to categorical variable. Algorithms as per my understanding tend to see a pattern specific to that class. But at the same time if you see it as numeric variable, it might denote a range also for a decision tree. Say passangers in between first class and second class.
It looks both are correct and both will affect the machine learning algorithm outputs in different ways.
Which one is appropriate and is there anywhere there is a extensive discussion about it? Should we use such ambiguous variables as numeric as well its copy as a categorical variable, which might prove to be a technique to uncover more patterns?
I suppose it's up to you whether you'd rather interpret a continuous PassengerClass variable as "for every one-unit increase in PassengerClass, the passenger's likelihood of survival goes up/down X%," versus a categorical (factor) PassengerClass as, "the likelihoods of survival for groups 2 and 3 (for example, leaving 1st-class passengers as the base group) are X and Y% percent higher, respectively, than the base group, holding all else constant."
I think about variables like PassengerClass almost as "treatment groups." Yes, I suppose you could interpret it as continuous, but I think it makes more sense to consider the unique effects of each class like "people who were given the drug versus those who weren't" - you can very easily compare the impacts of being in a higher class (e.g. 2 or 3) to being in the most common class, 1, which again would be left out.
The problem with mapping categorical notions to numerical is that some algorithms (e.g. neural networks) will interpret the value itself as having a meaning, i.e. you would get different results if you assign values 1,2,3 to passenger classes than, for example 0,1,2 or 3,2,1. The correspondence between the passenger classes and numbers is purely conventional and doesn't necessarily convey any additional meaning.
One could argue that the lesser the number, the "better" the class is, however it's still hard to interpret it as "the first class is twice as good as second class", unless you'll define some measure of "goodness" that will make the relation between numbers "1" and "2" sensible.
In this example, you have categorical data that is ordinal - meaning you can rank the categories (from best accommodations to worst, for example) but they're still categories. Regardless of how you label them, there's no actual information about the relative distances among your categories. You can put them in a table, but not (correctly) on a number line. In cases like this, it's generally best to treat your categorical data as independent categories.

C4.5 algorithm with unbounded attributes

Current implementation of C4.5 in VFDT (http://www.cs.washington.edu/dm/vfml/vfdt.html) or for that matter any other implementation uses the C4.5 format of files for providing inputs for constructing the decision tree. According to this the attributes can have the following formats:
continuous
If the attribute has a continuous value.
discrete
The word 'discrete' followed by an integer which indicates how many values the attribute can take.
list of identifiers
This is a discrete attribute with the values enumerated (this is the prefered method for discrete attributes). The identifiers should be separated by commas.
ignore
means the attribute should be ignored - it won't be used.
Does anybody know how we can specify discrete valued attributes whose complete set of possible values is too large to list down?
For example "IP-Address" attribute can have Math.Pow(255,4) possible discrete values;
"QueryString" attribute can have infinite number of possible values ... etc.
Can the C4.5 algorithm handle the case where the attribute has say 100,000 discrete distinct values, OR where the exact bound is not known, but only an approximation is known?
Thanks.
The usual choice is to enumerate all the values of a discrete feature that occur in your training set. Since the algorithm can never gather enough statistics for values that are not seen during training, those would be ignored no matter how you'd implement them.
Mind you, it's quite hard to gather statistics for such features anyway, so you might want to think about different representations. In particular, multi-word strings of text can be tokenized and treated as bags of words.

I'm looking for an algorithm or function that can take a text string and convert it a number

I looking for a algorithm, function or technique that can take a string and convert it to a number. I would like the algorithm or function to have the following properties:
Identical string yields the same calculated value
Similar strings would yield similar values (similar can be defined as similar in meaning or similar in composition)
Capable of handling strings of variable length
I read an article several years ago that gives me hope that this can be achieved. Unfortunately, I have been unable to recall the source of the article.
Similar in composition is pretty easy, I'll let somebody else tackle that.
Similar in meaning is a lot harder, but fun :), I remember reading an article about how a neural network was trained to construct a 2D "semantic meaning graph" of a whole bunch of english words, where the distance between two words represented how "similar" they are in meaning, just by training it on wikipedia articles.
You could do the same thing, but make it one-dimensional, that will give you a single continuous number, where similar words will be close to each other.
Non-serious answer: Map everything to 0
Property 1: check. Property 2: check. Property 3: check.
But I figure you want dissimilar strings to get different values, too. The question then is, what is similar and what is not.
Essentially, you are looking for a hash function.
There are a lot of hash functions designed with different objectives. Crypographic hashes for examples are pretty expensive to compute, because you want to make it really hard to go backwards or even predict how a change to the input affects the output. So they try really hard to violate your condition 2. There are also simpler hash functions that mostly try to spread the data. They mostly try to ensure that close input values are not close to each other afterwards (but it is okay if it is predictable).
You may want to read up on Wikipedia:
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_substrings
(Yes, it has a section on "Finding similar substrings" via Hashing)
Wikipedia also has a list of hash functions:
https://en.wikipedia.org/wiki/List_of_hash_functions
There is a couple of related stuff for you. For example minhash could be used. Here is a minhash-inspired approach for you: Define a few random lists of all letters in your alphabet. Say I have the letters "abcde" only for this example. I'll only use two lists for this example. Then my lists are:
p1 = "abcde"
p2 = "edcba"
Let f1(str) be the index in p1 of the first letter in my test word, f2(str) the first letter in p2. So the word "bababa" would map to 0,3. The word "ababab" also. The word "dada" would make to 0,1, while "ce" maps to 2,0. Note that this map is invariant to word permutations (because it treats them as sets) and for long texts it will converge to "0,0". Yet with some fine tuning it can give you a pretty fast chance of finding candidates for closer inspection.
Fuzzy hashing (context triggered piecewise hashing) may be what you are looking for.
Implemenation: ssdeep
Explanation of the algorithm: Identifying almost identical files using context triggered piecewise hashing
I think you're probably after a hash function, as numerous posters have said. However, similar in meaning is also possible, after a fashion: use something like Latent Dirichlet Allocation or Latent Semantic Analysis to map your word into multidimensional space, relative to a model trained on a large collection of text (these pre-trained models can be downloaded if you don't have access to a representative sample of the kind of text you're interested in). If you need a scalar value rather than multi-dimensional vector (it's hard to tell, you don't say what you want it for) you could try a number of things like the probability of the most probable topic, the mean across the dimensions, the index of the most probable topic, etc. etc.
num = 0
for (byte in getBytes(str))
num += UnsignedIntValue(byte)
This would meet all 3 properties(for #2, this works on the strings binary composition).

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Approximate string matching algorithms

Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms?
Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
Hope this is helpful or at least interesting.
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
We are using the Levenshtein distance method to check for duplicate customers in our database. It works quite well.
Alternative algorithms to look at are agrep (Wikipedia entry on agrep),
FASTA and BLAST biological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
Use FM Index with Backtracking, similar to the one in Bowtie fuzzy aligner
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/

Resources