Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I want to measure semantic similarity between two phrases/sentences. Is there any framework that I can use directly and reliably?
I have already checked out this question, but its pretty old and I couldn't find real helpful answer there. There was one link, but I found this unreliable.
e.g.:
I have a phrase: felt crushed
I have several choices: force inwards,pulverized, destroyed emotionally, reshaping etc.
I want to find the term/phrase with highest similarity to the first one.
The answer here is: destroyed emotionally.
The bigger picture is: I want to identify which frame from FrameNet matches to the given verb as per its usage in a sentence.
Update : I found this library very useful for measuring similarity between two words. Also the ConceptNet similarity mechanism is very good.
and this library for measuring semantic similarity between sentences
If anyone has any insights please share.
This is a very complicated problem.
The main technique that I can think of (before going into more complicated NLP processes) would be to apply cosine (or any other metric) similarity to each pair of phrases. Obviously this solution would be very inefficient at the moment due to the non-matching problem: The sentences might refer to the same concept with different words.
To solve this issue, you should transform the initial representation of each phrase with a more "conceptual" meaning. One option would be to extend each word with its synonyms (i.e. using WordNet, another option is to apply metrics such as distributional semantics DS (http://liawww.epfl.ch/Publications/Archive/Besanconetal2001.pdf) that extend the representation of each term with the more likely words to appear with it.
Example:
A representation of a document: {"car","race"} would be transform to {"car","automobile","race"} with synonyms. While, with DS it would be something like: {"car","wheel","road","pilot", ...}
Obviously this transformation won't be binary. Each term will have some associated weights.
I hope this helps.
Maybe the cortical.io API could help with your problem. The approach here is that every word is converted into a semantic fingerprint that characterizes the meaning of it with 16K semantic features. Phrases, sentences or longer texts are converted into fingerprints by ORing the word fingerprints together. After this conversion into a (numeric) binary vector representation semantic distance can easily be computed using distance measures like Euclidian Distance or cosine-similarity.
All necessary conversion- and comparison-functions are provided by the api.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am building a program to do some text analysis.
I'm guessing that unpacking an abbreviated word to its original word will improve the accuracy of my analysis.
But I have no idea to implement it. I've google searched a little but can't find any article or paper discussing this. (Or maybe I just don't know the right keyword to search)
Basically what I need is: Given a word W, find a word with the highest probability to be the unabbreviated version of W from a dictionary (list of unabbreviated word). Optionally, I want the algorithm to be compatible with Indonesian language.
My question is somewhat similar to this SO question: A string searching algorithm to quickly match an abbreviation in a large list of unabbreviated strings? , but that question hasn't been answered, despite being asked in 2010.
So, any idea? Thanks in advance!
Without any knowledge of Indonesian, my first step would be to obtain a list of common abbreviations, and simply do a dictionary lookup.
viz. => namely
i.e. => that is
fr. => from
Fr. => France, French
abbr. => abbreviated, abbreviation
How to decide which expansion to choose is a can of worms of its own. The examples I could quickly come up with are nice in that they are different parts of speech, so pick the adjective where an adjective fits in the sentence; but in the general case, you just have to cope with the fact that some abbreviations are genuinely ambiguous, just like there are ambiguous words. Maybe don't expand those at all, after all.
For abbreviations you don't have in the dictionary, I would simply look them up in a word list, perhaps with frequency and/or part of speech information so you can pick the most likely / most popular one if there are several prefix matches. Absent that information, I would use the crude heuristic to always pick the shortest match.
Context is everything with abbreviations. Your "highest probability" match is almost certainly going to the one where the context of the abbreviation matches the (intended) context of the expansion.
Of course, the issue is that there are so many possible contexts, as shown by certain abbreviations having dozens of possible expansions. There is also the difficulty of trying to define the context of an abbreviation.
You might be able to get away with limiting it to only say 10-20 different contexts, then doing a rather rough matching. I'm fairly sure it'll have a high error rate. It'll also require a lot of work to manually add/verify the contexts.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Suppose i am having following list of words
banana,apple,orange,tree In this list odd word is tree.Can any one give the idea to write a algorithm.
What is it about tree that makes it the odd one out? Why not banana (since it's a herb, where the others are trees, and also because it's the only one in the list that doesn't end with 'e'). Or why not orange (since it's a colour as well as a plant, where the others are just plants).
You need to define the criteria that you're trying to filter by: something may be obvious to a human reader, but a computer algorithm can't see that without knowing all the facts that make it obvious to a human. Or at least sufficient facts that are relevant to draw a reliable conclusion.
You're basically talking about a large knowledge-base, not a simple algorithm.
Disclaimer: This is not an easy to do task, and thus my suggested solutions will be high level and include references to academic papers that aim to solve a part of your problem:
You can try a semantic relatedness approach:
Find relatedness between every two pairs of words, filter out the word that is least related to all others.
Semantic relatedness can be done using semantic sort in a supervised learning, for example.
Another alternative is to model a semantic representation of each word.
Each word will be represented by a vector representing its meaning.
This vector can be obtained for example using the wikipedia articles
that mention this word. More information on this approach can be
found in Markovitch et al Wikipedia-based Semantic Interpretation
for Natural Language Processing
After you represent your data as vectors, it is a question of finding
the word which is least similar to the others. It can be done using
supervised learning, or other alternative is choosing the point
which is most distant from the median of all vectors.
One more possible solution is using WordNet
Note that all methods are heuristics that I would try, and are expected to fail for some cases, but I believe will work pretty well for most of the cases.
Have a look at ontologies and reasoning algorithms. If you have an ontology that models the specific area of knowledge you will have a source of information that will allow you to distinguish words, e.g. by using the partial order and the relations and then check if the words are in the same "sub branch" of the partial order. You might even define a metric to get a "level of closeness" or something similar.
Edit: also check SPARQ, a language to query such structures. And check out triple stores which allow to get information by subject, predicate object combinations. This matches your problem since it allows you to compare two objects of your list by a predicate.
You can try create some database of categorized words like:
banana {food, plant, fruit, yellow}
apple {food, plant, fruit, computer, phone}
orange {food, plant, fruit, phone}
tree {plant}
And then you can see that all words other than tree belong to fruit category. That kind of check would be easy to code.
Biggest problem here is getting the database - i don't think you would like to create it manually and have to idea where to find it. Also it could not work. Imagine we add
eclair{food, phone}
to this database (phone because android 2.1 is called eclair). Then for query orange, apple, banana, eclair there is two possible answers - eclair, which is not fruit or banana which is not connected with mobile phones.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Can someone provide me some pointers on population initialization algorithms for genetic programming?
I already know about the Grown, Full, Ramped half-half (taken from "A Field Guide to Genetic Programming") and saw one new algorithm Two Fast Tree-Creation (haven't read the paper yet.).
Initial population plays an important role in heuristic algorithms such as GA as it help to decrease the time those algorithms need to achieve an acceptable result. Furthermore, it may influence the quality of the final answer given by evolutionary algorithms. (http://arxiv.org/pdf/1406.4518.pdf)
so as you know about Koza's different population methods, you must also remember that each algorithm that is used, isn't 100% random, nor can it be as an algorithm can be used. therefore you can predict what the next value will be.
another method you could potentially use is something called Uniform initialisation (refer to the free pdf : "a field guide to genetic programming"). the idea of this is that initially, when a population is created, within a few generations, due to crossing over and selection, the entire syntax tree could be lost within a few generations. Langdon (2000) came up with the idea of a ramped uniform distribution which effectively allows the user to specify the range of sizes a possible tree can have, and if a permutation of the tree is generated in the search space that doesn't fulfil the range of sizes, then the tree is automatically discarded, regardless of its fitness evaluation value. from here, the ramped uniform distribution will create an equal amount of trees depending on the range that you have used - all of which are random, unique permutations of the functions and terminal values you are using. (again, refer to "a field guide on genetic programming" for more detail)
this method can be quite useful in terms of sampling where the desired solutions are asymmetric rather than symmetric (which is what the ramped half and half deal with).
other recommeded readings for population initialisation:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.962&rep=rep1&type=pdf
I think that will depend on the problem that you want to solve. For example I'm working on a TSP and my initial population is generated using a simple greedy technique. Sometimes you need to create only feasible solutions so you have to create a mechanism for doing that. Usually you will find papers about your problem and how to create initial solutions. Hope this helps.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Do you know of any implementations or improvements of the method of matching images proposed by David Nister and Henrik Stewenius, called "Scalable Recognition with a Vocabulary Tree"? I am trying to implement it and I am having trouble understanding some parts of the algorithm (more specifically, computing the score).
Here is a good implementation of vocabulary tree - libvot. It use the C++11 standard multi-thread library to accelerate the build process so it runs pretty fast.
It uses three steps to build a vocabulary tree. The first step is to build a kmeans tree using sift descriptors. The second step is to build a image database using the vocabulary tree you build in the first step. The third step is to query the image against the image database. Some advanced techniques such as inverted list and L1 distance measure are also reflected in this repository.
Regarding the vocabulary trees, I found this thesis (http://www.tango-controls.org/Members/srubio/MasterThesis-VocabularyTree-SergiRubio-2009.pdf) which implements them in C++/python. However, I can't find the code anywhere, so I contacted the author to get the code but without success til this date.
Furthermore, I found this other implementation (http://www.inf.ethz.ch/personal/fraundof/page2.html), however I was unable to put it to work.
Have you implemented it already?? I would like to do the same for image recognition but it seems like a very painful task.
Best regards.
Sergio Rubio has posted an implementation of using a vocabulary tree for image classification at http://sourceforge.net/projects/vocabularytree/. I had to rework much of the C code he posted to get it to work on my Windows system, but overall it was a very good resource for implementing the ideas presented in the original paper.
Recently I found a non-free pretty great Vocabulary Tree implementation in C++ called DBow.
The code is well organized and has a lot of comments.
Checkout here: http://webdiis.unizar.es/~dorian/index.php?p=31
and here: http://webdiis.unizar.es/~dorian/index.php?p=32
You want to look for a space-filling-curve or spatial index. A sfc reduce the 2d complexity to a 1d complexity although it is just an reorderd of the surface. A sfc recursivley subdivide the surface into smaller tiles and keep picking information of the near by tiles. It can be compared with a quadtree. This can be usefull to compare images because you compare near by tiles. The difficult is then to make the tiles comparable. I believe a DCT can be useful here. You want to look for Nick's hilbert curve quadtree spatial index blog.
I believe the Pyramid Match kernel method proposed by Grauman and Darrell is generally considered to be even better. You can get a C++ library implementation here.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Who knows the most robust algorithm for a chromatic instrument tuner?
I am trying to write an instrument tuner. I have tried the following two algorithms:
FFT to create a welch periodogram and then detect the peak frequency
A simple autocorrelation (http://en.wikipedia.org/wiki/Autocorrelation)
I encountered the following basic problems:
Accuracy 1: in FFT the relation between samplerate, recording length and bin size is fixed. This means that I need to record a 1-2 seconds of data to get an accuracy of a few cents. This is not exactly what i would call realtime.
Accuracy 2: autocorrelation works a bit better. To get the needed accuracy of a few cents I had to introduced linear interpolation of samples.
Robustness: In case of a guitar I see a lot of overtones. Some overtones are actually stronger than the main tone produced by the string. I could not find a robust way to select the right string played.
Still, any cheap electronic tuner works more robust than my implementation.
How are those tuners implemented?
You can interpolate FFTs also, and you can often use the higher harmonics for increased precision. You need to know a little bit about the harmonics of the instrument that was produced, and it's easier if you can assume you're less than half an octave off target, but even in the absence of that, the fundamental frequency is usually much stronger than the first subharmonic, and is not that far below the primary harmonic. A simple heuristic should let you pick the fundamental frequency.
I doubt that the autocorrelation method will work all that robustly across instruments, but you should get a series of self-similarity scores that is highest when you're offset by one fundamental frequency. If you go two, you should get the same score again (to within noise and differential damping of the different harmonics).
There's a pretty cool algorithm called Bitstream Autocorrelation. It doesn't take too many CPU cycles, and it's very accurate. You basically find all the zero cross points, and then save it as a binary string. Then you use Auto-correlation on the string. It's fast because you can use XOR instead of floating point multiplication.