I need to implement an algorithm for multiple extended string matching in text.
Extended means the presence of wildcards (any number of characters instead of a star), for example:
abc*def //matches abcdef, abcpppppdef etc.
Multiple means that the search is going on simultaneously for multiple string patterns (not a separate search for each pattern), for example:
abc*def
abc
whatever
some*string
QUESTION:
What is the fast algorithm that can do multiple extended string matching?
Preferably, optimized for SIMD instructions and multicore implementation. Open source implementation (C/C++/Python) would be great as well.
Thank you
I think that it might make sense to start by reading the following Wikipedia article's section: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times. You can then perform a literature review on algorithms, implementing regular expression pattern matching.
In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.
P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.
Related
I have a list of names of millions of famous people (from Wikidata), and I need to create a system that efficiently finds all people mentioned in a fairly short text: it can be just one word (eg. "Einstein") to a few pages of text (eg. a Wikipedia page).
I need the system to be fairly tolerant to spelling mistakes (eg. Mikael Jackson instead of Michael Jackson), and short forms (eg. M. Jackson). In case of ambiguity, it should return all possible people (eg. "George Bush" should return both father and son, and possibly other homonyms).
This related question has a few interesting answers, including using the Aho-Corasick algorithm. There are libraries in many languages, including in Python. However, it does not seem to support fuzzy search (ie. tolerate misspellings).
I guess I could extend the vocabulary to include all the possible spellings of each name, but that would make the vocabulary too large, so I would rather avoid that if possible (moreover, I may want to extend this solution to more than just people at one point).
I took a quick look at Lucene/ElasticSearch but it does not seem to support this use case (unless I missed it).
Any ideas?
Elasticsearch has support for fuzzy matching: See documentation here.
What Algorithm/method do I use for a Question Answering System's Question Processing?
I have been searching possible algorithms for my Question Answering System, the only thing that I think that would be possible to use is Parsing but I have asked about parsing in my last question and with the answers there i think its not possible to be used?(I'm not sure).
My idea of using Parsing is by Cutting the question into pieces word per word and then it will go through a Storage of Words that would determine what Kind of Word(noun,adjective,verb,etc) is being said. My purpose of using Parsing is to remove or rather to determine the Topic of the question.
The other idea of mine is the ChatterBot. A Chatterbot uses a query of words? Correct me if I'm not mistaken and those words are assigned to another Word. It would randomly choose a word from its Query.
Example: User's Statement: Hello > ChatterBot's Possible Replies: Hi,Hello,Hey
I'm not quite sure what is the possible method/algorithm to use in a Question Answering, I have read the Wikipedia post : http://en.wikipedia.org/wiki/Question_answering but I do not quite understand what algorithm to use in Question Processing.
Thank you.
PS: I'm developing in Javascript. Q = Question
You could use a naive bayes classifier in order to look at the questions and determine their subject. You'd need a lot of training data and a fairly narrow domain.
The sophisticated responses to this problem involve a lot of machine inference techniques which are a bit out of my skill level to explain extremely well. My idea is to use a markov network in which each word has an edge to one or two words next to it. A series of tests are applied to each word which indicate likely memberhood of that word to one of its possible meanings (For example, Mark is more likely a name if it's capitalized, but if the next word is 'a' it probably is used in the sense of a verb.) From there the machine can attempt to determine the actual meaning of the sentence, which will rely on the use of, again, unimaginably large amounts of training data.
Coursera's Probabilistic Graphical Models class (Probably their NLP class too) would probably be the best resource if you're interested in becoming skilled in this area. (PGM is the only reason I know anything about this!)
here's a great book, you may need to read to get a lot of stuff related to NLP, and Question answering systems http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210
the book has a full section (V.Applications) that will help you a lot to develop a good system.
but note that the book is discussing theories and algorithms only (no code)
it's not about parsing text only, you'll need to understand the context to provide better answer. actually you need to extract some keywords and ignore everything else.
also you may read in topics Keywords (Bag of words), algorithms like (TF/IDF).
This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.
I have tried to be disciplined about decomposing into small reusable methods when possible. As the project growing, I am re-implementing the exact same method.
I would like to know how to deal with this in an automated way. I am not looking for an IDE specific solution. Dependency on method names may not be sufficient. Unix and scripting are solutions that would be extremely beneficial. Answers such as "take care" etc. are not the solutions I am seeking.
I think the cheapest solution to implement might be to use Google Desktop. A more accurate solution would probably be much harder to implement - treat your code base as a collection of documents where the identifiers (or tokens in the identifiers) are words of the document, and then use document clustering techniques to find the closest matching code to a query. I'm aware of some research similar to that, but nothing close to out-of-the-box code that you could use. You might try looking on Google Code Search for something. I don't think they offer a desktop version, but you might be able to find some applicable code you can adapt.
Edit: And here's a list of somebody's favorite code search engines. I don't know whether any are adaptable for local use.
Edit2: Source Code Search Engine is mentioned in the comments following the list of code search engines. It appears to be a commercial product (with a free evaluation version) that is intended to be used for searching local code.
If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.