detection of pattern from unstructured data using fuzzy logic - fuzzy-search

I'm trying to extract a pattern and trim data from an unstructured data. I'm using Regx for this . I'm wondering if i can use fuzzy logic to do the same.I'm an amateur when it comes to fuzzy Logic . Should invest my time in it.
ps: my data to be extracted contains bot numbers and words.

Related

Is it possible to get the meaning of each word using BERT?

I'm a linguist, I'm new to AI and
I would like to know if BERT is able to get the meaning of each word based on context.
I've done some searches and found that BERT is able to do that and that if I'm not wrong, it recognizes them/ converts them into unique vectors, but that's not the output I want.
What I want is to get the meaning/ or the components that constitute the meaning of each word, written in plain English, is this possible?
No, you can not get the meaning of the word in plain english. The whole idea of the BERT is to convert plain english into meaningful numerical representations.
Unfortunately, these vectors are not interpretable. It is a general limitation of Deep Learning compared to other traditional ML models that use self-extracted features.
But note that you can use these representation to find out certain relationships between words. For example the words that are close to each other (in terms of some distance measure), have similar meanings. Have a look at this link for more information.
https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html

Are there any approaches/suggestiosn for classifying a keyword so the search space will be reduced in elasticsearch?

I was wondering is there any way to classify a single word before applying a search on elasticsearch. Let's say I have 4 indexes each one holds few millions documents about a specific category.
I'd like to avoid searching the whole search space each time.
This problem becomes more challenging since it's not a sentence,
The search query usually consists only a single or two words, so some nlp magic (Named-entity recognition, POS etc) can't be applied.
I have read few questions on stackoverflow like:
Semantic search with NLP and elasticsearch
Identifying a person's name vs. a dictionary word
and few more, but couldn't find an approach. are there any suggestions I should try?

How should I index and search on hyphenated words in English?

I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.
In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia and preeclampsia. At the moment, searching for one won't find the other (the standard tokenizer indexes the hyphenated version as pre eclampsia).
This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude and happy go lucky should match jean-claude and happy-go-lucky.
One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word into hyphenated-word OR hyphenatedword. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)
You can use a compound word token filter - hyphenation_decompounder should probably work decent enough.
It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder is suitable for doing stuff like that.
This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.

Sample data or corpora for testing text processing functions?

I'm wondering if there are online sample texts that can be used for testing algorithms. For example, I'm whipping up a simple tokenization function and want to make sure it works for special cases like mid-word punctuation characters ("don't", "O'Brien"), dashes (for my purposes, "Sacksville-Bagginses" should be a single token), international characters, etc.
Similarly, it would be nice when whipping up other algorithms to have documents at-hand that are ideal for testing them, instead of having to either whip up or searching for good sample texts in Gutenberg.
Also useful would be text that could be used for testing things like spelling & grammar tools, etc.
There are a bunch of text corpora listed in this Wikipedia entry. Also, there are some good pointers in the NLTK corpora list. And you might want to check out the Google ngram datasets.

Would my approach to fuzzy search, for my dataset, be better than using Lucene?

I want to implement a fuzzy search facility in the web-app i'm currently working on. The back-end is in Java, and it just so happens that the search engine that everyone recommends on here, Lucene, is coded in Java as well. I, however, am shying away from using it for several reasons:
I would feel accomplished building something of my own.
Lucene has a plethora of features that I don't see myself utilizing; i'd like to minimize bloat.
From what I understand, Lucene's fuzzy search implementation manually evaluates the edit distances of each term indexed. I feel the approach I want to take (detailed below), would be more efficient.
The data to-be-indexed could potentially be the entire set of nouns and pro-nouns in the English language, so you can see how Lucene's approach to fuzzy search makes me weary.
What I want to do is take an n-gram based approach to the problem: read and tokenize each item from the database and save them to disk in files named by a given n-gram and its location.
For example: let's assume n = 3 and my file-naming scheme is something like: [n-gram]_[location_of_n-gram_in_string].txt.
The file bea_0.txt would contain:
bear
beau
beacon
beautiful
beats by dre
When I receive a term to be searched, I can simply tokenize it in to n-grams, and use them along with their corresponding locations to read in to the corresponding n-gram files (if present). I can then perform any filtering operations (eliminating those not within a given length range, performing edit distance calculations, etc.) on this set of data instead of doing so for the entire dataset.
My question is... well I guess I have a couple of questions.
Has there been any improvements in Lucene's fuzzy search that I'm not aware of that would make my approach unnecessary?
Is this a good approach to implement fuzzy-search, (considering the set of data I'm dealing with), or is there something I'm oversimplifying/missing?
Lucene 3.x fuzzy query used to evaluate the Levenshtein distance between the queried term and every index term (brute-force approach). Given that this approach is rather inefficient, Lucene spellchecker used to rely on something similar to what you describe: Lucene would first search for terms with similar n-grams to the queried term and would then score these terms according to a String distance (such as Levenshtein or Jaro-Winckler).
However, this has changed a lot in Lucene 4.0 (an ALPHA preview has been released a few days ago): FuzzyQuery now uses a Levenshtein automaton to efficiently intersect the terms dictionary. This is so much faster that there is now a new direct spellchecker that doesn't require a dedicated index and directly intersects the terms dictionary with an automaton, similarly to FuzzyQuery.
For the record, as you are dealing with English corpus, Lucene (or Solr but I guess you could use them in vanilla lucene) has some Phonetic analyzers that might be useful (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone)
Lucene 4.0 alpha was just released, many things are easier to customize now, so you could also build upon it an create a custom fuzzy search.
In any case Lucene has many years of performance improvements so you hardly would be able to achieve the same perf. Of course it might be good enough for your case...

Resources