Text preprocessing for fitting Tokenizer model - huggingface-transformers

I have read that when preprocessing text it is best practice to remove stop words, remove special characters and punctuation, to end up only with list of words. My question is: If the original text I want my tokenizer to be fitted on is a text containing a lot of statistics (hence a lot of % = / etc..) but also some texts, then it makes sense to keep special characters and numbers as input to the tokenizer model? Or it should be removed in any case as a tokenizer can only understand words? Thanks a lot in advance

Related

Approximate text matching

I need to compare two pieces of text, say 200 words long. As these were obtained by OCR, discrepancies can arise at two levels:
words can be misspelled,
whole words can be missing or merged, or extra parasitic chunks inserted (in extreme cases, groups of words could be swapped).
The output of the recognition would be a similarity score. I don't think that matching the whole text as a long string can be efficient enough.
Are you aware of methods that specifically address this problem (two-level Levenshtein ??). Are there libraries available ?
(I am not looking for an OCR package.)

How to extract features from plain text?

I am writing a text parser which should extract features from product descriptions.
Eg:
text = "Canon EOS 7D Mark II Digital SLR Camera with 18-135mm IS STM Lens"
features = extract(text)
print features
Brand: Canon
Model: EOS 7D
....
The way I do this is by training the system with structured data and coming up with an inverted index which can map a term to a feature. This works mostly well.
When the text contains measurements like 50ml, or 2kg, the inverted index will say 2kg -> Size and 50ml -> Size for eg.
The problem here is that, when I get a value which I haven't seen before, like 13ml, it won't be processed. But since the patterns matches to a size, we could tag it as size.
I was thinking to solve this problem by preprocessing the tokens that I get from the text and look for patterns that I know. So when new patterns are identified, that has to be added to the preprocessing.
I was wondering, is this the best way to go about this? Or is there a better way of doing this?
The age-old problem of unseen cases. You could train your scraper to grab any number-like characters preceding certain suffixes (ml, kg, etc) and treat those as size. The problem with this is typos and other poorly formatted texts could enter into your structure data. There is no right answer for how to handle values you haven't seen before - you'll either have to QC them individually, or have rules around them. This is dependent on your dataset.
As far as identifying patterns, you'll either have to manually enter them, or manually classify a lot of records and let the algorithm learn them. Not sure that's very helpful, but a lot of this is very dependent on your data.
If you have a training data like this:
word label
10ml size-valume
20kg size-weight
etc...
you could train a classifier based on character n-grams and that would detect that ml is size-volume even if it sees a 11-ml or ml11 etc. you should also convert the numbers into a single number (e.g. 0) so that 11-ml is seen as 0-ml before feature extraction.
For that you'll need a preprocessing module and also a large training sample. For feature extraction you can use scikit-learn's character n-grams and also SVM.

elasticsearch - breaking english compound words?

I'm looking for a filter in elasticsearch that will let me break english compound words into their constituent parts, so for example for a term like eyewitness, eye witness and eyewitness as queries would both match eyewitness. I noticed the compound word filter, but this requires explicity defining a word list, which I couldn't possibly come up with on my own.
First, you need to ask yourself if you really need to break the compound words. Consider a simpler approach like using "edge n-grams" to hit in the leading or trailing edges. It would have the side effect of loosely hitting on fragments like "ey", but maybe that would be acceptable for your situation.
If you do need to break the compounds, and want to explicitly index the word fragments, the you'll need to get a word list. You can download a list English words, one example is here. The dictionary word list is used to know which fragments of the compound words are actually words themselves. This will add overhead to your indexing, so be sure to test it. An example showing the usage is here.
If your text is German, consider https://github.com/jprante/elasticsearch-analysis-decompound

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.

Resources