Can Stanford CoreNLP lemmatise a word given a custom POS? - stanford-nlp

I would like to lemmatise a given word multiple times, with different POS supplied.
For example, the lemma of "met" is "meet" (POS: verb), while the lemma of "meeting" is "meeting" (POS: noun).
But if "meeting" is a verb, the lemma is "meet". I would like then to lemmatise "meeting" with a given verb POS, in an effort to find such similarities.
Is this possible?
Using latest Java CoreNLP 3.9.2

Try the method String lemma(String word, String tag) in edu.stanford.nlp.process.Morphology.
Morphology morphology = new Morphology();
String word = "meeting";
String tag = "VB";
String lemma = morphology.lemma(word, tag);
System.out.println(String.format("%s_%s %s", word, tag, lemma));

Related

Is there a way to remove ALL special characters using Lucene filters?

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'
It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use 「 and 」 as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
「doc-size type」
Output tokens:
「docsize」
「type」
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
「doc-size type $foo^bar」
Output tokens:
「docsize」
「type」
「foobar」
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
「doc-size 123 %^&*{} type $foo^bar」
Output tokens:
「docsize」
「123」
「」
「type」
「foobar」
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
「doc-size type」
Output tokens:
「doc」
「size」
「type」
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.

How to reverse tokenization after running tokens through name finder?

After using NameFinderME to find the names in a series of tokens, I would like to reverse the tokenization and reconstruct the original text with the names that have been modified. Is there a way I can reverse the tokenization operation in the exact way in which it was performed, so that the output is the exact structure as the input?
Example
Hello my name is John. This is another sentence.
Find sentences
Hello my name is John.
This is another sentence.
Tokenize sentences.
> Hello
> my
> name
> is
> John.
>
> This
> is
> another
> sentence.
My code that analyzes the tokens above looks something like this so far.
TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
NameFinderME nameFinder = new NameFinderME(model3);
List<Span[]> spans = new List<Span[]>();
foreach (string sentence in sentences)
{
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = nameFinder.find(tokens);
string[] namedEntities = Span.spansToStrings(nameSpans, tokens);
//I want to modify each of the named entities found
//foreach(string s in namedEntities) { modifystring(s) };
spans.Add(nameSpans);
}
Desired output, perhaps masking the names that were found.
Hello my name is XXXX. This is another sentence.
In the documentation, there is a link to this post describing how to use the detokenizer. I don't understand how the operations array relates to the original tokenization (if at all)
https://issues.apache.org/jira/browse/OPENNLP-216
Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++)
{ operations[i] = Operation.parse(oper); }
System.out.println(operations.length);
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker
String st = dictDetokenize.detokenize(tokens, " ");
Output:
Use the Detokenizer.
String text = detokenize(myTokens, null);

Stanford POS Tagger to return more then one tag

I am implementing POS tagging with the Stanford POS Tagger. However, some of the words can have multiple tags. For example word heat, can be noun or verb. However POS tagger return only one value for current sentence - NOUN. Is it possible to return all possible POS tags using Stanford POS Tagger. Which means that for the word heat, I can get NOUN and VERB?
The POS tagger is designed to tag a word with the proper POS tag based on the context of the sentence.

Search by start string with accent

I'm trying to remove accents of occurrences looked somewhat the same way that the function downcase.
Currently searching for all results starting with a string like that :
r.Table("places").Filter(func(customer r.Term) interface{}{
return customer.Field("Name").Downcase().Match("^" + strings.ToLower(value))
})
but it does not work with words with an accent in the word.
Example : with search word "yes", it'll find :
"yes" "yesy" "yessss"
but not
"yés"
What is the best way to remove accents in query to pick them up, too?

String parsing for spanish language detection in Ruby

I'm in a situation where I'm given a character string and need to determine if the language of the string is Spanish or English. I plan on parsing for stop words - Spanish (`de, es, si, y") vs English ('of', 'is', 'if', 'and')? If there more Spanish occurrences than English occurrences, then, I conclude the page is Spanish.
Are there any Ruby snippets already available to do this? If not, what would be good method for string parsing or regex to do this?
If you have a string that contains a sentence (or a series of words, at least), you can use string.split(' ') to split the string into an array of words. From there, you can use .each to iterate through the list and process each word. For example:
def detect_language(sentence)
english_count = 0
spanish_count = 0
sentence.split(' ').each {|word|
if looks_like_english(word)
english_count += 1
elsif looks_like_spanish(word)
spanish_count += 1
end
}
retval = ["spanish", "unknown", "english"]
retval[(english_count <=> spanish_count) + 1]
end
I've got the experience with the same task. And decided to refuse of regexp/text-parsing solutions after several days of discussion.
Now i use translation web servers(like google, bing, ...) that supports auto-detect language. I think it's the best method to solve it (if your conditions permit, of course)

Resources