How to reverse tokenization after running tokens through name finder? - opennlp

After using NameFinderME to find the names in a series of tokens, I would like to reverse the tokenization and reconstruct the original text with the names that have been modified. Is there a way I can reverse the tokenization operation in the exact way in which it was performed, so that the output is the exact structure as the input?
Example
Hello my name is John. This is another sentence.
Find sentences
Hello my name is John.
This is another sentence.
Tokenize sentences.
> Hello
> my
> name
> is
> John.
>
> This
> is
> another
> sentence.
My code that analyzes the tokens above looks something like this so far.
TokenNameFinderModel model3 = new TokenNameFinderModel(modelIn3);
NameFinderME nameFinder = new NameFinderME(model3);
List<Span[]> spans = new List<Span[]>();
foreach (string sentence in sentences)
{
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = nameFinder.find(tokens);
string[] namedEntities = Span.spansToStrings(nameSpans, tokens);
//I want to modify each of the named entities found
//foreach(string s in namedEntities) { modifystring(s) };
spans.Add(nameSpans);
}
Desired output, perhaps masking the names that were found.
Hello my name is XXXX. This is another sentence.
In the documentation, there is a link to this post describing how to use the detokenizer. I don't understand how the operations array relates to the original tokenization (if at all)
https://issues.apache.org/jira/browse/OPENNLP-216
Create instance of SimpleTokenizer.
String sentence = "He said \"This is a test\".";
SimpleTokenizer instance = SimpleTokenizer.INSTANCE;
Tokenize the sentence using tokenize(String str) method from SimpleTokenizer
String tokens[] = instance.tokenize(sentence);
The operations array must have the same number of operation name as tokens array. Basically array length should be equal.
Store the operation name N-times (tokens.length times) into operation array.
Operation operations[] = new Operation[tokens.length];
String oper = "MOVE_RIGHT"; // please refer above list for the list of operations
for (int i = 0; i < tokens.length; i++)
{ operations[i] = Operation.parse(oper); }
System.out.println(operations.length);
Here the operation array length will be equal to the tokens array length.
Now create an instance of DetokenizationDictionary by passing tokens and operations arrays to the constructor.
DetokenizationDictionary detokenizeDict = new DetokenizationDictionary(tokens, operations);
Pass DetokenizationDictionary instance to the DictionaryDetokenizer class to detokenize the tokens.
DictionaryDetokenizer dictDetokenize = new DictionaryDetokenizer(detokenizeDict);
DictionaryDetokenizer.detokenize requires two parameters. a). tokens array and b). split marker
String st = dictDetokenize.detokenize(tokens, " ");
Output:

Use the Detokenizer.
String text = detokenize(myTokens, null);

Related

I wrote a code to update the Lettering of the first name in Zoho but it's not working

Here's the deluge script to capitalize the first letter of the sentence and make the other letters small that isn't working:
a = zoho.crm.getRecordById("Contacts",input.ID);
d = a.get("First_Name");
firstChar = d.subString(0,1);
otherChars = d.removeFirstOccurence(firstChar);
Name = firstChar.toUppercase() + otherChars.toLowerCase();
mp = map();
mp.put("First_Name",d);
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":"Name"});
info Name;
info b;
I tried capitalizing the first letter of the alphabet and make the other letters small. But it isn't working as expected.
Try using concat
Name = firstChar.toUppercase().concat( otherChars.toLowerCase() );
Try removing the double-quotes from the Name value in the the following statement. The reason is that Name is a variable holding the case-adjusted name, but "Name" is the string "Name".
From:
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":"Name"});
To
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":Name});

How can I filter() stream method using regexp and predicate to get negated list

I am trying to filter anything not in the regexp.
So what I am trying to express is write anything to a list that has characters other than a-z,0-9 and -, so I can deal with these city names with invalid characters afterwards.
But whatever I try I either end up with a list of valid cities or an IllegalArgumentException where the list contains valid character cities.
String str;
List<String> invalidCharactersList = cityName.stream()
.filter(Pattern.compile("[^a-z0-9-]*$").asPredicate())
.collect(toList());
// Check for invalid names
if (!invalidCharactersList.isEmpty()) {
str = (inOut) ? "c" : "q";
throw new IllegalArgumentException("City name characters "
+ str + ": for city name " + invalidCharactersList.get(0)
+ ": fails constraint city names [a-z, 0-9, -]");
}
I am try to filter anything not in the regexp
Following is some test data which fails on the first list, I want it to fail on last
List<String> c = new ArrayList<>(Arrays.asList("fastcity", "bigbanana", "xyz"));
List<Integer> x = new ArrayList<>(Arrays.asList(23, 23, 23));
List<Integer> y = new ArrayList<>(Arrays.asList(1, 10, 20));
List<String> q = new ArrayList<>(Arrays.asList("fastcity*", "bigbanana", "xyz&"));
Following is output:
#Holger
filter(Pattern.compile("[^a-z0-9-]").asPredicate())
Thanks this works fine.

How to display array of split elements using LINQ

I have this simple code
string[] sequences = {"red,green,blue","orange","white,pink"};
var allWords = sequences.Select(s => s.Split(','));
foreach (var letter in allWords)
{
Console.WriteLine(letter);
}
The problem is that in output I get System.String[] insted of splitted array.
How to display result at console?
Use SelectMany if you want an array of strings and not an array of arrays of strings.
See https://dotnetfiddle.net/0vsjfN
SelectMany concatenates the lists, which are generated by using .Split(','), into a single list.

Excel Power Query: How to Combine All List Items into Single Row

I have a query to the Cognitive text keyphase API from Microsoft from '16 Excel Power Query - getting keywords from tweets. Works fine.
However, the JSON doc that's returned per query is converted by Power Query into a list of ~1-5 rows.
In the case of the pic, I want all responses returned to be in one cell/row, regardless of the number of items returned.
Here is my full M query (you need to put your own key in) if you're interested.
let
TweetCognitive = (TweetID as text, TweetText as text) =>
let
JsonRecords = Text.FromBinary(Json.FromValue([id=TweetID, text=TweetText])),
JsonRequest = "{""documents"": [" & JsonRecords & "]}",
JsonContent = Text.ToBinary(JsonRequest, TextEncoding.Ascii),
Response =
Web.Contents("https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases?",
[
Headers = [#"Ocp-Apim-Subscription-Key"="yourkeyhere",
#"Content-Type"="application/json", Accept="application/json"],
Content=JsonContent
]),
JsonResponse = Json.Document(Response,1252)
in
JsonResponse
in
TweetCognitive
You can use List.Accumulate to turn a list of values into a single value. For example, this would combine the values in the list into a single text value with ". " separating each row's value:
List.Accumulate(JsonResponse, "", (state, current) => state & current & ". ")
This would generate "monday frank love happiness today. nice good kind. tomorrow. " in your example. If you want to get rid of the trailing space, you can surround the List.Accumulate expression with Text.Trim.
The basic function to concatenate elements in a list is Text.Combine. For instance:
Text.Combine(JsonResponse, " ")
This avoids the extra delimeter at the end you get with List.Accumulate. Note also List.Combine is for creating a longer combined list from shorter lists, and the similar naming there may cause confusion.

how to detect all caps word in a string

I am new using java. I wanted to ask, if I have a text file containing different words per line and I want to read that file as a string in order to detect if there are certain words that are written in all caps (abbreviations). The exception being that if the word starts with "#" or and "#" it will ignore counting it. For example I have:
OMG terry is cute #HAWT SMH
The result will be:
Abbreviations = 2.
or
terry likes TGIF parties #ANDERSON
The result will be:
Abbreviations = 1.
Please help
Try to use the .split(String T) method, and the .contains(char C) methods .....
I think they will help you a lot ....
Function split:
http://www.tutorialspoint.com/java/java_string_split.htm
Function contains:
http://www.tutorialspoint.com/java/lang/string_contains.htm
String str1 = "OMG terry is cute #HAWT SMH";
String str2 = "terry likes TGIF parties #ANDERSON";
Pattern p = Pattern.compile("(?>\\s)([A-Z]+)(?=\\s)");
Matcher matcher = p.matcher(" "+str1+" ");//pay attention! adding spaces
// before and after to catch potentials in
// beginning/end of the sentence
int i=0;
while (matcher.find()) {
i++; //count how many matches were found
}
System.out.println("matches: "+i); // prints 2
matcher = p.matcher(" "+str2+" ");
i=0;
while (matcher.find()) {
i++;
}
System.out.println("matches: "+i); // prints 1
OUTPUT:
matches: 2
matches: 1
Here is a bazooka for your spider problem.
(mystring+" ").split("(?<!#|#)[A-Z]{2,}").length-1;
Pad the string with a space (because .split removes trailing empty strings).
Split on the pattern "behind this is neither # nor #, and this is two or more capital letters". This returns an array of substrings of the that are not part of abbreviations.
Take the length of the array and subtract 1.
Example:
mystring = "OMG terry is cute #HAWT SMH";
String[] arr = (mystring+" ").split("(?<!#|#)[A-Z]{2,}").length-1;
//arr is now {"", " terry is cute #HAWT ", " "}, three strings
return arr.length-1; //returns 2

Resources