Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?
Is there any example?
In particular, I am interested in extraction LOCATION part of text. For example, from text
The meeting will be held at 22 West Westin st., South Carolina, 12345
on Nov.-18
ideally I would like to get something like
(S
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION
.....
or simply
22 West Westin st., South Carolina, 12345
Instead, I am only able to get
(S
The/DT
meeting/NN
will/MD
be/VB
held/VBN
at/IN
22/CD
(LOCATION West/NNP Westin/NNP)
st./NNP
,/,
(GPE South/NNP Carolina/NNP)
,/,
12345/CD
on/IN
Nov.-18/-NONE-)
Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?
What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?
Many thanks in advance!
nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
output:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.
If that's a problem, use Pyner.
First run Stanford NER as a server
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191
then go to pyner folder
import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Hope this helps.
Related
I am trying to explore T5
this is the code
!pip install transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
I got this error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-8d24c6a196e4> in <module>()
5 kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
6 among the oldest continuously inhabited cities in the world."""
----> 7 tokenizer = T5Tokenizer.from_pretrained('t5-small')
8 model = T5ForConditionalGeneration.from_pretrained('t5-small')
9 input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/file_utils.py in requires_sentencepiece(obj)
521 name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
522 if not is_sentencepiece_available():
--> 523 raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
524
525
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.
--------------------------------------------------------------------------
after that I install sentencepiece library as was suggested like this:
!pip install transformers
!pip install sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
but I got another issue:
Some weights of the model checkpoint at t5-small were not used when
initializing T5ForConditionalGeneration:
['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another
architecture (e.g. initializing a BertForSequenceClassification model
from a BertForPreTraining model).
This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you
expect to be exactly identical (initializing a
BertForSequenceClassification model from a
BertForSequenceClassification model).
so I did not understand what is going on, any explanation?
I used these two command and this working fine for me!
!pip install datsets transformers[sentencepiece]
!pip install sentencepiece
This is not an issue. I also observe the second output. It is just a warning that the library shows. You fixed your actual issue. Do not worry about the warning.
import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
#english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
'/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
encoding='utf-8')
After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?
%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
Output
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s
Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger but instead to use the newer nltk.parse.CoreNLPParser . See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .
More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer if you just need NER or the StanfordCoreNLPServer for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python
Found the answer.
Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.
from sner import Ner
tagger = Ner(host='localhost',port=9199)
Then run the tagger.
%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms
Almost 300 times better performance in terms of timing! Wow!
After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.
I found Stanza by following the link provided in Christopher Manning's answer.
To download:
pip install stanza
then in Python:
import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences
If you only want a specific tool (NER), you can specify with processors as:
nlp = stanza.Pipeline('en',processors='tokenize,ner')
For an output similar to that produced by the OP:
classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]
But to produce a list of only those words that are recognizable entities:
classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]
It produces a couple of features that I really like:
instead of each word being classified as a separate person entity, it combines John Doe into one 'PERSON' object.
If you do want each separate word, you can extract those and it identifies which part of the object it is ('B' for the first word in the object, 'I' for the intermediate words, and 'E' for the last word in the object)
I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.
I have a bunch of badly formatted text with lots of missing punctuation. I want to know if there was any method to segment text into sentences when periods, semi-colons, capitalization, etc. are missing.
For example, consider the paragraph: "the lion is called the king of the forest it has a majestic appearance it eats flesh it can run very fast the roar of the lion is very famous".
This text should be segmented as separate sentences:
the lion is called the king of the forest
it has a majestic appearance
it eats flesh
it can run very fast
the roar of the lion is very famous
Can this be done or is it impossible? Any suggestion is much appreciated!
You can try using the following Python implementation from here.
import torch
model, example_texts, languages, punct, apply_te = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_te')
#your text goes here. I imagine it is contained in some list
input_text = input('Enter input text\n')
apply_te(input_text, lan='en')
From my understanding, to create a training file, you put your words in a text file. Then after each word, add a space or tab along with the tag (such as PERS, LOC, etc...)
I also copied text from a sample properties file into a word pad. How do I get these into a gz file that I can input into the classifier and use?
Please guide me though. I'm a newbie and am fairly inept with technology.
Your training file (say training-data.tsv) should look like this:
I O
drove O
to O
Vancouver LOCATION
BC LOCATION
yesterday O
where O means "Outside", as in not a named entity.
where the space between the columns is a tab.
You don't put them in a ser.gz file. The ser.gz file is the classifier model that is created by the training process.
To train the classifier run:
java -cp ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop my-classifier.properties
where my-classifier.properties would look like this:
trainFile = training-data.tsv
serializeTo = my-classification-model.ser.gz
map = word=0,answer=1
...
I'd advise you take a look at the NLTK documentation to learn more about training a parser http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html
. Now, it seems that you want to train the CRFClassifier (not the parser!); for that you may want to check this FAQ http://nlp.stanford.edu/software/crf-faq.shtml#a