import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
#english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
'/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
encoding='utf-8')
After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?
%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
Output
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s
Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger but instead to use the newer nltk.parse.CoreNLPParser . See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .
More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer if you just need NER or the StanfordCoreNLPServer for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python
Found the answer.
Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.
from sner import Ner
tagger = Ner(host='localhost',port=9199)
Then run the tagger.
%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms
Almost 300 times better performance in terms of timing! Wow!
After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.
I found Stanza by following the link provided in Christopher Manning's answer.
To download:
pip install stanza
then in Python:
import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences
If you only want a specific tool (NER), you can specify with processors as:
nlp = stanza.Pipeline('en',processors='tokenize,ner')
For an output similar to that produced by the OP:
classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]
But to produce a list of only those words that are recognizable entities:
classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]
It produces a couple of features that I really like:
instead of each word being classified as a separate person entity, it combines John Doe into one 'PERSON' object.
If you do want each separate word, you can extract those and it identifies which part of the object it is ('B' for the first word in the object, 'I' for the intermediate words, and 'E' for the last word in the object)
Related
I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!
Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.
import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
sentences=identify_sentences(text)
len(sentences)
Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.
import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
doc=nlp(text)
sentences=doc.sentences
len(sentences)
Two answers:
If all you're wanting to do is to split text into sentences, then your pipeline should be simply nlp=stanza.Pipeline(lang='en', processors='tokenize') and that will be much faster than the pipeline you show that also runs a part-of-speech tagger and named entity recognizer.
But, yes, running Stanza is way slower than simply doing matching against a single regex. There should be many places where it works differently and better, because exclamation marks, question marks, and especially periods often occur in the middle of English sentences (e.g., here!). You'll have to decide for yourself whether the better accuracy is worth it to you.
I am trying to explore T5
this is the code
!pip install transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
I got this error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-8d24c6a196e4> in <module>()
5 kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
6 among the oldest continuously inhabited cities in the world."""
----> 7 tokenizer = T5Tokenizer.from_pretrained('t5-small')
8 model = T5ForConditionalGeneration.from_pretrained('t5-small')
9 input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/file_utils.py in requires_sentencepiece(obj)
521 name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
522 if not is_sentencepiece_available():
--> 523 raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
524
525
ImportError:
T5Tokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones
that match your environment.
--------------------------------------------------------------------------
after that I install sentencepiece library as was suggested like this:
!pip install transformers
!pip install sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
qa_input = """question: What is the capital of Syria? context: The name "Syria" historically referred to a wider region,
broadly synonymous with the Levant, and known in Arabic as al-Sham. The modern state encompasses the sites of several ancient
kingdoms and empires, including the Eblan civilization of the 3rd millennium BC. Aleppo and the capital city Damascus are
among the oldest continuously inhabited cities in the world."""
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer.encode(qa_input, return_tensors="pt") # Batch size 1
outputs = model.generate(input_ids)
output_str = tokenizer.decode(outputs.reshape(-1))
but I got another issue:
Some weights of the model checkpoint at t5-small were not used when
initializing T5ForConditionalGeneration:
['decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight']
This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another
architecture (e.g. initializing a BertForSequenceClassification model
from a BertForPreTraining model).
This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you
expect to be exactly identical (initializing a
BertForSequenceClassification model from a
BertForSequenceClassification model).
so I did not understand what is going on, any explanation?
I used these two command and this working fine for me!
!pip install datsets transformers[sentencepiece]
!pip install sentencepiece
This is not an issue. I also observe the second output. It is just a warning that the library shows. You fixed your actual issue. Do not worry about the warning.
I want to add addresses (and possibly other rules based entities) to an NER pipeline and the Tokens Regex seems like a terribly useful DSL for doing so. Following https://stackoverflow.com/a/42604225, I'm created this rules file:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: ([{ner:"NUMBER"}] [{pos:"NN"}|{pos:"NNP"}] /ave(nue)?|st(reet)?|boulevard|blvd|r(oa)?d/), action: Annotate($0, ner, "address") }
Here's a scala repl session showing how I'm trying to set up an annotation pipeline.
# import edu.stanford.nlp.pipeline.{StanfordCoreNLP, CoreDocument}
# import edu.stanford.nlp.util.PropertiesUtils.asProperties
# val pipe = new StanfordCoreNLP(asProperties(
"customAnnotatorClass.tokensregex", "edu.stanford.nlp.pipeline.TokensRegexAnnotator",
"annotators", "tokenize,ssplit,pos,lemma,ner,tokensregex",
"ner.combinationMode", "HIGH_RECALL",
"tokensregex.rules", "addresses.tregx"))
pipe: StanfordCoreNLP = edu.stanford.nlp.pipeline.StanfordCoreNLP#2ce6a051
# val doc = new CoreDocument("Adam Smith lived at 123 noun street in Glasgow, Scotland")
doc: CoreDocument = Adam Smith lived at 123 noun street in Glasgow, Scotland
# pipe.annotate(doc)
# doc.sentences.get(0).nerTags
res5: java.util.List[String] = [PERSON, PERSON, O, O, address, address, address, O, CITY, O, COUNTRY]
# doc.entityMentions
res6: java.util.List[edu.stanford.nlp.pipeline.CoreEntityMention] = [Adam Smith, 123, Glasgow, Scotland]
As you can see, the address gets correctly tagged in the nerTags for the sentence, but it doesn't show up in the documents entityMentions. Is there a way to do this?
Also, is there a way from the document to discern two adjacent matches of the tokenregex from a single match (assuming I have more complicated set of regexes; in the current example I only match exactly 3 tokens, so I could just count tokens)?
I tried approaching it using the regexner with a tokens regex described here https://stanfordnlp.github.io/CoreNLP/regexner.html, but I couldn't seem to get that working.
Since I'm working in scala I'll be happy to dive into the Java API to get this to work, rather than fiddle with properties and resource files, if that's necessary.
Yes, I've recently added some changes (in the GitHub version) to make this easier! Make sure to download the latest version from GitHub. Though we are aiming to release Stanford CoreNLP 3.9.2 fairly soon and it will have these changes.
If you read this page you can get an understanding of the full NER pipeline run by the NERCombinerAnnotator.
https://stanfordnlp.github.io/CoreNLP/ner.html
Furthermore there is a lot of write up on the TokensRegex here:
https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Basically what you want to do is run the ner annotator, and use it's TokensRegex sub-annotator. Imagine you have some named entity rules in a file called my_ner.rules.
You could run a command like this:
java -Xmx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ner.additional.tokensregex.rules my_ner.rules -outputFormat text -file example.txt
This will run a TokensRegex sub-annotator during the full named entity recognition process. Then when the final step of entity mentions are run, it will operate on the rules extracted named entities and create entity mentions from them.
Is this possible: to get (similar to) Stanford Named Entity Recognizer functionality using just NLTK?
Is there any example?
In particular, I am interested in extraction LOCATION part of text. For example, from text
The meeting will be held at 22 West Westin st., South Carolina, 12345
on Nov.-18
ideally I would like to get something like
(S
22/LOCATION
(LOCATION West/LOCATION Westin/LOCATION)
st./LOCATION
,/,
(South/LOCATION Carolina/LOCATION)
,/,
12345/LOCATION
.....
or simply
22 West Westin st., South Carolina, 12345
Instead, I am only able to get
(S
The/DT
meeting/NN
will/MD
be/VB
held/VBN
at/IN
22/CD
(LOCATION West/NNP Westin/NNP)
st./NNP
,/,
(GPE South/NNP Carolina/NNP)
,/,
12345/CD
on/IN
Nov.-18/-NONE-)
Note that if I enter my text into http://nlp.stanford.edu:8080/ner/process I get results far from perfect (street number and zip code are still missing) but at least "st." is a part of LOCATION and South Carolina is a LOCATION and not some "GPE / NNP" : ?
What I am doing wrong please? how can I fix it to use NLTK for extracting location piece from some text please?
Many thanks in advance!
nltk DOES have an interface for Stanford NER, check nltk.tag.stanford.NERTagger.
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
output:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
However every time you call tag, nltk simply writes the target sentence into a file and runs Stanford NER command line tool to parse that file and finally parses the output back to python. Therefore the overhead of loading classifiers (around 1 min for me every time) is unavoidable.
If that's a problem, use Pyner.
First run Stanford NER as a server
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9191
then go to pyner folder
import ner
tagger = ner.SocketNER(host='localhost', port=9191)
tagger.get_entities("University of California is located in California, United States")
# {'LOCATION': ['California', 'United States'],
# 'ORGANIZATION': ['University of California']}
tagger.json_entities("Alice went to the Museum of Natural History.")
#'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Hope this helps.
I was curious to know if there is any bioinformatics tool out there able to process a multiFASTA file giving me infos like number of sequences, length, nucleotide/aminoacid content, etc. and maybe automatically draw descriptive plots.
Also an R BIoconductor solution or a BioPerl module would do, but I didn't manage to find anything.
Can you help me? Thanks a lot :-)
Some of the emboss tools are a collection of small tools that can help you out.
seqstats returns sequence length
pepstats should give you aminoacid content etc.
Some of the tools also offer plotting functions. Very handy.
http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/groups.html
To count number of fasta entries, I use:
grep -c '^>' mySequences.fasta.
To make sure none of the entries are duplicate, I check that I get the same number when doing this: grep '^>' mySequences.fasta | sort | uniq | wc -l
You may also be interested in faSize, which is a tool from the Kent Source Tree, although this requires a bit more effort (you must dload and compile) than just using grep... here is some example output:
me#my-lab ~/data $ time faSize myfile.fna
215400419 bases (104761 N's 215295658 real 215295658 upper 0 lower) in 731620 sequences in 1 files
Total size: mean 294.4 sd 138.5 min 30 (F5854LK02GG895) max 1623 (F5854LK01AHBEH) median 307
N count: mean 0.1 sd 0.4
U count: mean 294.3 sd 138.5
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
real 0m3.710s
user 0m3.541s
sys 0m0.164s
Screed in python is brilliant:
import screed
for record in screed.open(fastafilename):
print(len(record.sequence))
It should be noted (for anyone stumbling upon this, like I just did) that there is a robust python library specifically designed to handle these tasks called Biopython. In a few lines of code, you can quickly access answers for all of the above questions. Here are some very basic examples, mostly adapted from the link. There are boiler-plate GC% graphs and sequence length graphs in the tutorial also.
In [1]: from Bio import SeqIO
In [2]: allSeqs = [seq_record for seq_record in SeqIO.parse('/home/kevin/stack/ls_orchid.fasta', """fasta""")]
In [3]: allSeqs[0]
Out[3]: SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[])
In [4]: len(allSeqs) #number of unique sequences in the file
Out[4]: 94
In [5]: len(allSeqs[0].seq) # call len() on each SeqRecord.seq object
Out[5]: 740
In [6]: A_count = allSeqs[0].seq.count('A')
C_count = allSeqs[0].seq.count('C')
G_count = allSeqs[0].seq.count('G')
T_count = allSeqs[0].seq.count('T')
​print A_count # number of A's
144
In [7]: allSeqs[0].seq.count("AUG") # or count how many start codons
Out[7]: 0
In [8]: allSeqs[0].seq.translate() # translate DNA -> Amino Acid
Out[8]: Seq('RNKVSVGEPAEGSLMRPWNKRSSESGGPVYSAHRGHCSRGDPDLLLGRLGSVHG...*VY', HasStopCodon(ExtendedIUPACProtein(), '*'))