How to use my own corpus on word embedding model BERT - word-embedding

I am trying to create a question-answering model with the word embedding model BERT from google. I am new to this and would really want to use my own corpus for the training. At first I used an example from the huggingface site and that worked fine:
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': "Amsterdam is de hoofdstad en de dichtstbevolkte stad van Nederland.",
'question': "Wat is de hoofdstad van Nederland?"})
output
> {'answer': 'Amsterdam', 'end': 9, 'score': 0.825619101524353, 'start': 0}
So, I tried creating a .txt file to test if it was possible to interchange the sentence in the context parameter with the exact same sentence but in a .txt file.
with open('test.txt') as f:
lines = f.readlines()
qa_pipeline = pipeline(
"question-answering",
model="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2",
tokenizer="henryk/bert-base-multilingual-cased-finetuned-dutch-squad2"
)
qa_pipeline({
'context': lines,
'question': "Wat is de hoofdstad van Nederland?"})
But this gave me the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-2bae0ecad43e> in <module>()
10 qa_pipeline({
11 'context': lines,
---> 12 'question': "Wat is de hoofdstad van Nederland?"})
5 frames
/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py in _is_whitespace(c)
84
85 def _is_whitespace(c):
---> 86 if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
87 return True
88 return False
TypeError: ord() expected a character, but string of length 66 found
I was just experimenting with ways to read and use a .txt file, but I don't seem to find a different solution. I did some research on the huggingface pipeline() function and this is what was written about the question and context parameters:

Got it! The solution was really easy. I assumed that the variable 'lines' was already a str but that wasn't the case. Just by casting to a string the question-answering model accepted my test.txt file.
so from:
with open('test.txt') as f:
lines = f.readlines()
to:
with open('test.txt') as f:
lines = str(f.readlines())

Related

KeyError when using non-default models in Huggingface transformers pipeline

I have no problems using the default model in the sentiment analysis pipeline.
# Allocate a pipeline for sentiment-analysis
nlp = pipeline('sentiment-analysis')
nlp('I am a black man.')
>>>[{'label': 'NEGATIVE', 'score': 0.5723695158958435}]
But, when I try to customise the pipeline a little by adding a specific model. It throws a KeyError.
nlp = pipeline('sentiment-analysis',
tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
nlp('I am a black man.')
>>>---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-55-af7e46d6c6c9> in <module>
3 tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
4 model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
----> 5 nlp('I am a black man.')
6
7
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in <listcomp>(.0)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
KeyError: 58129
I am facing the same problem. I am working with a model from XML-R fine-tuned with squadv2 data set ("a-ware/xlmroberta-squadv2"). In my case, the KeyError is 16.
Link
Looking for help on the issue I have found this information: link I hope you find it helpful.
Answer (from the link)
The pipeline throws an exception when the model predicts a token that is not part of the document (e.g. final special token [SEP])
My problem:
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
from transformers import pipeline
nlp = pipeline('question-answering',
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2'),
tokenizer= XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2'))
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-b5a8ece5e525> in <module>()
1 context = "Jim Henson was a nice puppet"
2 # --------------- CON INTERROGACIONES
----> 3 nlp(question = "Who was Jim Henson?", context =context)
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
1745 ),
1746 }
-> 1747 for s, e, score in zip(starts, ends, scores)
1748 ]
1749
KeyError: 16
Solution 1: Adding punctuation at the end of the context
In order to avoid the bug of trying to extract the final token (which may be an special one as [SEP]) I added an element (in this case a punctuation mark) at the end of the context:
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet.")
[OUT]
{'answer': 'nice puppet.', 'end': 28, 'score': 0.5742837190628052, 'start': 17}
Solution 2: Do not use pipeline()
The original model can handle itself to retrieve the correct token`s index.
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
Update
Looking in more detail your case, I found that the default model for Conversational task in the pipeline is distilbert-base-cased (source code).
The first solution I posted is not a good solution indeed. Trying other questions I got the same error. However, the model itself outside the pipeline works fine (as I showed in solution 2). Thus, I believe that not all models can be introduced in the pipeline. If anyone has more information about it please help us out. Thanks.

How to build an empirical codon substitution matrix from a multiple sequence alignment

I have been trying to build an empirical codon substitution matrix given a multiple sequence alignment in fasta format using Biopython.
It appears to be relatively straigh-forward for single nucleotide substitution matrices using the AlignInfo module when the aligned sequences have the same length. Here is what I managed to do using python2.7:
#!/usr/bin/env python
import os
import argparse
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio import SubsMat
import sys
version = "0.0.1 (23.04.20)"
name = "Aln2SubMatrix.py"
parser=argparse.ArgumentParser(description="Outputs a codon substitution matrix given a multi-alignment in FastaFormat. Will raise error if alignments contain dots (\".\"), so replace those with dashes (\"-\") beforehand (e.g. using sed)")
parser.add_argument('-i','--input', action = "store", dest = "input", required = True, help = "(aligned) input fasta")
parser.add_argument('-o','--output', action = "store", dest = "output", help = "Output filename (default = <Input-file>.codonSubmatrix")
args=parser.parse_args()
if not args.output:
args.output = args.input + ".codonSubmatrix" #if no outputname was specified set outputname based on inputname
def main():
infile = open(args.input, "r")
outfile = open(args.output, "w")
align = AlignIO.read(infile, "fasta")
summary_align = AlignInfo.SummaryInfo(align)
replace_info = summary_align.replacement_dictionary()
mat = SubsMat.SeqMat(replace_info)
print >> outfile, mat
infile.close()
outfile.close()
sys.stderr.write("\nfinished\n")
main()
Using a multiple sequence alignment file in fasta format with sequences of same length (aln.fa), the output is a half-matrix corresponding to the number of nucleotide substitutions oberved in the alignment (Note that gaps (-) are allowed):
python Aln2SubMatrix.py -i aln.fa
- 0
a 860 232
c 596 75 129
g 571 186 75 173
t 892 58 146 59 141
- a c g t
What I am aiming to do is to compute similar empirical substitution matrix but for all nucleotide triplets (codons) present in a multiple sequence alignment.
I have tried to tweak the _pair_replacement function of the AlignInfo module in order to accept nucleotide triplets by changing:
line 305 to 308
for residue_num in range(len(seq1)):
residue1 = seq1[residue_num]
try:
residue2 = seq2[residue_num]
to
for residue_num in range(0, len(seq1), 3):
residue1 = seq1[residue_num:residue_num+3]
try:
residue2 = seq2[residue_num:residue_num+3]
At this stage it can retrieve the codons from the alignment but complains about the alphabet (the module only accepts single character alphabet?).
Note that
(i) I would like to get a substitution matrix that accounts for the three possible reading frames
Any help is highly appreciated.

Analyzing protein sequences with the ProtParam module

I'm fairly new with Biopython. Right now, I'm trying to compute protein parameters from several protein sequences (more than 100) in fasta format. However, I've found difficult to parse the sequences correctly.
This is the code im using:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
input_file = open ("/Users/matias/Documents/Python/DOE.fasta", "r")
for record in SeqIO.parse(input_file, "fasta"):
my_seq = str(record.seq)
analyse = ProteinAnalysis(my_seq)
print(analyse.molecular_weight())
But I'm getting this error message:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site- packages/Bio/SeqUtils/__init__.py", line 438, in molecular_weight
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/Bio/SeqUtils/__init__.py", line 438, in <genexpr>
weight = sum(weight_table[x] for x in seq) - (len(seq) - 1) * water
KeyError: '\\'
Printing each sequence as string shows me every seq has a "\" at the end, but so far I haven't been able to remove it. Any ideas would be very appreciated.
That really shouldn't be there in your file, but if you can't get a clean input file, you can use my_seq = str(record.seq).rstrip('\\') to remove it at runtime.

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

mysterious "'str' object is not callable" python error

I am currently making my first python effort, a modification of some code written by a friend. I am using python 2.6.6. The original piece of code, which works, extracts information from a log file of data from donations made by credit card to my nonprofit. My new version, should it one day work, will perform the same task for donations that were made by paypal. The log files are similar, but have different field names and other differences.
The error messages I'm getting are:
Traceback (most recent call last):
File "../logparse-paypal-1.py", line 196, in
convert_log(sys.argv[1], sys.argv[2], access_ids)
File "../logparse-paypal-1.py", line 170, in convert_log
output = [f(record, access_ids) for f in output_fns]
TypeError: 'str' object is not callable
I've read some of the posts on this forum related to this error message, but so far I'm still at sea. I can't find any consequential differences between the portions of my code that related to the likely problem object (access_ids) and the code that I started with. All I did related to the access_ids table was to remove some lines that printed problems the script finds with the table that caused it to ignore some data. Perhaps I changed a character or something while doing that, but I've looked and so far can't find anything.
The portion of the code that is producing these error messages is the following:
# Use the output functions configured above to convert the
# transaction record into a list of outputs to be emitted to
# the CSV output file.
print "Converting %s at %s to CSV" % (record["type"], record["time"])
output = [f(record, access_ids) for f in output_fns]
j = 0
while j < len(output):
os.write(csv_fd, output[j])
if j < len(output) - 1:
os.write(csv_fd, ",")
else:
os.write(csv_fd, "\n")
j += 1
convert_count += 1
print "Converted %d approved transactions to CSV format, skipped %d non-approved transactions" % (convert_count, skip_count)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: logparse.py INPUT_FILE OUTPUT_FILE [ACCESS_IDS_FILE]"
print
print " INPUT_FILE Silent post log containing transaction records (must exist)"
print " OUTPUT_FILE Filename for the CSV file to be created (must not exist, will be created)"
print " ACCESS_IDS_FILE List of Access IDs and email addresses (optional, must exist if specified)"
sys.exit(-1)
access_ids = {}
if len(sys.argv) > 3:
access_ids = load_access_ids(sys.argv[3])
convert_log(sys.argv[1], sys.argv[2], access_ids)
Line 170 is this one:
output = [f(record, access_ids) for f in output_fns]
and line 196 is this one:
convert_log(sys.argv[1], sys.argv[2], access_ids)
The access_ids definition, possibly related to the problem, is this:
def access_id_fn(record, access_ids):
if "payer_email" in record and len(record["payer_email"]) > 0:
if record["payer_email"] in access_ids:
return '"' + access_ids[record["payer_email"]] + '"'
else:
return ""
else:
return ""
AND
def load_access_ids(filename):
print "Loading Access IDs from %s..." % filename
access_ids = {}
for line in open(filename, "r"):
line = line.rstrip()
access_id, email = [s.strip() for s in line.split(None, 1)]
if not email_address.match(email):
continue
if email in access_ids:
access_ids[string.strip(email)] = string.strip(access_id)
return access_ids
Thanks in advance for any advice with this.
Dave
I'm not seeing anything right off hand, but you did mention that the log files were similar and I take that to mean that there are differences between the two.
Can you post a line from each?
I would double check the data in the log files and make sure what you think is being read in is correct. This definitely appears to me like a piece of data is being read in, but somewhere it is breaking what the code is expecting.

Resources