Get position of token in berts output layer - huggingface-transformers

We are interested in the bert vectors for each token. With bert vector we mean the word vector for a specific token in berts output layer. So we would like to find out which token produces which bert vector. We wrote some code but we are not sure if it is correct or how to test it.
So in the code we process a sentence with bert. We construct a list of position ids and hand them to the model. Afterwards we use the same position ids to map the tokens to the output layer. Then there is some code that produces calculates the character offsets of each vector in the input sentence.
Is this the correct way how to use position_ids to generate
from transformers import BertModel, BertConfig, BertTokenizer
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
def sentence_to_vector(input_sentence):
tokens_encoded = tokenizer.encode(input_sentence, add_special_tokens=True)
input_ids = torch.tensor(tokens_encoded).unsqueeze(0) # Batch size 1
seq_length = input_ids.size(1)
# code to construct position_ids from here:
# https://github.com/huggingface/transformers/blob/8da280ebbeca5ebd7561fd05af78c65df9161f92/pytorch_pretrained_bert/modeling.py#L188:L189
position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
outputs = model(input_ids, position_ids=position_ids)
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
# from the BertModel documentation (example at the bottom):
# The last hidden-state is the first element of the output tuple
# https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel
#ttv = {} # token to vector
#for i in position_ids[0]:
# ttv[tokens[i]] = outputs[0][0][position_ids[0][i]]
data = []
last_offset = 0
for i in range(0, len(position_ids[0])):
token = tokens[position_ids[0][i]]
vector = outputs[0][0][position_ids[0][i]]
pos_begin = None
pos_end = None
if not token == "[CLS]" and not token == "[SEP]":
pos_begin = input_sentence.find(token, last_offset)
pos_end = pos_begin + len(token)
last_offset = pos_end
data.append({
"token": token,
"pos_begin": pos_begin,
"pos_end": pos_end,
"vector": vector
})
return data
input_sentence = "do the chicken dance!"
data = sentence_to_vector(input_sentence)
for token in data:
print(token["token"] + "\t" + str(token["pos_begin"]) + "\t" + str(token["pos_end"]) + "\t" + str(token["vector"][0:3]) + "..." )

Related

Understanding the distance metric in company name matching using KNN

I am trying to understand the following code that I found for matching a messy list of company names to a list of clean list of company names. My question is what the 'Ratio' metric is calculated using. It appears that the ratio is from scorer = fuzz.token_sort_ratio which is I understand is part of the fuzzywuzzy package and therefore a levenschtein distance calculation correct? I'm trying to understand why the author uses this as the scorer rather than the distance output from KNN. When I try changing the metric inside NearestNeighbors, it doesn't appear to change the results. Does the metric in NearestNeighbors matter then?
Original article:
https://audhiaprilliant.medium.com/fuzzy-string-matching-optimization-using-tf-idf-and-knn-b07fce69b58f
def build_vectorizer(
clean: pd.Series,
analyzer: str = 'char',
ngram_range: Tuple[int, int] = (1, 4),
n_neighbors: int = 1,
**kwargs
) -> Tuple:
# Create vectorizer
vectorizer = TfidfVectorizer(analyzer = analyzer, ngram_range = ngram_range, **kwargs)
X = vectorizer.fit_transform(clean.values.astype('U'))
# Fit nearest neighbors corpus
nbrs = NearestNeighbors(n_neighbors = n_neighbors, metric = 'cosine').fit(X)
return vectorizer, nbrs
# String matching - KNN
def tfidf_nn(
messy,
clean,
n_neighbors = 1,
**kwargs
):
# Fit clean data and transform messy data
vectorizer, nbrs = build_vectorizer(clean, n_neighbors = n_neighbors, **kwargs)
input_vec = vectorizer.transform(messy)
# Determine best possible matches
distances, indices = nbrs.kneighbors(input_vec, n_neighbors = n_neighbors)
nearest_values = np.array(clean)[indices]
return nearest_values, distances
# String matching - match fuzzy
def find_matches_fuzzy(
row,
match_candidates,
limit = 5
):
row_matches = process.extract(
row, dict(enumerate(match_candidates)),
scorer = fuzz.token_sort_ratio,
limit = limit
)
result = [(row, match[0], match[1]) for match in row_matches]
return result
# String matching - TF-IDF
def fuzzy_nn_match(
messy,
clean,
column,
col,
n_neighbors = 100,
limit = 5, **kwargs):
nearest_values, _ = tfidf_nn(messy, clean, n_neighbors, **kwargs)
results = [find_matches_fuzzy(row, nearest_values[i], limit) for i, row in enumerate(messy)]
df = pd.DataFrame(itertools.chain.from_iterable(results),
columns = [column, col, 'Ratio']
)
return df
# String matching - Fuzzy
def fuzzy_tf_idf(
df: pd.DataFrame,
column: str,
clean: pd.Series,
mapping_df: pd.DataFrame,
col: str,
analyzer: str = 'char',
ngram_range: Tuple[int, int] = (1, 3)
) -> pd.Series:
# Create vectorizer
clean = clean.drop_duplicates().reset_index(drop = True)
messy_prep = df[column].drop_duplicates().dropna().reset_index(drop = True).astype(str)
messy = messy_prep.apply(preprocess_string)
result = fuzzy_nn_match(messy = messy, clean = clean, column = column, col = col, n_neighbors = 1)
# Map value from messy to clean
return result

how to add symbols to the multiple stock data

#i have scraped data below is my code, now i want to add a column of symbols to the respective company data, plz guide me how the symbol can be added to the respective firm data
#code below
from time import sleep
import pandas as pd
import os
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
symbols =['FATIMA',
'SSGC',
'FCCL',
'ISL',
'KEL',
'NCL',
'DGKC',
'SNGP',
'NML',
'ENGRO',
'HUMNL',
'CHCC',
'ATRL',
'HUBC',
'ASTL',
'PIBTL',
'OGDC',
'EFERT',
'FFC',
'NCPL',
'KTML',
'PSO',
'LUCK',
'SEARL',
'KOHC',
'ABOT',
'AICL',
'HASCOL',
'PTC',
'KAPCO',
'PIOC',
'POL',
'SHEL',
'GHGL',
'HCAR',
'DCR',
'BWCL',
'MTL',
'GLAXO',
'PKGS',
'SHFA','MARI',
'ICI',
'ACPL',
'PSMC',
'SPWL',
'THALL',
'BNWM',
'EFUG',
'GADT',
'AABS']
company = 1
for ThisSymbol in symbols :
# Get first symbol from the above python list
company = 2
# In the URL, make symbol as variable
url = 'http://www.scstrade.com/stockscreening/SS_CompanySnapShotYF.aspx?symbol=' + ThisSymbol
browser.get(url)
sleep(2)
# The below command will get all the contents from the url
html = browser.execute_script("return document.documentElement.outerHTML")
# So we will supply the contents to beautiful soup and we tell to consider this text as a html, with the following command
soup = BeautifulSoup (html, "html.parser")
for rn in range(0,9) :
plist = []
r = soup.find_all('tr')[rn]
# Condition: if first row, then th, otherwise td
if (rn==0) :
celltag = 'th'
else :
celltag = 'td'
# Now use the celltag instead of using fixed td or th
col = r.find_all(celltag)
print()
if col[i] == 0:
print ("")
else:
for i in range(0,4) :
cell = col[i].text
clean = cell.replace('\xa0 ', '')
clean = clean.replace (' ', '')
plist.append(clean)
# If first row, create df, otherwise add to it
if (rn == 0) :
df = pd.DataFrame(plist)
else :
df2 = pd.DataFrame(plist)
colname = 'y' + str(2019-rn)
df[colname] = df2
if (company == 1):
dft = df.T
# Get header Column
head = dft.iloc[0]
# Exclude first row from the data
dft = dft[1:]
dft.columns = head
dft = dft.reset_index()
# Assign Headers
dft = dft.drop(['index'], axis = 'columns')
else:
dft2 = df.T
# Get header Column
head = dft2.iloc[0]
# Exclude first row from the data
dft2 = dft2[1:]
dft2.columns = head
dft2 = dft2.reset_index()
# Assign Headers
dft2 = dft2.drop(['index'], axis = 'columns')
dft['Symbol'] = ThisSymbol
dft = dft.append(dft2, sort=['Year','Symbol'])
company = company +1
dft
my output looks this, i want to have a symbol column to each respective firm data
Symbol,i have added
dft['Symbol'] = ThisSymbol
but it add just first company from the list to all companies data
enter image description here

How can I get the score from Question-Answer Pipeline? Is there a bug when Question-answer pipeline is used?

When I run the following code
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal
"""
questions = [
"Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
The answer that I get here is:
Question: Dis asked if it is possible to post the two invoice in ARB.I have not access so I wanted to check if you would be able to do it.
Answer: dis is not yet on boarded to ARB portal
How do I get a score for this answer? Score here is very similar to what is I get when I run Question-Answer pipeline .
I have to take this approach since Question-Answer pipeline when used is giving me Key Error for the below code
from transformers import pipeline
nlp = pipeline("question-answering")
context = r"""
As checked Dis is not yet on boarded to ARB portal, hence we cannot upload the invoices in portal.
"""
print(nlp(question="Dis asked if it is possible to post the two invoice in ARB?", context=context))
This is my attempt to get the score. It appears that I cannot figure out what feature.p_mask. So I could not remove the non-context indexes that contribute to the softmax at the moment.
# ... assuming imports and question and context
model_name="deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
inputs = tokenizer(question, context,
add_special_tokens=True,
return_tensors='pt')
input_ids = inputs['input_ids'].tolist()[0]
outputs = model(**inputs)
# used to compute score
start = outputs.start_logits.detach().numpy()
end = outputs.end_logits.detach().numpy()
# from source code
# Ensure padded tokens & question tokens cannot belong to the set of candidate answers.
#?? undesired_tokens = np.abs(np.array(feature.p_mask) - 1) & feature.attention_mask
# Generate mask
undesired_tokens = inputs['attention_mask']
undesired_tokens_mask = undesired_tokens == 0.0
# Make sure non-context indexes in the tensor cannot contribute to the softmax
start_ = np.where(undesired_tokens_mask, -10000.0, start)
end_ = np.where(undesired_tokens_mask, -10000.0, end)
# Normalize logits and spans to retrieve the answer
start_ = np.exp(start_ - np.log(np.sum(np.exp(start_), axis=-1, keepdims=True)))
end_ = np.exp(end_ - np.log(np.sum(np.exp(end_), axis=-1, keepdims=True)))
# Compute the score of each tuple(start, end) to be the real answer
outer = np.matmul(np.expand_dims(start_, -1), np.expand_dims(end_, 1))
# Remove candidate with end < start and end - start > max_answer_len
max_answer_len = 15
candidates = np.tril(np.triu(outer), max_answer_len - 1)
scores_flat = candidates.flatten()
idx_sort = [np.argmax(scores_flat)]
start, end = np.unravel_index(idx_sort, candidates.shape)[1:]
end += 1
score = candidates[0, start, end-1]
start, end, score = start.item(), end.item(), score.item()
print(tokenizer.decode(input_ids[start:end]))
print(score)
See more source code

Steganography program - converting python 2 to 3, syntax error in: base64.b64decode("".join(chars))

I have problem with the syntax in the last part of steg program. I tried to convert python 2 version (of the working code) to python 3, and this is the last part of it:
flag = base64.b64decode("".join(chars)) <- error
print(flag)
The program 1. encrypts the message in the Last Significiant Bits of the image as saves it as a new image. Then 2.decrypts the message, which is stored in "flag", and prints it.
* can the error be caused by the wrong type of input?:
message = input("Your message: ")
BELOW: UNHIDING PROGRAM
#coding: utf-8
import base64
from PIL import Image
image = Image.open("after.png")
extracted = ''
pixels = image.load()
#Iterating in 1st row
for x in range(0,image.width):
r,g,b = pixels[x,0]
# Storing LSB of each color
extracted += bin(r)[-1]
extracted += bin(g)[-1]
extracted += bin(b)[-1]
chars = []
for i in range(len(extracted)/8):
byte = extracted[i*8:(i+1)*8]
chars.append(chr(int(''.join([str(bit) for bit in byte]), 2)))
flag = base64.b64decode(''.join(chars))
print flag
BELOW: HIDING PROGRAM:
import bitarray
import base64
from PIL import Image
with Image.open('before.png') as im:
pixels=im.load()
message = input("Your message: ")
encoded_message = base64.b64encode(message.encode('utf-8'))
#Convert the message into an array of bits
ba = bitarray.bitarray()
ba.frombytes(encoded_message)
bit_array = [int(i) for i in ba]
#Duplicate the original picture
im = Image.open("before.png")
im.save("after.png")
im = Image.open("after.png")
width, height = im.size
pixels = im.load()
#Hide message in the first row
i = 0
for x in range(0,width):
r,g,b = pixels[x,0]
#print("[+] Pixel : [%d,%d]"%(x,0))
#print("[+] \tBefore : (%d,%d,%d)"%(r,g,b))
#Default values in case no bit has to be modified
new_bit_red_pixel = 255
new_bit_green_pixel = 255
new_bit_blue_pixel = 255
if i<len(bit_array):
#Red pixel
r_bit = bin(r)
r_last_bit = int(r_bit[-1])
r_new_last_bit = r_last_bit & bit_array[i]
new_bit_red_pixel = int(r_bit[:-1]+str(r_new_last_bit),2)
i += 1
if i<len(bit_array):
#Green pixel
g_bit = bin(g)
g_last_bit = int(g_bit[-1])
g_new_last_bit = g_last_bit & bit_array[i]
new_bit_green_pixel = int(g_bit[:-1]+str(g_new_last_bit),2)
i += 1
if i<len(bit_array):
#Blue pixel
b_bit = bin(b)
b_last_bit = int(b_bit[-1])
b_new_last_bit = b_last_bit & bit_array[i]
new_bit_blue_pixel = int(b_bit[:-1]+str(b_new_last_bit),2)
i += 1
pixels[x,0] = (new_bit_red_pixel,new_bit_green_pixel,new_bit_blue_pixel)
#print("[+] \tAfter: (%d,%d,%d)"%(new_bit_red_pixel,new_bit_green_pixel,new_bit_blue_pixel))
im.save('after.png')
error
ValueError: string argument should contain only ASCII characters
help for base64.b64decode says:
b64decode(s, altchars=None, validate=False)
Decode the Base64 encoded bytes-like object or ASCII string s.
...
Considering that in Python 2 there were "normal" strs and unicode-strs (u-prefixed), I suggest taking closer look at what produce "".join(chars). Does it contain solely ASCII characters?
I suggest adding:
print("Codes:",[ord(c) for c in chars])
directly before:
flag = base64.b64decode("".join(chars))
If there will be number >127 inside codes, that mean it might not work as it is fit only for pure ASCII strs.

Discrepancies in gensim doc2vec embedding vectors

I use gensim Doc2Vec package to train doc2vec embeddings. I would expect that two models trained with the identical parameters and data would have very close values of the doc2vec vectors. However, in my experience it is only true with doc2vec trained in the PV-DBOW without training word embedding (dbow_words = 0).
For PV-DM and for PV-DBOW with dbow_words = 1, i.e. every case the word embedding are trained along with doc2vec, the doc2vec embedding vectors for identically trained models are fairly different.
Here is my code
from sklearn.datasets import fetch_20newsgroups
from gensim import models
import scipy.spatial.distance as distance
import numpy as np
from nltk.corpus import stopwords
from string import punctuation
def clean_text(texts, min_length = 2):
clean = []
#don't remove apostrophes
translator = str.maketrans(punctuation.replace('\'',' '), ' '*len(punctuation))
for text in texts:
text = text.translate(translator)
tokens = text.split()
# remove not alphabetic tokens
tokens = [word.lower() for word in tokens if word.isalpha()]
# filter out stop words
stop_words = stopwords.words('english')
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) >= min_length]
tokens = ' '.join(tokens)
clean.append(tokens)
return clean
def tag_text(all_text, tag_type =''):
tagged_text = []
for i, text in enumerate(all_text):
tag = tag_type + '_' + str(i)
tagged_text.append(models.doc2vec.TaggedDocument(text.split(), [tag]))
return tagged_text
def train_docvec(dm, dbow_words, min_count, epochs, training_data):
model = models.Doc2Vec(dm=dm, dbow_words = dbow_words, min_count = min_count)
model.build_vocab(tagged_data)
model.train(training_data, total_examples=len(training_data), epochs=epochs)
return model
def compare_vectors(vector1, vector2):
cos_distances = []
for i in range(len(vector1)):
d = distance.cosine(vector1[i], vector2[i])
cos_distances.append(d)
print (np.median(cos_distances))
print (np.std(cos_distances))
dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers', 'footers', 'quotes'))
n_samples = len(dataset.data)
data = clean_text(dataset.data)
tagged_data = tag_text(data)
data_labels = dataset.target
data_label_names = dataset.target_names
model_dbow1 = train_docvec(0, 0, 4, 30, tagged_data)
model_dbow2 = train_docvec(0, 0, 4, 30, tagged_data)
model_dbow3 = train_docvec(0, 1, 4, 30, tagged_data)
model_dbow4 = train_docvec(0, 1, 4, 30, tagged_data)
model_dm1 = train_docvec(1, 0, 4, 30, tagged_data)
model_dm2 = train_docvec(1, 0, 4, 30, tagged_data)
compare_vectors(model_dbow1.docvecs, model_dbow2.docvecs)
> 0.07795828580856323
> 0.02610614028793008
compare_vectors(model_dbow1.docvecs, model_dbow3.docvecs)
> 0.6476179957389832
> 0.14797587172616306
compare_vectors(model_dbow3.docvecs, model_dbow4.docvecs)
> 0.19878000020980835
> 0.06362519480831186
compare_vectors(model_dm1.docvecs, model_dm2.docvecs)
> 0.13536489009857178
> 0.045365127475424386
compare_vectors(model_dbow1.docvecs, model_dm1.docvecs)
> 0.6358324736356735
> 0.15150255674571805
UPDATE
I tried, as suggested by gojomo, to compare the differences between the vectors, and, unfortunately, those are even worse:
def compare_vector_differences(vector1, vector2):
diff1 = []
diff2 = []
for i in range(len(vector1)-1):
diff1.append( vector1[i+1] - vector1[i])
for i in range(len(vector2)-1):
diff2[i].append(vector2[i+1] - vector2[i])
cos_distances = []
for i in range(len(diff1)):
d = distance.cosine(diff1[i], diff2[i])
cos_distances.append(d)
print (np.median(cos_distances))
print (np.std(cos_distances))
compare_vector_differences(model_dbow1.docvecs, model_dbow2.docvecs)
> 0.1134452223777771
> 0.02676398444178949
compare_vector_differences(model_dbow1.docvecs, model_dbow3.docvecs)
> 0.8464127033948898
> 0.11423789350773429
compare_vector_differences(model_dbow4.docvecs, model_dbow3.docvecs)
> 0.27400463819503784
> 0.05984108730423529
SECOND UPDATE
This time, after I finally understood gojomo, the things look fine.
def compare_distance_differences(vector1, vector2):
diff1 = []
diff2 = []
for i in range(len(vector1)-1):
diff1.append( distance.cosine(vector1[i+1], vector1[i]))
for i in range(len(vector2)-1):
diff2.append( distance.cosine(vector2[i+1], vector2[i]))
diff_distances = []
for i in range(len(diff1)):
diff_distances.append(abs(diff1[i] - diff2[i]))
print (np.median(diff_distances))
print (np.std(diff_distances))
compare_distance_differences(model_dbow1.docvecs, model_dbow2.docvecs)
>0.017469733953475952
>0.01659284710785352
compare_distance_differences(model_dbow1.docvecs, model_dbow3.docvecs)
>0.0786697268486023
>0.06092163158218411
compare_distance_differences(model_dbow3.docvecs, model_dbow4.docvecs)
>0.02321992814540863
>0.023095123172320778
The doc-vectors (or word-vectors) of Doc2Vec & Word2Vec models are only meaningfully comparable to other vectors that were co-trained, in the same interleaved training sessions.
Otherwise, randomness introduced by the algorithms (random-initialization & random-sampling) and by slight differences in training ordering (from multithreading) will cause the trained positions of individual vectors to wander to arbitrarily different positions. Their relative distances/directions, to other vectors that shared interleaved training, should be about as equally-useful from one model to the next.
But there's no one right place for such a vector, and measuring the differences between the vector for document '1' (or word 'foo') in one model, and the corresponding vector in another model, isn't reflective of anything the models/algorithms are trained to provide.
There's more information in the Gensim FAQ:
Q11: I've trained my Word2Vec/Doc2Vec/etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake?

Resources