Gensim - iterating over multiple documents - gensim

I'm trying to the Q6 recipe shown here but my corpus keep getting returned as [] even though I have checked and it does seem to be reading the document correctly.
So my code is:
def iter_documents(top_directory):
"""Iterate over all documents, yielding a document (=list of utf8 tokens) at a time."""
for root, dirs, files in os.walk(top_directory):
for file in filter(lambda file: file.endswith('.txt'), files):
document = open(os.path.join(root, file)).read() # read the entire document, as one big string
yield utils.tokenize(document, lower=True) # or whatever tokenization suits you
class MyCorpus(object):
# Used to create the object
def __init__(self, top_dir):
self.top_dir = top_dir
self.dictionary = corpora.Dictionary(iter_documents(top_dir))
self.dictionary.filter_extremes(no_below=1, keep_n=30000) # check API docs for pruning params
# Used if you ever need to iterate through the values
def __iter__(self):
for tokens in iter_documents(self.top_dir):
yield self.dictionary.doc2bow(tokens)
and the text file I'm using to test is this.

Ok I figured it out. Change line 12 to this: self.dictionary.filter_extremes(no_below=0, no_above=1,keep_n=30000)
Because I only have 1 document to start with it's being filtered out.
See this

Related

Adding new tokens to BERT/RoBERTa while retaining tokenization of adjacent tokens

I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word. The idea is to fine-tune the models on a limited set of sentences with the new word, and then see what it predicts about the word in other, different contexts, to examine the state of the model's knowledge of certain properties of language.
In order to do this, I'd like to add the new tokens and essentially treat them like new ordinary words (that the model just hasn't happened to encounter yet). They should behave exactly like normal words once added, with the exception that their embedding matrices will be randomly initialized and then be learned during fine-tuning.
However, I'm running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when initializing the tokenizer with do_basic_tokenize=False in the case of BERT (in the case of RoBERTa, changing this setting doesn't seem to affect the output in the examples here). The problem can be observed in the following example; in the case of BERT, the period following the newly added token is not tokenized as a subword (i.e., it is tokenized as . instead of as the expected ##.), and in the case of RoBERTa, the word following the newly added subword is treated as though it does not have a preceding space (i.e., it is tokenized as a instead of as Ġa.
from transformers import BertTokenizer, RobertaTokenizer
new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']
bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']
roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']
roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']
Is there a way for me to add the new tokens while getting the behavior of the surrounding tokens to match what it would be if there were not an added token there? I feel like it's important because the model could end up learning that (for instance), the new token can occur before ., while most others can only occur before ##. That seems like it would affect how it generalizes. In addition, I could turn on basic tokenization to solve the BERT problem here, but that wouldn't really reflect the full state of the model's knowledge, since it collapses the distinction between different tokens. And that doesn't help with the RoBERTa problem, which is still there regardless.
In addition, I'd ideally be able to add the RoBERTa token as Ġmynewword, but I'm assuming that as long as it never occurs as the first word in a sentence, that shouldn't matter.
After continuing to try and figure this out, I seem to have found something that might work. It's not necessarily generalizable, but one can load a tokenizer from a vocabulary file (+ a merges file for RoBERTa). If you manually edit those files to add the new tokens in the right way, everything seems to work as expected. Here's an example for BERT:
from transformers import BertTokenizer
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bert.tokenize('testing.') # ['testing', '##.']
bert.tokenize('mynewword') # ['my', '##ne', '##w', '##word']
bert_vocab = bert.get_vocab() # get the pretrained tokenizer's vocabulary
bert_vocab.update({'mynewword' : len(bert_vocab)}) # add the new word to the end
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
tmp_vocab_file.write('\n'.join(bert_vocab))
new_bert = BertTokenizer(name_or_path = 'bert-base-uncased', vocab_file = 'vocab.tmp', do_basic_tokenize=False)
new_bert.max_model_length = 512 # for identity to this setting on the pretrained one
new_bert.tokenize('mynewword') # ['mynewword']
new_bert.tokenize('mynewword.') # ['mynewword', '##.']
import os
os.remove('vocab.tmp') # cleanup
RoBERTa is much harder since we also have to add the pairs to merges.txt. I have a way of doing this that works for the new tokens, but unfortunately it can affect tokenization of words that are subparts of the new tokens, so it's not perfect—if one is using this to add made up words (as in my use case), you can just choose strings that are unlikely to cause problems (unlike the example here of 'mynewword'), but in other cases it is likely to cause problems. (While it's not a perfect solution, hopefully it might get others to see a better one.)
import re
import json
import requests
from transformers import RobertaTokenizer
roberta = RobertaTokenizer.from_pretrained('roberta-base')
roberta.tokenize('testing a') # ['testing', 'Ġa']
roberta.tokenize('mynewword') # ['my', 'new', 'word']
# update the vocabulary with the new token and the 'Ġ'' version
roberta_vocab = roberta.get_vocab()
roberta_vocab.update({'mynewword' : len(roberta_vocab)})
roberta_vocab.update({chr(288) + 'mynewword' : len(roberta_vocab)}) # chr(288) = 'Ġ'
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
json.dump(roberta_vocab, tmp_vocab_file, ensure_ascii=False)
# get and modify the merges file so that the new token will always be tokenized as a single word
url = 'https://huggingface.co/roberta-base/resolve/main/merges.txt'
roberta_merges = requests.get(url).content.decode().split('\n')
# this is a helper function to loop through a list of new tokens and get the byte-pair encodings
# such that the new token will be treated as a single unit always
def get_roberta_merges_for_new_tokens(new_tokens):
merges = [gen_roberta_pairs(new_token) for new_token in new_tokens]
merges = [pair for token in merges for pair in token]
return merges
def gen_roberta_pairs(new_token, highest = True):
# highest is used to determine whether we are dealing with the Ġ version or not.
# we add those pairs at the end, which is only if highest = True
# this is the hard part...
chrs = [c for c in new_token] # list of characters in the new token, which we will recursively iterate through to find the BPEs
# the simplest case: add one pair
if len(chrs) == 2:
if not highest:
return tuple([chrs[0], chrs[1]])
else:
return [' '.join([chrs[0], chrs[1]])]
# add the tokenization of the first letter plus the other two letters as an already merged pair
if len(chrs) == 3:
if not highest:
return tuple([chrs[0], ''.join(chrs[1:])])
else:
return gen_roberta_pairs(chrs[1:]) + [' '.join([chrs[0], ''.join(chrs[1:])])]
if len(chrs) % 2 == 0:
pairs = gen_roberta_pairs(''.join(chrs[:-2]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
pairs += tuple([''.join(chrs[:-2]), ''.join(chrs[-2:])])
if not highest:
return pairs
else:
# for new tokens with odd numbers of characters, we need to add the final two tokens before the
# third-to-last token
pairs = gen_roberta_pairs(''.join(chrs[:-3]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-3:]), highest = False)
pairs += tuple([''.join(chrs[:-3]), ''.join(chrs[-3:])])
if not highest:
return pairs
pairs = tuple(zip(pairs[::2], pairs[1::2]))
pairs = [' '.join(pair) for pair in pairs]
# pairs with the preceding special token
g_pairs = []
for pair in pairs:
if re.search(r'^' + ''.join(pair.split(' ')), new_token):
g_pairs.append(chr(288) + pair)
pairs = g_pairs + pairs
pairs = [chr(288) + ' ' + new_token[0]] + pairs
pairs = list(dict.fromkeys(pairs)) # remove any duplicates
return pairs
# first line of this file is a comment; add the new pairs after it
roberta_merges = roberta_merges[:1] + get_roberta_merges_for_new_tokens(['mynewword']) + roberta_merges[1:]
roberta_merges = list(dict.fromkeys(roberta_merges))
with open('merges.tmp', 'w', encoding = 'utf-8') as tmp_merges_file:
tmp_merges_file.write('\n'.join(roberta_merges))
new_roberta = RobertaTokenizer(name_or_path='roberta-base', vocab_file='vocab.tmp', merges_file='merges.tmp')
# for some reason, we have to re-add the <mask> token to roberta if we are using it, since
# loading the tokenizer from a file will cause it to be tokenized as separate parts
# the weight matrix is identical, and once re-added, a fill-mask pipeline still identifies
# the mask token correctly (not shown here)
new_roberta.add_tokens(new_roberta.mask_token, special_tokens=True)
new_roberta.model_max_length = 512
new_roberta.tokenize('mynewword') # ['mynewword']
new_roberta.tokenize('mynewword a') # ['mynewword', 'Ġa']
new_roberta.tokenize(' mynewword') # ['Ġmynewword']
# however, this does not guarantee that tokenization of other words will not be affected
roberta.tokenize('mynew') # ['my', 'new']
new_roberta.tokenize('mynew') # ['myne', 'w']
import os
os.remove('vocab.tmp')
os.remove('merges.tmp') # cleanup
If you want to add new tokens to fine-tune a Roberta-based model, consider training your tokenizer on your corpus. Take a look at the HuggingFace How To Train for a complete roadmap of how to do that.
I did that myself to fine-tune the XLM-Roberta-base on my health-related corpus.
Here's the snippet:
from tokenizers import ByteLevelBPETokenizer
from glob import glob
import os
CORPUS_TRAIN = 'corpus_train.shc'
TOKENIZER_DIR = 'you_tokenizer_dir'
paths = list(
glob(CORPUS_TRAIN)
)
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=False)
# Customize training
tokenizer.train(files=paths, vocab_size=32000, min_frequency=3, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
os.makedirs(TOKENIZER_DIR, exist_ok=True)
tokenizer.save_model(TOKENIZER_DIR)
The 32k parameter was arbitrarily chosen. It took 10min on my corpus, then I was able to train my model.
Inside the TOKENIZER_DIR you will see the vocab.json and merges.txt.
If you are using a custom script for training, you can load the tokenizer like this: tokenizer = RobertaTokenizerFast.from_pretrained(TOKENIZER_DIR, max_len=512).

Dynamic Nested Ruby Loops

So, What I'm trying to do is make calls to a Reporting API to filter by all possible breakdowns (breakdown the reports by site, avertiser, ad type, campaign, etc...). But, one issue is that the breakdowns can be unique to each login.
Example:
user1: alice123's reporting breakdowns are ["site","advertiser","ad_type","campaign","line_items"]
user2: bob789's reporting breakdowns are ["campaign","position","line_items"]
When I first built the code for this reporting API, I only had one login to test with, so I hard coded the loops for the dimensions (["site","advertiser","ad_type","campaign","line_items"]). So what I did was pinged the API for a report by sites. Then for each site, pinged for advertisers, and each advertiser, I pinged for the next dimension and so on..., leaving me with a nested loop of ~6 layers.
basically what I'm doing:
sites = mechanize.get "#{base_ur}/report?dim=sites"
sites = Yajl::Parser.parse(sites.body) # json parser
sites.each do |site|
advertisers = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&dim=advertiser"
advertisers = Yajl::Parser.parse(advertisers.body) # json parser
advertisers.each do |advertiser|
ad_types = mechanize.get "#{base_ur}/report?site=#{site.fetch("id")}&advertiser=#{advertiser.fetch("id")}&dim=ad_type"
ad_types = Yajl::Parser.parse(ad_types.body) # json parser
ad_types.each do |ad_type|
...and so on...
end
end
end
GET <api_url>/?dim=<dimension to breakdown>&site=<filter by site id>&advertiser=<filter by advertiser id>...etc...
At the end of the nested loop, I'm left with a report that's broken down as much granularity as possible.
This works now since I only thought that there was one path of breaking down, but apparently each account could have different dimensions breakdowns.
So what I'm asking is if given an array of breakdowns, how can I set up a nested loop to traverse down dynamically do the granularity singularity?
Thanks.
I'm not sure what your JSON/GET returns exactly but for a problem like this you would need recursion.
Something like this perhaps? It's not very elegant and can definitely be optimised further but should hopefully give you an idea.
some_hash = {:id=>"site-id", :body=>{:id=>"advertiser-id", :body=>{:id=>"ad_type-id", :body=>{:id=>"something-id"}}}}
#breakdowns = ["site", "advertiser", "ad_type", "something"]
def recursive(some_hash, str = nil, i = 0)
if #breakdowns[i+1].nil?
str += "#{#breakdowns[i]}=#{some_hash[:id]}"
else
str += "#{#breakdowns[i]}=#{some_hash[:id]}&dim=#{#breakdowns[i + 1]}"
end
p str
some_hash[:body].is_a?(Hash) ? recursive(some_hash[:body], str.gsub(/dim.*/, ''), i + 1) : return
end
recursive(some_hash, 'base-url/report?')
=> "base-url/report?site=site-id&dim=advertiser"
=> "base-url/report?site=site-id&advertiser=advertiser-id&dim=ad_type"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&dim=something"
=> "base-url/report?site=site-id&advertiser=advertiser-id&ad_type=ad_type-id&something=something-id"
If you are just looking to map your data, you can recursively map to a hash as another user pointed out. If you are actually looking to do something with this data while within the loop and want to dynamically recreate the loop structure you listed in your question (though I would advise coming up with a different solution), you can use metaprogramming as follows:
require 'active_support/inflector'
# Assume we are given an input of breakdowns
# I put 'testarr' in place of the operations you perform on each local variable
# for brevity and so you can see that the code works.
# You will have to modify to suit your needs
result = []
testarr = [1,2,3]
b = binding
breakdowns.each do |breakdown|
snippet = <<-END
eval("#{breakdown.pluralize} = testarr", b)
eval("#{breakdown.pluralize}", b).each do |#{breakdown}|
END
result << snippet
end
result << "end\n"*breakdowns.length
eval(result.join)
Note: This method is probably frowned upon, and as I've said I'm sure there are other methods of accomplishing what you are trying to do.

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

How do I generate a hash from an array

I'm trying to build a hash from an array. Basically I want to take the unique string values of the array and build a hash with a key. I'm also trying to figure out how to record how many times that unique word happens.
#The text from the .txt file:
# **Bob and George are great! George and Sam are great.
#Bob, George, and sam are great!**
#The source code:
count_my_rows = File.readlines("bob.txt")
row_text = count_my_rows.join
puts row_text.split.uniq #testing to make sure array is getting filled
Anyways I've tried http://ruby-doc.org/core/classes/Hash.html
I think I need to declare a empty hash with name.new to start I have no idea how to fill it up though. I'm assuming some iteration through the array fills the hash. I'm starting to think I need to record the value as a separate array storing the time it occurs and the word then assign the hash key to it.
Example = { ["Bob",2] => 1 , ["George",3], =>2 }
Leave some code so I can mull over it.
To get you started,
h={}
h.default=0
File.read("myfile").split.each do |x|
h[x]+=1
end
p h
Note: this is not complete solution

Ruby, how should I design a parser?

I'm writing a small parser for Google and I'm not sure what's the best way to design it. The main problem is the way it will remember the position it stopped at.
During parsing it's going to append new searches to the end of a file and go through the file startig with the first line. Now I want to do it so, that if for some reason the execution is interrupted, the script knows the last search it has accomplished successfully.
One way is to delete a line in a file after fetching it, but in this case I have to handle order that threads access file and deleting first line in a file afaik can't be done processor-effectively.
Another way is to write the number of used line to a text file and skip the lines whose numbers are in that file. Or maybe I should use some database instead? TIA
There's nothing wrong with using a state file. The only catch will be that you need to ensure you have fully committed your changes to the state file before your program enters a section where it may be interrupted. Typically this is done with an IO#flush call.
For example, here's a simple state-tracking class that works on a line-by-line basis:
class ProgressTracker
def initialize(filename)
#filename = filename
#file = open(#filename)
#state_filename = File.expand_path(".#{File.basename(#filename)}.position", File.dirname(#filename))
if (File.exist?(#state_filename))
#state_file = open(#state_filename, File::RDWR)
resume!
else
#state_file = open(#state_filename, File::RDWR | File::CREAT)
end
end
def each_line
#file.each_line do |line|
mark_position!
yield(line) if (block_given?)
end
end
protected
def mark_position!
#state_file.rewind
#state_file.puts(#file.pos)
#state_file.flush
end
def resume!
if (position = #state_file.readline)
#file.seek(position.to_i)
end
end
end
You use it with an IO-like block call:
test = ProgressTracker.new(__FILE__)
n = 0
test.each_line do |line|
n += 1
puts "%3d %s" % [ n, line ]
if (n == 10)
raise 'terminate'
end
end
In this case, the program reads itself and will stop after ten lines due to a simulated error. On the second run it should display the next ten lines, if there are that many, or simply exit if there's no additional data to retrieve.
One caveat is that you need to remove the .position file associated with the input data if you want the file to be reprocessed, or if the file has been reset. It's also not possible to edit the file and remove earlier lines or it will throw off the offset tracking. So long as you're simply appending data to the file, or restarting it, everything will be fine.

Resources