How does TextCategorizer.predict work with spaCy? - label

I've been following the spaCy quick-start guide for text classification.
Let's say I have a very simple dataset.
TRAIN_DATA = [
("beef", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("apple", {"cats": {"POSITIVE": 0, "NEGATIVE": 1}})
]
I'm training a pipe to classify text. It trains and has a low loss rate.
textcat = nlp.create_pipe("pytt_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
optimizer = nlp.resume_training()
for i in range(10):
random.shuffle(TRAIN_DATA)
losses = {}
for batch in minibatch(TRAIN_DATA, size=8):
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
print(i, losses)
Now, how do I predict whether a new string of text is "POSITIVE" or "NEGATIVE"?
This will work:
doc = nlp(u'Pork')
print(doc.cats)
It gives a score for each category we've trained to predict on.
But that seems at odds with the docs. It says I should use a predict method on the original subclass pipeline component.
That doesn't work though.
Trying textcat.predict('text') or textcat.predict(['text']) etc.. throws:
AttributeError Traceback (most recent call last)
<ipython-input-29-39e0c6e34fd8> in <module>
----> 1 textcat.predict(['text'])
pipes.pyx in spacy.pipeline.pipes.TextCategorizer.predict()
AttributeError: 'str' object has no attribute 'tensor'

The predict methods of pipeline components actually expect a Doc as input, so you'll need to do something like textcat.predict(nlp(text)). The nlp used there does not necessarily have a textcat component. The result of that call then needs to be fed into a call to set_annotations() as shown here.
However, your first approach is just fine:
...
nlp.add_pipe(textcat)
...
doc = nlp(u'Pork')
print(doc.cats)
...
Internally, when calling nlp(text), first the Doc for the text will be generated, and then each pipeline component, one by one, will run its predict method on that Doc and keep adding information to it with set_annotations. Eventually the textcat component will define the cats variable of the Doc.
The API docs from which you're citing for the other approach, kind of give you a look "under the hood". So they're not really conflicting approaches ;-)

Related

Set the name for each ParallelFor iteration in KFP v2 on Vertex AI

I am currently using kfp.dsl.ParallelFor to train 300 models. It looks something like this:
...
models_to_train_op = get_models()
with dsl.ParallelFor(models_to_train_op.outputs["data"], parallelism=100) as item:
prepare_data_op = prepare_data(item)
train_model_op = train_model(prepare_data_op["train_data"]
...
Currently, the iterations in Vertex AI are labeled in a dropdown as something like for-loop-worker-0, for-loop-worker-1, and so on. For tasks (like prepare_data_op, there's a function called set_display_name. Is there a similar method that allows you to set the iteration name? It would be helpful to relate them to the training data so that it's easier to look through the dropdown UI that Vertex AI provides.
I reached out to a contact I have at Google. They recommended that you can pass the list that is passed to ParallelFor to set_display_name for each 'iteration' of the loop. When the pipeline is compiled, it'll know to set the corresponding iteration.
# Create component that returns a range list
model_list_op = model_list(n_models)
# Parallelize jobs
ParallelFor(model_list_op.outputs["model_list"], parallelism=100) as x:
x.set_display_name(str(model_list_op.outputs["model_list"]))

TypeError: forward() takes from 2 to 7 positional arguments but 8 were given "transformers"

As per the Roberta-long docs, the way to load the roberta long model for sequence classification is
class RobertaLongSelfAttention(LongformerSelfAttention):
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False
):
return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
class RobertaLongForSequenceClassification(RobertaForSequenceClassification):
def __init__(self, config):
super().__init__(config)
print("Config.........")
config
for i, layer in enumerate(self.roberta.encoder.layer):
layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)
I have my own pre-trained long model and tokenizer, but when I try to call trainer.train(), this gives an error saying
TypeError: forward() takes from 2 to 7 positional arguments but 8 were given
I also tried pulling the simonlevine/bioclinical-roberta-long model and tokenizer just to check if something's wrong with my pretrained model and tokenizer, but still getting the same error.
Any idea how to get rid of?
Note : I have used transformers 4.17.0 for pre-training the converting pre-trained model to long and using the same version for finetuning

dry-validation: Case insensitive `included_in?` validation with Dry::Validation.Schema

I'm trying to create a validation for a predetermined list of valid brands as part of an ETL pipeline. My validation requires case insensitivity, as some brands are compound words or abbreviations that are insignificant.
I created a custom predicate, but I cannot figure out how to generate the appropriate error message.
I read the error messages doc, but am having a hard time interpreting:
How to build the syntax for my custom predicate?
Can I apply the messages in my schema class directly, without referencing an external .yml file? I looked here and it seems like it's not as straightforward as I'd hoped.
Below I've given code that represents what I have tried using both built-in predicates, and a custom one, each with their own issues. If there is a better way to compose a rule that achieves the same goal, I'd love to learn it.
require 'dry/validation'
CaseSensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
# :included_in? from https://dry-rb.org/gems/dry-validation/basics/built-in-predicates/
required(:brand).value(included_in?: BRANDS)
end
CaseInsensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
configure do
def in_brand_list?(value)
BRANDS.include? value.downcase
end
end
required(:brand).value(:in_brand_list?)
end
# A valid string if case insensitive
valid_product = {brand: 'Valid'}
CaseSensitiveSchema.call(valid_product).errors
# => {:brand=>["must be one of: here, are, some, valid, brands"]} # This message will be ridiculous when the full brand list is applied
CaseInsensitiveSchema.call(valid_product).errors
# => {} # Good!
invalid_product = {brand: 'Junk'}
CaseSensitiveSchema.call(invalid_product).errors
# => {:brand=>["must be one of: several, hundred, valid, brands"]} # Good... (Except this error message will contain the entire brand list!!!)
CaseInsensitiveSchema.call(invalid_product).errors
# => Dry::Validation::MissingMessageError: message for in_brand_list? was not found
# => from .. /gems/2.5.0/gems/dry-validation-0.12.2/lib/dry/validation/message_compiler.rb:116:in `visit_predicate'
The correct way to reference my error message was to reference the predicate method. No need to worry about arg, value, etc.
en:
errors:
in_brand_list?: "must be in the master brands list"
Additionally, I was able to load this error message without a separate .yml by doing this:
CaseInsensitiveSchema = Dry::Validation.Schema do
BRANDS = %w(several hundred valid brands)
configure do
def in_brand_list?(value)
BRANDS.include? value.downcase
end
def self.messages
super.merge({en: {errors: {in_brand_list?: "must be in the master brand list"}}})
end
end
required(:brand).value(:in_brand_list?)
end
I'd still love to see other implementations, specifically for a generic case-insensitive predicate. Many people say dry-rb is fantastically organized, but I find it hard to follow.

XPath: Select Certain Child Nodes

I'm using XPath with Scrapy to scrape data off of a movie website BoxOfficeMojo.com.
As a general question: I'm wondering how to select certain child nodes of one parent node all in one Xpath string.
Depending on the movie web page from which I'm scraping data, sometimes the data I need is located at different children nodes, such as whether or not there is a link or not. I will be going through about 14000 movies, so this process needs to be automated.
Using this as an example. I will need actor/s, director/s and producer/s.
This is the Xpath to the director: Note: The %s corresponds to a determined index where that information is found - in the action Jackson example director is found at [1] and actors at [2].
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()
However, would a link exist to a page on the director, this would be the Xpath:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/a/text()
Actors are a bit more tricky, as there <br> included for subsequent actors listed, which may be the children of an /a or children of the parent /font, so:
//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()
Gets all most all of the actors (except those with font/br).
Now, the main problem here, I believe, is that there are multiple //div[#class="mp_box_content"] - everything I have works EXCEPT that I also end up getting some digits from other mp_box_content. Also I have added numerous try:, except: statements in order to get everything (actors, directors, producers who both have and do not have links associated with them). For example, the following is my Scrapy code for actors:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font//a/text()' % (locActor,)).extract()
try:
second = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
for n in second:
actors.append(n)
except:
actors = hxs.select('//div[#class="mp_box_content"]/table/tr[%s]/td[2]/font/text()' % (locActor,)).extract()
This is an attempt to cover for the facts that: the first actor may not have a link associated with him/her and subsequent actors do, the first actor may have a link associated with him/her but the rest may not.
I appreciate the time taken to read this and any attempts to help me find/address this problem! Please let me know if any more information is needed.
I am assuming you are only interested in textual content, not the links to actors' pages etc.
Here is a proposition using lxml.html (and a bit of lxml.etree) directly
First, I recommend you select td[2] cells by the text content of td[1], with expressions like .//tr[starts-with(td[1], "Director")]/td[2] to account for "Director", or "Directors"
Second, testing various expressions with or without <font>, with or without <a> etc., makes code difficult to read and maintain, and since you're interested only in the text content, you might as well use string(.//tr[starts-with(td[1], "Actor")]/td[2]) to get the text, or use lxml.html.tostring(e, method="text", encoding=unicode) on selected elements
And for the <br> issue for multiple names, the way I do is generally modify the lxml tree containing the targetted content to add a special formatting character to <br> elements' .text or .tail, for example a \n, with one of lxml's iter() functions. This can be useful on other HTML block elements, like <hr> for example.
You may see better what I mean with some spider code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import lxml.etree
import lxml.html
MARKER = "|"
def br2nl(tree):
for element in tree:
for elem in element.iter("br"):
elem.text = MARKER
def extract_category_lines(tree):
if tree is not None and len(tree):
# modify the tree by adding a MARKER after <br> elements
br2nl(tree)
# use lxml's .tostring() to get a unicode string
# and split lines on the marker we added above
# so we get lists of actors, producers, directors...
return lxml.html.tostring(
tree[0], method="text", encoding=unicode).split(MARKER)
class BoxOfficeMojoSpider(BaseSpider):
name = "boxofficemojo"
start_urls = [
"http://www.boxofficemojo.com/movies/?id=actionjackson.htm",
"http://www.boxofficemojo.com/movies/?id=cloudatlas.htm",
]
# locate 2nd cell by text content of first cell
XPATH_CATEGORY_CELL = lxml.etree.XPath('.//tr[starts-with(td[1], $category)]/td[2]')
def parse(self, response):
root = lxml.html.fromstring(response.body)
# locate the "The Players" table
players = root.xpath('//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table')
# we have only one table in "players" so the for loop is not really necessary
for players_table in players:
directors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Director")
actors_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Actor")
producers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
writers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Producer")
composers_cells = self.XPATH_CATEGORY_CELL(players_table,
category="Composer")
directors = extract_category_lines(directors_cells)
actors = extract_category_lines(actors_cells)
producers = extract_category_lines(producers_cells)
writers = extract_category_lines(writers_cells)
composers = extract_category_lines(composers_cells)
print "Directors:", directors
print "Actors:", actors
print "Producers:", producers
print "Writers:", writers
print "Composers:", composers
# here you should of course populate scrapy items
The code can be simplified for sure, but I hope you get the idea.
You can do similar things with HtmlXPathSelector of course (with the string() XPath function for example), but without modifying the tree for <br> (how to do that with hxs?) it works only for non-multiple names in your case:
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Director")]/td[2])').extract()
[u'Craig R. Baxley']
>>> hxs.select('string(//div[#class="mp_box"][div[#class="mp_box_tab"]="The Players"]/div[#class="mp_box_content"]/table//tr[contains(td, "Actor")]/td[2])').extract()
[u'Carl WeathersCraig T. NelsonSharon Stone']

Django models are not ajax serializable

I have a simple view that I'm using to experiment with AJAX.
def get_shifts_for_day(request,year,month,day):
data= dict()
data['d'] =year
data['e'] = month
data['x'] = User.objects.all()[2]
return HttpResponse(simplejson.dumps(data), mimetype='application/javascript')
This returns the following:
TypeError at /sched/shifts/2009/11/9/
<User: someguy> is not JSON serializable
If I take out the data['x'] line so that I'm not referencing any models it works and returns this:
{"e": "11", "d": "2009"}
Why can't simplejson parse my one of the default django models? I get the same behavior with any model I use.
You just need to add, in your .dumps call, a default=encode_myway argument to let simplejson know what to do when you pass it data whose types it does not know -- the answer to your "why" question is of course that you haven't told poor simplejson what to DO with one of your models' instances.
And of course you need to write encode_myway to provide JSON-encodable data, e.g.:
def encode_myway(obj):
if isinstance(obj, User):
return [obj.username,
obj.firstname,
obj.lastname,
obj.email]
# and/or whatever else
elif isinstance(obj, OtherModel):
return [] # whatever
elif ...
else:
raise TypeError(repr(obj) + " is not JSON serializable")
Basically, JSON knows about VERY elementary data types (strings, ints and floats, grouped into dicts and lists) -- it's YOUR responsibility as an application programmer to match everything else into/from such elementary data types, and in simplejson that's typically done through a function passed to default= at dump or dumps time.
Alternatively, you can use the json serializer that's part of Django, see the docs.

Resources