Fix tokenization to tensors with padding Huggingface - huggingface-transformers

I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.
`
max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
targets = [ex for ex in examples["hombres_tweet"]]
model_inputs = tokz(inputs,
padding=True,
truncation=True,
max_length=max_input_length,
return_tensors = 'pt'
)
# Setup the tokenizer for targets
with tokz.as_target_tokenizer():
labels = tokz(targets,
padding=True,
truncation=True,
max_length=max_target_length,
return_tensors = 'pt'
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
`
And I get the following error when trying to pass my dataset object through the function.
I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face
Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.
My Dataset looks like the following
And an example of the inputs
So that I just do:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Related

Power Query: Change Header Titles from Camel Case to Snake Case

I have just imported some data into Power Query. The headers are in camel case. I.e.:
headerOne
headerTwo
headerThree
Etc.
I would like them to be in snake case. I.e.:
header_one
header_two
header_three
Etc.
I am not sure, though, how to do this. Any ideas?
Thanks.
Examine the applied steps to understand the algorithm.
let
//change next line to reflect actual data source
Source = Excel.CurrentWorkbook(){[Name="Table4"]}[Content],
//change column headers
colNames = Table.ColumnNames(Source),
#"Split at UpperCase" = List.Transform(colNames, each Splitter.SplitTextByCharacterTransition({"a".."z"},{"A".."Z"})(_)),
#"Snake Case" = List.Transform(#"Split at UpperCase", each Text.Lower(Text.Combine(_,"_"))),
rename = Table.RenameColumns(Source, List.Zip({colNames,#"Snake Case"}))
in
rename
Another method is to replace "X" with "_x" for any upper-case letter.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WSlTSUUoC4mSl2FgA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [headerOne = _t, headerTwo = _t, headerThree = _t]),
Headers = Table.ColumnNames(Source),
NewHeaders = List.Transform(Headers, each Text.Combine(List.Transform(Text.ToList(_), each if List.Contains({"A".."Z"}, _) then "_" & Text.Lower(_) else _))),
Result = Table.RenameColumns(Source, List.Zip({Headers, NewHeaders}))
in
Result
This splits each header into a list of characters (using Text.ToList), replaces any capital letter in "A" to "Z" with the "_" prepended to the lower-case version (using List.Transform), and then combines the list back into a string (using Text.Combine).

SQLAlchemy Oracle can't insert characters with accents

I've got a project based in Flask that uses a Oracle database and communicates trough SQLAlchemyand the cx_Oracle plugin. My problem is that I have a simple table with 2 Strings:
class Example(Base):
__tablename__ = 'example'
id = Column(Integer, primary_key=True)
title = Column(String(255))
description = Column(String(1024))
And when I try to save values with accents I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 5: ordinal not in range(128)
In which de encode characters is different depending on the value of the text.
Here's an example of the values:
object = Example()
object.title = 'É its a character with accent'
object.description = 'Á another characters with accent'
db_session.add(object)
db_session.commit()
Do you have any idea what I can do to fix this? Some configuration?
Thanks :)
UPDATE:
As suggested I've tried 2 other ways:
class Example(Base):
tablename = 'example'
id = Column(Integer, primary_key=True)
title = Column(Unicode(255))
description = Column(Unicode(1024))
And
class Example(Base):
tablename = 'example'
id = Column(Integer, primary_key=True)
title = Column(String(255, convert_unicode=True))
description = Column(String(1024, convert_unicode=True))
Still got the same error.
that is because the names you are using, specially the accents are not in the ASCII table, please try to declare the title property as:
title = Column(String(255, convert_unicode=True)
This may help, if not declare it as Unicode instead of String.
For more information you can also check the documentation here:
http://docs.sqlalchemy.org/en/latest/core/type_basics.html
You should also ensure that on your create_engine() function you have the encoding optional param as "UTF-8" or "latin1" depending on the charset that you might need to be inputed. Of course "UTF-8" has everything you might actually need.

How to write this domain to be sure at 100 % that we will get the right stock pack operation for each invoice line?

This post should be a little more complex than usual.
We have created a new field for an account.invoice.line : pack_operation. With this field, we can print serial/lot number for each line on the PDF invoice (this part works well).
Many hours passed trying to write the domain to select the EXACT and ONLY stock pack operation for each invoice line.
In the code below, we used the domain [('id','=', 31)] to make our tests printing the PDF.
Ho to write this domain to be sure at 100 % that we will get the right stock pack operation for each invoice line?
I really need your help here... Too complex for my brain.
Our code :
class AccountInvoiceLine(models.Model):
_inherit = "account.invoice.line"
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')
def compute_stock_pack_operation_id(self):
stock_operation_obj = self.env['stock.pack.operation']
stock_operation = stock_operation_obj.search( [('id','=', 31)] )
self.pack_operation = stock_operation[0]
EDIT#1
I know that you won't like my code. But, this one seems to work. I take any comments and improvements with pleasure.
class AccountInvoiceLine(models.Model):
_inherit = "account.invoice.line"
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')#api.one
def compute_stock_pack_operation_id(self):
procurement_order_obj = self.env['procurement.order']
stock_operation_obj = self.env['stock.pack.operation']
all_picking_ids_for_this_invoice_line = []
for saleorderline in self.sale_line_ids:
for procurement in saleorderline.procurement_ids:
for stockmove in procurement.move_ids:
if stockmove.picking_id.id not in all_picking_ids_for_this_invoice_line
all_picking_ids_for_this_invoice_line.append(stockmove.picking_id.id)
all_picking_ids_for_this_invoice_line))
stock_operation = stock_operation_obj.search(
[ '&',
('picking_id','in',all_picking_ids_for_this_invoice_line),
('product_id','=',self.product_id.id)
]
)
self.pack_operation = stock_operation[0]
The pack_operation field is a computed field, that be default means that the field will not be saved on the database unless you set store=True when you define your field.
So, what you can do here is change:
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id')
to:
pack_operation = fields.Many2one(comodel_name='stock.pack.operation', compute='compute_stock_pack_operation_id', store=True)
And try running your query again.

Labeled LDA learn in Stanford Topic Modeling Toolbox

It's ok when I run the example-6-llda-learn.scala as follows:
val source = CSVFile("pubmed-oa-subset.csv") ~> IDColumn(1);
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
CaseFolder() ~> // lowercase everything
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(4) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
TermDynamicStopListFilter(30) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
// define fields from the dataset we are going to slice against
val labels = {
source ~> // read from the source file
Column(2) ~> // take column two, the year
TokenizeWith(WhitespaceTokenizer()) ~> // turns label field into an array
TermCounter() ~> // collect label counts
TermMinimumDocumentCountFilter(10) // filter labels in < 10 docs
}
val dataset = LabeledLDADataset(text, labels);
// define the model parameters
val modelParams = LabeledLDAModelParams(dataset);
// Name of the output model folder to generate
val modelPath = file("llda-cvb0-"+dataset.signature+"-"+modelParams.signature);
// Trains the model, writing to the given output path
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
// or could use TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
But it's not ok when I change the last line from:
TrainCVB0LabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1000);
to:
TrainGibbsLabeledLDA(modelParams, dataset, output = modelPath, maxIterations = 1500);
And the method of CVB0 cost much memory.I train a corpus of 10,000 documents with about 10 labels each document,it will cost 30G memory.
I've encountered the same situation and indeed I believe it's a bug. Check GIbbsLabeledLDA.scala in edu.stanford.nlp.tmt.model.llda under the src/main/scala folder, from line 204:
val z = doc.labels(zI);
val pZ = (doc.theta(z)+topicSmoothing(z)) *
(countTopicTerm(z)(term)+termSmooth) /
(countTopic(z)+termSmoothDenom);
doc.labels is self-explanatory, and doc.theta records the distribution (counts, actually) of its labels, which has the same size as doc.labels.
zI is index variable iterating doc.labels, while the value z gets the actual label number. Here comes the problem: it's possible this documents has only one label - say 1000 - therefore zI is 0 and z is 1000, then doc.theta(z) gets out of range.
I suppose the solution would be to modify doc.theta(z) to doc.theta(zI).
(I'm trying to check whether the results would be meaningful, anyway this bug has made me not so confident in this toolbox.)

Sphinx wildcard searching won't work

I have used the following code:
function searchSphinx2($tofind,$jobtype_id,$payper_id,$onetimeBounds)
{
$this->load->library('session');
$this->load->library('sphinxclient');
global $result;
global $functionresult;
$functionresult=array();
$this->sphinxclient->setServer('localhost', 3312);
$this->sphinxclient->SetMatchMode( SPH_MATCH_ANY );
$this->sphinxclient->SetIndexWeights( array("jobs_index_main"=>10, "jobs_index_delta"=>10,"jobs_index_prefix_main"=>1,"jobs_index_prefix_delta"=>1,"jobs_index_infix_main"=>1,"jobs_index_infix_delta"=>1) );
$this->sphinxclient->ResetFilters();
$this->sphinxclient->SetFilter('jobtype_id',$jobtype_id,TRUE);
$this->sphinxclient->SetFilter('payper_id',$payper_id,TRUE);
$this->sphinxclient->SetFilterFloatRange('payamount', $ontimeBounds[0], $ontimeBounds[1], FALSE);
$this->sphinxclient->AddQuery("$tofind", "jobs_index_main;jobs_index_delta");
$this->sphinxclient->AddQuery("*$tofind*", "jobs_index_main_prefix;jobs_index_delta_prefix");
$this->sphinxclient->AddQuery("*$tofind*", "jobs_index_main_infix;jobs_index_delta_infix");
$result = $this->sphinxclient->RunQueries();
In my data base there is a job with title "Intern" However, if I search for "inter" I do not get any results.
The indices in my confi file are set up as follows:
index jobs_index_prefix_main
{
source = jobs_main
path = /var/newsphinx/index/main_prefix
morphology = stem_en
min_stemming_len = 4
min_word_len = 3
min_prefix_len = 3
prefix_fields = title, contactname
enable_star =1
}
Can anyone tell me why I am not getting partial word results?
I've never found Sphinx to return partial matches without using stars. I agree that it's not particularly intuitive (surely if the prefixes are being indexed, there's a match?), but if you want to ensure you always get results, add a star to the end of each word.

Resources