K-fold cross validation - save folds for different models - cross-validation

I am trying to train my models and validate them using sklearn's cross validation. What I want to do is use the same folds across all of my models (which will be running from different python scripts).
How can I do this? Should I save them to a file? or should I save the kfold model? or should I use the same seed?
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

Well the easiest way I found to save the folds was to simply get them from the stratified k fold split method by looping over it. Then storing it to a json file:
kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
folds = {}
count = 1
for train, test in kfold.split(np.zeros(len(y)), y.argmax(1)):
folds['fold_{}'.format(count)] = {}
folds['fold_{}'.format(count)]['train'] = train.tolist()
folds['fold_{}'.format(count)]['test'] = test.tolist()
count += 1
print(len(folds) == n_splits)#assert we have the same number of splits
#dump folds to json
import json
with open('folds.json', 'w') as fp:
json.dump(folds, fp)
Note 1: Argmax here is used because my y values are one hot variables so we need to get the class that is predicted/ground truth.
Now to load it from any other script:
#load to dict to be used
with open('folds.json') as f:
kfolds = json.load(f)
From here we can easily just loop over the elements in the dict:
for key, val in kfolds.items():
print(key)
train = val['train']
test = val['test']
Our json file looks like so:
{"fold_1": {"train": [193, 2405, 2895, 565, 1215, 274, 2839, 1735, 2536, 1196, 40, 2541, 980,...SNIP...830, 1032], "test": [1, 5, 6, 7, 10, 15, 20, 26, 37, 45, 52, 54, 55, 59, 60, 64, 65, 68, 74, 76, 78, 90, 100, 106, 107, 113, 122, 124, 132, 135, 141, 146,...SNIP...]}

Related

Do huggingface translation models support separate vocabulary for source and target?

Every example I've looked at so far seems to use a shared vocabulary between source and target languages, and I'm wondering if that is a hard-coded constraint of the Huggingface models, or my misunderstanding, or I've just not looked in the right place yet?
To take a random example, when I look at the files here, https://huggingface.co/Helsinki-NLP/opus-mt-en-zls/tree/main, I see separate "spm" (sentience piece model) files for source and target languages, and they are of different sizes (792kb vs. 850kb). But there is only a single "vocab.json" file. And the config.json file only mentions a single "vocab_size": 57680.
I've also been experimenting, e.g. tokenizer(inputs, text_target=inputs, return_tensors="pt"). If source and target used different vocabulary I would expect the returned input_ids and labels to use different numbers. But every model I've tried so far the numbers are identical (NO, my mistake - see update below).
Can a Huggingface tokenizer even support two vocabularies? If not then a model would need two tokenizers, which seems to clash with the way AutoTokenizer works.
UPDATE
Here is a test script to show the above model is actually using two spm vocabs with AutoTokenizer.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = 'Helsinki-NLP/opus-mt-en-zls'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
inputs = ['Filter all items from same host']
targets = ['Filtriraj sve stavke s istog hosta']
x=tokenizer(inputs, text_target=targets, return_tensors="pt")
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving inputs on both sides")
x=tokenizer(inputs, text_target=inputs, return_tensors="pt")
print(x) ## Expecting to see different numbers if they use different vocabs
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print("\nGiving targets on both sides")
x=tokenizer(targets, text_target=targets, return_tensors="pt") ## Expecting to see different numbers if they use different vocabs
print(x)
print(tokenizer.decode(x['input_ids'][0]))
print(tokenizer.decode(x['labels'][0]))
print(model)
The output is:
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
▁Filter all▁items from same host</s>
Filtriraj sve stavke s istog hosta</s>
Giving inputs on both sides
{'input_ids': tensor([[10373, 90, 8255, 98, 605, 6276, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 911, 90, 3188, 7, 98, 605, 6276, 0]])}
▁Filter all▁items from same host</s>
Filter all items from same host</s>
Giving targets on both sides
{'input_ids': tensor([[11638, 1392, 7636, 95, 120, 914, 465, 478, 95, 29,
25, 897, 6276, 27, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[11638, 1392, 7636, 386, 35861, 95, 2130, 218, 6276, 27,
0]])}
Filtriraj sve stavke s istog hosta</s>
Filtriraj sve stavke s istog hosta</s>
When I choose identical strings in English or Croatian it gives slightly different numbers, showing that different tokenizers are involved. You can then see that the different ids sometimes map back to an identical string, sometimes not.
But when I print out the model we see it is actually a shared vocabulary, which makes the two spm models a bit pointless.
(encoder): MarianEncoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(decoder): MarianDecoder(
(embed_tokens): Embedding(57680, 512, padding_idx=57679)
...
(lm_head): Linear(in_features=512, out_features=57680, bias=False)
I haven't got as far as finding out if a non-shared vocabulary is possible, but still yet to see evidence of one.
For Marian-based models, HuggingFace now supports separate vocabularies for source and target, but some models may not, especially older models.
(As you know, OPUS-MT models are based on MarianMT. The MarianMT framework supports it.)
Before https://github.com/huggingface/transformers/pull/15831, HuggingFace used a shared vocabulary file for Marian.
This PR updates the Marian model:
To allow not sharing embeddings between encoder and decoder.
Allow tying only decoder embeddings with lm_head.
Separate two vocabs in tokenizer for src and tgt language
...
share_encoder_decoder_embeddings: to indicate if emb should be shared or not
So models trained with earlier versions of the framework, or that parameter set to false, only have one shared vocabulary file for source and target.

How does Phoenix conglomerate cookie data?

I'm attempting to store some data into the session storage and I'm getting the same cookie error as this guy, the cookie is over the system byte limit of 4096.
This seems pretty straight forward, don't attempt to store more than the system limit in the session. Right, but I'm not attempting to do that. Clearly, the cookie is over 4096 bytes and my additions have caused it to overflow, but that doesn't explain where the data is.
The data I'm attempting to store is only 1500 bytes. In fact, the entire session that is being saved is 1500 bytes (the errored session). Thats nowhere near the overflow limit. So that means one thing for certain: The data stored in :plug_session inside of conn is not the only data being stored inside of the session cookie.
This is the session that's throwing the CookieOverflowError:
:plug_session => %{
"_csrf_token" => "XmE4kgdxk4D0NwwlfTL77Ic62t123123sdfh1s",
"page_trail" => [{"/", "Catalog"}, {'/', "Catalog"}],
"shopping_cart_changeset" => #Ecto.Changeset<
action: nil,
changes: %{
order: #Ecto.Changeset<
action: :insert,
changes: %{
address: #Ecto.Changeset<
action: :insert,
changes: %{
address_one: "800 Arola Drive, apt 321, apt 321",
address_two: "apt 321",
city: "Wooster",
company: "Thomas",
country: "US",
name: "user one",
phone: "3305551111",
state: "WV",
zip_code: "44691"
},
errors: [],
data: #FulfillmentCart.Addresses.Address<>,
valid?: true
>,
priority: false,
shipping_method: #Ecto.Changeset<
action: :insert,
changes: %{id: 2, is_priority?: false, name: "3 Day Select"},
errors: [],
data: #FulfillmentCart.ShippingMethods.ShippingMethod<>,
valid?: true
>
},
errors: [],
data: #FulfillmentCart.Orders.Order<>,
valid?: true
>
},
errors: [],
data: #FulfillmentCart.ShoppingCarts.ShoppingCart<>,
valid?: true
>,
"user_id" => 8
},
I actually followed this guide on decoding a phoenix session cookie, and I get the session before the error.
Which gives me:
iex(8)> [_, payload, _] = String.split(cookie, ".", parts: 3)
["SFMyNTY",
"g3QAAAADbQAAAAtfY3NyZl90b2tlbm0AAAAYWU92dkRfVDh5UXlRTUh4TGlpRTQxOFREbQAAAApwYWdlX3RyYWlsbAAAAAJoAm0AAAABL20AAAAHQ2F0YWxvZ2gCawABL20AAAAHQ2F0YWxvZ2ptAAAAB3VzZXJfaWRhCA",
"Ytg5oklzyWMvtu1vyXVvQ2xBzdtMnS9zVth7LIRALsU"]
iex(9)> {:ok, encoded_term } = Base.url_decode64(payload, padding: false)
{:ok,
<<131, 116, 0, 0, 0, 3, 109, 0, 0, 0, 11, 95, 99, 115, 114, 102, 95, 116, 111,
107, 101, 110, 109, 0, 0, 0, 24, 89, 79, 118, 118, 68, 95, 84, 56, 121, 81,
121, 81, 77, 72, 120, 76, 105, 105, 69, 52, 49, ...>>}
iex(10)> :erlang.binary_to_term(encoded_term)
%{
"_csrf_token" => "YOvvD_T8yQyQMHxLiiE418TD",
"page_trail" => [{"/", "Catalog"}, {'/', "Catalog"}],
"user_id" => 8
}
iex(11)>
This is 127 bytes, so the addition of the 1500 bytes isn't the problem. It's the other allocation of storage that isn't represented inside of the session. What is that?
My assumption of the byte size of the text itself in :plug_session is correct, but the reason the cookie is overflowing is not because the byte size of the decoded text in :plug_session is too big but that the encoded version of the :plug_session is too big. I figured this out by creating multiple cookies and looking at the byte_size of the data.
Save a new cookie
conn = put_resp_cookie(conn, "address",
changeset.changes.order.changes.address.changes, sign: true)
Get a saved cookie
def get_resp_cookie(conn, attribute) do
cookie = conn.req_cookies[attribute]
case cookie != nil do
false ->
{:invalid, %{}}
true ->
[_, payload, _] = String.split(cookie, ".", parts: 3)
{:ok, encoded_term } = Base.url_decode64(payload, padding: false)
{val, max, max_age} = :erlang.binary_to_term(encoded_term)
{:valid, val}
end
end
get_resp_cookie/2 pattern matching
address_map = case Connection.get_resp_cookie(conn, "address") do
{:invalid, val} -> IO.puts("Unable to find cookie.");val
{:valid, val} -> val
end
I made a few changes to the way I save the data from when I posted this. Namely I am now storing a map of changes, not the actual changeset...which means that the session most likely would've worked for me all along.
I think the answer to this issue was that the encoded %Ecto.Changeset{} was too big for the cookie to hold.
If you use this solution then be wary, you have to manage the newly created cookies yourself.

How can i add multiple dataset in chart.js package in Laravel

How can i add multiple dataset in nsoleTVs/Charts charts.js package in laravel.
my single dataset code is running well:
$data['transactionChart'] = new TransactionChart();
$data['transactionChart']->dataset('Sample', 'line',[100, 65, 84, 45, 90])
->options(['borderColor' => '#97d881']);
Simply use ->dataset() multiple times.
https://github.com/ConsoleTVs/Charts/issues/331
Example:
$data['transactionChart'] = new TransactionChart();
$data['transactionChart']->dataset('Sample', 'line',[100, 65, 84, 45, 90])
->options(['borderColor' => '#97d881']);
$data['transactionChart']->dataset('Another Sample', 'line',[100, 65, 84, 45, 90])
->options(['borderColor' => '#ff0000']);

Same string but different bytes codes

I have two strings:
a = 'hà nội'
b = 'hà nội'
When I compare them with a == b, it returns false.
I checked the byte codes:
a.bytes = [104, 97, 204, 128, 32, 110, 195, 180, 204, 163, 105]
b.bytes = [104, 195, 160, 32, 110, 225, 187, 153, 105]
What is the cause? How can I fix it so that a == b returns true?
This is an issue with Unicode equivalence.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters.
a.unicode_normalize == b.unicode_normalize
unicode_normalize(form=:nfc) [link]
Returns a normalized form of str, using Unicode normalizations NFC,
NFD, NFKC, or NFKD. The normalization form used is determined by form,
which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. The
default is :nfc.
If the string is not in a Unicode Encoding, then an Exception is
raised. In this context, 'Unicode Encoding' means any of UTF-8,
UTF-16BE/LE, and UTF-32BE/LE, as well as GB18030, UCS_2BE, and
UCS_4BE. Anything else than UTF-8 is implemented by converting to
UTF-8, which makes it slower than UTF-8.

use for loop to call multiple functions in lua

I want to call multiple methods in lua that are very similar except their parameters change by one character. The way I'm doing it now works but is extremely in efficient.
function scene:createScene(event)
screenGroup = self.view
level1= display.newRoundedRect( 50, 110, 50, 50, 5 )
level1:setFillColor( 100,0,200 )
level2= display.newRoundedRect( 105, 110, 50, 50, 5 )
level2:setFillColor (100,200,0)
--and so on so forth
screenGroup:insert (level1)
screenGroup:insert (level2)
screenGroup:insert (level3)
screenGroup:insert (level4)
end
I plan on extending the screenGroop:insert method to hundreds of levels, maybe up to (level300). As you can see the way I'm doing it now is inefficient. I tried doing
for i=1, 4, 1 do
screenGroup:insert(level..i)
end
but I get the error "table expected."
The best way in this case is to probably use a table:
local levels = {}
levels[1] = display.newRoundedRect( 50, 110, 50, 50, 5 )
levels[1]:setFillColor( 100,0,200 )
levels[2] = display.newRoundedRect( 105, 110, 50, 50, 5 )
levels[2]:setFillColor (100,200,0)
--and so on so forth
for _, level in ipairs(levels) do
screenGroup:insert(level)
end
For other alternatives check the SO answer from #EtanReisner's comment.
If your 'level' tables are global, which is appears they are, you can use getfenv to index them.
for i = 1, number_of_levels do
screenGroup:insert(getfenv()["level" .. i])
end
getfenv returns the environment, with all global variables, in the form of a dictionary. Therefore, you can index it like a normal table like getfenv()["key"]

Resources