Transformers fine tune model warning - huggingface-transformers

Transformers fine tune model warning - huggingface-transformers

python run_clm.py \
--model_name_or_path ctrl \
--train_file df_finetune_train.csv \
--validation_file df_finetune_test.csv \
--do_train \
--do_eval \
--preprocessing_num_workers 72 \
--block_size 256 \
--output_dir ./finetuned
I am trying to fine tune the ctrl model on my own dataset, where each row represents a sample.
However, I got the warning info below.
[WARNING|tokenization_utils_base.py:3213] 2021-03-25 01:32:22,323 >>
Token indices sequence length is longer than the specified maximum
sequence length for this model (934 > 256). Running this sequence
through the model will result in indexing errors
What is the cause for this ? any solutions ?

Related

Output training losses over iterations/epochs to file from trainer.py in HuggingFace Transfrormers

In the Transformer's library framework, by HuggingFace only the evaluation step metrics are outputted to a file named eval_resuls_{dataset}.txt in the "output_dir" when running run_glue.py. In the eval_resuls file, there are the metrics associated with the dataset. e.g., accuracy for MNLI and the evaluation loss.
Can a parameter be passed to run_glue.py to generate a training_results_{dataset}.txt file that tracks the training loss? Or would I have to build the functionality myself?
My file named run_python_script_glue.bash:
GLUE_DIR=../../huggingface/GLUE_SMALL/
TASK_NAME=MNLI
ID=OT
python3 run_glue.py \
--local_rank -1 \
--seed 42 \
--model_type albert \
--model_name_or_path albert-base-v2 \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--data_dir $GLUE_DIR/$TASK_NAME \
--max_seq_length 128 \
--per_gpu_train_batch_size 8 \
--per_gpu_eval_batch_size 8 \
--gradient_accumulation_steps 2\
--learning_rate 3e-5 \
--max_steps -1 \
--warmup_steps 1000\
--doc_stride 128 \
--num_train_epochs 3.0 \
--save_steps 9999\
--output_dir ./results/GLUE_SMALL/$TASK_NAME/ALBERT/$ID/ \
--do_lower_case \
--overwrite_output_dir \
--label_noise 0.2\
--att_kl 0.01\
--att_se_hid_size 16\
--att_se_nonlinear relu\
--att_type soft_attention \
--adver_type ot \
--rho 0.5 \
--model_type whai \
--prior_gamma 2.70 \
--three_initial 0.0
In the trainer.py file in the transformer library, the training loss variable during the training step is called tr_loss.
tr_loss = self._training_step(model, inputs, optimizer, global_step)
loss_scalar = (tr_loss - logging_loss) / self.args.logging_steps
logs["loss"] = loss_scalar
logging_loss = tr_loss
In the code, the training loss is first scaled by the logging steps and later passed to a logs dictionary. The logs['loss'] is later printed to the terminal but not to a file. Is there a way to upgrade this to include an update to a txt file?

Learning rate not set in run_mlm.py?

I want to run (or resume) the run_mlm.py script with a specific learning rate, but it doesn't seem like setting it in the script arguments does anything.
os.system(
f"python {script} \
--model_type {model} \
--config_name './models/{model}/config.json' \
--train_file './content/{data}/train.txt' \
--validation_file './content/{data}/test.txt' \
--learning_rate 6e-4 \
--weight_decay 0.01 \
--warmup_steps 6000 \
--adam_beta1 0.9 \
--adam_beta2 0.98 \
--adam_epsilon 1e-6 \
--tokenizer_name './tokenizer/{model}' \
--output_dir './{out_dir}' \
--do_train \
--do_eval \
--num_train_epochs 40 \
--overwrite_output_dir {overwrite} \
--ignore_data_skip"
)
After warm-up, the log indicates that the learning rate tops out at 1e-05—a default from somewhere, I guess, but I'm not sure where (and certainly not 6e-4):
{'loss': 3.9821, 'learning_rate': 1e-05, 'epoch': 0.09}

how to fix'Too few arguments to function Http\Adapter\Guzzle6\Client::buildClient()' when use 'GrahamCampbell/Laravel-GitHub'

I make a web application for the management of the educational establishment with laravel, so I have to make a collaborative workspace.
The idea that I find is to work with GitHub repository, after a search in the web I find 'GrahamCampbell / Laravel-GitHub'.
I do the installation like documentation, but when I test I have the following error:
Too few arguments to function Http \ Adapter \ Guzzle6 \ Client :: buildClient (),
0 passed in C: \ Users \ Fehmi \ Dropbox \ GRASP \ vendor \ php-http \ guzzle6-adapter \ src \ Client.php on line 31 and exactly 1 expected "
use GrahamCampbell\GitHub\Facades\GitHub;
class GitController extends Controller
{
public function FuncName ()
{
dd(GitHub::me()->organizations());
}
}
The result that I have is
Symfony \ Component \ Debug \ Exception \ FatalThrowableError (E_RECOVERABLE_ERROR)
Too few arguments to function Http\Adapter\Guzzle6\Client::buildClient(), 0 passed in C:\Users\Fehmi\Dropbox\GRASP\vendor\php-http\guzzle6-adapter\src\Client.php on line 31 and exactly 1 expected

Make sure to use the latest php-http/guzzle6-adapter version.
Only the one from May 2016 has a line 31 with $client = static::buildClient();, and it had an issue fixed in PR 32 to allow calling the buildClient() with no parameters.
GrahamCampbell/Laravel-GitHub only imposes a guzzle6 version as a range between 1.0 (included) and 2.0.
Maybe using ^2.0 or at least ^1.1 might help.

Unclear errorreport from RRDTool graphing script

When one version of a set of scripts runs, which apply RRDTool, you try more of the same .....
Made a version of the lua-script, which now collects power/energy-info, and the related file create_pipower1A_graph.sh is a direct derivative of the errorfree running sh-file described in RRDTool, How to get png-files by means of os-execute-call from lua-script?
The derivative sh-file should produce a graph with the output of 3 inverters and the parallel consumption.
That sh-file for graphic output is below.
#!/bin/bash
rrdtool graph /home/pi/pipower1.png \
DEF:Pwr_MAC=/home/pi/pipower1.rrd:Power0430:AVERAGE \
DEF:Pwr_SAJ=/home/pi/pipower1.rrd:Power1530:AVERAGE \
DEF:Pwr_STECA=/home/pi/pipower1.rrd:Power2950:AVERAGE \
DEF:Pwr_Cons=/home/pi/pipower1.rrd:Power_Cons:AVERAGE \
LINE1:Pwr_MAC#ff0000:Output Involar \
LINE1:Pwr_SAJ#0000ff:Output SAJ1.5 \
LINE1:Pwr_STECA#5fd00b:Output STECA \
LINE1:Pwr_Cons#00ffff:Consumption \
COMMENT:"\t\t\t\t\t\t\l" \
COMMENT:"\t\t\t\t\t\t\l" \
GPRINT:Pwr_MAC:LAST:"Output_Involar Latest\: %2.1lf" \
GPRINT:Pwr_MAC:MAX:" Max.\: %2.1lf" \
GPRINT:Pwr_MAC:MIN:" Min.\: %2.1lf" \
COMMENT:"\t\t\t\t\t\t\l" \
GPRINT:Pwr_SAJ:LAST:"Output SAJ1.5k Latest\: %2.1lf" \
GPRINT:Pwr_SAJ:MAX:" Max.\: %2.1lf" \
GPRINT:Pwr_SAJ:MIN:" Min.\: %2.1lf" \
COMMENT:"\t\t\t\t\t\t\l" \
GPRINT:Pwr_STECA:LAST:"Output STECA Latest\: %2.1lf" \
GPRINT:Pwr_STECA:MAX:" Max.\: %2.1lf" \
GPRINT:Pwr_STECA:MIN:" Min.\: %2.1lf" \
COMMENT:"\t\t\t\t\t\t\l" \
GPRINT:Pwr_Cons:LAST:"Consumption Latest\: %2.1lf" \
GPRINT:Pwr_Cons:MAX:" Max.\: %2.1lf" \
GPRINT:Pwr_Cons:MIN:" Min.\: %2.1lf" \
COMMENT:"\t\t\t\t\t\t\l" \
--width 700 --height 400 \
--title="Graph B: Power Production & Consumption for last 24 hour" \
--vertical-label="Power(W)" \
--watermark "`date`"
The lua-script again runs without errors and as result the rrd-file is periodically updated, the graphic output is generated,but no graph appears! Tested on 2 different Raspberries, but no difference in reactions.
Running the sh-file create_pipower1A_graph from the commandline produces the following errors.
pi#raspberrypi:~$ sudo /home/pi/create_pipower1A_graph.sh
ERROR: 'I' is not a valid function name
pi#raspberrypi:~$ ./create_pipower1A_graph.sh
ERROR: 'I' is not a valid function name
Question: Puzzled, because nowhere in the sh-file an I is applied as function command. Explanation? Hint for remedy of this error?

Your problem is here:
LINE1:Pwr_MAC#ff0000:Output Involar \
LINE1:Pwr_SAJ#0000ff:Output SAJ1.5 \
LINE1:Pwr_STECA#5fd00b:Output STECA \
LINE1:Pwr_Cons#00ffff:Consumption \
These lines need to be quoted as they contain spaces and hash symbols.
LINE1:"Pwr_MAC#ff0000:Output Involar" \
LINE1:"Pwr_SAJ#0000ff:Output SAJ1.5" \
LINE1:"Pwr_STECA#5fd00b:Output STECA" \
LINE1:"Pwr_Cons#00ffff:Consumption" \

Import sensor data to RRDtool DB

Trying to import data to RRDtool DB for a couple of temperature sensor collected from a RFXtrx433e USB-controller. Output to .txt files
My database created like this:
[code]
# Script to create rrd-file
# 24h with 2,5 min resolution
# 7d with 5 min resolution
# 1y with 10 min resolution
# 20y with 1h resolution
directory="/home/pi/temp/rrddata/"
filename="domoticz_temp.rrd"
# Check i file already exists
if [ ! -f "$directory$filename" ]
then
# File doesn't exist, create new rrd-file
echo "Creating RRDtool DB for outside temp sensor"
rrdtool create $directory$filename \
--step 120 \
DS:probe:GAUGE:120:-50:60 \
DS:xxxx1:GAUGE:120:-50:60 \
DS:vardagsrum:GAUGE:120:-50:60 \
RRA:AVERAGE:0.5:1:576 \
RRA:AVERAGE:0.5:2:2016 \
RRA:AVERAGE:0.5:4:52560 \
RRA:AVERAGE:0.5:24:175200 \
RRA:MAX:0.5:1:5760 \
RRA:MAX:0.5:2:2016 \
RRA:MAX:0.5:4:52560 \
RRA:MAX:0.5:24:175200 \
RRA:MIN:0.5:1:5760 \
RRA:MIN:0.5:2:2016 \
RRA:MIN:0.5:4:52560 \
RRA:MIN:0.5:24:175200
echo "Done!"
else
echo $directory$filename" already exists, delete it first."
fi
Import of sensor data
rrdtool update /home/pi/temp/rrddata/domoticz_temp.rrd --template probe N:`head -n 1 </home/pi/temp/output/temp_probe.txt`
The textfile imported just contain one row with a number (temperature collected from the sensor through a LUA-script)
The code for create the graph
rrdtool graph /home/pi/temp/output/img/test/hour.png \
-w 697 -h 287 -a PNG \
--slope-mode \
--start -6h --end now \
--vertical-label "Last 6 hour temperature" \
DEF:probe=/home/pi/temp/rrddata/domoticz_temp.rrd:probe:AVERAGE \
DEF:xxxx1=/home/pi/temp/rrddata/domoticz_temp.rrd:xxxx1:AVERAGE \
DEF:vardagsrum=/home/pi/temp/rrddata/domoticz_temp.rrd:vardagsrum:AVERAGE \
COMMENT:" Location Min Max Senaste\l" \
LINE1:probe#ff0000:"Utetemp" \
LINE1:0#ff0000: \
GPRINT:probe:MIN:" %5.1lf" \
GPRINT:probe:MAX:" %5.1lf" \
GPRINT:probe:LAST:" %5.1lf\n" \
LINE1:xxxx1#00ff00:"Xxxx1" \
LINE1:0#00ff00: \
GPRINT:probe:MIN:" %5.1lf" \
GPRINT:probe:MAX:" %5.1lf" \
GPRINT:probe:LAST:" %5.1lf\n" \
LINE1:vardagsrum#0000ff:"vardagsrum" \
LINE1:0#0000ff: \
GPRINT:probe:MIN:" %5.1lf" \
GPRINT:probe:MAX:" %5.1lf" \
GPRINT:probe:LAST:" %5.1lf\n" \
Gives me this graph http://i.imgur.com/lnFxTik.png
Now to my questions:
Have I created the database and the rest of script in a correct way? I think should get NAN on the values not in the DB?
How do I import the rest of the sensors? They are in several simular TXT files.
Should/can I collect data from the sensor in another better way to get them in to the RRDtool DB?
Hope anyone can help me.
New info!
My LUA-script for collection sensor data
commandArray = {}
if (devicechanged['Probe']) then
local file = io.open("/home/pi/temp/output/temp_probe.txt", "w")
file:write(tonumber(otherdevices_temperature['Probe']))
file:close()
end
if (devicechanged['Xxxx1']) then
local file = io.open("/home/pi/temp/output/temp_xxxx1.txt", "w")
file:write(tonumber(otherdevices_temperature['Xxxx1']))
file:close()
end
if (devicechanged['Vardagsrum']) then
local file = io.open("/home/pi/temp/output/temp_vardagsrum.txt", "w")
file:write(tonumber(otherdevices_temperature['Vardagsrum']))
file:close()
end
return commandArray`

Yes if a value is missing you get NaN. Your create statement looks ok ... although 20y with 1h resolution ... wow!
importing from several text files would work like this
.
A=`perl -ne 'chomp;print;exit' xx1.txt`
B=`perl -ne 'chomp;print;exit' xx2.txt`
rrdtool update domoticz_temp.rrd --template xx1:xx2 N:$A:$B
.
yes instead of writing them to a file first, I would recommend to update the rrd file directly.

# 24h with 2,5 min resolution
# 7d with 5 min resolution
# 1y with 10 min resolution
# 20y with 1h resolution
...
rrdtool create $directory$filename \
--step 120 \
DS:probe:GAUGE:120:-50:60 \
DS:xxxx1:GAUGE:120:-50:60 \
DS:vardagsrum:GAUGE:120:-50:60 \
RRA:AVERAGE:0.5:1:576 \
RRA:AVERAGE:0.5:2:2016 \
RRA:AVERAGE:0.5:4:52560 \
RRA:AVERAGE:0.5:24:175200 \
OK, you seem to have a 2min step size, and your RRAs are consolodating 1, 2, 4 and 24 steps. This corresponds to 2min, 4min, 8min and 48min, not to 2.5, 5, 10 and 1h. Maybe your step should be 150? Also, the heartbeat on your DSs is the same as your step, which might cause you to lose data. Generally speaking, the heartbeat should be about 1.5 to 2 times the step size to allow for irregular data arrival.
However none of this relates to your 'unknown' question, much of which Tobi has already answered.
You will get 'unknown' on timeslots you have not loaded, yes.
2 and 3. Since you have a single RRD you need to have all the samples updated at the same timestamp, in the same operation. In this case, you're probably better off collecting them all at once and storing them into the same file, so that you can load them together and store into the RRD together. If this is an issue, and the sensors are probed independently, then I'd advise having a separate RRD for each sensor, so that you can update them independently. You can still generate a graph over all 3 together as you can define your graph DEFs to point to different RRD files no problem. This might be a better way to do it.
And Tobi's right about a 20y RRA possibly being somewhat excessive ;)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio