Vowpal Wabbit: obtaining a readable_model when in --daemon mode - vowpalwabbit

I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?

Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand

Related

is there a way to save only the model with huggingface trainer?

I want to keep multiple checkpoints during training to analyse them later but the Trainer also saves other files to resume training. Is there a way to only save the model to save space and writing time?
15K rng_state.pth
906 trainer_state.json
623 scheduler.pt
2,1G optimizer.pt
2,5K training_args.bin
1,1G pytorch_model.bin
900 config.json
I could just delete the optimizer after training but i'm also working with a disk with slower write speed so this is also a consideration
Unfortunately, there is currently no way to disable the saving of single files. There are basically two ways to get your behavior:
The "hacky" way would be to simply disable the line of code in the Trainer source code that stores the optimizer, which (if you train on your local machine) should be this one. However, whenever you update your transformers version, this could lead to ugly behavior, which is why I recommend the second one.
Overwrite the _save_checkpoint() function in your own Trainer object. This way, you always guarantee that the correct files are saved, and don't have to interact with the library's code.
An outline for what this looks like:
from transformers import Trainer
class OwnTrainer(Trainer):
# Don't forget to correctly set up __init__()
# ...
def _save_checkpoint(self, model, trial, metrics=None):
# insert your own behavior here

TwinCAT fails to save data to CSV

I am part of tractor pulling team and we have Bechoff CX8190 based PLC for data logging. System works most of the time but every now and then saving sensor values (every 10ms is collected) to CSV fails (mostly in middle of csv row). Guy who build the code is new with the TwinCAT and does not know how to find what causes that. Any Ideas where to look reason for this.
Writing to a file is always a asynchron action in TwinCAT. That is to say this is no realtime action and it is not safe that the writing process is done within the task cycletime of 10ms. Therefore these functionblocks always have a BUSY-output which has to be evaluated and the functionblock has to be called successivly until the BUSY-output returns to FALSE. Only then a new write command can be executed.
I normally tackle this task with a two-side-buffer algorithm. Lets say the buffer-array has 2x100 entries. So fill up the first 100 entries with sample values. Then write them all together with one command to the file. When its done, clean the buffer. In the meanwhile the other half of the buffer can be filled with sample values. If second side is full, write them all together to the file ... and so on. So you have more time for the filesystem access (in the example above 100x10ms=1s) as the 10ms task cycletime.
But this is just a suggestion out of my experience. I agree with the others, some code could really help.

When we specify "--algorithms=sgd" in vw-hyperopt, does it run with adaptive, normalised and invariant updates?

the confusion is because when we specify --sgd in vw command line, it runs classic sgd, without adaptive, normalised and invariant updates. So, when we specify algorithm as sgd in vw-hyperopt, does it run as classic or with special updates? Is it mandatory to specify algorithm in vw-hyperopt? Which is the default algorithm? Thank you.
Looking at the source code confirms that the meaning of --algorithm sgd here simply leaves the default alone.
This is different than vw --sgd. It doesn't disable the defaults by passing --sgd to vw. IOW: yes, the adaptive, normalized and invariant updates will still be in effect.
Also: you can verify this further by looking at the log file created by vw-hyperopt in the current dir and verify it has no --sgd option in it. This log includes the full vw command line it executes for training and testing, e.g:
2020-09-08 00:58:45,053 INFO [root/vw-hyperopt:239]: executing the following command (training): vw -d mydata.train -f ./current.model --holdout_off -c ... --loss_function quantile

Nifi: how to avoid copying file that are partially written

I am trying to use Nifi to get a file from SFTP server. Potentially the file can be big , so my question is how to avoid getting the file while it is being written. I am planning to use ListSFTP+FetchSFTP but also okay with GetSFTP if it can avoid copying partially written files.
thank you
In addition to Andy's solid answer you can also be a bit more flexible by using the ListSFTP/FetchSFTP processor pair by doing some metadata based routing.
After ListSFTP each flowfile will have attributes such as 'file.lastModifiedTime' and others. You can read about them here https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.3.0/org.apache.nifi.processors.standard.ListSFTP/index.html
You can put a RouteOnAttribute process in between the List and Fetch to detect objects that at least based on the reported last modified time are 'too new'. You could route those to a processor that is just a slow pass through to intentionally wait a bit. You can then run those back through the first router until they are 'old enough'. Now, this is admittedly a power user approach but it does give you a lot of flexibility and control. The approach I'm mentioning here is not fool proof as the source system may not report the last mod time correctly, it may not mean the source file is doing being written, etc.. But it gives you additional options IF you cannot do the definitely correct thing above that Andy talks about.
If you have control over the process which writes the file in, a common pattern to solve this is to initially write the file with a specific naming structure, such as beginning with .. After the successful write operation, the file is renamed without the . and it is picked up by the processor. Both GetSFTP and ListSFTP have a processor property called Ignore Dotted Files which is set to true by default and means those processors will not operate on or return files beginning with the dot character.
There is a minimum file age property you can use. The last modification time gets updated as the file is being written. Setting this value to something other than 0 will help fix the problem:

How to use RRDTool/Cacti to count "user activities" in apache access logs?

Goal
I wish to use RRDTool to count logical "user activity" from our web application's apache/tomcat access logs.
Specifically we want to count, for a period, occurrences of several url patterns.
Example
We have two applications (call them 'foo' and 'bar')
These url's interest us. They indicate when users 'did interesting stuff'.
/foo/hop
/foo/skip
/foo/jump
/bar/crawl
/bar/walk
/bar/run
Basically we want to know for a given interval (10 minutes, hour, day, etc.) how many users: hopped,skipped,jumped,crawled, walked, etc.
Reference/Starting point
This article on importing access logs into RRDTool seemed like a helpful starting point.
http://neidetcher.com/programming/2014/05/13/just-enough-rrdtool.html
However to clarify, this example uses the access log directly , whereas we want to a handful of url's 'in buckets' and count the 'number in each bucket'
Some Scripting Required..
I could do this with bash & grep & wc --iterating through the patterns, sending output to an 'intermediate results' text file....but believe RRDTool could do this with minimal 'outside coding'
That said, I believe RRDTool could do this with minimal 'outside coding'--but am unclear on the details.
Some points
I mention 'two applications' because we actually serve them up from separate servers with different log file formats. I'd like go get them into the same RRA file
Eventually I'd like to report this in cacti; initially however, I wanted to understand RRDTool details
Open to doing any coding, but would like to keep it as efficient as possible--both administratively and computer-resources. (By administratively, I mean: easy to monitor new instances)
I am very new to RRDTool and am RTM'ing . (and Walking through the Tutorial). I'm used to relational databases and spreadsheets, etc and don't have my mind around all the nuances of the RRA format.
Thanks in advance!
You could setup a separate RRD file with ABSOLUTE type datasources for each address you want to track.
Then you tail the log file and whenever you see one of the interesting urls rush by you call:
rrdtool update url-xyz.rrd N:1
The ABSOLUTE data source type is like a counter, but it gets reset every time it is read. Your counter will just count to one, but that should not be a problem.
In the example above I am using N: and not the timestamp from the access log. You could also use that if you are not doing this in real time ... but beware that you can not update the same rrd file twice at the same time. N: will use milli timestamps internally and thus probably avoid this problem.
On the other hand it may make more sense to accumulate matching log entries with the same timestamp and only update rrdtool with that number once the timestamp on the logfile changes.

Resources