When we specify "--algorithms=sgd" in vw-hyperopt, does it run with adaptive, normalised and invariant updates? - vowpalwabbit

the confusion is because when we specify --sgd in vw command line, it runs classic sgd, without adaptive, normalised and invariant updates. So, when we specify algorithm as sgd in vw-hyperopt, does it run as classic or with special updates? Is it mandatory to specify algorithm in vw-hyperopt? Which is the default algorithm? Thank you.

Looking at the source code confirms that the meaning of --algorithm sgd here simply leaves the default alone.
This is different than vw --sgd. It doesn't disable the defaults by passing --sgd to vw. IOW: yes, the adaptive, normalized and invariant updates will still be in effect.
Also: you can verify this further by looking at the log file created by vw-hyperopt in the current dir and verify it has no --sgd option in it. This log includes the full vw command line it executes for training and testing, e.g:
2020-09-08 00:58:45,053 INFO [root/vw-hyperopt:239]: executing the following command (training): vw -d mydata.train -f ./current.model --holdout_off -c ... --loss_function quantile

Related

Where to view logged results in Veins 5.1

I'm somewhat new to Veins and I'm trying to record collision statistics within the sample "RSUExampleScenario" provided in the VM. I found this question which describes what line to add to the .ini file, which I have, but I'm unable to find the "ncollisions" value in the results folder, which makes me think either I ran the wrong .ini line or am looking in the wrong place.
Thanks!
Because collision statistics take time to compute (essentially: trying to decode every transmission twice: once while considering interference by other nodes as usual, then trying again while ignoring all interference), Veins 5.1 requires you to explicitly turn collision statistics on. As discussed in https://stackoverflow.com/a/52103375/4707703, this can be achieved by adding a line *.**.nic.phy80211p.collectCollisionStatistics = true to omnetpp.ini.
After altering the Veins 5.1 example simulation this way and running it again (e.g., by running ./run -u Cmdenv -c Default from the command line), the ncollisions field in the resulting .sca file should now (sometimes) have non-zero values.
You can quickly verify this by running (from the command line)
opp_scavetool export --filter 'module("**.phy80211p") and name("ncollisions")' results/Default-\#0.sca -F CSV-R -o collisions.csv
The resulting collisions.csv should now contain a line containing (among other information) param,,,*.**.nic.phy80211p.collectCollisionStatistics,true (indicating that the simulation was executed with the required configuration) as well as many lines containing (among other information) scalar,RSUExampleScenario.node[10].nic.phy80211p,ncollisions,,,1 (indicating that node[10] could have received one more message, had it not been for interference caused by other transmissions in the simulation.

Makefile: store warning count into variable without using temp file

I would like to improve an existing Makefile, so it prints out the number of warnings and/or errors that were encountered during the build process.
My basic idea is that there must be a way to pipe the output to grep and have the number of occurences of a certain string in either stderr or stdout stream (i.e. "Warning:") stored into a variable that can then simply be echo'ed out at the end make command.
Requirements / Challenges:
Current console output and exit code must remain exactly the same
That means also without even changing control characters. Dev's using the MakeFile must not recognize any difference to what the output was prior to my change (except for a nice, additional Warning count output at the end of the make process). Any approaches with tee i tried so far were not successful, as the color coding of stderr messages in the console is lost, changing them to all black & white.
Must be system-independent
The project is currently being built by Win/OSX/Linux devs and thus needs to work with standard tools available out-of-the-box in most *nix / CygWin shells. Introducing another dependency such as unbuffer is not an option.
It must be stable and free of side-effects (see also 5.)
If the make process is interrupted (i.e. by the user pressing CTRL+C or for any other reason), there should be no side-effects (such as having an orphaned log file of the output left behind on disk)
(Bonus) It should be efficient
The amount of output may get >1MB and more, so just piping to a file and greping it will be a small performance hit and also there will be additional the disk I/O (thus unnecessarily slowing down the build). I'm simply wondering if this cannot be done w/o a temp file as i understand pipes as sort of "streams" that just need to be analysed as the flow through.
(Bonus) Make it locale-independent w/o changing the current output
Depending on the current locale, the string to grep and count is localized differently, i.e. "Warning:" (en_US.utf8) or "Warnung:" (de_DE.utf8). Surely i could have locale switch to en_US in the Makefile, but that would change console output for users (Hence breaking requirement 1.), so i'd like to know if there's any (efficient) approach you could think of for this.
At the end of the day, i'd be able to do with a solid solution that just fullfills requirement 1. + 2.
If 3. to 5. are not possible to be done then i'd have to convince the project maintainers to have some changes to .gitignore, have the build process slightly take up more time and resources, and/or fix the make output to english only but i assume they will agree that would be worth it.
Current solution
The best i have so far is:
script -eqc "make" make.log && WARNING_COUNT=$(grep -i --count "Warning:" make.log)" && rm make.log || rm make.log
That fulfills my requirements 1, 2 and almost no. 3: still, if the machine has a power-outage while running the command, make.log will remain as an unwanted artifact. Also the repetition of rm make.log looks ugly.
So i'm open on alternative approaches and improvements by anybody. Thanks in advance.

Vocabulary for a script that is expected to produce the same output no matter where it is run

I'd like some advice on what vocabulary to use to describe the following. Having the right vocabulary will allow me to search for tools and ideas related to the concept
I'd like to say a script is SomeWord if it is expected to produce the same output no matter where it is run.
For example, the following script is not SomeWord:
#!/bin/bash
ls ~
because of course it depends on where it is executed.
Whereas the following (if it runs without error) is expected to always produce the same output:
#!/bin/bash
echo "hello, world"
A more useful example would be something that loads and runs a docker or singularity container in a way that guarantees that a very particular container image is being used. For example by retrieving the singularity image by its content-hash.
The advantages of SomeWord scripts are: (a) they may be safely run on a remote system without worrying about the environment and (b) their outputs may be cached.
The best I can think of would be "deterministic" or some variation on "environment independent" or "reproducible".
Any container should be able to do this, as that is a big part of why the tech developed in the first place. Environment managers like conda can also do this to a certain extent, but because it's just modifying the host environment it's possible to be using non-conda binaries without realizing it.

OMNeT++: Different results for simulation repetition using the same seed

Used Versions: OMNeT++ 5.0 with iNET 3.4.0
Using OMNeT++ I'm running some simulation with a big amount of repetitions.
In some cases I don't understand the behaviour of my system, so I want to watch the procedure using Qt. Therefore I need the repeat some special cases of the previously simulated repetitions.
Even though I use the exact same configuration-file in combination with the corresponding seedset, I don't get the desired repetion, so I get completly different results. What can be the reason for that?
Analyzing the header of the generated log-files, there are only differences in the following lines:
run General-107342-20170331-15:42:22-5528
attr datetime 20170331-15:42:22
attr processid 5528
All other parameters are matching exactly. I don't understand why the results are different. Is the processid relevant for behavior like this?
Some tips to nail down the problem:
Check if the difference is indeed caused by the Graphical/Non-graphical difference. Run your simulation with both:
$ mysim -r 154 -u Cmdenv
$ mysim -r 154 -u Qtenv
$ mysim -r 154 -u Tkenv
Check the results. Different results may be caused by several issues:
relying on undefined behavior in C++, like you have a (set) collection and you iterate over it. The order of the collection is undefined and it can throw the simulation to a different trajectory
Accessing uninitialized memory
Using data that is available only in graphical runtime, like using the positions of the nodes defined by the #displayString property. Node positions may change based on the layouting algorithm and layouting is not available in Cmdenv
Changing the model state while testing whether the model is running under a graphical runtine i.e. inside if (isGUI()) {} blocs.
First I would try to figure out whether this is related to GUI vs non-GUI or rather the use of undefined behavior. If Tkenv and Qtenv gives the same result while Cmdenv differes, then it is a GUI-nonGUI issue. If all of them is different I would suspect a memory issue or undefined behavior.
If everything else fails, run the simulation in both Cmdenv and Qtenv and turn on event logging. Compare the logs and see where the two trajctories start to diverge and debug both run around that point to see the cause of divergence.

Vowpal Wabbit: obtaining a readable_model when in --daemon mode

I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?
Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand

Resources