How to use vowpal wabbit for online prediction (streaming mode) - vowpalwabbit

I am trying to use Vowpal Wabbit for one multi class classification task with 154 different class labels as follows:
Trained VW model with large amount of data.
Tested the model with one dedicated test set.
In this scenario I was able to hit >80% result, which is good. But the problem which currently I am working on is:
I have to replicate the real time prediction scenario. In this case I have to pass one data point (i.e text line) at a time so that model can predict the value and output.
I have tried out all the options which I knew but failed. Can any of you let me know how to create a real time scenario by passing one data point along with VW command but not as a file.

You can use vw as a daemon:
vw --daemon --port 54321 --quiet -i model_file -t --num_children 1
Now vw loads the model and listens on port 54321 (on localhost). Every time you send a line (ending with a newline, ASCII 10) to localhost:54321 you'll get a prediction back on the same socket, for example:
echo " | your features here..." | netcat localhost 54321
This is just an example, normally you would write a program that will write and then read from the socket in a loop instead of calling netcat.
You can also call vw in regular input/output and prediction mode:
vw --quiet -i model_file -t -p /dev/stdout
And write to it (via stdin) and read from it (via stdout). The key is that you'll get one line of output for each line of input you send, in the same order. You can also send N lines at a time, and then read back N responses. The order relative order of requests vs responses is guaranteed to be preserved.

Related

Where to view logged results in Veins 5.1

I'm somewhat new to Veins and I'm trying to record collision statistics within the sample "RSUExampleScenario" provided in the VM. I found this question which describes what line to add to the .ini file, which I have, but I'm unable to find the "ncollisions" value in the results folder, which makes me think either I ran the wrong .ini line or am looking in the wrong place.
Thanks!
Because collision statistics take time to compute (essentially: trying to decode every transmission twice: once while considering interference by other nodes as usual, then trying again while ignoring all interference), Veins 5.1 requires you to explicitly turn collision statistics on. As discussed in https://stackoverflow.com/a/52103375/4707703, this can be achieved by adding a line *.**.nic.phy80211p.collectCollisionStatistics = true to omnetpp.ini.
After altering the Veins 5.1 example simulation this way and running it again (e.g., by running ./run -u Cmdenv -c Default from the command line), the ncollisions field in the resulting .sca file should now (sometimes) have non-zero values.
You can quickly verify this by running (from the command line)
opp_scavetool export --filter 'module("**.phy80211p") and name("ncollisions")' results/Default-\#0.sca -F CSV-R -o collisions.csv
The resulting collisions.csv should now contain a line containing (among other information) param,,,*.**.nic.phy80211p.collectCollisionStatistics,true (indicating that the simulation was executed with the required configuration) as well as many lines containing (among other information) scalar,RSUExampleScenario.node[10].nic.phy80211p,ncollisions,,,1 (indicating that node[10] could have received one more message, had it not been for interference caused by other transmissions in the simulation.

Hadoop/Yarn/Spark can I call command line?

Short version of my question: We need to call the command line from within a Spark job. Is this feasible? The cluster support group indicated this could cause problems with memory.
Long Version: I have a job that I need to run on a Hadoop/MapR cluster processing packet data captured with tshark/wireshark. The data is binary packet data, one file per minute of capture. We need to extract certain fields from this packet data such as IP addresses etc. We have investigated options such a jNetPcap but this library is a bit limited. So it looks like we need to call the tshark command line from within the job and process the response. We can't do this directly during capture as we need the capture to be as efficient as possible to avoid dropping packets. Converting the binary data to text outside the cluster is possible but this is 95% of the work so we may as well run the entire job as a non-distributed job on a single server. This limits the number of cores we can use.
Command line to decode is:
tshark -V -r somefile.pcap
or
tshark -T pdml -r somefile.pcap
Well, it is not impossible. Spark provides pipe method which can be used to pipe data to external process and read the output. General structure could be for example something like this:
val files: RDD[String] = ??? // List of paths
val processed: RDD[String] = files.pipe("some Unix pipe")
Still, from your description it looks like GNU Parallel could be much better choice than Spark here.

get progress of a file being uploaded - unix

i have a requirement to monitor the progress of a file being uploaded using script. In putty(software) we can view the Percentage Uplaod, Bytes transferred , Upload Speed and ETA in the right hand side. I want to develop a similar functionality. Is there any way to achieve this?
Your question is lacking any information of how your file is transferred. Most clients have some way to display progress, but that is depending on the individual client used (scp, sftp, ncftp, ...).
But there is a way to monitor progress independently on what is progressing: pv (pipe viewer).
This tool has the sole purpose of generating monitoring information. It can be used in a way much similar to cat. You either use it to "lift" a file to pv's stdout....
pv -petar <file> | ...
...or you use it in the middle of a pipe -- but you need to manually provide the "expected size" in order to get a proper progress bar, since pv cannot determine the size of the transfer beforehand. I used 2 Gigabyte expected size here (-s 2G)...
cat <file> | pv -petar -s 2G | ...
The options used are -p (progress bar), -e (ETA), -t (elapsed time), -a (average rate), and -r (current rate). Together they make for a nice mnemonic.
Other nice options:
-L, which can be used to limit the maximum rate in the pipe (throttle).
-W, to make pv wait until data is actually transferred before showing a progress bar (e.g. if the tool you are piping the data to will require a password first).
This is most likely not what you're looking for (since chances are the transfer client you're using has its own options for showing progress), but it's the one tool I know that could work for virtually any kind of data transfer, and it might help others visiting this question in the future.

Vowpal Wabbit: obtaining a readable_model when in --daemon mode

I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?
Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand

bash: wait for specific command output before continuing

I know there are several posts asking similar things, but none address the problem I'm having.
I'm working on a script that handles connections to different Bluetooth low energy devices, reads from some of their handles using gatttool and dynamically creates a .json file with those values.
The problem I'm having is that gatttool commands take a while to execute (and are not always successful in connecting to the devices due to device is busy or similar messages). These "errors" translate not only in wrong data to fill the .json file but they also allow lines of the script to continue writing to the file (e.g. adding extra } or similar). An example of the commands I'm using would be the following:
sudo gatttool -l high -b <MAC_ADDRESS> --char-read -a <#handle>
How can I approach this in a way that I can wait for a certain output? In this case, the ideal output when you --char-read using gatttool would be:
Characteristic value/description: some_hexadecimal_data`
This way I can make sure I am following the script line by line instead of having these "jumps".
grep allows you to filter the output of gatttool for the data you are looking for.
If you are actually looking for a way to wait until a specific output is encountered before continuing, expect might be what you are looking for.
From the manpage:
expect [[-opts] pat1 body1] ... [-opts] patn [bodyn]
waits until one of the patterns matches the output of a spawned
process, a specified time period has passed, or an end-of-file is
seen. If the final body is empty, it may be omitted.

Resources