Possible to output Vowpal Wabbit predictions to .txt along with observed target values? - vowpalwabbit

We're writing a forecasting application that uses Vowpal Wabbit and are looking to automate as much of our model validation process as we can. Anyone know whether vw has a native utility to output the target values in a test file along with the predictions from a vw model? These values are printed to the terminal output during prediction. Is there an argument to the regular vw call, or perhaps a tool in the utl folder that prints targets and forecasts together on a row-wise basis?
Here's what the code I'm using now for prediction looks like:
vw -d /path/to/data/test.vw -t -i lg.vw --link=logistic -p predictions.txt
My goal is to produce from within Vowpal an output file that looks like this:
Predicted Target
0.78 1
0.23 0
0.49 1
...
UPDATE
#arielf's code worked like a charm. I've only made one minor addition to print the streaming results to a validation.txt file:
vw -d test.vw -t -i lg.vw --link=logistic -P 1 2>&1 | \
perl -ane 'print "$F[5]\t$F[4]\n" if (/^\d/)' > validation.txt

Try this:
vw -d test.vw -t -i lg.vw --link=logistic -P 1 2>&1 | \
perl -ane 'print "$F[5]\t$F[4]\n" if (/^\d/)'
Explanation:
-P 1 # Add option: set vw progress report to apply to every example
Note: -P is a capital P (alias for --progress), 1 is the progress printing interval.
Note that you don't need to add predictions with -p ... since that is redundant in this case (predictions are already included in vw progress lines)
A progress report line with headers, looks like this:
average since example example current current current
loss last counter weight label predict features
0.000494 0.000494 1 1.0 -0.0222 0.0000 14
Since progress report goes to stderr, we need to redirect stderr to stdout (2>&1).
Now we pipe the vw progress output into perl for simple post-processing. The perl command loops over each line of input without printing by default (-n), auto-splits into fields on white-space (-a), and applies the expression (-e) printing the 5th and 4th fields separated by a TAB and terminated by a newline if the line starts with a number (in order to skip whatever isn't a progress line, e.g. headers, preambles and summary lines). I reversed the 5th & 4th filed order because vw progress lines have the observed value before the predicted value and you asked for the opposite order.
UPDATE
Aaron published a working example using this solution in Google Drive: https://drive.google.com/open?id=0BzKSYsAMaJLjZzJlWFA2N3NnZGc

Related

Show only newly added lines of logfile in terminal

I use tail -f to show the contents of a logfile.
What I want is when the logfile content changes, instead of appending the new lines to my screen, only the newly added lines should be shown on my screen.
So as if a clearscreen was made every time before printing the new lines.
I tried to find a solution by web search but couldn't find anything useful.
edit:
In my case it happens that several lines will be added at once (it is a php error logfile). So I am looking for a solution where more than the single last line can be shown on screen.
The watch command in combination with the tail command shows the last line of a log file with the intervall of every 2 seconds. Basically it doesn't refresh whenever a new line is appended to the log file but since you could specifiy an intervall it might help you for your use case.
watch -t tail -1 <path_to_logfile>
If you need a faster intervall like every 0.5 seconds, then you could specify it with the 'n' option i.e.:
watch -t -n 0.5 tail -1 <path_to_logfile>
Try
$ watch 'tac FILE | grep -m1 -C2 PATTERN | tac'
where
PATTERN is any keyword (or regexp) to identify errors you seek in the log,
tac prints the lines in reverse,
-m is a max count of matching lines to grep,
-C is any number of lines of context (before and after the match) to show (optional).
That would be similar to
$ tail -f FILE | grep -C2 PATTERN
if you didn't mind just appending occurrences to the output in real-time.
But if you don't know any generic PATTERN to look for at all,
you'd have to just follow all the updates as the logfile grows:
$ tail -n0 -f FILE
Or even, create a copy of the logfile and then do a diff:
Copy: cp file.log{,.old}
Refresh the webpage with your .php code (or whatever, to trigger the error)
Run: diff file.log{,.old}
(or, if you prefer sort to diff: $ sort file.log{,.old} | uniq -u)
The curly braces is shorthand for both filenames (see Brace Expansion in $ man bash)
If you must avoid any temp copies, store the line count in memory:
z=$(grep -c ^ file.log)
Refresh the webpage to trigger an error
tail -n +$z file.log
The latter approach can be built upon, to create a custom scripting solution more suitable for your needs (check timestamps, clear screen, filter specific errors, etc). For example, to only show the lines that belong to the last error message in the log file updated in real-time:
$ clear; z=$(grep -c ^ FILE); while true; do d=$(date -r FILE); sleep 1; b=$(date -r FILE); if [ "$d" != "$b" ]; then clear; tail -n +$z FILE; z=$(grep -c ^ FILE); fi; done
where
FILE is, obviously, your log file name;
grep -c ^ FILE counts all lines in a file (that is almost, but not entirely unlike cat FILE|wc -l that would only count newlines);
sleep 1 sets the pause/delay between checking the file timestamps to 1 second, but you could change it to even a floating point number (the less the interval, the higher the CPU usage).
To simplify any repetitive invocations in future, you could save this compound command in a Bash script that could take a target logfile name as an argument, or define a shell function, or create an alias in your shell, or just reverse-search your bash history with CTRL+R. Hope it helps!

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

Unix Epoch to date with sed

I wanna change unix epoch to normal date
i'm trying:
sed < file.json -e 's/\([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]/`date -r \1`/g'
any hint?
With the lack of information from your post, I can not give you a better answer than this but it is possible to execute commands using sed!
You have different ways to do it you can use
directly sed e instruction followed by the command to be
executed, if you do not pass a command to e then it will treat the content of the pattern buffer as external command.
use a simple substitute command with sed and pipe the output to sh
Example 1:
echo 12687278 | sed "s/\([0-9]\{8,\}\)/date -d #\1/;e"
Example 2:
echo 12687278 | sed "s/\([0-9]\{8,\}\)/date -d #\1/" |sh
Test 1 (with Japanese locale LC_TIME=ja_JP.UTF-8):
Test 2 (with Japanese locale LC_TIME=ja_JP.UTF-8):
Remarks:
I will let you adapt the date command accordingly to your system specifications
Since modern dates are longer than 8 characters, the sed command uses an
open ended length specifier of at least 8, rather than exactly 8.
Allan has a nice way to tackle dynamic arguments: write a script dynamically and pipe it to a shell! It works. It tends to be a bit more insecure because you could potentially pipe unintentional shell components to sh - for example if rm -f some-important-file was in the file along with the numbers , the sed pipeline wouldn't change that line, and it would also be passed to sh along with the date commands. Obviously, this is only a concern if you don't control the input. But mistakes can happen.
A similar method I much prefer is with xargs. It's a bit of a head trip for new users, but very powerful. The idea behind xargs is that it takes its input from its standard in, then adds it to the command comprised of its own non-option arguments and runs the command(s). For instance,
$ echo -e "/tmp\n/usr/lib" | xargs ls -d
/tmp /usr/lib
Its a trivial example of course, but you can see more exactly how this works by adding an echo:
echo -e "/tmp\n/usr/lib" | xargs echo ls -d
ls -d /tmp /usr/lib
The input to xargs becomes the additional arguments to the command specified in xargs's own arguments. Read that twice if necessary, or better yet, fiddle with this powerful tool, and the light bulb should come on.
Here's how I would approach what you're doing. Of course I'm not sure if this is actually a logical thing to do in your case, but given the detail you went into in your question, it's the best I can do.
$ cat dates.txt
Dates:
1517363346
I can run a command like this:
$ sed -ne '/^[0-9]\{8,\}$/ p' < dates.txt | xargs -I % -n 1 date -d #%
Tue Jan 30 19:49:06 CST 2018
Makes sense, because I used the commnad echo -e "Dates:\ndate +%s" > dates.txt to make the file a few minutes before I wrote this post! Let's go through it together and I'll break down what I'm doing here.
For one thing, I'm running sed with -n. This tells it not to print the lines by default. That makes this script work if not every line has an 8+ digit "date" in it. I also added anchors to the start (^) and end ($) of the regex so the line had only the approprate digits ( I realize this may not be perfect for you, but without understanding your its input, I can't do better ). These are important changes if your file is not entirely comprised of date strings. Additionally, I am matching at least 8 characters, as modern date strings are going to be more like 10 characters long. Finally, I added a command p to sed. This tells it to print the matching lines, which is necessary because I specifically said not to print the nonmatching lines.
The next bit is the xargs iteslf. The sed will write a date string out to xargs's standard input. I set only a few settings for xargs. By default it will add the standard input to the end of the command, separated by a space. I didn't want a space, so I used -I to specify a replacement string. % doesn't have a special meaning; its just a placeholder that gets replaced with the input. I used % because its not a special character but rarely is used in commands. Finally, I added -n 1 to make sure only 1 input was used per execution of date. ( xargs can also add many inputs together, as in my ls example above).
The end result? Sed matches lines that consist, exclusively, of 8 or more numeric values, outputting the matching lines. The pipe then sends this output to xargs, which takes each line separately (-n 1) and, replacing the placeholder (-I %) with each match, then executes the date command.
This is a shell pattern I really like, and use every day, and with some clever tweaks, can be very powerful. I encourage anyone who uses linux shell to get to know xargs right away.
There is another option for GNU sed users. While the BSD land folks were pretty true to their old BSD unix roots, the GNU folks, who wrote their userspace from scratch, added many wonderful enhancements to the standards. GNU Sed can apparently run a subshell command for you and then do the replacement for you, which would be dramatically easier. Since you are using the bsd style date invocation, I'm going to assume you don't have gnu sed at your disposal.
Using sed: tested with macOs only
There is a slight difference with the command date that should use the flag (-r) instead of (-d) exclusive to macOS
echo 12687278 | sed "s/\([0-9]\{8,\}\)/$(date -r \1)/g"
Results:
Thu Jan 1 09:00:01 JST 1970

Prepend message to rsstail

I am trying to prepend a message to the output of rsstail, this is what I have right now:
rsstail -o -i 15 --initial 0 http://feeds.bbci.co.uk/news/world/europe/rss.xml | awk -v time=$( date +\[%H:%M:%S_%d/%m/%Y\] ) '{print time,$0}' | tee someFile.txt
which should give me the following:
[23:46:49_23/10/2014] Title: someTitle
After the command I have a | while read line do ... end which never gets called because the above command does not output a single thing. What am I doing wrong?
PS: I am using the python version of rsstail, since the other one kept on crashing (https://github.com/gvalkov/rsstail.py)
EDIT:
As requested in the comments the command:
rsstail -o -i 15 --initial 0 http://feeds.bbci.co.uk/news/world/europe/rss.xml
Will give back a message like the following when a new article is found
Title: Sweden calls off search for sub
It seems that my rsstail is different from yours, but mine supports the option
-Z x add heading 'x'
so that
rsstail -Z"$( date +\[%H:%M:%S_%d/%m/%Y\] ) " ...
does the job without awk; on the other hand, you do have some problem with buffering, is it possible to ask rsstail to stop after a given number of titles?

Direct xargs output to file, using multiple arguments

Given a set of files, I need to pass 2 arguments and direct the output to a newly named file, based on either input filename. The input list follows a defined format: S1_R1.txt, S1_R2.txt; S2_R1.txt, S2_R2.txt; S3_R1.txt, S3_R2.txt, etc. The first numeric is incremented by 1 and each has an R1 and corresponding R2.
The output file is a combination of each S#-pair and should be named respective of this, e.g. S1_interleave.txt, S3_interleave.txt, S3_interleave.txt, etc.
The following works to print to screen
find S*R*.txt -maxdepth 0 | xargs -n 2 python interleave.py
How can I utilize the input filenames for use as output?
Just to make it at bit more fun: Let us assume the files are gzipped (as paired end reads often are) and you want the result gzipped, too:
parallel --xapply 'python interleave.py <(zcat {1}) <(zcat {2}) |gzip > {=1 s/_R1.txt.gz/_interleave.txt.gz/=}' ::: *R1.txt.gz ::: *R2.txt.gz
You need the pre-release of GNU Parallel to do this http://git.savannah.gnu.org/cgit/parallel.git/snapshot/parallel-1a1c0ebe0f79c0ada18527366b1eabeccd18bdf5.tar.gz (or wait for the release 20140722).
As asked it is even simpler (but you still need the pre-release, though):
parallel --xapply 'python interleave.py {1} {2} > {=1 s/_R1.txt/_interleave.txt/=}' ::: *R1.txt ::: *R2.txt

Resources