Vowpal Wabbit predictions for multi-label classification - vowpalwabbit

I am sorry, I do feel I am overlooking something really obvious.
But how can the following happen:
$ cat myTrainFile.txt
1:1 |f 1:12 2:13
2:1 |f 3:23 4:234
3:1 |f 5:12 6:34
$ cat myTestFile.txt
14:1 |f 1:12 2:13
14:1 |f 3:23 4:234
14:1 |f 5:12 6:34
$ vw --csoaa 3 -f myModel.model --compressed < myTrainFile.txt
final_regressor = myModel.model
...
...
$ vw -t -i myModel.model -p myPred.pred < myTestFile.txt
only testing
Num weight bits = 18
...
...
$ cat myPred.pred
14.000000
14.000000
14.000000
So the test file is identical to the train file, but for the labels.
Hence, I would expect vw to produce the original labels that it learned from the train file, as it ignores the labels in the test file completely.
However, it seems to reproduce the labels form the test file?!?
Clearly, I am doing something completely wrong here... but what?

If you specify just one label in --csoaa (even in the -t test mode), it means that only that label is "available" for this example, so no other label can be predicted.
This is another difference from --oaa (where you always specify just the correct label).
See https://groups.yahoo.com/neo/groups/vowpal_wabbit/conversations/topics/2949.
If all labels are "available" (possible) for any test example, you must always include all the labels on each line.
With -t you do not need to include the costs of the labels if you just want to get the --predictions (if you don't need vw to compute the test loss).
So your myTestFile.txt should look like:
1 2 3 |f 1:12 2:13
1 2 3 |f 3:23 4:234
1 2 3 |f 5:12 6:34
and your myTrainFile.txt should look like:
1:0 2:1 3:1 |f 1:12 2:13
1:1 2:0 3:1 |f 3:23 4:234
1:1 2:1 3:0 |f 5:12 6:34

So, for completeness' sake, here is how it does work:
$ cat myTrainFile.txt
1:1.0 |f 1:12 2:13
2:1.0 |f 3:23 4:234
3:1.0 |f 5:12 6:34
$ cat myTestFile.txt
1 2 3 |f 1:12 2:13
1 2 3 |f 3:23 4:234
1 2 3 |f 5:12 6:34
$ vw -t -i myModel.model -p myPred.pred < myTestFile.txt
only testing
...
$ cat myPred.pred
2.000000
1.000000
2.000000
So it is a bit suprising maybe that none of examples is classified correctly, but that is another problem.
Thanks #Martin Popel!

Related

bash: conserve tab with spaces for alignment with column

I am trying to display .tsv files aligned nicely as columns, and yet allow limiting display to the current screen width. I am able to get this done in the following way that works in general but will fail if the input contains a particular character that is used by column. The current solution that I am using presently works as follows:
bash$ cat sample.tsv | tr '\t' '#' | column -n -t -s # | cut -c-`tput cols`
I tried using tab itself directly but could not make it work. And with default option for column, any whitespace and not just tabs are used so it does not work for me. Would be thankful for any better alternative than the above.
PS:
A sample is shown below
bash:~$ cat sample.tsv
Sl Name Number Status
1 W Jhon +1 234 4454 y
2 M Walter +2 232 453 n
3 S M Ray +1 343 453 y
bash:~$ cat sample.tsv | tr '\t' '#' | column -n -t -s # | cut -c-`tput cols`
Sl Name Number Status
1 W Jhon +1 234 4454 y
2 M Walter +2 232 453 n
3 S M Ray +1 343 453 y
bash:~$ cat sample.tsv | column -n -t | cut -c-`tput cols`
Sl Name Number Status
1 W Jhon +1 234 4454 y
2 M Walter +2 232 453 n
3 S M Ray +1 343 453 y
bash:~$
You can set column to use tab as character to be used to delimit columns with -s:
column -t -s $'\t' -n sample.tsv
Sl Name Number Status
1 W Jhon +1 234 4454 y
2 M Walter +2 232 453 n
3 S M Ray +1 343 453 y

select multiple patterns with grep

I have file that looks like that:
t # 3-7, 1
v 0 104
v 1 92
v 2 95
u 0 1 2
u 0 2 2
u 1 2 2
t # 3-8, 1
v 0 94
v 1 13
v 2 19
v 3 5
u 0 1 2
u 0 2 2
u 0 3 2
t # 3-9, 1
v 0 94
v 1 13
v 2 19
v 3 7
u 0 1 2
u 0 2 2
u 0 3 2
t corresponds to header of each block.
I would like to extract multiple patterns from the file and output transactions that contain required patterns altogether.
I tried the following code:
ps | grep -e 't\|u 0 1 2' file.txt
and it works well to extract header and pattern 'u 0 1 2'. However, when I add one more pattern, the output list only headers start with t #. My modified code looks like that:
ps | grep -e 't\|u 0 1 2 && u 0 2 2' file.txt
I tried sed and awk solutions, but they do not work for me as well.
Thank you for your help!
Olha
Use | as the separator before the third alternative, just like the second alternative.
grep -E 't|u 0 1 2|u 0 2 2' file.txt
Also, it doesn't make sense to specify a filename and also pipe ps to grep. If you provide filename arguments, it doesn't read from the pipe (unless you use - as a filename).
You can use grep with multiple -e expressions to grep for more than one thing at a time:
$ printf '%d\n' {0..10} | grep -e '0' -e '5'
0
5
10
Expanding on #kojiro's answer, you'll want to use an array to collect arguments:
mapfile -t lines < file.txt
for line in "${lines[#]}"
do
arguments+=(-e "$line")
done
grep "${arguments[#]}"
You'll probably need a condition within the loop to check whether the line is one you want to search for, but that's it.

Bash_shell Use shell to convert three format in one script to another script at one time

cat file1.txt
set A B 1
set C D E 2
set E F 3 3 3 3 3 3
cat file2.txt
A;B;1;
C;D.E;2;
E;F;3 3 3 3 3 3;
please help convert the format in file1.txt to file2.txt, the file2.txt is the output. I just input 3 lines in file1.txt for taking example, but in fact ,there are many command lines same with these 3 format.So the shell command should be adapt to any situation where the content contains these 3 format in file1.txt.
echo "set A B 1
set C D E 2
set E F 3 3 3 3 3 3 " | sed -r 's/set (.) /\1;/;s/([A-Z])*( ([A-Z]))/\1.\3/g;s/([A-Z]) ([0-9])/\1;\2/;s/ ?$/;/'
A;B;1;
C;D.E;2;
E;F;3 3 3 3 3 3;

same dataset different prediction results

I have a very simple dataset, see below (let's call it a.vw):
-1 |a 1 |b c57
1 |a 2 |b c3
2 namespaces (a and b), and after reading wiki, I know that vw will automatically make the real features like a^1 or b^c57.
However, before I knew it, I actually made a vw file like this (call it b.vw):
-1 |a a_1 |b b_c57
1 |a a_2 |b b_c3
As you can see, I just add prefix for each feature manually.
Now I train models on both files with same configuration, like this:
cat a.vw | vw --loss_function logistic --passes 1 --hash all -f a.model --invert_hash a.readable --random_seed 1
cat b.vw | vw --loss_function logistic --passes 1 --hash all -f b.model --invert_hash b.readable --random_seed 1
then I checked the readable model files, they have exactly the same weights for each feature, see below:
$ cat a.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^1:112195:-0.235305
a^2:1080:0.243315
b^c3:46188:0.243315
b^c57:166454:-0.235305
$ cat b.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^a_1:252326:-0.235305
a^a_2:85600:0.243315
b^b_c3:166594:0.243315
b^b_c57:227001:-0.235305
Finally, I did prediction using both models on both datasets respectively, like this:
$ cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet
$ cat b.vw | vw -t -i b.model -p b.pred --link logistic --quiet
Now, here comes the problem, a.pred holds very different results from b.pred, see below:
$ cat a.pred
0.428175
0.547189
$ cat b.pred
0.371776
0.606502
WHY? Does it mean we have to manually add prefix for features?
If you try cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet --hash all you'll get:
$ cat a.pred
0.371776
0.606502
It seems --hash argument value doesn't stored in model file and you need it to be specified at test step too. It doesn't matter for b.vw as it has no pure numeric features but comes into play with a.vw. I'm not sure if it's a bug. But you may report it.

Add to the end of a predetermined line using sed in bash

I have a file in the format:
C 1 1 2
H 2 2 1
C 3 1 2
C 3 3 2
H 2 3 1
I need to add " f" to the end of specific lines, for example the third line, so the output would be:
C 1 1 2
H 2 2 1
C 3 1 2 f
C 3 3 2
H 2 3 1
From Googling, it seems that I need to use sed, but I couldn't find any examples on how to do specifically what I want.
Thanks in advance.
You are looking for this article on sed. Specifically, the section on restricting to a line number. An example:
sed '3 s/$/f/' < yourFile
awk 'NR==3{$0=$0" f"}1' your_file

Resources