same dataset different prediction results - vowpalwabbit

I have a very simple dataset, see below (let's call it a.vw):
-1 |a 1 |b c57
1 |a 2 |b c3
2 namespaces (a and b), and after reading wiki, I know that vw will automatically make the real features like a^1 or b^c57.
However, before I knew it, I actually made a vw file like this (call it b.vw):
-1 |a a_1 |b b_c57
1 |a a_2 |b b_c3
As you can see, I just add prefix for each feature manually.
Now I train models on both files with same configuration, like this:
cat a.vw | vw --loss_function logistic --passes 1 --hash all -f a.model --invert_hash a.readable --random_seed 1
cat b.vw | vw --loss_function logistic --passes 1 --hash all -f b.model --invert_hash b.readable --random_seed 1
then I checked the readable model files, they have exactly the same weights for each feature, see below:
$ cat a.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^1:112195:-0.235305
a^2:1080:0.243315
b^c3:46188:0.243315
b^c57:166454:-0.235305
$ cat b.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^a_1:252326:-0.235305
a^a_2:85600:0.243315
b^b_c3:166594:0.243315
b^b_c57:227001:-0.235305
Finally, I did prediction using both models on both datasets respectively, like this:
$ cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet
$ cat b.vw | vw -t -i b.model -p b.pred --link logistic --quiet
Now, here comes the problem, a.pred holds very different results from b.pred, see below:
$ cat a.pred
0.428175
0.547189
$ cat b.pred
0.371776
0.606502
WHY? Does it mean we have to manually add prefix for features?

If you try cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet --hash all you'll get:
$ cat a.pred
0.371776
0.606502
It seems --hash argument value doesn't stored in model file and you need it to be specified at test step too. It doesn't matter for b.vw as it has no pure numeric features but comes into play with a.vw. I'm not sure if it's a bug. But you may report it.

Related

how to add secuential number getting output with jq

I'm getting some values with jq command like these:
curl xxxxxx | jq -r '.[] | ["\(.job.Name), \(.atrib.data)"]' | #tsv' | column -t -s ","
It gives me:
AAAA PENDING
ZZZ FAILED BAD
What I want is that I get is a first field with a secuencial number (1 ....) like these:
1 AAA PENDING
2 ZZZ FAILED BAD
......
Do you know if it's possible? Thanks!
One way would be to start your pipeline with:
range(0;length) as $i | .[$i]
You then can use $i in the remainder of the program.

sort: wrong order when comparing according to numerical value

I'm trying to sort the lines in the following file according to numerical value in the second column:
2 117.336136
1 141.003021
1 342.389160
1 169.059006
1 208.173370
1 117.608192
However, for some reason, the following command returns the lines in the wrong order:
cat file | sort -n -k2
1 117.608192
2 117.336136
1 141.003021
1 169.059006
1 208.173370
1 342.389160
The first two lines are swapped. For other lines, the content of the first column does not affect the result.
Without the -k argument, the sort works exacly as expected:
cat file | cut -d' ' -f2 | sort -n
117.336136
117.608192
141.003021
169.059006
208.173370
342.389160
Why is that? Did I misunderstand the meaning of the -k argument?
Additional information:
LC_ALL=cs_CZ.utf8
sort --version gives sort (GNU coreutils) 8.31
Sort sorts files according to your locale settings.
As mentioned by #KamilCuk, the decimal separator for cs_CZ is , instead of .. You can override the default locale with LC_ALL=C.UTF-8 or LC_ALL=C (no UTF-8 support), to use the C local settings for sorting.
For sorting floating point values, you need -g as -n just sorts the integer part.
Also important when using sort is to restrict the sorting to the specific column -k 2,2g as by default sort also will use the rest of the line for sorting.
$ LC_ALL=C.UTF-8 sort -k 2,2g test_sort.txt
2 117.336136
1 117.608192
1 141.003021
1 169.059006
1 208.173370
1 342.389160
$ LC_ALL=C sort -k 2,2g test_sort.txt
2 117.336136
1 117.608192
1 141.003021
1 169.059006
1 208.173370
1 342.389160
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n'
1 5.3 6.0
2 5.3 5.0
# Sort uses the rest of the line to sort.
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n' | LC_ALL=C.UTF-8 sort -k 2
2 5.3 5.0
1 5.3 6.0
# Sort only uses the second column.
$ printf '1\t5.3\t6.0\n2\t5.3\t5.0\n' | LC_ALL=C.UTF-8 sort -k 2,2
1 5.3 6.0
2 5.3 5.0

How do I remove duplicated by position SNPs using PLink?

I am working with PLINK to analyse SNP chip data.
Does anyone know how to remove duplicated SNPs (duplicated by position)?
If we already have files in plink format then we should have .bim for binary plink files or .map for text plink files. In either case the positions are on the 3rd column and SNP names are on 2nd column.
We need to create a list of SNPs that are duplicated:
sort -k3n myFile.map | uniq -f2 -D | cut -f2 > dupeSNP.txt
Then run plink with --exclude flag:
plink --file myFile --exclude dupeSNP.txt --out myFileSubset
you can also do it directly in PLINK1.9 using the --list-duplicate-vars flag
together with the <require-same-ref>, <ids-only>, or <suppress-first> modifiers depending on what you want to do.
check https://www.cog-genomics.org/plink/1.9/data#list_duplicate_vars for more details
If you want to delete all occurences of a variant with duplicates, you will have to use the --exclude flag on the output file of --list-duplicate-vars ,
which should have a .dupvar extention.
I should caution that the two answers given below yield different results. This is because the sort | uniq method only takes into account SNP and bp location; whereas, the PLINK method (--list-duplicate-vars) takes into account A1 and A2 as well.
Similar to sort | uniq on the .map file we could use AWK on a .gen file, that looks like this:
22 rs1 12 A G 1 0 0 1 0 0
22 rs1 12 G A 0 1 0 0 0 1
22 rs2 16 C A 1 0 0 0 1 0
22 rs2 16 C G 0 0 1 1 0 0
22 rs3 17 T CTA 0 0 1 0 1 0
22 rs3 17 CTA T 1 0 0 0 0 1
# Get list of duplicate rsXYZ ID's
awk -F' ' '{print $2}' chr22.gen |\
sort |\
uniq -d > chr22_rsid_duplicates.txt
# Get list of duplicated bp positions
awk -F' ' '{print $3}' chr22.gen |\
sort |\
uniq -d > chr22_location_duplicates.txt
# Now match this list of bp positions to gen file to get the rsid for these locations
awk 'NR==FNR{a[$1]=$2;next}$3 in a{print $2}' \
chr22_location_duplicates.txt \
chr22.gen |\
sort |\
uniq \
> chr22_rsidBylocation_duplicates.txt
cat chr22_rsid_duplicates.txt \
chr22_rsidBylocation_duplicates.txt \
> tmp
# Get list of duplicates (by location and/or rsid)
cat tmp | sort | uniq > chr22_duplicates.txt
plink --gen chr22.gen \
--sample chr22.sample \
--exclude chr22_duplicates.txt \
--recode oxford \
--out chr22_noDups
This will classify rs2 as a duplicate; however, for the PLINK list-duplicate-vars method rs2 will not be flagged as a duplicate.
If one want's to obtain the same results using PLINK (a non-trivial task for BGEN file formats since awk, sed etc. do not work on binary files!) you can use the --rm-dup command from PLINK2.0. The list of all duplicate SNPs removed can be logged (to a file ending in .rmdup.list) using the list parameter, like so:
plink2 --bgen chr22.bgen \
--sample chr22.sample \
--rm-dup exclude-all list \
--export bgen-1.1 \ # Export as bgen version 1.1
--out chr22_noDups
Note: I'm saving the output as version 1.1 since plink1.9 still has commands not available in plink version 2.0. Therefore the only way to use bgen files with plink1.9 (at this time) is with the older 1.1 version.

Minimal two column numeric input data for `sort` example, with distinct permutations

What's the least number of rows of two-column numeric input needed to produce four unique sort outputs for the following four options:
1. -sn -k1 2. -sn -k2 3. -sn -k1 -k2 4. -sn -k2 -k1 ?
Here's a 6 row example, (with 4 unique outputs):
6 5
3 7
6 3
2 7
4 4
5 2
As a convenience, a function to count those four outputs given 2 columns of numbers, (requires the moreutils pee command), which prints the number of unique outputs:
# Usage: foo c1_1 c2_1 c1_2 c2_2 ...
foo() { echo "$#" | tr -s '[:space:]' '\n' | paste - - | \
pee "sort -sn -k1 | md5sum" \
"sort -sn -k2 | md5sum" \
"sort -sn -k1 -k2 | md5sum" \
"sort -sn -k2 -k1 | md5sum" | \
sort -u | wc -l ; }
So to count the unique permutations of this input:
8 5
3 5
8 4
Run this:
foo 8 5 3 1 8 3
Output:
2
(Only two unique outputs. Not enough...)
Note: This question was inspired by the obscurity of the current version of the sort manual, specifically COLUMNS=65 man sort | grep -A 17 KEYDEF | sed 3,18d. The info sort page's treatment of KEYDEFs is much better.
KEYDEFs are more useful than they might first seem. The -u or --unique switch works nicely with the KEYDEFs, and in effect allows sort to delete unwanted redundant lines, and therefore can furnish a more concise substitute for certain sed or awk scripts and similar pipelines.
I can do it in 3 by varying the whitespace:
1 1
2 1
1 2
Your foo function doesn't produce this kind of output, but since it was only a "convenience" and not a part of the question proper, I declare this answer correct and minimal!
Sneakier version:
2 1
11 1
2 2
(The last line contains a tab; the others don't.)
With the -s option, I can't exploit non-numeric comparisons, but then I can exploit the stability of the sort:
1 2
2 1
1 1
The 1 1 line goes above both of the others if both fields are compared numerically, regardless of which comparison is done first. The ordering of the two comparisons determines the ordering of the other two lines.
On the other hand, if one of the fields isn't used for comparison, the 1 1 line stays below one of the other lines (and which one that is depends on which field is used for comparison).

Vowpal Wabbit predictions for multi-label classification

I am sorry, I do feel I am overlooking something really obvious.
But how can the following happen:
$ cat myTrainFile.txt
1:1 |f 1:12 2:13
2:1 |f 3:23 4:234
3:1 |f 5:12 6:34
$ cat myTestFile.txt
14:1 |f 1:12 2:13
14:1 |f 3:23 4:234
14:1 |f 5:12 6:34
$ vw --csoaa 3 -f myModel.model --compressed < myTrainFile.txt
final_regressor = myModel.model
...
...
$ vw -t -i myModel.model -p myPred.pred < myTestFile.txt
only testing
Num weight bits = 18
...
...
$ cat myPred.pred
14.000000
14.000000
14.000000
So the test file is identical to the train file, but for the labels.
Hence, I would expect vw to produce the original labels that it learned from the train file, as it ignores the labels in the test file completely.
However, it seems to reproduce the labels form the test file?!?
Clearly, I am doing something completely wrong here... but what?
If you specify just one label in --csoaa (even in the -t test mode), it means that only that label is "available" for this example, so no other label can be predicted.
This is another difference from --oaa (where you always specify just the correct label).
See https://groups.yahoo.com/neo/groups/vowpal_wabbit/conversations/topics/2949.
If all labels are "available" (possible) for any test example, you must always include all the labels on each line.
With -t you do not need to include the costs of the labels if you just want to get the --predictions (if you don't need vw to compute the test loss).
So your myTestFile.txt should look like:
1 2 3 |f 1:12 2:13
1 2 3 |f 3:23 4:234
1 2 3 |f 5:12 6:34
and your myTrainFile.txt should look like:
1:0 2:1 3:1 |f 1:12 2:13
1:1 2:0 3:1 |f 3:23 4:234
1:1 2:1 3:0 |f 5:12 6:34
So, for completeness' sake, here is how it does work:
$ cat myTrainFile.txt
1:1.0 |f 1:12 2:13
2:1.0 |f 3:23 4:234
3:1.0 |f 5:12 6:34
$ cat myTestFile.txt
1 2 3 |f 1:12 2:13
1 2 3 |f 3:23 4:234
1 2 3 |f 5:12 6:34
$ vw -t -i myModel.model -p myPred.pred < myTestFile.txt
only testing
...
$ cat myPred.pred
2.000000
1.000000
2.000000
So it is a bit suprising maybe that none of examples is classified correctly, but that is another problem.
Thanks #Martin Popel!

Resources