How to interpret the vw output with a -a (audit) option in the contextual bandit mode? - vowpalwabbit

I am running a contextual bandint approach to a 2 different sets which differ only in action variable. One set that I construct model on (klaster3.model) has 6 different action type, while the other set on which I also construct model (klaster8.model) has 7 different action type.
When I run such a line
head testLabels -n 1 | vw -i klaster8.model -t -p /dev/stdout --quiet
in a command line I get
5.000000 Mloda_kobieta
which looks like a choosen action from the policy for that context.
But when I try the same code with -a (audit) option I recieve such output:
0.943965 Mloda_kobieta
Constant:142055:1:0.50745 ^K:136407:1:0.236886 ^Young:101199:1:0.199628
0.994175 Mloda_kobieta
Constant:142056:1:0.488827 ^K:136408:1:0.281023 ^Young:101200:1:0.224326
0.948740 Mloda_kobieta
Constant:142057:1:0.482498 ^K:136409:1:0.2568 ^Young:101201:1:0.209442
0.979921 Mloda_kobieta
Constant:142058:1:0.497253 ^K:136410:1:0.241421 ^Young:101202:1:0.241247
0.910945 Mloda_kobieta
Constant:142059:1:0.506602 ^K:136411:1:0.208468 ^Young:101203:1:0.195875
1.004143 Mloda_kobieta
Constant:142060:1:0.49813 ^K:136412:1:0.280554 ^Young:101204:1:0.225459
0.934807 Mloda_kobieta
Constant:142061:1:0.494118 ^K:136413:1:0.240735 ^Young:101205:1:0.199954
0.953710 Mloda_kobieta
Constant:142048:1:0.582269 ^K:136400:1:0.213502 ^Young:101192:1:0.15794
0.994442 Mloda_kobieta
Constant:142049:1:0.526175 ^K:136401:1:0.243671 ^Young:101193:1:0.224595
0.944228 Mloda_kobieta
Constant:142050:1:0.504455 ^K:136402:1:0.22308 ^Young:101194:1:0.216693
0.979964 Mloda_kobieta
Constant:142051:1:0.521737 ^K:136403:1:0.233687 ^Young:101195:1:0.22454
0.907704 Mloda_kobieta
Constant:142052:1:0.547686 ^Young:101196:1:0.186401 ^K:136404:1:0.173617
1.004132 Mloda_kobieta
Constant:142053:1:0.549014 ^K:136405:1:0.247787 ^Young:101197:1:0.207331
0.937724 Mloda_kobieta
Constant:142054:1:0.525254 ^K:136406:1:0.236784 ^Young:101198:1:0.175686
5.000000 Mloda_kobieta
Which looks like some kind of scoring of actions for this context, and in my opinion the action with the least scoring should be choosed (this is action 5 for this example). I am wondering why there are 14 rows, while I have only 7 different acion types in this data set? And why I recieve 12 rows when I have 6 different action types. It looks for me like number_of_different_acion_types*2. In my case there are only 2 explanatory variables, age and sex.
The questions are:
1) Does the number of rows in output with audit (-a) correspond to the equation: number_of_different_acion_types*number_of_explanatory_variables
2) If yes, does first 7 rows (in that example) correspond to first variable, and the other 7 to second variable?
3) How to get know what's the order of output? Which variable is treated as first and which as second? Does it correspond to the order of columns in an input dataset?
4) If the first 7 rows corresponds to the coefficients of costs for the 1st variable, and other 7 rows corresponds to the coefficients of costs for the 2nd variable, does the output policy chooses the arm/action with the lowest sum of those coefficients? (Each action has 2 coefficients, because there are 2 variables).
I have suspicions that the order of output corresponds to the order of columns in input but I am not sure.
Thanks for answer.

Related

How to grab Index instead of relative position using xpath

Given the following xml:
<randomName>
<otherName>
<a>item1</a>
<a>item2</a>
<a>item3</a>
</otherName>
<lastName>
<a>item4</a>
<a>item5</a>
</lastName>
</randomName>
Running: '//a' Gives me an array of all 5 "a" elements, however '//a[1]' does not give me the first of those five elements (item1). It instead gives me an array containing (item1 and item 4).
I believe this is because they are both position 1 relatively. How can I grab any a element by its overall index?
I would like to be able to use a variable "x" to get itemX.
You can wrap it in parenthesis so it knows to apply the index to the entire result set
(//a)[1]

Passing file name of a specified length in MATLAB

Problem in generating file names
I have around 4000 .txt files each containing three columns of data. I want to read all the 3 columns from a single file one at a time and then plot three values which correspond to x,y,z values on a contour plot.
These files are created at various time step. So a plot from one file will be a level curve and plots from all of them will give me a contour plot.
But the problem I want to do something which I can do in bash like this:
for n in `seq -f "%09g" 30001 200 830001`; do
./someFile$n.whateverFileFormat
done
How can I do this in matlab so that if I have let's say:
t-000030001.txt
1 2 3
......
......
......
t-0000320001.txt
2 4 5
. . .
. . .
. . .
and so on to
t-0008300001.txt
3 5 6
. . .
. . .
and on it goes.
I want to load all these files one at a time store the values in a infx3 array plot them on a contour plot and do this again and again for all the files so that I can have all of them on a single plot.
P.S. I need to reproduce something equivalent to that bash script mentioned above so as to load files appropriately then only I will be read from them
One way to get the list of file names is this:
fnames = arrayfun(#(num)sprintf('t-%09g.txt', num), 30001:200:830001, 'Uniformoutput', 0);
Let's have a closer look: 30001:200:830001 generates an array, starting at 30001, incrementing by 200, ending at 830001. sprintf generates a formatted string, and arrayfun applies the anonymous function passed as its first argument to each element of the array in its second argument (the sequence). The output is a cell array containing the file names.
EDIT
The solution above is equivalent to the following code:
ind = 30001:200:830001;
fnames = cell(numel(ind), 1);
for i = 1:numel(ind)
fnames{i} = sprintf('t-%09g.txt',ind(i));
end
This stores all the values in the a cell array.
Writing #(num)sprintf('t-%09g.txt', num) creates an anonymous function. The looping happens in arrayfun.

Can I calculate something inside a for loop and then plot those values on the same graph?

I have the following code, which plots 4 lines:
plot for [i=1:4] \
path_to_file using 1:(column(i)) , \
I also want to plot 8 horizontal lines on this graph, the values of which come from mydata.txt.
I have seen, from the answer to Gnuplot: How to load and display single numeric value from data file, that I can use the stats command to access the constant values I am interested in. I think I can access the cell (row, col) as follows:
stats 'mydata.txt' every ::row::row using col nooutput
value = int(STATS_min)
But their location is a function of i. So, inside the plot command, I want to add something like:
for [i=1:4] \
stats 'mydata.txt' every ::(1+i*10)::(1+i*10) using 1 nooutput
mean = int(STATS_min)
stats 'mydata.txt' every ::(1+i*10)::(1+i*10) using 2 nooutput
SE = int(STATS_min)
upper = mean + 2 * SE
lower = mean - 2 * SE
and then plot upper and lower, as horizontal lines on the graph, above.
I think I can plot them separately by typing plot upper, lower but how do I plot them on the graph, above, for all i?
Thank you.
You can create an array and store the values in it, then using an index that refers to the value's position in the array you can access it inside a loop.
You can create the array as follows:
array=""
do for [i=1:4] {
val = i / 9.
array = sprintf("%s %g",array,val)
}
where I have stored 4 values: 1/9, 2/9, 3/9 and 4/9. In your case you would run stats and store your upper and/or lower variables. You can check what the array looks like in this way:
gnuplot> print array
0.111111 0.222222 0.333333 0.444444
For plotting, you can access the different elements in the array using word(array,i), where i refers to the position. Since the array is a string, you need to convert it to float, which can be done multiplying by 1.:
plot for [i=1:4] 1.*word(array,i)
If you have values stored in a data file, you can process it with awk or even with gnuplot:
array = ""
plot for [i=1:4] "data" every ::i::i u (array=sprintf("%s %g",array,$1), 1/0), \
for [i=1:4] 1.*word(array,i)
The first plot instance creates the array from the first column data entries without plotting the points (the 1/0 option tells gnuplot to ignore them, so expect warning messages) and the second plot instance uses the values stored in array as variables (hence as horizontal lines in this case). Note that every takes 0 as the first entry, so [i=1:4] runs from the second through to the fifth lines of the file.

hadoop stream, how to set partition?

I'm very new with hadoop stream and have some difficulties with the partitioning.
According to what is found in a line, my mapper function either returns
key1, 0, somegeneralvalues # some kind of "header" line where linetype = 0
or
key1, 1, value1, value2, othervalues... # "data" line, different values, linetype =1
To properly reduce I need to group all lines having the same key1, and to sort them by value1, value2, and the linetype ( 0 or 1), something like:
1 0 foo bar... # header first
1 1 888 999.... # data line, with lower value1
1 1 999 111.... # a few datalines may follow. Sort by value1,value2 should be performed
------------ #possible partition here, and only here in this example
2 0 baz foobar....
2 1 123 888...
2 1 123 999...
2 1 456 111...
Is there a way to ensure such partitioning ? so far I've tried to play with options such as
-partitioner,'org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner'
-D stream.num.map.output.key.fields=4 # please use 4 fields to sort data
-D mapred.text.key.partitioner.options=-k1,1 # please make partitions based on first key
or alternatively
-D num.key.fields.for.partition=1 # Seriously, please group by key1 !
which yet only brought rage and despair.
If it's worth mentioning it, my scripts work properly if I use cat data | mapper | sort | reduce
and I'm using the amazon elastic map reduce ruby client, so I'm passing the options with
--arg '-D','options' for the ruby script.
Any help would be highly appreciated ! Thanks in advance
Thanks to ryanbwork I've been able to solve this problem. Yay !
The right idea was indeed to create a key that consists of a concatenation of the values. To go a little further, it is also possible to create a key that looks like
<'1.0.foo.bar', {'0','foo','bar'}>
<'1.1.888.999', {'1','888','999'}>
Options can then be passed to hadoop so that it can partition by the first "part" of the key. If I'm not mistaking in the interpretation, it looks like
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartioner
-D stream.map.output.field.separator=. # I added some "." in the key
-D stream.num.map.output.key.fields=4 # 4 "sub-fields" are used to sort
-D num.key.fields.for.partition=1 # only one field is used to partition
This solution, based on what ryanbwork said, allows to create more reducers, while ensuring the data is properly splitted, and sorted.
After reading this post I'd propose modifying your mapper such that it returns pairs whose 'keys' include your key value, your linetype value, and the value1/value2 values all concatenated together. You'd keep the 'value' part of the pair the same. So for example, you'd return the following pairs to represent your first two examples:
<'10foobar',{'0','foo','bar'}>
<'11888999',{'1','888','999'}>
Now if you were to utilize a single reducer, all of your records would be get sent to the same reduce task and sorted in alphabetical order based on their 'key'. This would fulfill your requirement that pairs get sorted by key, then by linetype, then by value1 and finally value2, and you could access these values individually in the 'value' portion of the pair. I'm not very familiar with the different built in partioner/sort classes, but I'd assume you could just use the defaults and get this to work.

Group and Count an Array of Structs

Ruby noob here!
I have an array of structs that look like this
Token = Struct.new(:token, :ordinal)
So an array of these would look like this, in tabular form:
Token | Ordinal
---------------
C | 2
CC | 3
C | 5
And I want to group by the "token" (i.e. the left hand column) of the struct and get a count, but also preserve the "ordinal" element. So the above would look like this
Token | Merged Ordinal | Count
------------------------------
C | 2, 5 | 2
CC | 3 | 1
Notice that the last column is a count of the grouped tokens and the middle column merges the "ordinal". The first column ("Token") can contain a variable number of characters, and I want to group on these.
I have tried various methods, using group_by (I can get the count, but not the middle column), inject, iterating (does not seem very functional) but I just can't get it right, partly because I don't have a good grasp of Ruby and the available operations / functions.
I have also had a good look around SO, but I am not getting very far.
Any help, pointers would be much appreciated!
Use Enumerable#group_by to do the grouping for you and use the resulting hash to get what you want with map or similar.
structs.group_by(&:token).map do |token, with_same_token|
[token, with_same_token.map(&:ordinal), with_same_token.size]
end

Resources