numerical training data format of contextual bandit in Vowpal Wabbit - vowpalwabbit

I have plan to use contextual bandit of Vowpal Wabbit(VW) for building the recommend system.
I have M(26 in this case) dimensional numerical feature of N users, and have the feedback logs that contains information which user clicks which item(e.g. Ad). And the total number of valid actions slightly different each feedback logs (about 100~150). Only information from items(actions) is its unique ID.
So in this situation, I decided to use ADF learning mode (--cb_explore_adf). But in the tutorial, it seems the VW only takes care of categorical data type not numerical. Anyway, I tried to set test data format like below.
shared |User feat_0=1.0 feat_1=0.00389094278216362 feat_2=0.004632890224456787 feat_3=0.003936515189707279 feat_4=0.0053831832483410835 ... feat_23=0.4192083477973938 feat_24=0.003969503100961447 feat_25=0.0038898871280252934
|Action item_id=hamny-kU9bbbbbak
|Action item_id=hamny-kU9bcxP9v1
...
|Action item_id=hamny-bbbbbcxP9v
|Action item_id=hamny-k7bbbbbcxd
|Action item_id=hamny-bbbbbbbbbc
|Action item_id=hamny-aaaaaaaaac
Above example asks CB model to produce pmf(predict) among 100 actions given 26D user context feature.
After getting prediction from model and reward, training data format would be ..
shared |User feat_0=1.0 feat_1=0.00389094278216362 feat_2=0.004632890224456787 feat_3=0.003936515189707279 feat_4=0.0053831832483410835 ... feat_23=0.4192083477973938 feat_24=0.003969503100961447 feat_25=0.0038898871280252934
|Action item_id=hamny-kU9bbbbbak
|Action item_id=hamny-kU9bcxP9v1
...
|Action item_id=hamny-bbbbbcxP9v
0:-1:0.57124 |Action item_id=hamny-k7bbbbbcxd
|Action item_id=hamny-bbbbbbbbbc
|Action item_id=hamny-aaaaaaaaac
I'm not sure it is proper format or not. But when I run some simulation for CTR, I got gives almost same result from CB model regardless of exploration option (e.g. epsilon, bag, softmax etc.).
I just tried same logic in tutorial function(run_simulation). The only differences are example: shared context, number of actions, and ADF.

The VW text format is quite simple. When specifying features if you use a ':' and then a float afterwards it allows you to specify the feature's value. If there is no explicit value after a ':' the value is 1.
So when you supply a feature as feat_1=0.00389094278216362 it is a categorical feature of value 1. The important thing to note here is that if any part of that feature string changes it results in a completely different features (the entire string is hashed to determine its index) so feat_1=0.00389094278216363 (last character changed) is a completely different feature. There is no relation between the two.
You could try specifying the value like feat_1:0.00389094278216362 but I am not sure if that will really work. Perhaps if there is some sort of linear relationship of the feature with the outcome?
You could also try binning the features to some decimal place with rounding. So, feat_1=0.00389094278216362 may become feat_1=0.004.
I am not sure of the theory behind what should be done here, but those are my thoughts of things you could try empirically.

Related

How do I access h2o xgb model input features after saving a model to disk and reloading it?

I'm using h2o's xgboost implementation in Python. I've saved a model to disk and I'm trying to load it later on for analysis and predicting. I'm trying to access the input features list or, even better, the feature list used by the model which does not include the features it decided not to use. The way people advise doing this is to use varimp function to get the variable importance and while this does remove features that aren't used in the model this actually gives you the variable importance of intermediate features created by OHE the categorical features, not the original categorical feature names.
I've searched for how to do this and so far I've found the following but no concrete way to do this:
Someone asking something very similar to this and being told the feature has been requested in Jira
Said Jira ticket which has been marked resolved but I believe says this was implemented but not customer visible.
A similar ticket requesting this feature (original categorical feature importance) for variable importance heatmaps but it is still open.
Someone else who found an unofficial way to access the columns with model._model_json['output']['names'] but that doesn't give the features that weren't used by the model and they are told to use a different method that doesn't work if you have saved the model to disk and reloaded it (which I am doing).
The only option I see is to just use the varimp features, split on period character to break the OHE feature names, select the first part of all the splits, and then run a set over everything to get the unique column names. But I'm hoping there's a better way to do this.

Can you provide additional tags for documents using TaggedLineDocument?

When training a doc2vec model using a corpus in the TaggedDocument class, you can provide a list of tags. When the doc2vec model is trained it learns a vector representation for the tags. For example you could have one tag representing the document, and another representing some classification that can be shared between documents.
How would one provide additional tags when streaming a corpus using TaggedLineDocument?
The TaggedLineDocument class only considers documents to be one per line, with a single tag that is their line-number.
If you want more tags, you'll have to provide your own iterable which does that. It should only be a few lines of code, depending on where your other tags come from. You can use the source for TaggedLineDocument – which is itself only 9 lines of Python code –as a model to build on:
https://github.com/RaRe-Technologies/gensim/blob/e4199cb4e9a90df44ca59c1d0505b138caa21951/gensim/models/doc2vec.py#L1126
Note: while supplying ore than one tag per document is a natural extension of the original 'Paragraph Vectors' approach, and often can provide benefits, sometimes it also 'dilutes' the salience of each tag's vector – which will be a special concern as the average number of tags per document grows, or the model acquires many more tags than unique documents. So be sure to comparatively evaluate whether any multiple-tag strategy is helping or hurting, in different modes, and whether things like pre-known categories work better as extra tags or known-labels for some later steps.

openEHR, Snomed and Measurement units

I'm new to openEHR and snomed. I'm wanting to store information pack definition for a tobacco summary. How do I go about storing the measurement units (grams, oz, number of cigarettes)? Is there a reference list of these in either of the standards?
Thanks
Your question should not be about storing, it should be about modeling with openEHR. Storage of openEHR data is a separated issue.
For modeling, you will need first to understand the information model, the structure, the datatypes, etc. You will find some types that might be useful in your case, for instance using a DV_COUNT for storing the number of (this is for counting, like number of cigarettes), that doesn't have units of measure since is a count. If you want to store volume or weight, the openEHR information model has DV_QUANTITY. For standard units, as Bert says, you can use UCUM. For non standard units, you might need to choose a different datatype since the recommendation for DV_QUANTITY.units is to use UCUM (Unified Code for Units of Measure).
When you have that figured our, you need to follow the openEHR methodology for modeling, using archetypes and templates. A template would be the final form of your structure that can be used in software. At that moment you can worry about storage.
Storing today is a solved problem. There are many solutions, using relational, document and mixed databases. My implementation, the EHRServer, uses pure relational approach. But you can create your own, just map the openEHR information model structures to your database of preference, starting from the datatypes.
And of course, start with the openEHR specs: https://www.openehr.org/programs/specification/workingbaseline
BTW, SNOMED doesn't play any role here, not sure why you mentioned that in the title. You need to understand the standards before trying to implement them.
OpenEhr has an own Unit list from which you should choose a unit in a DvQuantity, but since short time, in the specs, the newest version, is described that you must use a unit from the UCUM standard. Check the description for DataTypes in the specifications.
You can find the UCUM standard here. The link is published by the Regenstreif institute (the same institute which serves the LOINC standard), so it is stable.
http://unitsofmeasure.org/ucum.html
There is a Golang-UCUM-library:
https://github.com/BertVerhees/ucum

How do I use IOB tags with Stanford NER?

There seem to be a few different settings:
iobtags
iobTags
entitySubclassification (IOB1 or IOB2?)
evaluateIOB
Which setting do I use, and how do I use it correctly?
I tried labelling like this:
1997 B-DATE
volvo B-BRAND
wia64t B-MODEL
highway B-TYPE
tractor I-TYPE
But on the training output, it seemed to think that B-TYPE and I-TYPE were different classes.
I am using the 2013-11-12 release.
How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter implementations. Sorry.
The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:
-entitySubclassification IOB2
The resulting label set is then used for training and classification. The options are documented in the entitySubclassify() method of CoNLLDocumentReaderAndWriter: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval program, but you can keep it as what you mapped it to with the flag:
-retainEntitySubclassification
To use this DocumentReaderAndWriter, you can give a training command like:
java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2
Alternatively, ColumnDocumentReaderAndWriter is the default DocumentReaderAndWriter which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:
-mergeTags will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
-iobTags can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.
In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.

How to Build a User Friendly Filter

Our application displays tons of valuable information to our users in a table. We have a filtering capablity that is based on boolean/logic searches. Even after coaching, users still tend to not understand how to use filters because AND OR > >= etc are foreign to them. This filter is easy for programmers since it is easily translated into code. Any examples on how this can be made more user-friendly and less prone to error?
In the past, when I needed to solve this problem, I presented the users with a list of items (in one or more columns), and gave them a single text box to type text into. I would then match the text against the text in the columns, and collapse the list (removing records that do not match) as they type.
This approach reminds users of Google. Everyone knows how to Google.
If you don't like the idea of presenting a large list of all items initially, you can show an empty results pane first, and display results after a search is typed in.
Convert operators to plain English text and ask them to select from it.
For eg: To
Show me all Books whose author is [text field] and the price is [less than/greater than] [text field]
[less than/greater than] is a dropdown list
[text field] is an input box
The resulting text after the user has filled in all the fields should result in plain simple English
Eg: Show me all books whose author is Stephen King and the price is less than 10 $
I used this in an app of mine when I used to freelance and the users loved it.
Using some nifty UI programming you can give options to expand the filter to n levels.
In web applications, telerik had a good idea with their grid, you should be able to do that in desktop applications too.
you can provide some preset filters for the most common queries to that table - if that's possible with the application you are using
you can provide a "count instead of display" mechanism so the user sees how many rows he/she will potentially retrieve
you can provide them a Wiki page with some examples online
you can give them a QBE tool
hope that helps
good luck MikeD
In my experience you are simply not going to get end users to understand the difference between AND and OR conditions. Therefore I build my filters so that ANDing or ORing is built in. In general, my logic is as follows:
Criteria for different fields are ANDed together to restrict results.
Multiple values for the same field are effectively ORed together and then ANDed onto the criteria for other fields. I generally detect input into a single field of comma-separated lists (translated to IN ()), dash-separated ranges (translated to BETWEEN), wildcard values (translated to LIKE), and any combination (for example Customer ID: 1-10, 50, 52).
I find that most users intuitively understand this system.
Of course, from time to time a different interface with some degree of ORing is required and in those cases I generally have a section of the search user interface in a panel or group box labelled "Any of these is true".
I have recently been working on this problem. My solution is to be more descriptive, to use words instead of symbols and to change the words where it allows for a more readable layout. To illustrate, imagine the filter expression:
Breed == "Spaniel" AND (Age == 2 OR Colour == "White")
Certain linear Query builders might write this:
( And/Or Field Operator Value
[ ] [Breed] [=] [Spaniel]
[1] [AND] [Age] [=] [2]
[1] [OR] [Colour] [=] [White]
Or a hierarchical one may display this as:
AND
[Breed] [Is Equal To] [Spaniel]
OR
[Age] [Is Equal To] [Spaniel]
[Colour] [Is Equal To] [White]
Both of which might be readable to a developer but not so readable to the layperson.
My solution is more like:
Show ALL records where
[Breed] [Is Equal To] [Spaniel]
Show ANY records where
[Age] [Is Equal To] [Spaniel]
[Colour] [Is Equal To] [White]
So borrowing from the hierarchical approach but changing the AND and OR to an ALL or ANY. This means it can be read from top to bottom a little more easily.
I think Django's built-in admin interface has a very intuitive UI for filters.
There's a simple screenshot in the docs but there's a lot more you can do, especially when filtering on dates.
You might want to take a closer look at Django's admin interface to see if you can apply some of their tricks to your case.
I would think something similar to MS Access Query generator. You may also want to have good context sensitive help system that will guide first time users.
Theresa Neil illustrated several approaches for building complex rule interfaces (AKA predicate clauses) in the iTunes Solves the Nested Clause Dillema post. Some good examples there. I really like the way Apple does it in iTunes (although, I don't use iTunes).

Resources