Does MovieLens 100k dataset lack the validation set? - validation

The MovieLens 100k data set provides five pairs of training and test sets for 5-fold cross validation. However, I learnt that a validation set should be used prior to testing on the test set, in order to get the optimal parameter values.
I assume that in the original split, the five "test sets" are actually the validation sets. If that's true, then there are no "test set" which the model performance can be tested on. So shall I re-split the MovieLens data in order to perform a sound train-validate-test process?
Thanks!

You actually have 2 options for the tests in the movielens set.
First option :
Users are split into 5 groups, and in each group is also split in a base group and a test group.
The base groups are here to "train" your algorithms, and the test groups to test. You have 5 different groups so you can do the learning and the testing process 5 times, and eventually have a statistical informations on various sets.
Second option :
Every user in the 100k set have 20 ratings. In the second case, you have two sets a and b.
Each user has 10 ratings on a and 10 ratings on b. You can therefore learn from the set a, and then try to guess and compare for the set b.
Of course, having the complete set, you can also set your own groups if you wants !

Related

VW contextual bandits: historical data and online learning

I'd like to test CB for e-commerce task: personal offer recommendations (like "last chance to buy", "similar positions", "consumers recommend", "bestsellers", etc.). My task is to order them (more relevant issue is higher in the list of recommendations).
So, there are 5 possible offers.
I have some historical data collected without using any model: context (user and web-session features), action id (one of my 5 offers), reward (1 if user clicked this offer, 0 - not clicked). So I have N users and 5 offers with known reward, totally 5*N rows in my historical data.
Ex:
1:1:1 | user_id:1 f1:... f2:...
2:-1:1 | user_id:1 f1:... f2:...
3:-1:1 | user_id:1 f1:... f2:...
This means that user 1 have seen 3 offers (1,2,3), cost of the 1 offer is equal to 1 (didn't click), user ckickes on offers 2 and 3 (cost is negative -> reward is positive). Probabilities are equal to 1, since all offers were shown and we know rewards.
Global task is to increase CTR. I'd like to use this data for training CB and then improve the model by exploration/exploitation policies. I set probabilities equal to 1 in this data (Is it right?). But next I'd like to set the order of offers according to rewards.
Should I use for this warm start in VW CB? Will this work correctly with data collected without using CB? Maybe you can advise more relevant methods in CB for this data and task?
Thanks a lot.
If there are only 5 possible offers and if you (as indicated) have data of the form "I have N users and 5 offers with known reward, totally 5*N rows in my historical data." then your historical data is supervised multilabel data and the warm-start functionality would apply; make sure you use the cost-sensitive version to accommodate the multilabel aspect of your historical data (i.e., there is more than one offer that would result in a click).
Will this work correctly with data collected without using CB?
Because the every action-reward is specified for every user in the data set, you only have to ensure that the sample of users is representative of the population you care about.
Maybe you can advise more relevant methods in CB for this data and task?
The first paragraph started with "if" because the more typical case is 1) there are many possible offers and 2) users have only seen a few of them historically.
In such case what you have is a combination of a degenerate logging policy and multiple rewards being revealed. If there are k possible actions but each user has only seen n<=k historically then you could try and make n lines for each user as you did. Theoretically this does not necessarily work but in practice it might help.
Out of the box: change the data
If the data you have was collected as the result of running an existing policy, then an alternative would be to start randomizing the decisions made by that system in order to collect a dataset which conforms to CB. For example, use your current system to pick the "best" action 96% of the time, and one of the other 4 actions at random 4% of the time, and log the probability along with the reward (either 0.96 or 0.01 depending upon whether it was the considered best), and then set up a proper CB-style training set for vw. With this you can also counterfactually estimate the value of both your current policy and the policy vw generates, and only switch to vw when it is winning.
The fastest way to implement the last paragraph is to just start using APS.

How to merge multiple datasets in long format with missing data in SPSS

I am working on a project compiling data from a longitdinal study with approx 300 participants. The data I have been supplied with is divided into multiple SPSS-files, which all are in "long-format". Each data set contain the longitudinal information for one test. The participants have been measured at 20 time-points, so there is a maximum of 20 observations per participant.
I am to compile all these data sets into one SPSS-file in wide-format.
However, there are multiple challenges when combining these data sets into one data set:
Missing visits are not represented by a row in any of the dataset.
The response rate (i.e. total number of rows) differ between measure.
There are multiple errors in coding of visits (i.e. the "timpeoint" variable is wrong in many of the cases), but i have date of each visit.
To correct for two of these challanges have manually checked in one the databases, and corrected errorous "timepoint"-values, added missing rows. This "mother-database" is now of sound quality.
I was wondering if there is a way to merge the rest of the spss-files with this one, while meeting the following criteria:
Match by ID number.
Match by visit-date, using "mother-database" as a definition of a visit (note that not all visits are available in the data sets i am trying to merge with the "mother-database").
In case of missing data on a visit, add missing values.
In case of missing visit (i.e. no date), add missing values.

Cross-validation in Lenskit

I'm trying to understand how exactly is performed cross-validation in lenskit. In the documentation, it says that by default the data are partitioned by user. Does that mean that, in each fold, none of the users in the test set has been used for training? Is this achieved through the "holdout" option? If so, does this option break the user-based partioning and yields folds in which each user shows up in both the training and test sets?
Right now, my evaluation code looks something like this:
dataset crossfold("data") {
source csvfile(sourceFile) {
delimiter "\t"
domain {
minimum 0.0
maximum 10.0
precision 0.1
}
}
// order RandomOrder
holdoutFraction 0.1
}
I commented out the "order" option because, when using it, lenskit eval throws an error.
Cheers!!!
Each user appears in both the training and the test sets, no matter the holdout, holdoutFraction, or retain options.
However, for each test user (when using 5 partitions, 20% of the users), part of their ratings (the test ratings) are held out and placed in the test set. The remainder of their ratings are placed in the training set, along with all ratings from other users.
This simulates the common case of a recommender system: you have users, for whom some of their history is already known and can be used in model training, and you're trying to recommend or predict their future behavior.
The holdout, holdoutFraction, and retain options are different ways of deciding how many ratings are put in the test set. If you say holdout 5, then 5 ratings from each test user are put in the test set, and the rest are used for training. If you say holdoutFraction 0.2, then 20% are used for testing and 80% for training. If you say retain 5, then 5 are used for training and the rest are used for testing.

How can I test randomly ordered data from Postgres?

I'm writing a REST API that returns products in JSON from a Postgres database.
I have written integration tests to test which products are returned and this works fine. A requirement has just been introduced to randomly order the products returned.
I've changed my tests to not rely on the order the results come back in. My problem is testing the new random requirement.
I plan on implementing this in the database with Postgres' RANDOM() keyword. If I was doing this "in code" I could stub the random code generator to always be the same value, but I'm not sure what to do in the database.
How can I test that my new random requirement is working?
I've found a way of doing what I need.
You can set the seed value for Postgres using SETSEED().
If you set the seed before the you execute the query that uses RANDOM(), the results will come back in the same order every time.
SELECT SETSEED(0.5);
SELECT id, title FROM products ORDER BY RANDOM() LIMIT 2;
The seed value is reset after the SELECT query.
To test that the data comes back random we can change the seed value.
I don't want to test if Postgres' RANDOM() works, but that my code that uses it does.
That will depend on your definition of randomness. As a first try you could issue the same request twice and make sure that the same result set is returned but in a different order. This of course assumes that your test data will not page or some such, but if it does your test will of course be more difficult, as you would probably have to retrieve all pages in order to verify anything.
On second thoughts paging would probably complicate the whole request, as it would require having the same randomness across several pages.
IMHO if you want to test randomness, you should find a query that return few results - 2 would be the ideal number.
You then run the query a big number of times and count the occurences of the different ordering possibilities. The number must not reach same value, but the frequencies should converge toward 1/n, where n is the number of orderings. But in fact, you do not want the quality of the random generator, all you need is to be sure that you correctly use it. So you should only test that you get one of each possibility for a correct number of tests.
I would use 100 run if n <= 10 and n2 if n > 10. For n = 10 and 100 runs the probability of having one possibility off is less than 3e-5. So run the test once, and run it again if it fails, and it should be enough. Of course, if you want to reduce the risk of false detection simply augment the number of runs ... but tests will be longer ...

Sorting and merging in Stata on categorical variables

I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.

Resources