How can I make randomization automatic on REDCap? - survey

I am running a survey on REDCap where participants need to be assigned to one of three groups before receiving a group-specific intervention to reduce their smartphone use (e.g., 1 - intervention one, 2 - intervention 2, 3 - intervention 3). I have tried using the randomization module, but it requires one to allocate each record manually. For this specific study it becomes a problem, because we want to collect data from hundreds of people who will be completing the study all over the world, meaning that I cannot be on the computer at all times manually randomizing people and entering their records.
Is there a way to set up the randomization (or any other method) so that participants are randomly assigned to one of the three groups?

As you noted, randomisation in REDCap must be executed by a user with sufficient rights to do so, and ordinarily cannot be automated. But there are other options.
Realtime Randomization
You should reach out to your local REDCap administrators, as they may be amenable to installing the Realtime Randomization External Module, which may provide you the functionality that you want. This will (I think) automate the execution of the randomize button when a form is completed. Whether it works on surveys I don't fully know. Assuming it does, this is advantageous as it will uses the pre-defined randomisation allocation table that you generate outside REDCap, possibly with the help of a statistician. This is preferred if you need real randomisation.
Pseudo-randomisation
If you don't need to use a pre-defined randomisation allocation table, and can get by with each successive participant being allocated to a different group (record 1 -> intervention 1, record 2 -> intervention 2, record 3 -> intervention 3, record 4 -> intervention 1, etc.), so in fact not random at all, but sort of gated, then you can use the record ID in a calculated field to determine which of the three interventions a record should be allocated to. To do this you should return the modulo of the record ID by 3:
[record-name] - (rounddown([record-name]/3) * 3)
This will return 1, 2 and 0 for record IDs 1, 2 and 3, respectively, and for 4, 5 and 6, respectively, and so on ad infinitum.
Then, from this value, you can use standard branching logic to display a different fields, direct respondents to different surveys using logic in the survey queue, invite them to specific instruments using logic in automated survey invitations, fire off different alerts with instructions for each intervention group, etc.

Related

How to measure flexibility using entropy or other methods?

E.g. I want to measure how flexible customers for different brands of goods by looking at their searching behaviors in the past months.
Users could have below searching logs
User 1: Brand A
User 2: Brand A A A A
User 3: Brand A A B A
User 4: Brand A A B B
So it is clear that user 1 and 2 have less flexibility than 3 and 4 and they tend to buy only from one brand.
While user 2 is less flexible than user 1 since 2 have multiple searches to show its inflexibility while 1 we don’t have that same amount of confidence.
So intuitively the degree of flexibility are 2<1<3<4
Initial I want to use entropy to calculate but it can’t distinguish user 1 and 2 as entropy will be both 0.
Do you know any method to calculate the flexibility of each user using one numeric value that also take the frequency of each unique value into account? Really appreciate

VW contextual bandits: historical data and online learning

I'd like to test CB for e-commerce task: personal offer recommendations (like "last chance to buy", "similar positions", "consumers recommend", "bestsellers", etc.). My task is to order them (more relevant issue is higher in the list of recommendations).
So, there are 5 possible offers.
I have some historical data collected without using any model: context (user and web-session features), action id (one of my 5 offers), reward (1 if user clicked this offer, 0 - not clicked). So I have N users and 5 offers with known reward, totally 5*N rows in my historical data.
Ex:
1:1:1 | user_id:1 f1:... f2:...
2:-1:1 | user_id:1 f1:... f2:...
3:-1:1 | user_id:1 f1:... f2:...
This means that user 1 have seen 3 offers (1,2,3), cost of the 1 offer is equal to 1 (didn't click), user ckickes on offers 2 and 3 (cost is negative -> reward is positive). Probabilities are equal to 1, since all offers were shown and we know rewards.
Global task is to increase CTR. I'd like to use this data for training CB and then improve the model by exploration/exploitation policies. I set probabilities equal to 1 in this data (Is it right?). But next I'd like to set the order of offers according to rewards.
Should I use for this warm start in VW CB? Will this work correctly with data collected without using CB? Maybe you can advise more relevant methods in CB for this data and task?
Thanks a lot.
If there are only 5 possible offers and if you (as indicated) have data of the form "I have N users and 5 offers with known reward, totally 5*N rows in my historical data." then your historical data is supervised multilabel data and the warm-start functionality would apply; make sure you use the cost-sensitive version to accommodate the multilabel aspect of your historical data (i.e., there is more than one offer that would result in a click).
Will this work correctly with data collected without using CB?
Because the every action-reward is specified for every user in the data set, you only have to ensure that the sample of users is representative of the population you care about.
Maybe you can advise more relevant methods in CB for this data and task?
The first paragraph started with "if" because the more typical case is 1) there are many possible offers and 2) users have only seen a few of them historically.
In such case what you have is a combination of a degenerate logging policy and multiple rewards being revealed. If there are k possible actions but each user has only seen n<=k historically then you could try and make n lines for each user as you did. Theoretically this does not necessarily work but in practice it might help.
Out of the box: change the data
If the data you have was collected as the result of running an existing policy, then an alternative would be to start randomizing the decisions made by that system in order to collect a dataset which conforms to CB. For example, use your current system to pick the "best" action 96% of the time, and one of the other 4 actions at random 4% of the time, and log the probability along with the reward (either 0.96 or 0.01 depending upon whether it was the considered best), and then set up a proper CB-style training set for vw. With this you can also counterfactually estimate the value of both your current policy and the policy vw generates, and only switch to vw when it is winning.
The fastest way to implement the last paragraph is to just start using APS.

Using scoring to find customers

I have a site where customers purchase items that are tagged with a variety of taxonomy terms. I want to create a group of customers who might be interested in the same items by considering the tags associated with purchases they've made. Rather than comparing a list of tags for each customer each time I want to build the group, I'm wondering if I can use some type of scoring to solve the problem.
The way I'm thinking about it, each tag would have some unique number assigned to it. When I perform a scoring operation it would render a number that could only be achieved by combining a specific set of tags.
I could update a customer's "score" periodically so that it remains relevant.
Am I on the right track? Any ideas?
Your description of the problem looks much more like a clustering or recommendation problem. I am not sure if those tags are enough of an information to use clustering or recommendation tough.
Your idea of the score doesn't look promising to me, because the same sum could be achieved in several ways, if those numbers aren't carefully enough chosen.
What I would suggest you:
You can store tags for each user. When some user purchases a new item, you will add the tags of the item to the user's tags. On periodical time you will update the users profiles. Let's say we have users A and B. If at the time of the update the similarity between A and B is greater than some threshold, you will add a relation between the users which will indicate that the two users are similar. If it's lower you will remove the relation (if previously they were related). The similarity could be either a number of common tags or num_common_tags / num_of_tags_assigned_either_in_A_or_B.
Later on, when you will want to get users with particular set of tags, you will just do a query which checks which users have that set of tags. Also you can check for similar users to given user, just by looking up which users are linked with the user in question.
If you assign a unique power of two to each tag, then you can sum the values corresponding to the tags, and users with the exact same sets of tags will get identical values.
red = 1
green = 2
blue = 4
yellow = 8
For example, only customers who have the set of { red, blue } will have a value of 5.
This is essentially using a bitmap to represent a set. The drawback is that if you have many tags, you'll quickly run out of integers. For example, if your (unsigned) integer type is four bytes, you'd be limited to 32 tags. There are libraries and classes that let you represent much larger bitsets, but, at that point, it's probably worth considering other approaches.
Another problem with this approach is that it doesn't help you cluster members that are similar but not identical.

How to manage multiple positive implicit feedbacks?

When there are no ratings, a common scenario is to use implicit feedback (items bought, pageviews, clicks, ...) to suggests recommendations. I'm using a model-based approach and I wondering how to deal with multiple identical feedback.
As an example, let's imagine that consummers buy items more than once. Should I have to consider the number of feedback (pageviews, items bought, ...) as a rating or compute a custom value ?
To model implicit feedback, we usually have a mapping procedure to map implicit user feedback into the explicit ratings. I guess in most domains, repeated user action against the same item indicates that the user's preference over the item is increasing.
This is certainly true if the domain is music or video recommendation. In a shopping site, such a behavior might indicate the item is consumed periodically, e.g., diapers or printer ink.
One way I am aware of to model this multiple implicit feedback is to create a numeric rating mapping function. When the number of times (k) of implicit feedback increases, the mapped value of rating should increase. At k = 1, you have a minimal rating of positive feedback, for example 0.6; when k increases, it approaches 1. For sure, you don't need to map to [0,1]; you can have integer ratings, 0,1,2,3,4,5.
To give you a concrete example of the mapping, here is what they did in a music recommendation domain. For short, they used the statistic info of the items per user to define the mapping function.
We assume that the more
times the user has listened to an artist the more the user
likes that particular artist. Note that user’s listening habits
usually present a power law distribution, meaning that a few
artists have lots of plays in the users profile, while the rest
of the artists have significantly less play counts. Therefore,
we compute the complementary cumulative distribution of
artist plays in the users’ profile. Artists located in the top
80-100% of the distribution are assigned a score of 5, while
artists in the 60-80% range assign a score of 4.
Another way I have seen in the literature is to create another variable besides a binary rating variable. They call it confidence levels. See here for details.
Probably not that helpful for OP any longer, but it might be for others in the same boat.
Evaluating Various Implicit Factors in E-commerce
Modelling User Preferences from Implicit Preference Indicators via Compensational Aggregations
If anyone knows more papers/methods, please share as I'm currently looking for state of the art approaches to this problem. Thanks in advance.
You typically use a sum of clicks, or some weighted sum of events, as a "score" for each user-item pair in implicit feedback systems. It's not a rating, and that's more than a semantic distinction. You won't get good results if you feed these values into a process that's expecting rating-like and trying to minimize a squared-error loss.
You treat 3 clicks as adding 3 times the value of 1 click to the user-item interaction strength. Other events, like a purchase, might be weighted much more highly than a click. But in the end it also adds to a sum.

Complex Combinatorial Algorithms

So Wendy's advertises their sandwich as having 256 combinations - meaning there are 8 ingredients you can either have to not have (although I wonder why they would count the combination where you include nothing as valid, but I digress).
A generalized approach allows you to multiply the various states of each selection together, which allows more complex combinations. In this case Wendy's items can only be included or excluded. But some sandwiches might have the option of two kinds of mustard (but not both, to save costs).
These are fairly straightforward. You multiply the number of options together, so For Wendy's it's:
2*2*2*2*2*2*2*2 = 256
If they diversified their mustard selection as above it would be:
2*2*3*2*2*2*2*2 = 384
Going further appears to be harder.
If you make sesame seeds a separate item, then they require the bun item. You can have the sesame seed only if you include the bun, and you can have the bun without sesame seeds, but you cannot have sesame seeds without the bun. This can be simplified to a single bun item with three states (none, bun with seeds, bun without) but there are situations where that cannot be done.
Dell's computer configurator, for instance, disallows certain combinations (maybe the slots are all full, items are incompatible when put into same system, etc).
What are the appropriate combinatorial approaches when dealing with significantly more complex systems where items can conflict?
What are good, generalized, approaches to storing such information without having to code for each product/combination/item to catch conflicts?
Is there a simple way to say, "there are X ways to configure your system/sandwich" when the system has to deal with complex conflicting combinations?
HP's high-end server manufacturing facility in California used a custom rule-based system for many years to do just this.
The factory shopfloor build-cycle process included up-front checks to ensure the order was buildable prior to releasing it to the builders and testers.
One of these checks determined whether the order's bill of materials (BOM) conformed to a list of rules specified by the process engineers. For example, if the customer orders processors, ensure they have also ordered sufficient dc-converter parts; or, if they have ordered a certain quantity of memory DIMMs, ensure they have also ordered a daughter-board to accommodate the additional capacity.
A computer science student with a background in compilers would have recognized the code. The code parsed the BOM, internally generating a threaded tree of parts grouped by type. It then applied the rules to the internal tree to make the determination of whether the order conformed.
As a side-effect, the system also generated build documentation for each order which workers pulled up as they built each system. It also generated expected test results for the post-build burn-in process so the testing bays could reference them and determine whether everything was built correctly.
Adam Davis: If I understand correctly you intend to develop some sort of system that could in effect be used for shopping carts that assist users to purchase compatible parts.
Problem definition
Well this is a graph problem (aren't they all), you have items that are compatible with other items. For example, Pentium i3-2020 is compatible with any Socket 1155 Motherboard, The Asrock H61M-VS is a Socket 1155 Motherboard, which is compatible with 2xDDR3(speed = 1066), and requires a PCI-Express GPU, DDR3 PC RAM{Total(size) <= 16GB}, 4 pin ATX 12V power, etc.
You need to be able to (a) identify whether each item in the basket is satisfied by another item in the basket (i.e. RAM Card has a compatible Motherboard), (b) assign the most appropriate items (i.e. assign USB Hub to Motherboard USB port and Printer to USB Hub if motherboard runs out of USB ports, rather than do it the other way around and leave the hub dry), and (c) provide the user with a function to find a list of satisfying components. Perhaps USB Hubs can always take precedence as they are extensions (but do be aware of it).
Data structures you will need
You will need a simple classification system, i.e. H61M-VS is-a Motherboard, H61M-VS has-a DDR3 memory slot (with speed properties for each slot).
Second to classification and composition, you will need to identify requirements, which is quite simple. Now the simple classification can allow a simple SQL query to find all items that fit a classification.
Testing for a satisfying basket
To test the basket, a configuration need to be created, identifying which items are being matched with which (i.e. Motherboard's DDR3 slot matches with 4GB Ram module, SATA HDD cable connects to Motherboard SATA port and PSU's SATA power cable, while PSU's 4 pin ATX 12V power cable connects to the motherboard.
The simplest thing is just to check whether another satisfying item exists
Dell's Computer Configurator
You begin with one item, say a Processor. The processor requires a motherboard and a fan, so you can give them a choice of motherboard (adding the processor-fan to list_of_things_to_be_satisfied). This continues until there are no more items held in in list_of_things_to_be_satisfied. Of course this all depends on your exact-requirements and knowing what problem(s) you will solve for the user.
There are many ways you can implement this in code, but here is in my humble opinion, the best way to go about solving the problem before programming anything:
Define Parts & Products (Pre-code)
When defining all the "parts" it will be paramount to identify hierarchy and categorization for the parts. This is true because some rules may be exclusive to a unique part (ex. "brown mustard only"), some categorical (ex. "all mustards"), some by type (ex. "all condiments"), etc.
Build Rule Sets (Pre-code)
Define the rule sets (prerequisites, exclusions, etc.) for each unique part, category, type, and finished product.
It may sound silly, but a lot of care must be taken to ensure the rules are defined with an appropriate scope. For example, if the finished product is a Burger:
Unique Item Rule - "Mushrooms only available with Blue Cheese selected" prerequisite
Categorical Rule - "Only 1 mustard may be selected" exclusive
Type Rule - "Pickles are incompatible with Peppers" exclusive
After having spent so much time on unique/category/type rules for "parts", many designers will overlook rules that apply only to the finished product even when the parts have no conflict.
Product Rule - "Maximum 5 condiments" condition
Product Rule - "Burger must have a bun" prerequisite
This graph of rule can quickly grow very complex.
Suggestions for Building Data Structures (Code)
Ensure your structures accommodate hierarchy and categorization. For example: "brown mustard" and "dijon mustard" are individual objects, and they are both mustards, and both condiments.
Carefully select the right combination of inheritance modeling (base classes) and object attributes (ex. Category property, or HasCondiments flag) to make this work.
Make a private field for RuleSets at each hierarchic object level.
Make public properties for a HasConflicts flag, and a RuleViolations collection.
When a part is added to a product, check against all levels of rules (its own, category, type, and product) -- do this via a public function that can be called from the product. Or for better internalization, you can make an event handler on the part itself.
Write your algorithms (Code)
This is where I suck, and good thing as it is sort of beyond the scope of your question.
The trick with this step will be how to implement in code the rules that travel up the tree/graph -- for example, when a specific part has issue with another part outside its scope, or how does it's validation get run when another part is added? My thought:
Use a public function methodology for each part. Pass it the product's CurrentParts collection.
On the Product object, have handlers defined to handle OnPartAdded and OnPartRemoved, and have them enumerate the CurrentParts collection and call each part's validation function.
Example Bare-bones Prototype
interface IProduct
{
void AddPart();
void OnAddPart();
}
// base class for products
public class Product() : IProduct
{
// private or no setter. write functions as you like to add/remove parts.
public ICollection<Part> CurrentParts { get; };
// Add part function adds to collection and triggers a handler.
public void AddPart(Part p)
{
CurrentParts.Add(p);
OnAddParts();
}
// handler for adding a part should trigger part validations
public void OnAddPart()
{
// validate part-scope rules, you'll want to return some message/exception
foreach(var part in CurrentParts) {
part.ValidateRules(CurrentParts);
}
ValidateRules(); // validate Product-scope rules.
}
}
interface IProduct
{
// "object" should be replaced with whatever way you implement your rules
void object RuleSet;
void ValidateRules(ICollection<Part> otherParts);
}
// base class for parts
public class Part : IPart
{
public object RuleSet; // see note in interface.
public ValidateRules(ICollection<Part> otherParts)
{
// insert your algorithms here for validating
// the product parts against this part's rule set.
}
}
Nice and clean.
As a programmer I would do the following (although I have never actually ever had to do this in real life):
Work out the total number of
combinations, usually a straight
multiplication of the options as
stated in your question will suffice.
There's no need to store all these
combinations.
Then divide your total by the
exceptions. The exceptions can be
stored as just a set of rules,
effectively saying which combinations
are not allowed.
To work out a total number of
combinations allowable, you will have
to run through the entire set of
exception rules.
If you think of all your combinations as a set, then the exceptions just remove members of that set. But you don't need to store the entire set, just the exceptions, since you can calculate the size of the set quite easily.
"Generating Functions" comes to mind as one construct that can be used when solving this type of problem. I'd note that there are several different generating functions depending on what you want.
In North America, car license plates can be an interesting combinatorial problem in counting all the permutations where there are 36 possible values for each place of the 6 or 7 that are the lengths of license plates depending on where one is getting a plate. However, some combinations are disqualified due to there being swear words or racist words in some of them that makes for a slightly harder problem. For example, there is an infamour N-word that has at least a couple of different spellings that wouldn't be allowed on license plates I'd think.
Another example would be determining all the different orders of words using a given alphabet that contains some items repeated multiple times. For example, if one wanted all the different ways to arrange the letters of say the word "letter" it isn't just 6! which would be the case of "abcdef" because there are 2 pairs of letters that make it slightly trickier to compute.
L33t can be another way to bring in more complexity in identifying inappropriate words as while a-s-s gets censored a$$ or #ss may not necessarily be treated the same way even though it is basically the same term expressed in different ways. I'm not sure if many special characters like $ or # would appear on license plates but one could think of parental controls on web content as having to have these kinds of algorithms to identify which terms to censor.
You'd probably want to create a data structure that represents an individual configuration uniquely. Then each compatibility rule should be defined in a way where it can generate a set containing all the individual configurations that fail that rule. Then you would take the union of all the sets generated by all the rules to get the set of all configurations that fail the rules. Then you count the size of that set and subtract it from the size of the set all possible configurations.
The hard part is defining the data structure in a way that can be generated by your rules and can have the set operations work on it! That's an exercise for the reader, AKA I've got nothing.
The only thing I can think of right now is building is if you can build a tree that defines the dependency between the parts you have a simple solution.
sandwitch
|
|__Bun(2)__sesame(1)
|
|__Mustard(3)
|
|__Mayo(2)
|
|__Ketchup(2)
|
|__Olives(3)
this simply says that you have 2 options for the Bun (bun or no bun) - 1 for the sesame (only if you have a bun - signifying the dependency - if you have a 7 here it means 7 types that can exist if you only have a bun)
3 for the mustard .. etc
then simply multiply the sum of all branches.
It is probably possible to formalize the problem as a k-sat problem. In some cases, the problem appear to be NP-complete and you will have to enumerate all the possibilities to check wether they satisfy or not all the conditions. In some other cases, the problem will be easily solvable (when few conditions are required for instance). This is an active field of research. You will find relevant references on google scholar.
In the case of the mustard, you would add a binary entry "mustard_type" for the mustard type and introduce the condition: not (not mustard and mustard_type) where mustard is the binary entry for mustard. It will impose the default choice mustard_type == 0 when you choose not mustard.
For the sesame choice, this is more explicit: not (sesame and not bun).
It thus seems that the cases you propose fall into the 2-sat family of problems.

Resources