brms: how do I setup a model with multiple categorical variables, so that all levels are present and none are baked into the general intercept? - categorical-data

brms: how do I setup a model with multiple categorical variables, so that all levels are present and none are baked into the general intercept?
E.g. suppose we have predictors:
gender (2 levels)
educ (educational level, 3 levels)
Doing
brm(somey ~ 0 + gender)
seems to allow me to have both levels of gender present as coefficients. However, if I modify this to:
brm(somey ~ 0 + gender + educ)
leads to still having the 2 genders individually, but now one level of educ is baked into the intercept. I'm not sure if this is desirable. I wonder how to have all 3 levels individually present and none baked into intercept.
Doing
brm(somey ~ 0 + gender + 0 + educ)
does not help.
Or put otherwise:
Can I have
\beta_1 x_1 + \beta_2 x_2
rather than
\beta_0 + \beta_1 x_1 + \beta_2 x_2

Related

How do binary heaps determine their parent if it involves a half (i.e. 1.5)?

I am learning about Binary Heaps and I understand that to get the left child the formula is (2 x index)+1 and to get the right child is (2 x index)+2. That makes sense to me but getting the parent node is where I don't really understand. I know the formula for that is (index - 1) / 2 but how does it work if it returns a half (.5)?
For example, if I am trying to find the parent node of index 3 then that would be (3-1)/2 which gives you 1 so that makes sense but what about index 4? That would be (4-1)/2 which gives you 1.5. So would the parent of index 4 be index 1 or index 2? Looking at a diagram like the one below it makes sense obviously because index 4 is tied to index 1 but from a mathematics standpoint I'm just not sure I understand how to handle it when halves come into play.
Thank you!
It depends on the language in which it is implemented. In some languages / represents an integer division when the operands are integers, which is what you need. For instance, this is the case in Java where 3/2 == 1.
If it is not, then indeed you need to truncate the result to an integer. Python for example, has the integer division operator, //, so that 3//2 == 1.
Alternatively, in many of those languages you can use a shift operator: >>1 corresponds to an integer division by 2.
So here are some equivalent ways to do it when / does not truncate the result to an integer:
// Using bit shift
parentIndex = (childIndex - 1) >> 1
// Using ternary operator based on check whether number is odd
parentIndex = childIndex % 2 > 0 ? (childIndex - 1) / 2 : (childIndex - 2) / 2
// Use remainder of two to get to an even number that can be divided by 2
parentIndex = (childIndex - 2 + childIndex % 2) / 2
// Same thing, but the subtraction of 2 is taken out of the numerator
parentIndex = (childIndex + childIndex % 2) / 2 - 1
// Or doing the explicit truncation
parentIndex = floor((childIndex - 1) / 2)

Math function with three variables (correlation)

I want to analyse some data in order to program a pricing algorithm.
Following dates are available:
I need a function/correlationfactor of the three variables/dimension which show the change of the Median (price) while the three dimensions (pers_capacity, amount of bedrooms, amount of bathrooms) grow.
e.g. Y(#pers_capacity,bedroom,bathroom) = ..
note:
- in the screenshot below are not all the data available (just a part of it)
- median => price per night
- yellow => #bathroom
e.g. For 2 persons, 2 bedrooms and 1 bathroom is the median price 187$ per night
Do you have some ideas how I can calculate the correlation/equation (f(..)=...) in order to get a reliable factor?
Kind regards
One typical approach would be formulating this as a linear model. Given three variables x, y and z which explain your observed values v, you assume v ≈ ax + by + cz + d and try to find a, b, c and d which match this as closely as possible, minimizing the squared error. This is called a linear least squares approximation. You can also refer to this Math SE post for one example of a specific linear least squares approximation.
If your your dataset is sufficiently large, you may consider more complicated formulas. Things like
v ≈
a1x2 +
a2y2 +
a3z2 +
a4xy +
a5xz +
a6yz +
a7x +
a8y +
a9z +
a10
The above is non-linear in the variables but still linear in the coefficients ai so it's still a linear least squares problem.
Or you could apply transformations to your variables, e.g.
v ≈
a1x +
a2y +
a3z +
a4exp(x) +
a5exp(y) +
a6exp(z) +
a7
Looking at the residual errors (i.e. difference between predicted and observed values) in any of these may indicate terms worth adding.
Personally I'd try all this in R, since computing linear models is just one line in that language, and visualizing data is fairly easy as well.

random choice, specific distributions, variable applicability

Imagine I have person 1, 2, 3 and 4, then I have shirt styles A, B, C, D and I want to distribute the shirt styles to the people such that 25% of them get style A, 25% get style B, 25% get style C and 25% get style D but some of the people refuse to wear certain styles, these people are represented by Fs. How can I randomly match all the people with the styles they are willing to wear to get the approximate distribution?
A B C D
1 T F T T
2 T F F F
3 T T T T
4 T T T F
In this case this is easy and 25% is can be fully achieved, just give each person a different style. However, I intend to take this problem beyond this simple situation, my solution has to be generic. The number or styles, the number of people, and the distribution is all variable. Sometimes, the distribution will be impossible to create 100% accurately, approximate/close/best effor is expected. The selection process should be random and attempt to maintain the distribution.
I'm pretty agnostic to the language here, I'm just seeking the algorithm. Though preferably it would be able to be distributed.
Finding a solution when you are hampered by the Fs is https://en.wikipedia.org/wiki/Assignment_problem. One way to select an arbitrary assignment when there are many would be to set random costs where a style is acceptable to a person and then let it find the assignment with lowest possible cost. However it is not obvious that this will fit any natural definition of random. One (very inefficient) natural definition of random would be to select from all possible assignments at random until you get one that is acceptable to everybody. The distribution you get from this might not be the same as the one you would get by setting up random costs and then solving the resulting assignment problem.
You are using the term 'randomly match' which should be used with caution. The correct interpretation, I believe, is a random selection from the set of all valid solutions, so basically if we could enumerate all valid solution - we could trivially solve the problem.
You are looking for a close-enough solution, so we need to better define what a valid solution is. I suggest to define some threshold (say 1% error at most).
In your example, there are 4 groups (those assigned with shirt style A/B/C/D). Therefore, there are 2^4-1 possible person archetypes (love/hate each of A/B/C/D, with the assumption that anyone loves at least one shirt style). Each archetype population has a given population size, and each archetype can be assigned to some of the 4 groups (1 or more).
The goal is to divide the population of each archetype between the 4 groups, such that say, the size of each group is between L and H.
Lets formalize it.
Problem statement:
Denote A(0001),...,A(1111): the population size of each of the 15 archetypes
Denote G1(0001): the size of A(0001) assigned to G1, etc.
Given
L,H: constants
A(0001),...,A(1111): 15 constants
Our goal is to find all integer solutions for
G1(0001),G1(0011),G1(0101),G1(0111),G1(1001),G1(1011),G1(1101),G1(1111),
G2(0010),G2(0011),G2(0110),G2(0111),G2(1010),G2(1011),G2(1110),G2(1111),
G3(0100),G3(0101),G3(0110),G3(0111),G3(1100),G3(1101),G3(1110),G3(1111),
G4(1000),G4(1001),G4(1010),G4(1011),G4(1100),G4(1101),G4(1110),G4(1111)
subject to:
G1(0001) = A(0001)
G2(0010) = A(0010)
G2(0011) + G1(0011) = A(0011)
G3(0100) = A(0100)
G3(0101) + G1(0101) = A(0101)
G3(0110) + G2(0110) = A(0110)
G3(0111) + G2(0101) + G1(0101) = A(0111)
G4(1000) = A(1000)
G4(1001) + G1(1001) = A(1001)
G4(1010) + G2(1010) = A(1010)
G4(1011) + G2(1011) + G1(1011) = A(1011)
G4(1100) + G3(1100) = A(1100)
G4(1101) + G3(1101) + G1(1101) = A(1101)
G4(1110) + G3(1110) + G2(1110) = A(1110)
G4(1111) + G3(1111) + G2(1111) + G1(1111) = A(1111)
L < G1(0001)+G1(0011)+G1(0101)+G1(0111)+G1(1001)+G1(1011)+G1(1101)+G1(1111) <
H
L < G2(0010)+G2(0011)+G2(0110)+G2(0111)+G2(1010)+G2(1011)+G2(1110)+G2(1111) <
H
L < G3(0100)+G3(0101)+G3(0110)+G3(0111)+G3(1100)+G3(1101)+G3(1110)+G3(1111) <
H
L < G4(1000)+G4(1001)+G4(1010)+G4(1011)+G4(1100)+G4(1101)+G4(1110)+G4(1111) <
H
Now we can use an integer programming solver for the job.

Search cost on Tree with ordered nodes

Context: I am building a FoundationDB, and I am thinking about which key use first
Let's say we have this set of elements :
{AP,AQ,AR,BP,BQ,BR}
and we want to build a tree from it. One way is to group by the first character first, and then by the second, obtaining
root
+-----+------+
+ +
A B
+----+----+ +----+----+
| | | | | |
+ + + + + +
P Q R P Q R
One other possible way is to group first by the second character, and then by the first, obtaining:
root
+--------+--------+
+ + +
P Q R
+--+-+ +--+--+ +-+--+
+ + + + + +
A B A B A B
Assuming the probability distribution of the strings is uniform, which one leads to the fastest search time? In general, is best to having an high number of branches on the top levels or the trees or on the bottom ones?
First solution will lead to choosing one out of 2 options and then choosing one out of 3 options, while the second will first make a choice one out of three and then one out of two. Theoretically both should be approximately the same.
EDIT: as per your comment. In case you have two layers where the number of choices is significantly different like 30 and 1000000 I advice you to put the 30 options on the higher level and then have the 1000000 ones on the lower level. I believe caching will speed up the lower level more in similar cases.

Find similar records in dataset

I have a dataset of 25 integer fields and 40k records, e.g.
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
etc.
I'm testing with MySQL but am not tied to it.
Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.
https://en.wikipedia.org/wiki/Euclidean_distance
https://en.wikipedia.org/wiki/S%C3%B8rensen_similarity_index
https://en.wikipedia.org/wiki/Similarity_matrix
I know it's an old post, but for anyone who comes by it seeking similar algorithms, one that works particularly well is Cosine Similarity. Find a way to vectorize your records, then look for vectors with minimum angle between them. If vectorizing a record is not trivial, then you can vectorize similarity between them via some known algorithm, and then look at cosine similarity of the similarity vectors to the perfect match vector (assuming perfect matches aren't the goal since they're easy to find anyway). I get tremendous results with this matching even comparing things like lists of people in various countries working on a particular project with various contributions to the project. Vectorization implies looking at number of country matches, country mismatches, ratio of people in a matching country between two datasets, etc etc etc. I use string edit distance functions like Levenshtein distance for getting numeric value from string dissimilarities, but one could use phonetic matching, etc. As long as the target number is not 0 (vector [0 0 ... 0] is the subspace of ANY vector and thus its angle would be undefined. Sometimes to get away from the problem, such as the case of edit distance, I give a perfect match (e.d. 0) a negative weight, so that perfect matches are really emphasized. -1 and 1 are farther away than 1 and 2, which makes a lot of sense - perfect match is better than anything with even 1 misspelling.
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) where dot is the dot-product, and Norm is the Euclidian magnitude of the vector.
Good luck!
Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):
SELECT id,
(
ABS(field1-2)
+ ABS(field2-2)
+ ABS(field3-3)
+ ABS(field4-1)
+ ABS(field5-0)
+ ABS(field6-3)
+ ABS(field7-2)
+ ABS(field8-0)
+ ABS(field9-1)
+ ABS(field10-0)
+ ABS(field11-2)
+ ABS(field12-2)
+ ABS(field13-3)
+ ABS(field14-2)
+ ABS(field15-0)
+ ABS(field16-1)
+ ABS(field17-0)
+ ABS(field18-2)
+ ABS(field19-3)
+ ABS(field20-1)
+ ABS(field21-0)
+ ABS(field22-1)
+ ABS(field23-3)
+ ABS(field24-2)
+ ABS(field25-2)
)/25
AS distance
FROM mytable
ORDER BY distance ASC
LIMIT 20;

Resources