Continous Parent and Discrete Child Conditional with Observed Data in PyMC - probability

I am new to PyMC and trying to set up the simple conditional probability model: P(has_diabetes|bmi, race). Race can take on 5 discrete values encoded as 0-4 and BMI can take on a non-zero positive real number. So far I have the parent variables setup like this:
p_race = [0.009149232914923292,
0.15656903765690378,
0.019637377963737795,
0.013947001394700141,
0.800697350069735]
race = pymc.Categorical('race', p_race)
bmi_alpha = pymc.Exponential('bmi_alpha', 1)
bmi_beta = pymc.Exponential('bmi_beta', 1)
bmi = pymc.Gamma('bmi', bmi_alpha, bmi_beta, value=bmis, observed=True)
I have observed data that looks like:
| bmi | race | has_diabetes |
| 21.7 | 1 | 0 |
| 45.3 | 4 | 1 |
| 18.9 | 2 | 0 |
| 26.6 | 0 | 0 |
| 35.1 | 4 | 0 |
I am attempting to model has_diabetes as:
has_diabetes = pymc.Bernoulli('has_diabetes', p_diabetes, value=data, observed=True)
My problem is that I am not sure how to construct the p_diabetes function since it is dependent on the values of race and and the continuous value of bmi.

You need to construct a deterministic function that generates p_diabetes as a function of your predictors. The safest way to do this is via a logit-linear transformation. For example:
intercept = pymc.Normal('intercept', 0, 0.01, value=0)
beta_race = pymc.Normal('beta_race', 0, 0.01, value=np.zeros(4))
beta_bmi = pymc.Normal('beta_bmi', 0, 0.01, value=0)
#pymc.deterministic
def p_diabetes(b0=intercept, b1=beta_race, b2=beta_bmi):
# Prepend a zero for baseline
b1 = np.append(0, b1)
# Logit-linear model
return pymc.invlogit(b0 + b1[race] + b2*bmi)
I would make the baseline race be the largest group (it is assumed to be index 0 in this example).
Actually, its not clear what the first part of the model above is for, specifically, why you are building models for the predictors, but perhaps I am missing something.

Related

Sort order of a Bitwise Enum

Is there an intelligent way to identify the order of a Bitwise Enum?
Take this enum for example:
[System.FlagsAttribute()]
internal enum AnswersSort : int
{
None = 0,
Bounty = 1,
MarkedAnswer = 2,
MostVotes = 4
}
If I combine these in different ways:
var result = AnswersSort.MarkedAnswer | AnswersSort.MostVotes | AnswersSort.Bounty;
result = AnswersSort.MostVotes | AnswersSort.Bounty | AnswersSort.MarkedAnswer;
Both results are 7 and the order is lost. Is there a way to do this without using an array or a list? Ideally I'm looking for a solution using an enum but I'm not sure how or if it's possible.
If you have 10 values, you need 4 bits per item. You could treat the combination as a single 40-bit value, encoded with 4 bits per digit. So, given your two examples:
var result = AnswersSort.MarkedAnswer | AnswersSort.MostVotes | AnswersSort.Bounty;
result = AnswersSort.MostVotes | AnswersSort.Bounty | AnswersSort.MarkedAnswer;
The first would be encoded as
0010 0100 0001
---- ---- ----
| | - Bounty
| - MostVotes
- MarkedAnswer
You could build that in a 64-bit integer:
long first = BuildValue(AnswersSort.MarkedAnswer, AnswersSort.MostVotes, AnswersSort.Bounty);
long BuildValue(params AnswersSort[] values)
{
long result = 0;
foreach (var val in values)
{
result = result << 4;
result |= (int)val;
}
return result;
}

how to calculate svg transform matrix from rotation with pivot point

If there is a svg rotation( a deg) with default pivot point(0,0), then I can calculate the rotation transform matrix as
_ _
| cos a -sin a 0 |
| sin a cos a 0 |
| 0 0 1 |
- -
But if pivot point is not (0,0), Lets say (px,py) then how do I calculate the rotation transform matrix?
I got the ans,
Lets the pivot point is (px ,py) and rotation is a degree
then net transform matrix will be
_ _ _ _
| 1 0 px | | cos a -sin a 0 |
net_matrix = | 0 1 py | X | sin a cos a 0 |
| 0 0 1 | | 0 0 1 |
- - - -
_ _
| 1 0 -px |
rotate_transform_matrix = net_matrix X | 0 1 -py |
| 0 0 1 |
- -
You can use javascript to apply the rotation transform on an svg element:
var rect = document.createElementNS("http://www.w3.org/2000/svg", "rect");
rect.setAttribute('transform', 'rotate(-30 50 50)');
rect.getCTM();
to get The TransformMatrix.
Just multiplying out (and tidying up the result to use the same variable names as the W3C) in case anyone else reading this wants something explicit.
rotate(a, cx, cy)
is equivalent to
matrix(cos(a), sin(a), -sin(a), cos(a), cx(1 - cos(a)) + cy(sin(a)), cy(1 - cos(a)) - cx(sin(a)))
Using mathematical notation assuming rotate and matrix are functions.
For anyone who is interested in Swarnendu Paul rotate_transform_matrix above, one would get:
_ _
| cos a -sin a px * (1 - cos a) + py * sin a |
| sin a cos a py * (1 - cos a) - px * sin a |
| 0 0 1 |
¯ ¯
I used it for SVG matrix transforms.

FFI code segfaults in jRuby but not MRI Ruby

I'm porting a Ruby gem written in C to Ruby with FFI.
When I run the tests using MRI Ruby there aren't any seg-faults.
When running in jRuby, I get a seg-fault.
This is the code in the test that I think is responsible:
if type == Date or type == DateTime then
assert_nil param.set_value(value.strftime("%F %T"));
else
assert_nil param.set_value(value);
end
#api.sqlany_bind_param(stmt, 0, param)
puts "\n#{param.inspect}"
#return if String === value or Date === value or DateTime === value
assert_succeeded #api.sqlany_execute(stmt)
The segmentation fault happens when running sqlany_execute, but only when the object passed to set_value is of the class String.
sqlany_execute just uses FFI's attach_function method.
param.set_value is more complicated. I'll focus just on the String specific part. Here is the original C code
case T_STRING:
s_bind->value.length = malloc(sizeof(size_t));
length = RSTRING_LEN(val);
*s_bind->value.length = length;
s_bind->value.buffer = malloc(length);
memcpy(s_bind->value.buffer, RSTRING_PTR(val), length);
s_bind->value.type = A_STRING;
break;
https://github.com/in4systems/sqlanywhere/blob/db25e7c7a2d5c855ab3899eacbc7a86b91114f53/ext/sqlanywhere.c#L1461
In my port, this became:
when String
self[:value][:length] = SQLAnywhere::LibC.malloc(FFI::Type::ULONG.size)
length = value.bytesize
self[:value][:length].write_int(length)
self[:value][:buffer] = SQLAnywhere::LibC.malloc(length + 1)
self[:value][:buffer_size] = length + 1
## Don't use put_string as that includes the terminating null
# value.each_byte.each_with_index do |byte, index|
# self[:value][:buffer].put_uchar(index, byte)
# end
self[:value][:buffer].put_string(0, value)
self[:value][:type] = :string
https://github.com/in4systems/sqlanywhere/blob/e49099a4e6514169395523391f57d2333fbf7d78/lib/bind_param.rb#L31
My question is: what's causing jRuby to seg fault and what can I do about it?
This answer is possibly overly detailed, but I thought it would be good to go into a bit of depth for those who run across similar problems in the future.
It looks like this was your problem:
self[:value][:length].write_int(length)
when it should have been:
self[:value][:length].write_ulong(length)
On a 64 bit system, bytes 4..7 of the memory self[:value][:length] points to could have contained garbage (since malloc does not clear the memory it returns), and when the native code reads a size_t quantity at that address, it will be garbage, potentially indicating a buffer larger than 4 gigabytes.
e.g. if the string length is really 15 bytes, the lower 4 bits will be set, and the upper 60 should be all zero.
bit 0 1 2 3 4 32 63
+---+---+---+---+---+ ~ +---+ ~ +---+
| 1 | 1 | 1 | 1 | 0 | ~ | 0 | ~ | 0 |
+---+---+---+---+---+ ~ +---+ ~ +---+
if just one bit in that upper 32 bits is set, then you get a > 4 gigabyte value
bit 0 1 2 3 4 32 63
+---+---+---+---+---+ ~ +---+ ~ +---+
| 1 | 1 | 1 | 1 | 0 | ~ | 1 | ~ | 0 |
+---+---+---+---+---+ ~ +---+ ~ +---+
which would be a length of 4294967311 bytes.
One way to fix it, is to define a SizeT struct and use that for the length.
e.g.
class SizeT < FFI::Struct
layout :value, :size_t
end
self[:value][:length] = SQLAnywhere::LibC.malloc(SizeT.size)
length = value.bytesize
SizeT.new(self[:value][:length])[:value] = length
or you could monkey patch FFI::Pointer:
class FFI::Pointer
if FFI.type_size(:size_t) == 4
def write_size_t(val)
write_int(val)
end
else
def write_size_t(val)
write_long_long(val)
end
end
end
Why was it only segfaulting on JRuby, not on MRI? Maybe MRI was a 32 bit executable (printing the value of FFI.type_size(:size_t) will tell you).

divide list in two parts that their sum closest to each other

This is a hard algorithms problem that :
Divide the list in 2 parts (sum) that their sum closest to (most) each other
list length is 1 <= n <= 100 and their(numbers) weights 1<=w<=250 given in the question.
For example : 23 65 134 32 95 123 34
1.sum = 256
2.sum = 250
1.list = 1 2 3 7
2.list = 4 5 6
I have an algorithm but it didn't work for all inputs.
init. lists list1 = [], list2 = []
Sort elements (given list) [23 32 34 65 95 123 134]
pop last one (max one)
insert to the list which differs less
Implementation :
list1 = [], list2 = []
select 134 insert list1. list1 = [134]
select 123 insert list2. because if you insert to the list1 the difference getting bigger 3. select 95 and insert list2 . because sum(list2) + 95 - sum(list1) is less.
and so on...
You can reformulate this as the knapsack problem.
You have a list of items with total weight M that should be fitted into a bin that can hold maximum weight M/2. The items packed in the bin should weigh as much as possible, but not more than the bin holds.
For the case where all weights are non-negative, this problem is only weakly NP-complete and has polynomial time solutions.
A description of dynamic programming solutions for this problem can be found on Wikipedia.
The problem is NPC, but there is a pseudo polynomial algorithm for it, this is a 2-Partition problem, you can follow the way of pseudo polynomial time algorithm for sub set sum problem to solve this. If the input size is related polynomially to input values, then this can be done in polynomial time.
In your case (weights < 250) it's polynomial (because weight <= 250 n => sums <= 250 n^2).
Let Sum = sum of weights, we have to create two dimensional array A, then construct A, Column by Column
A[i,j] = true if (j == weight[i] or j - weight[i] = weight[k] (k is in list)).
The creation of array with this algorithm takes O(n^2 * sum/2).
At last we should find most valuable column which has true value.
Here is an example:
items:{0,1,2,3}
weights:{4,7,2,8} => sum = 21 sum/2 = 10
items/weights 0| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
---------------------------------------------------------
|0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0
|1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0
|2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1
|3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1
So because a[10, 2] == true the partition is 10, 11
This is an algorithm I found here and edited a little bit to solve your problem:
bool partition( vector< int > C ) {
// compute the total sum
int n = C.size();
int N = 0;
for( int i = 0; i < n; i++ ) N += C[i];
// initialize the table
T[0] = true;
for( int i = 1; i <= N; i++ ) T[i] = false;
// process the numbers one by one
for( int i = 0; i < n; i++ )
for( int j = N - C[i]; j >= 0; j--)
if( T[j] ) T[j + C[i]] = true;
for(int i = N/2;i>=0;i--)
if (T[i])
return i;
return 0;
}
I just returned first T[i] which is true instead of returning T[N/2] (in max to min order).
Finding the path which gives this value is not hard.
This problem is at least as hard as the NP-complete problem subset sum. Your algorithm is a greedy algorithm. This type of algorithm is fast, and can generate an approximate solution quickly but it cannot find the exact solution to an NP-complete problem.
A brute force approach is probably the simplest way to solve your problem, although it is will be to slow if there are too many elements.
Try every possible way of partitioning the elements into two sets and calculate the absolute difference in the sums.
Choose the partition for which the absolute difference is minimal.
Generating all the partitions can be done by considering the binary representation of each integer from 0 to 2^n, where each binary digit determines whether the correspending element is in the left or right partition.
Trying to resolve the same problem I have faced into the following idea which seems too much a solution, but it works in a linear time. Could one provide an example which would show that it does not work or explain why it is not a solution?
arr = [20,10,15,6,1,17,3,9,10,2,19] # a list of numbers
g1 = []
g2 = []
for el in reversed(sorted(arr)):
if sum(g1) > sum(g2):
g2.append(el)
else:
g1.append(el)
print(f"{sum(g1)}: {g1}")
print(f"{sum(g2)}: {g2}")
Typescript code:
import * as _ from 'lodash'
function partitionArray(numbers: number[]): {
arr1: number[]
arr2: number[]
difference: number
} {
let sortedArr: number[] = _.chain(numbers).without(0).sortBy((x) => x).value().reverse()
let arr1: number[] = []
let arr2: number[] = []
let median = _.sum(sortedArr) / 2
let sum = 0
_.each(sortedArr, (n) => {
let ns = sum + n
if (ns > median) {
arr1.push(n)
} else {
sum += n
arr2.push(n)
}
})
return {
arr1: arr1,
arr2: arr2,
difference: Math.abs(_.sum(arr1) - _.sum(arr2))
}
}

Quickest way to determine range overlap in Perl

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).
$a_ranges =
{
a_1 =>
{
start => ...,
end => ...,
},
a_2 =>
{
start => ...,
end => ...,
},
a_3 =>
{
start => ...,
end => ...,
},
# and so on
};
$b_ranges =
{
b_1 =>
{
start => ...,
end => ...,
},
b_2 =>
{
start => ...,
end => ...,
},
b_3 =>
{
start => ...,
end => ...,
},
# and so on
};
I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.
I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.
Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?
Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.
This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.
I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.
Hope this helps!
In case you are not exclusively tied to perl; The IRanges package in R deals with interval arithmetic. It has very powerful primitives, it would probably be easy to code a solution with them.
A second remark is that the problem could possibly become very easy if the intervals have additional structure; for example, if within each set of ranges there is no overlap (in that case a linear approach sifting through the two ordered sets simultaneously is possible). Even in the absence of such structure, the least you can do is to sort one set of ranges by start point, and the other set by end point, then break out of the inner loop once a match is no longer possible. Of course, existing and general algorithms and data structures such as the interval tree mentioned earlier are the most powerful.
There are Several existing CPAN modules that solve this issue, I have developed 2 of them: Data::Range::Compare and Data::Range::Compare::Stream
Data::Range::Compare only works with arrays in memory, but supports generic range types.
Data::Range::Compare::Stream Works with streams of data via iterators, but it requires understanding OO Perl to extend to generic data types. Data::Range::Compare::Stream is recommended if you are processing very very large sets of data.
Here is an Excerpt form the Examples folder of Data::Range::Compare::Stream.
Given these 3 sets of data:
Numeric Range set: A contained in file: source_a.src
+----------+
| 1 - 11 |
| 13 - 44 |
| 17 - 23 |
| 55 - 66 |
+----------+
Numeric Range set: B contained in file: source_b.src
+----------+
| 0 - 1 |
| 2 - 29 |
| 88 - 133 |
+----------+
Numeric Range set: C contained in file: source_c.src
+-----------+
| 17 - 29 |
| 220 - 240 |
| 241 - 250 |
+-----------+
The expected output would be:
+--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+
| 0 - 0 | No Data | 0 - 1 | No Data |
| 1 - 1 | 1 - 11 | 0 - 1 | No Data |
| 2 - 11 | 1 - 11 | 2 - 29 | No Data |
| 12 - 12 | No Data | 2 - 29 | No Data |
| 13 - 16 | 13 - 44 | 2 - 29 | No Data |
| 17 - 29 | 13 - 44 | 2 - 29 | 17 - 29 |
| 30 - 44 | 13 - 44 | No Data | No Data |
| 55 - 66 | 55 - 66 | No Data | No Data |
| 88 - 133 | No Data | 88 - 133 | No Data |
| 220 - 240 | No Data | No Data | 220 - 240 |
| 241 - 250 | No Data | No Data | 241 - 250 |
+--------------------------------------------------------------------+
The Source code can be found here.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use lib qw(./ ../lib);
# custom package from FILE_EXAMPLE.pl
use Data::Range::Compare::Stream::Iterator::File;
use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::Consolidate;
use Data::Range::Compare::Stream::Iterator::Compare::Asc;
my $source_a=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_a.src');
my $source_b=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_b.src');
my $source_c=Data::Range::Compare::Stream::Iterator::File->new(filename=>'source_c.src');
my $consolidator_a=new Data::Range::Compare::Stream::Iterator::Consolidate($source_a);
my $consolidator_b=new Data::Range::Compare::Stream::Iterator::Consolidate($source_b);
my $consolidator_c=new Data::Range::Compare::Stream::Iterator::Consolidate($source_c);
my $compare=new Data::Range::Compare::Stream::Iterator::Compare::Asc();
my $src_id_a=$compare->add_consolidator($consolidator_a);
my $src_id_b=$compare->add_consolidator($consolidator_b);
my $src_id_c=$compare->add_consolidator($consolidator_c);
print " +--------------------------------------------------------------------+
| Common Range | Numeric Range A | Numeric Range B | Numeric Range C |
+--------------------------------------------------------------------+\n";
my $format=' | %-12s | %-13s | %-13s | %-13s |'."\n";
while($compare->has_next) {
my $result=$compare->get_next;
my $string=$result->to_string;
my #data=($result->get_common);
next if $result->is_empty;
for(0 .. 2) {
my $column=$result->get_column_by_id($_);
unless(defined($column)) {
$column="No Data";
} else {
$column=$column->get_common->to_string;
}
push #data,$column;
}
printf $format,#data;
}
print " +--------------------------------------------------------------------+\n";
Try Tree::RB but to find mutually exclusive ranges, no overlaps
The performance is not bad, if I had about 10000 segments and had to find the segment for each discrete number. My input had 300 million records. I neaded to put them into separate buckets. Like partitioning the data. Tree::RB worked out great.
$var = [
[0,90],
[91,2930],
[2950,8293]
.
.
.
]
my lookup value were 10, 99, 991 ...
and basically I needed the position of the range for the given number
With this below comparison function, mine uses something like this:
my $cmp = sub
{
my ($a1, $b1) = #_;
if(ref($b1) && ref($a1))
{
return ($$a1[1]) <=> ($$b1[0]);
}
my $ret = 0;
if(ref($a1) eq 'ARRAY')
{
#
if($$a1[0] <= $b1 && $b1 >= $$a1[1])
{
$ret = 0;
}
if($$a1[0] < $b1)
{
$ret = -1;
}
if($$a1[1] > $b1)
{
$ret = 1;
}
}
else
{
if($$b1[0] <= $a1 && $a1 >= $$b1[1])
{
$ret = 0;
}
if($$b1[0] > $a1)
{
$ret = -1;
}
if($$b1[1] < $a1)
{
$ret = 1;
}
}
return $ret;
}
I should check time in order to know if its the fastest way, but according to the structure of your data you should try this:
use strict;
my $fromA = 12;
my $toA = 15;
my $fromB = 7;
my $toB = 35;
my #common_range = get_common_range($fromA, $toA, $fromB, $toB);
my $common_range = $common_range[0]."-".$common_range[-1];
sub get_common_range {
my #A = $_[0]..$_[1];
my %B = map {$_ => 1} $_[2]..$_[3];
my #common = ();
foreach my $i (#A) {
if (defined $B{$i}) {
push (#common, $i);
}
}
return sort {$a <=> $b} #common;
}

Resources