Design of Bayesian networks: Understanding the difference between "States" and "Nodes" - probability

I'm designing a small Bayesian Network using the program "Hugin Lite".
The problem is that I have difficulty understanding the difference between "Nodes"(visual circles) and "States"(witch are the "fields" of a node).
I will write an example where it is clear,and another which I can't understand.
The example I understand:
There are two women (W1 and W2) and one men (M).
M get a child with W1. Child's name is: C1
Then M get a child with W2. Child's name is: C2
The resulting network is:
The four possibles STATES of every Node (W1,W2,M,C1,C2) are:
AA: the person has two genes "A"
Aa/aA: the person has one gene "A" and one gene "a"
aa: the person has two genes "a"
Now the example that I can't understand:
The data given:
Total(authorized or not) of payments while a person is in a foreign country (travelling): 5% (of course the 95% of transactions are transactions made in the home country)
NOT AUTHORIZED payments while TRAVELLING: 1%
NOT AUTHORIZED payments while in HOME COUNTRY: 0,2%
NOT AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 10%
AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 1%
TOTAL (authorized of not authorized) payments while TRAVELLING and to a FOREIGN country: 90%
What I've drawn is the following.
But I'm not sure if it's correct or not. What do you think? Then I'm supposed to fulfill a "probability table" for each node. But what should I write?
Probability table:
Any hint about the network correctness and how to fullfill the table is really appreciated.

Nodes are Random Variables (RV), that is is "things" that can have different states thus with certain levels of uncertainty therefore you assign probabilities to those states. So for example if you talk of RV of Person it could have different states such as [Man or Woman] with their corresponding probabilities, if you want to relate it to another RV Credit Worthiness [Good,Bad] then you can "marry" Person and Credit Worthiness to have a combination of both RV and the combination of states.

This is homework so I don't want to just tell you the answer. Instead, I'll make an observation, and ask a few questions. The observation is that you want your arrows goig from cause to effect.
So. Is the payment authorization status a/the cause of the location? Or is the location a/the cause of the payment authorization?
Also, do you really need four variables for each of travelling, home, foreign, and local? Or might some smaller number of variables suffice?

Related

Finding combination of trainstops through Europe

I am currently working on a school project. The objective is to find a route through Europe from trainstation to trainstation, from country to country where the names of the consecutive trainstations have to start with the letters of the alphabet consecutively and a country can only be used once. To give an example route:(Amsterdam Central, Netherlands) --> (Berlin, Germany) --> (Carcasonne, France) etc. Countries also need to be neighbouring countries. We have received a dataset in which countries and some of their specific stations are mentioned. Some of the countries don't have a large selection of stations, making it important that a certain letter is used with a certain country, because only a small selection first letters will be present for this specific country. Can someone maybe provide me with some guidance as to how I can tackle this problem. I am currently coding in python.
cheers!
- Sort countries in order of increasing number of letter choices
- Loop C over countries in order of increasing number of letter choices
- Place C in position for a random letter that it provides
- IF neighbouring positions have been populated, but they are NOT neighbours geographically
- THEN choose a different letter for those available in C
I would use a constraint solver for this. (I'm familiar with CP-SAT because I use it at work, but you have options.)
For each letter from A to Z, create a variable whose domain is the set of stations whose name starts with that letter. Create 26 corresponding country variables and, for each corresponding station variable and country variable, add a table constraint to ensure that they describe the same thing (so the table contains entries like ("Amsterdam Central, Netherlands", "Netherlands"), though you have to translate into integer indices). Add an all-different constraint on the country variables. For each pair of consecutive country variables, add a table constraint that they be neighbors.
The solver contains a powerful deduction engine that will surely pick through the possibilities much faster than brute force or simple heuristics.

How to manage multiple positive implicit feedbacks?

When there are no ratings, a common scenario is to use implicit feedback (items bought, pageviews, clicks, ...) to suggests recommendations. I'm using a model-based approach and I wondering how to deal with multiple identical feedback.
As an example, let's imagine that consummers buy items more than once. Should I have to consider the number of feedback (pageviews, items bought, ...) as a rating or compute a custom value ?
To model implicit feedback, we usually have a mapping procedure to map implicit user feedback into the explicit ratings. I guess in most domains, repeated user action against the same item indicates that the user's preference over the item is increasing.
This is certainly true if the domain is music or video recommendation. In a shopping site, such a behavior might indicate the item is consumed periodically, e.g., diapers or printer ink.
One way I am aware of to model this multiple implicit feedback is to create a numeric rating mapping function. When the number of times (k) of implicit feedback increases, the mapped value of rating should increase. At k = 1, you have a minimal rating of positive feedback, for example 0.6; when k increases, it approaches 1. For sure, you don't need to map to [0,1]; you can have integer ratings, 0,1,2,3,4,5.
To give you a concrete example of the mapping, here is what they did in a music recommendation domain. For short, they used the statistic info of the items per user to define the mapping function.
We assume that the more
times the user has listened to an artist the more the user
likes that particular artist. Note that user’s listening habits
usually present a power law distribution, meaning that a few
artists have lots of plays in the users profile, while the rest
of the artists have significantly less play counts. Therefore,
we compute the complementary cumulative distribution of
artist plays in the users’ profile. Artists located in the top
80-100% of the distribution are assigned a score of 5, while
artists in the 60-80% range assign a score of 4.
Another way I have seen in the literature is to create another variable besides a binary rating variable. They call it confidence levels. See here for details.
Probably not that helpful for OP any longer, but it might be for others in the same boat.
Evaluating Various Implicit Factors in E-commerce
Modelling User Preferences from Implicit Preference Indicators via Compensational Aggregations
If anyone knows more papers/methods, please share as I'm currently looking for state of the art approaches to this problem. Thanks in advance.
You typically use a sum of clicks, or some weighted sum of events, as a "score" for each user-item pair in implicit feedback systems. It's not a rating, and that's more than a semantic distinction. You won't get good results if you feed these values into a process that's expecting rating-like and trying to minimize a squared-error loss.
You treat 3 clicks as adding 3 times the value of 1 click to the user-item interaction strength. Other events, like a purchase, might be weighted much more highly than a click. But in the end it also adds to a sum.

How do you group data objects subject to several constraints?

I'm writing an application for a society in my campus and the main job of this application is to put members into groups subject to several
I need to determine the number of groups which depends on the number of members.
Then I need to arbitrarily/randomly select leaders for each group.
Then I need to add members to each group ensuring that each group satisfies the following constraints:
The number of members(including the leader) should be less than or equal to 7 and greater than or equal to 4 .
No more than 2/3 of the group should be the same gender.
No more than 2/3 of the group should be the same year of study.
Each member is classified as coming from a certain region depending on their place of residency. All members of the group should come from the same region.
Now I'd like to know how to go about this in terms of what known data structures and abstract data types can I use? What known algorithms can come in handy? Is there already a known computer science problem similar to mine that I can read up on?... etc... I think you get the question. I've done some googling around the web but nothing really helpful so far.

Logic Implemention: Determining availability by resource type, when a resource can belong to multiple types

Consider a hotel which has multiple room types (e.g. single, double, twin, family), and multiple rooms. Each room can be a combination of room types (e.g. one particular room can be a double/twin room).
The problem I'm facing is how to determine availability of rooms based on what is booked already. Consider a hotel with 2 rooms:
Single / Double
Double / Family
We have a basic availability of:
Single: 1
Double: 2
Family: 1
(yes, it seems like there are four rooms, but so long as the availability > 1, it can be assigned, that's the premise I'm working on right now)
In this way, I can sell any combination of rooms, and only when a room availability counter hits zero will it affect the other rooms. E.g. I can sell a double room, and still keep the option of single or family room available. Only when another room is sold will everything close off.
So far, so good.
Except when I come up with a multiple S/D rooms (e.g. two or more) and sell them separately (e.g. a single, then a double) the counter doesn't reach 0 (so I can't use that as a trigger to close off other rooms) but I've sold the maximum number of physical rooms the hotel has anyway.
Clearly there's some fault in my approach to how I'm determining what's available, and I'd appreciate any pointers if this issue has been resolved before (as pseudo-code for now, I'll translate to MySQL/PHP once I've got my head around it).
Thanks
I managed to resolve this eventually through SQL.
My reservations table holds a room_type_id, and a room_id. Depending on whether a room is assigned, I either join the pivot table and then room_types table, or the room_types table directly using the room_type_id. And then I just SUM() 1 for each tuple which thankfully returns the right amount when you group by room_type.id in the end.

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Resources