Entity Relationship Diagram (ERD) - Is "CAN" relationship considered total participation? - data-structures

In Entity Relationship Diagram (ERD) is the word usage CAN considered total participation or partial participation?
For example, given the statement:
"An apple CAN be eaten by many different students."
In this relationship between the entities apple and students, would the apple considered to be in a total participation or partial?

There's no formal rule in the entity-relationship model about the word "CAN". It's up to the data modeler to name relationships and decide the cardinality and participation of the roles.
My common sense says that "CAN" implies partial participation, unless it's accompanied by "MUST". For example, if "An apple CAN be eaten by many different students", then it can also be eaten by no students. If the participation was total, I would expect a word like "MUST" or "SHALL", e.g. "An apple MUST be eaten by at least one student, and CAN be eaten by many different students".
In the case of one-to-one relationships, we can distinguish "An apple CAN be eaten by a student" vs "An apple MUST be eaten by a student". Again, "CAN" sounds like partial participation.

Related

Design of Bayesian networks: Understanding the difference between "States" and "Nodes"

I'm designing a small Bayesian Network using the program "Hugin Lite".
The problem is that I have difficulty understanding the difference between "Nodes"(visual circles) and "States"(witch are the "fields" of a node).
I will write an example where it is clear,and another which I can't understand.
The example I understand:
There are two women (W1 and W2) and one men (M).
M get a child with W1. Child's name is: C1
Then M get a child with W2. Child's name is: C2
The resulting network is:
The four possibles STATES of every Node (W1,W2,M,C1,C2) are:
AA: the person has two genes "A"
Aa/aA: the person has one gene "A" and one gene "a"
aa: the person has two genes "a"
Now the example that I can't understand:
The data given:
Total(authorized or not) of payments while a person is in a foreign country (travelling): 5% (of course the 95% of transactions are transactions made in the home country)
NOT AUTHORIZED payments while TRAVELLING: 1%
NOT AUTHORIZED payments while in HOME COUNTRY: 0,2%
NOT AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 10%
AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 1%
TOTAL (authorized of not authorized) payments while TRAVELLING and to a FOREIGN country: 90%
What I've drawn is the following.
But I'm not sure if it's correct or not. What do you think? Then I'm supposed to fulfill a "probability table" for each node. But what should I write?
Probability table:
Any hint about the network correctness and how to fullfill the table is really appreciated.
Nodes are Random Variables (RV), that is is "things" that can have different states thus with certain levels of uncertainty therefore you assign probabilities to those states. So for example if you talk of RV of Person it could have different states such as [Man or Woman] with their corresponding probabilities, if you want to relate it to another RV Credit Worthiness [Good,Bad] then you can "marry" Person and Credit Worthiness to have a combination of both RV and the combination of states.
This is homework so I don't want to just tell you the answer. Instead, I'll make an observation, and ask a few questions. The observation is that you want your arrows goig from cause to effect.
So. Is the payment authorization status a/the cause of the location? Or is the location a/the cause of the payment authorization?
Also, do you really need four variables for each of travelling, home, foreign, and local? Or might some smaller number of variables suffice?

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

How to create persistent data model for the future?

I have a concern that always accompanies me on the consistency of the data model with respect to future changes and backwards compatibility.
Assuming we have an application that cycles through periods (each year) and a model where a portion of their data, these organized hierarchically and that this hierarchy may or may not change between periods, there are implementations where it simply separated into a different database each cycle, but is the problem of data interoperate between cycles, as we can keep this hierarchy and its changes in each cycle, without having to store the entire hierarchy each cycle, since it does not necessarily change much less change it all, but there is that possibility.
Example:
academic information system, we have a hierarchy of subjects in each knowledge area
Mathematics
Algebra
Trigonometry
Arithmetic
Social Sciences
History
Geography
now based on this hierarchy holdings keep the qualifications of each student in the 2010 period, now in the period following 2011, the hierarchy changes
Mathematics
Trigonometry
Arithmetic
Algebra / * here's a change * /
Algebra
Social Sciences
History
Geography
or
Mathematics
Trigonometry
Arithmetic
/ * here's other change no more algebra * /
Social Sciences
History
Geography
the system is working and continue to keep the grades of students in the period 2011, now a student needs its past rating period, but the hierarchy has changed, as you can get the system the previous hierarchy
how I can fix this problem?
Here is a modeling suggestion: a subject entity should have attributes
subject_id (unique primary key)
name
superordinate_subject_id (if empty, you have a top node in your hierarchy)
lifetime (from_year, to_year; when to_year is empty, it is the currently active subject)
Subjects of similar names should not have overlapping lifetimes. Every time you make a change to an active subject in the hierarchy, make a copy of the subject and change the lifetime fields accordingly. As long as the hierarchy does not change, you have nothing to change in your data.
To match your example:
subject Mathematics, lifetime: from_year=2010, to_year=NULL
Algebra: lifetime: from_year=2010, to_year=2010
Trigonometry: lifetime: from_year=2010, to_year=NULL
Arithmetic: from_year=2010, to_year=NULL
subject Algebra: lifetime from_year=2011,to_year=NULL
Algebra: lifetime from_year=2011,to_year=NULL
Another option is to have a single "year" field in your subject instead of a lifetime; that may be a much simpler solution, better suited to the case when you want to store a different grade for each student per subject per year. But that would mean to store the entire hierarchy each cycle, what is what you excluded.
Don't mix up the identity of each subject and its position in the heirarchy.
If I got a B+ in Algebra in 2010, that data looks like:
ClassID StudentID Grade
100 100001 B+
the ID of the 'Algebra' class shouldn't be changing when your categories change.

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Algorithm for Rating Objects Based on Amount of Votes and 5 Star Rating

I'm creating a site whereby people can rate an object of their choice by allotting a star rating (say 5 star rating). Objects are arranged in a series of tags and categories eg. electronics>graphics cards>pci express>... or maintenance>contractor>plumber.
If another user searches for a specific category or tag, the hits must return the highest "rated" object in that category. However, the system would be flawed if 1 person only votes 5 stars for an object whilst 1000 users vote an average of 4.5 stars for another object. Obviously, logic dictates that credibility would be given to the 1000 user rated object as opposed to the object that is evaluated by 1 user even though it has a "lower" score.
Conversely, it's reliable to trust an object with 500 user rating with score of 4.8 than it is to trust an object with 1000 user ratings of 4.5 for example.
What algorithm can achieve this weighting?
A great answer to this question is here:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You can use the Bayesian average when sorting by recommendation.
I'd be tempted to have a cutoff (say, fifty votes though this is obviously traffic dependent) before which you consider the item as unranked. That would significantly reduce the motivation for spam/idiot rankings (especially if each vote is tied to a user account), and also gets you a simple, quick to implement, and reasonably reliable system.
simboid_function(value) = 1/(1+e^(-value));
rating = simboid_function(number_of_voters) + simboid_function(average_rating);

Resources