How do i extract certain information from a prolog database of a taxi company? - prolog

I'm working with a database or knowledge base, whatever you call it in Prolog.
It is filled with facts about clients and jobs of a taxi company:
client(id, name, surname, list_of_jobs).
The list of jobs is filled with facts in this form:
job(from, to, price_per_km, date(day,month,year), no_of_car).
If there is a direct link between 2 locations, there can be in the database a fact as such:
distance(Loc_a, loc_b, Km).
The drivers always choose the shortest path from point A to point B.
We need:
count_loc(Loc_ N) - A predicate which counts how many times a location has been arrival or departure point in the whole database, stores that number in N and returns N.
most_km(X,Y) - Returns the name and surname of the client which has accumulated the most mileage while driving with this company.
most_money_made(X) - Returns the no_of_car of the taxi car which has made the most money in the month of december 2015.
This is way out of my league but i need to solve it, so if anyone knows and wants to i'd be thrilled.

Related

Finding combination of trainstops through Europe

I am currently working on a school project. The objective is to find a route through Europe from trainstation to trainstation, from country to country where the names of the consecutive trainstations have to start with the letters of the alphabet consecutively and a country can only be used once. To give an example route:(Amsterdam Central, Netherlands) --> (Berlin, Germany) --> (Carcasonne, France) etc. Countries also need to be neighbouring countries. We have received a dataset in which countries and some of their specific stations are mentioned. Some of the countries don't have a large selection of stations, making it important that a certain letter is used with a certain country, because only a small selection first letters will be present for this specific country. Can someone maybe provide me with some guidance as to how I can tackle this problem. I am currently coding in python.
cheers!
- Sort countries in order of increasing number of letter choices
- Loop C over countries in order of increasing number of letter choices
- Place C in position for a random letter that it provides
- IF neighbouring positions have been populated, but they are NOT neighbours geographically
- THEN choose a different letter for those available in C
I would use a constraint solver for this. (I'm familiar with CP-SAT because I use it at work, but you have options.)
For each letter from A to Z, create a variable whose domain is the set of stations whose name starts with that letter. Create 26 corresponding country variables and, for each corresponding station variable and country variable, add a table constraint to ensure that they describe the same thing (so the table contains entries like ("Amsterdam Central, Netherlands", "Netherlands"), though you have to translate into integer indices). Add an all-different constraint on the country variables. For each pair of consecutive country variables, add a table constraint that they be neighbors.
The solver contains a powerful deduction engine that will surely pick through the possibilities much faster than brute force or simple heuristics.

Class Scheduling to Boolean satisfiability [Polynomial-time reduction] part 2

I asked few days ago, a question about how to transform a University Class Scheduling Problem into a Boolean Satisfiability Problem.
(Class Scheduling to Boolean satisfiability [Polynomial-time reduction])
I got an answer by #Amit who was very elegant and easy to code.
Basically, his answer was like this : instead of considering courses, he considered time-intervals.
So for the i-th course, he just indicted all the possible intervals for this course. And we obtain a solution when there is at least 1 true-interval for every course and when no interval overlap an other.
This methods works very well when we consider only courses and nothing else. I generalize it by encoding the room inside the interval.
for example, instead of [8-10] to say that a course can happen between 8am and 10am.
I used [0.00801 - 0.01001] to say that a course can happen between 8am and 10am in the room 1.
I'm sure that you are currently wandering "why use double ?" well, because here come my problem :
To continue to generalize this method, I encode also the n° of the teacher in this interval.
I used [1.00801 - 1.01001] to say that a course can happen between 8am and 10am in the room 1 and be taught by the teacher n°1.
Here is what I got for now :
like this [1.008XX - 1.010XX] can happen in the same time as [2.008YY - 2.010YY], which is true, if the teacher 1 is teaching in the room X between 8am and 10am, the teacher 2 can teach also in Y between 8am and 10am, if and only if the room is available.
The problem is : with this method I cannot assure that XX and YY will be different and that YY will be available, because [1.008XX - 1.010XX] don't overlap [2.008XX - 2.010XX], so for now, the solver consider this possible.
And I still don't have any clue on how to assure this, by using this interval-method...
I need a way to encode {Interval, room and teacher-id} in order that :
a teacher cannot be in 2 places in the same interval.
there cannot be 2 teachers in the same room for the same interval.
there is a least 1 interval true by course.
Thanks in advance for your help,
Best regards !
Follow up question: Class Scheduling to Boolean satisfiability [Polynomial-time reduction] Final Part
This answer is extension of Part 1's answer, and uses the same notations when possible.
Ok, assume each interval is assigned to one teacher (if more than one teacher can take the interval, just have multiple instances of it, with different teachers per instance), so to indicate teacher t teaches in a classroom p at time x to y, we can use the old variable that this class is given - V_{i,j} - for the class and interval.
For each teacher t , and for each pair of intervals c=(x1,y1), d=(x2,y2) in classes (a,b) the teacher might participate in, add the clause:
Q_{t,i,j} = Not(V_ac) OR Not(V_bd) OR Smaller(y1,x2) OR Smaller(y2,x1)
Intuitively, the above clause guarantees a teacher cannot be in the same time in two places - no intervals overlap that the same teacher is assigned to them.
By chaining each pair (i,j) for each teacher t with AND to the original formula, it satisfies your first constraint - a teacher cannot be in 2 places in the same interval. - since each teacher cannot be in two places at the same time.
Your second constraint there cannot be 2 teachers in the same room for the same interval. is also satisfied by the fact that there cannot be two classes that overlap the time and class.
The 3rd constraint there is a least 1 interval true by course. is satisfied by the F1 clause, since you have to choose at least one interval (with one teacher assigned) for each course.

Logic Implemention: Determining availability by resource type, when a resource can belong to multiple types

Consider a hotel which has multiple room types (e.g. single, double, twin, family), and multiple rooms. Each room can be a combination of room types (e.g. one particular room can be a double/twin room).
The problem I'm facing is how to determine availability of rooms based on what is booked already. Consider a hotel with 2 rooms:
Single / Double
Double / Family
We have a basic availability of:
Single: 1
Double: 2
Family: 1
(yes, it seems like there are four rooms, but so long as the availability > 1, it can be assigned, that's the premise I'm working on right now)
In this way, I can sell any combination of rooms, and only when a room availability counter hits zero will it affect the other rooms. E.g. I can sell a double room, and still keep the option of single or family room available. Only when another room is sold will everything close off.
So far, so good.
Except when I come up with a multiple S/D rooms (e.g. two or more) and sell them separately (e.g. a single, then a double) the counter doesn't reach 0 (so I can't use that as a trigger to close off other rooms) but I've sold the maximum number of physical rooms the hotel has anyway.
Clearly there's some fault in my approach to how I'm determining what's available, and I'd appreciate any pointers if this issue has been resolved before (as pseudo-code for now, I'll translate to MySQL/PHP once I've got my head around it).
Thanks
I managed to resolve this eventually through SQL.
My reservations table holds a room_type_id, and a room_id. Depending on whether a room is assigned, I either join the pivot table and then room_types table, or the room_types table directly using the room_type_id. And then I just SUM() 1 for each tuple which thankfully returns the right amount when you group by room_type.id in the end.

Design of Bayesian networks: Understanding the difference between "States" and "Nodes"

I'm designing a small Bayesian Network using the program "Hugin Lite".
The problem is that I have difficulty understanding the difference between "Nodes"(visual circles) and "States"(witch are the "fields" of a node).
I will write an example where it is clear,and another which I can't understand.
The example I understand:
There are two women (W1 and W2) and one men (M).
M get a child with W1. Child's name is: C1
Then M get a child with W2. Child's name is: C2
The resulting network is:
The four possibles STATES of every Node (W1,W2,M,C1,C2) are:
AA: the person has two genes "A"
Aa/aA: the person has one gene "A" and one gene "a"
aa: the person has two genes "a"
Now the example that I can't understand:
The data given:
Total(authorized or not) of payments while a person is in a foreign country (travelling): 5% (of course the 95% of transactions are transactions made in the home country)
NOT AUTHORIZED payments while TRAVELLING: 1%
NOT AUTHORIZED payments while in HOME COUNTRY: 0,2%
NOT AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 10%
AUTHORIZED payments while in HOME COUNTRY and to a FOREIGN COMPANY: 1%
TOTAL (authorized of not authorized) payments while TRAVELLING and to a FOREIGN country: 90%
What I've drawn is the following.
But I'm not sure if it's correct or not. What do you think? Then I'm supposed to fulfill a "probability table" for each node. But what should I write?
Probability table:
Any hint about the network correctness and how to fullfill the table is really appreciated.
Nodes are Random Variables (RV), that is is "things" that can have different states thus with certain levels of uncertainty therefore you assign probabilities to those states. So for example if you talk of RV of Person it could have different states such as [Man or Woman] with their corresponding probabilities, if you want to relate it to another RV Credit Worthiness [Good,Bad] then you can "marry" Person and Credit Worthiness to have a combination of both RV and the combination of states.
This is homework so I don't want to just tell you the answer. Instead, I'll make an observation, and ask a few questions. The observation is that you want your arrows goig from cause to effect.
So. Is the payment authorization status a/the cause of the location? Or is the location a/the cause of the payment authorization?
Also, do you really need four variables for each of travelling, home, foreign, and local? Or might some smaller number of variables suffice?

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Resources