Database design for bus reservation - algorithm

I'm developing a reservation module for buses and I have trouble designing the right database structure for it.
Let's take following case:
Buses go from A to D with stopovers at B and C. A Passenger can reserve ticket for any route, ie. from A to B, C to D, A to D, etc.
So each route can have many "subroutes", and bigger contain smaller ones.
I want to design a table structure for routes and stops in a way that would help easily search for free seats. So if someone reserves seat from A to B, then seats from B to C or D would be still be available.
All ideas would be appreciated.

I'd probably go with a "brute force" structure similar to this basic idea:
(There are many more fields that should exist in the real model. This is only a simplified version containing the bare essentials necessary to establish relationships between tables.)
The ticket "covers" stops through TICKET_STOP table, For example, if a ticket covers 3 stops, then TICKET_STOP will contain 3 rows related to that ticket. If there are 2 other stops not covered by that ticket, then there will be no related rows there, but there is nothing preventing a different ticket from covering these stops.
Liberal usage or natural keys / identifying relationships ensures two tickets cannot cover the same seat/stop combination. Look at how LINE.LINE_ID "migrates" alongside both edges of the diamond-shaped dependency, only to be merged at its bottom, in the TICKET_STOP table.
This model, by itself, won't protect you from anomalies such as a single ticket "skipping" some stops - you'll have to enforce some rules through the application logic. But, it should allow for a fairly simple and fast determination of which seats are free for which parts of the trip, something like this:
SELECT *
FROM
STOP CROSS JOIN SEAT
WHERE
STOP.LINE_ID = :line_id
AND SEAT.BUS_NO = :bus_no
AND NOT EXIST (
SELECT *
FROM TICKET_STOP
WHERE
TICKET_STOP.LINE_ID = :line_id
AND TICKET_STOP.BUS_ID = :bus_no
AND TICKET_STOP.TRIP_NO = :trip_no
AND TICKET_STOP.SEAT_NO = SEAT.SEAT_NO
AND TICKET_STOP.STOP_NO = STOP.STOP_NO
)
(Replace the parameter prefix : with what is appropriate for your DBMS.)
This query essentially generates all combinations of stops and seats for given line and bus, then discards those that are already "covered" by some ticket on the given trip. Those combinations that remain "uncovered" are free for that trip.
You can easily add: STOP.STOP_NO IN ( ... ) or SEAT.SEAT_NO IN ( ... ) to the WHERE clause to restrict the search on specific stops or seats.

From the perspective of bus company:
Usually one route is considered as series of sections, like A to B, B to C, C to D, etc. The fill is calculated on each of those sections separately. So if the bus leaves from A full, and people leave at C, then user can buy ticket at C.
We calculate it this way, that each route has ID, and each section belongs to this route ID. Then if user buys ticket for more than one section, then each section is marked. Then for the next passenger system checks if all sections along the way are available.

Related

Handling multiple grains within a star schema

I'm trying to model a business process that is inherently measured at multiple grains. Usually, this would necessitate one fact table per grain. Because this a single business process and only one of the dimensions is at a mixed grain (for some records) I'm not sure a separate fact table makes the most sense.
The process itself is based on measuring a Research Application. Each application can have applicants, funders, collaborators, and so forth. Additionally, each application can be managed by an organisation. For the M:N relationships I'm using bridge tables and weighting factors. The problem lies with the organisation dimension, which models a slightly ragged hierarchy as fixed depth attributes.
dim_organisation
id, organisation, faculty, school, division, unit
Each fact record has the same dimensionality with the exception of this dimension. Sometimes the application is managed by a faculty (level 2 in the hierarchy), and sometimes by a school (level 3 in the hierarchy). Furthermore, the fact record itself will only contain the business key for one of those levels e.g. school_code or faculty_code.
Here's how I believe the problem can and should be solved but I'd like some validation of this approach and / or some better proposals if necessary:
The initial dim_organisation table is populated via an external, master data source. The data is always balanced, i.e., there's no missing levels in the data, but it's ragged, so that some entries end at school, whereas others go right down to the unit level:
id, organisation, faculty, school, division, unit
1, org A, faculty A, school A, NULL, NULL
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, NULL, NULL, NULL
Because these records are at different grains I've copied down the last non-NULL level to complete the hierarchy:
id, organisation, faculty, school, division, unit
1, org A, faculty A, school A, school A, school A
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, faculty C, faculty C, faculty C
This ensures that every record in the org_dimension is at the same grain and is a standard approach for handling slightly ragged hierarchies. In addition, each of these levels has their own code e.g. L456 for a level 4 division or L521 for a level 5 unit. These are the business keys obtained from the source system.
Therefore, I can only refer to a single record in the dimension by combining all of the level codes accordingly. At the moment I'm creating a hash key on these level codes and storing the value in a lookup column on the dimension.
Assuming this approach is correct, I then have fact records coming in as follows:
application_id, organisation_id, applicant_id, ...
1, L456, 99
2, L321, 50
3, L549, 20
As you can see, the application fact is linked to my organisation dimension at different grains e.g. Level 4, Level 3, Level 5 and so forth. Because of the changes I've made to the dimension, I believe I now need to do the following:
1. Lookup the level code from dim_organisation.
2. Return the parent levels.
3. Copy down the level value associated with the fact to level 5.
4. Hash the keys and lookup the corresponding dimensional record.
For example:
1. Lookup L456 to return Division e.g. "Research and Engineering".
2. Return parents: "UoM" -> "Faculty of R&D" -> "School of Engineering".
3. Copy levels: L1 -> L2 -> L3 -> "Research and Engineering" (L4) -> "Research Engineering" (L5).
4. Now we have all the levels (parents + cascaded) to give us a unique record to look up in dim_organisation.
I'd like to know if this approach makes sense or if there is a better and more intuitive way of doing this? It's slightly messy because of the source data that I'm dealing with but that's the data reality I have to work with.
You've done good with your dimension by pushing the ragged hierarchy down to the lowest grain. Now have your fact record reference the unique row indicator for the dimension.
1, org A, faculty A, school A, school A, school A
2. org A, faculty B, school B, division B, unit B
3. org A, faculty C, faculty C, faculty C, faculty C
If the fact event is related to school A, the fact would store row id #1.
The only caveat to this approach is that the real level of the dim should be identifiable by the content. In other words, if School A is West Side High School and faculty C is Mr West, you wouldn't want them both described as "WEST". If the content of each level is fully descriptive then this model will work just fine.
I have used this exact same approach to model an organizational hierarchy containing up to 10 levels of reporting.

Jira's Lexorank algorithm for new stories

I am looking to create a large list of items that allows for easy insertion of new items and for easily changing the position of items within that list. When updating the position of an item, I want to change as few fields as possible regarding the order of items.
After some research, I found that Jira's Lexorank algorithm fulfills all of these needs. Each story in Jira has a 'rank-field' containing a string which is built up of 3 parts: <bucket>|<rank>:<sub-rank>. (I don't know whether these parts have actual names, this is what I will call them for ease of reference)
Examples of valid rank-fields:
0|vmis7l:hl4
0|i000w8:
0|003fhy:zzzzzzzzzzzw68bj
When dragging a card above 0|vmis7l:hl4, the new card will receive rank 0|vmis7l:hl2, which means that only the rank-field for this new card needs to be updated while the entire list can always be sorted on this rank-field. This is rather clever, and I can't imagine that Lexorank is the only algorithm to use this.
Is there a name for this method of sorting used in the sub-rank?
My question is related to the creation of new cards in Jira. Each new card starts with an empty sub-rank, and the rank is always chosen such that the new card is located at the bottom of the list. I've created a bunch of new stories just to see how the rank would change, and it seems that the rank is always incremented by 8 (in base-36).
Does anyone know more specifically how the rank for new cards is generated? Why is it incremented by 8?
I can only imagine that after some time (270 million cards) there are no more ranks to generate, and the system needs to recalculate the rank-field of all cards to make room for additional ranks.
Are there other triggers that require recalculation of all rank-fields?
I suppose the bucket plays a role in this recalculation. I would like to know how?
We are talking about a special kind of indexing here. This is not sorting; it is just preparing items to end up in a certain order in case someone happens to sort them (by whatever sorting algorithm). I know that variants of this kind of indexing have been used in libraries for decades, maybe centuries, to ensure that books belonging together but lacking a common title end up next to each other in the shelves, but I have never heard of a name for it.
The 8 is probably chosen wisely as a compromise, maybe even by analyzing typical use cases. Consider this: If you choose a small increment, e. g. 1, then all tickets will have ranks like [a, b, c, …]. This will be great if you create a lot of tickets (up to 26) in the correct order because then your rank fields keep small (one letter). But as soon as you move a ticket between two other tickets, you will have to add a letter: [a, b] plus a new ticket between them: [a, an, b]. If you expect to have this a lot, you better leave gaps between the ranks: [a, i, q, …], then an additional ticket can get a single letter as well: [a, e, i, q, …]. But of course if you now create lots of tickets in the correct order right in the beginning, you quickly run out of letters: [a, i, q, y, z, za, zi, zq, …]. The 8 probably is a good value which allows for enough gaps between the tickets without increasing the need for many letters too soon. Keep in mind that other scenarios (maybe not Jira tickets which are created manually) might make other values more reasonable.
You are right, the rank fields get recalculated now and then, Lexorank calls this "balancing". Basically, balancing takes place in one of three occasions: ① The ranks are exhausted (largest value reached), ② the ranks are due to user-reranking of tickets too close together ([a, b, i] and something is supposed to go in between a and b), and ③ a balancing is triggered manually in the management page. (Actually, according to the presentation, Lexorank allows for up to three letter ranks, so "too close together" can be something like aaa and aab but the idea is the same.)
The <bucket> part of the rank is increased during balancing, so a messy [0|a, 0|an, 0|b] can become a nice and clean [1|a, 1|i, 1|q] again. The brownbag presentation about Lexorank (as linked by #dandoen in the comments) mentions a round-robin use of <buckets>, so instead of a constant increment (0→1→2→3→…) a 2 is increased modulo 3, so it will turn back to 0 after the 2 (0→1→2→0→…). When comparing the ranks, the sorting algorithm can consider a 0 "greater" than a 2 (it will not be purely lexicographical then, admitted). If now the balancing algorithm works backwards (reorder the last ticket first), this will keep the sorting order intact all the time. (This is just a side aspect, that's why I keep the explanation small, but if this is interesting, ask, and I will elaborate on this.)
Sidenote: Lexorank also keeps track of minimum and maximum values of the ranks. For the functioning of the algorithm itself, this is not necessary.

Logic Implemention: Determining availability by resource type, when a resource can belong to multiple types

Consider a hotel which has multiple room types (e.g. single, double, twin, family), and multiple rooms. Each room can be a combination of room types (e.g. one particular room can be a double/twin room).
The problem I'm facing is how to determine availability of rooms based on what is booked already. Consider a hotel with 2 rooms:
Single / Double
Double / Family
We have a basic availability of:
Single: 1
Double: 2
Family: 1
(yes, it seems like there are four rooms, but so long as the availability > 1, it can be assigned, that's the premise I'm working on right now)
In this way, I can sell any combination of rooms, and only when a room availability counter hits zero will it affect the other rooms. E.g. I can sell a double room, and still keep the option of single or family room available. Only when another room is sold will everything close off.
So far, so good.
Except when I come up with a multiple S/D rooms (e.g. two or more) and sell them separately (e.g. a single, then a double) the counter doesn't reach 0 (so I can't use that as a trigger to close off other rooms) but I've sold the maximum number of physical rooms the hotel has anyway.
Clearly there's some fault in my approach to how I'm determining what's available, and I'd appreciate any pointers if this issue has been resolved before (as pseudo-code for now, I'll translate to MySQL/PHP once I've got my head around it).
Thanks
I managed to resolve this eventually through SQL.
My reservations table holds a room_type_id, and a room_id. Depending on whether a room is assigned, I either join the pivot table and then room_types table, or the room_types table directly using the room_type_id. And then I just SUM() 1 for each tuple which thankfully returns the right amount when you group by room_type.id in the end.

Algorithm for multicriterial arrangement

Let me describe problem in a form of a small fiction story.
The story
In a Brave New World new cities are built in a couple of days and only need to be populated. Moreover, there's no more long boring hiring process, no interviews and subjective decisions - every person passes several tests and their results are used to find best employees.
When new city is built, number of companies place their offices there and ask Super Mind to find best employees for them given a way to calculate person's score for their particular company. People on their side ask Super Mind to find work for them. They give him list of companies where they would like to work together with corresponding priorities. Super Mind is very humanistic, so its task is to find such arrangement that people get to the best companies they want, even if some companies will left without employees at all.
Formal definition
Now let me define the task more formally.
E - number of employees seeking for a job.
C - number of companies.
S(e,c) - score of employee e for company c.
Pr(e,c) - priority of company c in a personal "wishlist" of employee e.
P(c) - # of positions available in company c.
Task: obtain list of (e, c) tuples given following conditions:
employees with higher S(e,c) should always go first (e.g. if there's only one position left in company c and there are 2 candidates for it, it should be guaranteed that employee with higher score gets to this position).
employees should get to the company with highest priority available for them.
My algorithm
The only algorithm I can think of that guarantees all conditions is as follows. First I create list of all possible applications from employees to companies (A(e,c,s,p)), where s is a score of employee e for company c and p is company priority for this employee. Then I sort all applications by total score and run next recursive procedure:
def arrange(As, Ps, not_approved, approved):
# As - list of applications left
# Ps - map of type (company -> # of positions left)
# not_approved - set of not approved applications
# approved - set of approved applications (hold intermediate result)
if (empty(As))
return approved
a = head(As)
As_rest = tail(As)
if (cant_be_hired(a)) # if no places left in company from this application
return arrange(As_rest, Ps, not_approved + a, approved)
else if (highest_priority(a)) # if this application has highest of left priorities
return arrange(As_rest, Ps(c) - 1, not_approved, approved + a)
else
# if application can be accepted, but it has higher priorities left,
# check what will happen if we do not accept this application
check_result = arrange(As_left, Ps, not_approved + a, approved)
if (employee_is_hired_for_better_job(a, check_result))
# if employee can be hired to a job with higher priority,
# just return check_result - it is already an answer
return check_result
else
# otherwise accept this application and proceed for rest of them
return arrange(As_rest, Ps, not_approved, approved + a)
But, of course, this algorithm has very large computational complexity. Dynamic programming with caching check results helps a bit, but this is still too slow.
I was thinking of some kind of conditional optimization algorithm that always converges, however I'm not so closely familiar with this field to find appropriate one.
So, is there better algorithm?

Looking for a model to represent this problem, which I suspect may be NP-complete

(I've changed the details of this question to avoid NDA issues. I'm aware that if taken literally, there are better ways to run this theoretical company.)
There is a group of warehouses, each of which are capable of storing and distributing 200 different products, out of a possible 1000 total products that Company A manufactures. Each warehouse is stocked with 200 products, and assigned orders which they are then to fill from their stock on hand.
The challenge is that each warehouse needs to be self-sufficient. There will be an order for an arbitrary number of products (5-10 usually), which is assigned to a warehouse. The warehouse then packs the required products for the order, and ships them together. For any item which isn't available in the warehouse, the item must be delivered individually to the warehouse before the order can be shipped.
So, the problem lies in determining the best warehouse/product configurations so that the largest possible number of orders can be packed without having to order and wait for individual items.
For example (using products each represented by a letter, and warehouses capable of stocking 5 product lines):
Warehouse 1: [A, B, C, D, E]
Warehouse 2: [A, D, F, G, H]
Order: [A, C, D] -> Warehouse 1
Order: [A, D, H] -> Warehouse 2
Order: [A, B, E, F] -> Warehouse 1 (+1 separately ordered)
Order: [A, D, E, F] -> Warehouse 2 (+1 separately ordered)
The goal is to use historical data to minimize the number of individually ordered products in future. Once the warehouses had been set up a certain way, the software would just determine which warehouse could handle an order with minimal overhead.
This immediately strikes me as a machine learning style problem. It also seems like a combination of certain well known NP-Complete problems, though none of them seem to fit properly.
Is there a model which represents this type of problem?
If I understand correctly, you have to separate problems :
Predict what should each warehouse pre-buy
Get the best warehouse for an order
For the first problem, I point you to the netflix prize : this was almost the same problem, and great solutions have been proposed. (My datamining handbook is at home and I can't remember for precise keyword to google, sorry.Try "data mining time series" )
For the second one, this is a problem for Prolog.
Set a cost for separately ordering an item
Set a cost, for, idk, proximity to the customer
Set the cost for already owning the product to 0
Make the rule to get a product : buy it if you don't have it, get it if you do
Make the rule to get all products : foreach product, rule above
get the cost of this rule
Gently ask Prolog to get a solution. If it's not good enough, ask more.
If you don't want to use Prolog, there are several constraints libraries out there. Just google "constraint library <insert your programming language here>"
The first part of the problem (which items are frequently ordered together) is sometimes known as the co-occurrence problem, and is a big part of the data mining literature. (My recollection is that the problem is in NP, but there exist quite good approximate algorithms).
Once you have co-occurrence data you are happy with, you are still left with the assignment of items to warehouses. It's a little like the set-covering problem, but not quite the same. This problem is NP-hard.

Resources