Google transit is too idealistic. How would you change that? - algorithm

Suppose you want to get from point A to point B. You use Google Transit directions, and it tells you:
Route 1:
1. Wait 5 minutes
2. Walk from point A to Bus stop 1 for 8 minutes
3. Take bus 69 till stop 2 (15 minues)
4. Wait 2 minutes
5. Take bus 6969 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 3 minutes.
Total time = 5 wait + 40 minutes.
Route 2:
1. Wait 10 minutes
2. Walk from point A to Bus stop I for 13 minutes
3. Take bus 96 till stop II (10 minues)
4. Wait 17 minutes
5. Take bus 9696 till stop 3(12 minutes)
6. Walk 7 minutes from stop 3 till point B for 8 minutes.
Total time = 10 wait + 50 minutes.
All in all Route 1 looks way better. However, what really happens in practice is that bus 69 is 3 minutes behind due to traffic, and I end up missing bus 6969. The next bus 6969 comes at least 30 minutes later, which amounts to 5 wait + 70 minutes (including 30 m wait in the cold or heat). Would not it be nice if Google actually advertised this possibility? My question now is: what is the better algorithm for displaying the top 3 routes, given uncertainty in the schedule?
Thanks!

How about adding weightings that express a level of uncertainty for different types of journey elements.
Bus services in Dublin City are notoriously untimely, you could add a 40% margin of error to anything to do with Dublin Bus schedule, giving a best & worst case scenario. you could also factor in the chronic traffic delays at rush hours. Then a user could see that they may have a 20% or 80% chance of actually making a connection.
You could sort "best" journeys by the "most probably correct" factor, and include this data in the results shown to the user.
My two cents :)

For the UK rail system, each interchange node has an associated 'minimum transfer time to allow'. The interface to the route planner here then has an Advanced option allowing the user to either accept the default, or add half hour increments.
In your example, setting a' minimum transfer time to allow' of say 10 minutes at step 2 would prevent Route 1 as shown being suggested. Of course, this means that the minimum possible journey time is increased, but that's the trade off.

If you take uncertainty into account then there is no longer a "best route", but instead there can be a "best strategy" that minimizes the total time in transit; however, it can't be represented as a linear sequence of instructions but is more of the form of a general plan, i.e. "go to bus station X, wait until 10:00 for bus Y, if it does not arrive walk to station Z..." This would be notoriously difficult to present to the user (in addition of being computationally expensive to produce).
For a fixed sequence of instructions it is possible to calculate the probability that it actually works out; but what would be the level of certainty users want to accept? Would you be content with, say, 80% success rate? When you then miss one of your connections the house of cards falls down in the worst case, e.g. if you miss a train that leaves every second hour.
I wrote many years a go a similar program to calculate long-distance bus journeys in Finland, and I just reported the transfer times assuming every bus was on schedule. Then basically every plan with less than 15 minutes transfer time or so was disregarded because they were too risky (there were sometimes only one or two long-distance buses per day at a given route).

Empirically. Record the actual arrival times vs scheduled arrival times, and compute the mean and standard deviation for each. When considering possible routes, calculate the probability that a given leg will arrive late enough to make you miss the next leg, and make the average wait time P(on time)*T(first bus) + (1-P(on time))*T(second bus). This gets more complicated if you have to consider multiple legs, each of which could be late independently, and multiple possible next legs you could miss, but the general principle holds.

Catastrophic failure should be the first check.
This is especially important when you are trying to connect to that last bus of the day which is a critical part of the route. The rider needs to know that is what is happening so he doesn't get too distracted and knows the risk.
After that it could evaluate worst-case single misses.
And then, if you really wanna get fancy, take a look at the crime stats for the neighborhood or transit station where the waiting point is.

Related

Distribute user active time blocks subject to total constraint

I am building an agent-based model for product usage. I am trying to develop a function to decide whether the user is using the product at a given time, while incorporating randomness.
So, say we know the user spends a total of 1 hour per day using the product, and we know the average distribution of this time (e.g., most used at 6-8pm).
How can I generate a set of usage/non-usage times (i.e., during each 10 minute block is the user active or not) while ensuring that at the end of the day the total active time sums to one hour.
In most cases I would just run the distributor without concern for the total, and then at the end normalize by making it proportional to the total target time so the total was 1 hour. However, I can't do that because time blocks must be 10 minutes. I think this is a different question because I'm really not computing time ranges, I'm computing booleans to associate with different 10 minute time blocks (e.g., the user was/was not active during a given block).
Is there a standard way to do this?
I did some more thinking and figured it out, if anyone else is looking at this.
The approach to take is this: You know the allowed number n of 10-minute time blocks for a given agent.
Iterate n times, and on each iteration select a time block out of the day subject to your activity distribution function.
Main point is to iterate over the number of time blocks you want to place, not over the entire day.

Best algorithm for threshold identitication

Assume I have huge set of data about a system idle time.
Day 1 - 5 mins
Day 2 - 3 mins
Day 3 - 7 mins
...
Day 'n' - 'k' mins
We can assume that even though the idletime is random, the pattern repeats.
Using this as a training data, is it possible for me to identify the idle time behavior of the system. With that, can a abnormality be predicted
Which algorithm would best suit for this purpose
I tried to fit in regression, but it can just answer me " What is the expected idle time today "
But what I want to do is. When the idle time goes away from the pattern, it has to be detected.
Edit:
Or does it make sense to predict for the current day only. i.e Today the expected idle time is 'x' mins. Tomorrow it may differ
I would try a Fourier Transformation and have a look if your system behaves in a periodic way (this would mean there are some peaks in the frequency domain).
Than get rid of the frequencies with low values and use the rest to predict the system behavior in the future.
If the real behavior differs a lot from the prediction that is what you want to detect.
wikipedia: Fast Fourier Transformation

Algorithm to distribute heartbeats?

I am building a sensor network where a large number of sensors report their status to a central hub. The sensors need to report status atleast once every 3 hours, but I want to make sure that the hub does not get innundated with too many reports at any given time. So to mitigate this, I let the hub tell the sensors the 'next report time'.
Now I am looking for any standard algorithms for doing some load balancing of these updates, such that the sensors dont exceed a set interval between reports and the hub can calculate the next report time such that its load (of receiving reports) is evenly divided over the day.
Any help will be appreciated.
If you know how many sensors there are, just divide up every three hour chunk into that many time slots and (either randomly or programmatically as you need), assign one to each sensor.
If you don't, you can still divide up every three hour chunk into some large number of time slots and assign them to sensors. In your assignment algorithm, you just have to make sure that all the slots have one assigned sensor before any of them have two, and all of them have two before any of them have three, etc.
Easiest solution: Is there any reason why the hub cannot poll the sensors according to its own schedule?
Otherwise you may want to devise a system where the hub can decide whether or not to accept a report based on its own load. If a sensor has its connection denied make it wait an random period of time and retry. Over time the sensors should space themselves out more or less optimally.
IIRC some facet of TCP/IP uses a similar method, but I'm drawing a blank as to which.
I would use a base of 90 minutes with a randomized variation over a 30-minute range, so that the intervals are randomly beteween 60 and 120 minutes. Adjust these numbers if you want to get closer to the 3-hour interval but I would personally stay well under it

Regrading simulation of bank-teller

we have a system, such as a bank, where customers arrive and wait on a
line until one of k tellers is available.Customer arrival is governed
by a probability distribution function, as is the service time (the
amount of time to be served once a teller is available). We are
interested in statistics such as how long on average a customer has to
wait or how long the line might be.
We can use the probability functions to generate an input stream
consisting of ordered pairs of arrival time and service time for each
customer, sorted by arrival time. We do not need to use the exact time
of day. Rather, we can use a quantum unit, which we will refer to as
a tick.
One way to do this simulation is to start a simulation clock at zero
ticks. We then advance the clock one tick at a time, checking to see
if there is an event. If there is, then we process the event(s) and
compile statistics. When there are no customers left in the input
stream and all the tellers are free, then the simulation is over.
The problem with this simulation strategy is that its running time
does not depend on the number of customers or events (there are two
events per customer), but instead depends on the number of ticks,
which is not really part of the input. To see why this is important,
suppose we changed the clock units to milliticks and multiplied all
the times in the input by 1,000. The result would be that the
simulation would take 1,000 times longer!
My question on above text is how author came in last paragraph what does author mean by " suppose we changed the clock units to milliticks and multiplied all the times in the input by 1,000. The result would be that the simulation would take 1,000 times longer!" ?
Thanks!
With this algorithm we have to check every tick. More ticks there are the more checks we carry out. For example if first customers arrives at 3rd tick, then we had to do 2 unnecessary checks. But if we would check every millitick then we would have to do 2999 unnecessary checks.
Because the checking is being carried out on a per tick basis if the number of ticks is multiplied by 1000 then there will be 1000 times more checks.
Imagine that you set an alarm so that you perform a task, like checking your email, every hour. This means you would check your email 24 times in day, assuming you didn't sleep. If you decide to change this alarm so that it goes off every minute you would now be checking your email 24*60 = 1440 times per day, where 24 is the number of times you were checking it before and 60 is the number of minutes in an hour.
This is exactly what happens in the simulation above, except rather than perform some action every time an alarm goes off, you just do all 1440 email checks as quickly as you can.

Average waiting time in Round Robin scheduling

Waiting time is defined as how long each process has to wait before it gets it's time slice.
In scheduling algorithms such as Shorted Job First and First Come First Serve, we can find that waiting time easily when we just queue up the jobs and see how long each one had to wait before it got serviced.
When it comes to Round Robin or any other preemptive algorithms, we find that long running jobs spend a little time in CPU, when they are preempted and then wait for sometime for it's turn to execute and at some point in it's turn, it executes till completion. I wanted to findout the best way to understand 'waiting time' of the jobs in such a scheduling algorithm.
I found a formula which gives waiting time as:
Waiting Time = (Final Start Time - Previous Time in CPU - Arrival Time)
But I fail to understand the reasoning for this formula. For e.g. Consider a job A which has a burst time of 30 units and round-robin happens at every 5 units. There are two more jobs B(10) and C(15).
The order in which these will be serviced would be:
0 A 5 B 10 C 15 A 20 B 25 C 30 A 35 C 40 A 45 A 50 A 55
Waiting time for A = 40 - 5 - 0
I choose 40 because, after 40 A never waits. It just gets its time slices and goes on and on.
Choose 5 because A spent in process previouly between 30 and 35.
0 is the start time.
Well, I have a doubt in this formula as why was 15 A 20 is not accounted for?
Intuitively, I unable to get how this is getting us the waiting time for A, when we are just accounting for the penultimate execution only and then subtracting the arrival time.
According to me, the waiting time for A should be:
Final Start time - (sum of all times it spend in the processing).
If this formula is wrong, why is it?
Please help clarify my understanding of this concept.
You've misunderstood what the formula means by "previous time in CPU". This actually means the same thing as what you call "sum of all times it spend in the processing". (I guess "previous time in CPU" is supposed to be short for "total time previously spent running on the CPU", where "previously" means "before the final start".)
You still need to subtract the arrival time because the process obviously wasn't waiting before it arrived. (Just in case this is unclear: The "arrival time" is the time when the job was submitted to the scheduler.) In your example, the arrival time for all processes is 0, so this doesn't make a difference there, but in the general case, the arrival time needs to be taken into account.
Edit: If you look at the example on the webpage you linked to, process P1 takes two time slices of four time units each before its final start, and its "previous time in CPU" is calculated as 8, consistent with the interpretation above.
Last waiting
value-(time quantum×(n-1))
Here n denotes the no of times a process arrives in the gantt chart.

Resources