Machine learning in cab pooling scenario? - algorithm

So I have this dataset for a transportation problem. Which shows a cab pooling scenario. Consider the following image:
The users with same ride number went in the same cab (each user has the same starting point so please ignore that). Now that means, Y, Z and A are in same proximity, and so as B & C and D & E.
Now I would like to fit this dataset into a machine learning model such that when I enter the destination of any user, the model should give me the prediction on with whom my destination can be coupled so I can go in the cab with those people.
Like if I have to go to a location 'C' I can join people who are going to 'B'.
Which machine learning algorithm can I use in this scenario?

You can probably do without machine learning algorithm. Given the ride number, you can identify locations which are close to each other and group them. When a new location comes, you can see which group it belongs to and pair the people traveling to locations within that group.
To do this you can create a matrix which has locations A, B,C,... as the rows and as columns. What you'll get is a num_of_locations x num_of_locations matrix. For the cell with row label B and column label C you can mark it as 1 since they are in proximity and the locations which aren't in proximity(like A and B) should be marked as zero.
The matrix will be a symmetric one, so if you have too many locations you can save on memory and computation by some optimizations. You can research around saving triangular matrices as sparse matrices.
Also, if you have access to the right resources(paid libraries), you can replace the 0,1 with distances(displacements actually).

Related

What algorithm and data structure would fit the use case of overlapping traffic on a roadway

I have a problem where I have a road that has multiple entry points and exits. I am trying to model it so that traffic can flow into an entry and go out the exit. The entry points also act as exits. All the entrypoints are labelled 1 to 10 (i.e. we have 10 entry and exits).
A car is allowed to enter and exit at any point however the entry is always lower number than the exit. For example a car enters at 3 and goes to 8, it cannot go from 3 to 3 or from 8 to 3.
After every second the car moves one unit on the road. So from above example the car goes from 3 to 4 after one second. I want to continuously accept cars at different entrypoints and update their positions after each second. However I cannot accept a car at an entry if there is already one present at that location.
All cars are travelling at the same speed of 1 unit per second and all are same size and occupy just the space at the point they are in. Once a car reaches its destination, its removed from the road.
For all new cars that come into the entrypoint and are waiting, we need to assign a waiting time. How would that work? For example it needs to account for when it is able to find a slot where it can be put on the road.
Is there an algorithm that this problem fits into?
What data structure would I model this in - for example for each entrypoints, I was thinking something like a queue or like an ordered map and for the road, maybe a linkedlist?
Outside of a top down master algorithm that decides what each car does and when, there is another approach that uses agents that interact with their environment and amongst themselves, with a limited set of simple rules. This often give rise to complex behaviors: You could maybe code simple rules into car objects, to define these interactions?
Maybe something like this:
emerging behavior algorithm:
a car moves forward if there are no cars just in front of it.
a car merges into a lane if there are no car right on its side (and
maybe behind that slot too)
a car progresses towards its destination, and removes itself when destination is reached.
proposed data structure
The data structure could be an indexed collection of "slots" along which a car moves towards a destination.
Two data structures could intersect at a tuple of index values for each.
Roads with 2 or more lanes could be modeled with coupled data structures...
optimial numbers
Determining the max road use, and min time to destination would require running the simulation several times, with varying parameters of the number of cars, and maybe variations of the rules.
A more elaborate approach would us continuous space on the road, instead of discrete slots.
I can suggest a Directed Acyclic Graph (DAG) which will store each entry point as a node.
The problem of moving from one point to another can be thought of as a graph-flow problem, which has a number of algorithms for determining movement in a graph.

Finding the flow of water in a connected network

I have a set of water meters for water consumers drawn up as geojson and visualized with ol3. For each consumer house i have their usage of water for the given year, and also the water pipe system is given as linestrings, with metadata for the diameter of each pipe section.
What is the minimum required information I need to be able to visualize/calculate the amount of water that passed each pipe in total of the year when the pipes have inner loops/circles.
is there a library that makes it easy to do the calculations in javascript.
Naive approach, start from each house and move to the first pipe junction and add the used mater measurement for the house as water out of the junction and continue until the water plant is reached. This works if there was no loops within the pipe system.
This sounds more like a physics or civil engineering problem than a programming one.
But as best I can tell, you would need time series data for sources and sinks.
Consider this simple network:
Say, A is a source and B and D are sinks/outlets.
If the flow out of B is given, the flow in |CB| would be dependent on the flow out of D.
So e.g. if B and D were always open at the same time, the total volume that has passed |CB| might be close to 0. Conversely, if B and D were never open at the same time the number might be equal to the volume that flowed through |AB|.
If you can obtain time series data, so you have concurrent values of flow through D and B, I would think there would exist a standard way of determining the flow through |CB|.
Wikipedia's Pipe Network Analysis article mentions one such method: The Hardy Cross method, which:
"assumes that the flow going in and out of the system is known and that the pipe length, diameter, roughness and other key characteristics are also known or can be assumed".
If time series data are not an option, I would pretend it was always average (which might not be so bad given a large network, like in your image) and then do the same thing.
You can use the Ford-Fulkerson algorithm to find the maximum flow in a network. To use this algorithm, you need to represent your network as a graph with nodes that represent your houses and edges to represent your pipes.
You can first simplify the network by consolidating demands on "dead-ends". Next you'll need pressure data at the 3 feeds into this network, which I see as the top feed from the 90 (mm?), centre feed at the 63 and bottom feed near to the 50. These 3 clusters are linked by a 63mm running down, which have the consolidated demand and the pressure readings at the feed would be sufficient to give the flowrate across the inner clusters.

Multiple depot vehicle scheduling

I have been playing around with algorithms and ILP for the single depot vehicle scheduling problem (SDVSP) and now want to extend my knowledge towards the multiple depot vehicle scheduling problem (MDVSP), as i would like to use this knowledge in a project of mine.
As for the question, I've found and implemented several algorithms for the MDSVP. However, one question i am very curious about is how to go about determining the amount of needed depots (and locations to an extend). Sadly enough i haven't been able to find any resources really which do not assume/require that the depots are set. Thus my question would be: How would i be able to approach a MDVSP in which i can determine the amount and locations of the depots?
(Edit) To clarify:
Assume we are given a set of trips T1, T2...Tn like usually in a SDVSP or MDVSP. Multiple trips can be driven in succession before returning to a depot. Leaving and returning to depots usually only happen at the start and end of a day. But as an extension to the normal problems, we can now determine the amount and locations of our depots, opposed to having set depots.
The objective is to find a solution in which all trips are driven with the minimal cost. The cost consists of the amount of deadhead (the distance which the car has to travel between trips, and from and to the depots), a fixed cost K per car, and a fixed cost C per depots.
I hope this clears up the question somewhat.
The standard approach involves adding |V| binary variables in ILP, one for each node where x_i = 1 if v_i is a depot and 0 otherwise.
However, the way the question is currently articulated, all x_i values will come out to be zero, since there is no "advantage" of making the node a depot and the total cost = (other cost factors) + sum_i (x_i) * FIXED_COST_PER_DEPOT.
Perhaps the question needs to be updated with some other constraint about the range of the car. For example, a car can only go so and so miles before returning to a depot.

System Design of Google Trends?

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Confusion with neural networks in MATLAB

I'm working on character recognition (and later fingerprint recognition) using neural networks. I'm getting confused with the sequence of events. I'm training the net with 26 letters. Later I will increase this to include 26 clean letters and 26 noisy letters. If I want to recognize one letter say "A", what is the right way to do this? Here is what I'm doing now.
1) Train network with a 26x100 matrix; each row contains a letter from segmentation of the bmp (10x10).
2) However, for the test targets I use my input matrix for "A". I had 25 rows of zeros after the first row so that my input matrix is the same size as my target matrix.
3) I run perform(net, testTargets,outputs) where outputs are the outputs from the net trained with the 26x100 matrix. testTargets is the matrix for "A".
This doesn't seem right though. Is training supposed by separate from recognizing any character? What I want to happen is as follows.
1) Training the network for an image file that I select (after processing the image into logical arrays).
2) Use this trained network to recognize letter in a different image file.
So train the network to recognize A through Z. Then pick an image, run the network to see what letters are recognized from the picked image.
Okay, so it seems that the question here seems to be more along the lines of "How do I neural networks" I can outline the basic procedure here to try to solidify the idea in your mind, but as far as actually implementing it goes you're on your own. Personally I believe that proprietary languages (MATLAB) are an abomination, but I always appreciate intellectual zeal.
The basic concept of a neural net is that you have a series of nodes in layers with weights that connect them (depending on what you want to do you can either just connect each node to the layer above and beneath, or connect every node, or anywhere in betweeen.). Each node has a "work function" or a probabilistic function that represents the chance that the given node, or neuron will evaluate to "on" or 1.
The general workflow starts from whatever top layer neurons/nodes you've got, initializing them to the values of your data (in your case, you would probably start each of these off as the pixel values in your image, normalized to be binary would be simplest). Each of those nodes would then be multiplied by a weight and fed down towards your second layer, which would be considered a "hidden layer" depending on the sum (either geometric or arithmetic sum, depending on your implementation) which would be used with the work function to determine the state of your hidden layer.
That last point was a little theoretical and hard to follow, so here's an example. Imagine your first row has three nodes ([1,0,1]), and the weights connecting the three of those nodes to the first node in your second layer are something like ([0.5, 2.0, 0.6]). If you're doing an arithmetic sum that means that the weighting on the first node in your "hidden layer" would be
1*0.5 + 0*2.0 + 1*0.6 = 1.1
If you're using a logistic function as your work function (a very common choice, though tanh is also common) this would make the chance of that node evaluating to 1 approximately 75%.
You would probably want your final layer to have 26 nodes, one for each letter, but you could add in more hidden layers to improve your model. You would assume that the letter your model predicted would be the final node with the largest weighting heading in.
After you have that up and running you want to train it though, because you probably just randomly seeded your weights, which makes sense. There are a lot of different methods for this, but I'll generally outline back-propagation which is a very common method of training neural nets. The idea is essentially, since you know which character the image should have been recognized, you compare the result to the one that your model actually predicted. If your model accurately predicted the character you're fine, you can leave the model as is, since it worked. If you predicted an incorrect character you want to go back through your neural net and increment the weights that lead from the pixel nodes you fed in to the ending node that is the character that should have been predicted. You should also decrement the weights that led to the character it incorrectly returned.
Hope that helps, let me know if you have any more questions.

Resources