how to align asynchronous reports for a related asset? - algorithm

The example I will give is a train. A train has multiple train-cars and in my system each train car will send me a packet of info on a determined interval. For example, I can guarantee I will have at least 1 packet from each train-car every 10 minutes.
What I want to do is show an animation of the train cars proceeding along the map.
The problem I have is how to align all the data to make a train - visualization with each car in it's relative position to each other - when the data is recorded at different times and without any order relevant to the order of the train cars?
What ends up happening is sometimes car2 will look like it's in front of car1! In the example below, if I showed the data reported for this period, car2 looks like it's on top of car1!
For example
CAR | TIME | LOCATION | LOCATION of 1 (guessed, not reported)
1 | 05 | 10,10 | 10,10
2 | 06 | 10,10 | 10,11
3 | 04 | 10,07 | 10,09
You may suggest that I look at my own example and visualize the guessed location of the other cars at a specific point. For a train where the route is known this is a good solution (and a likely candidate I'll be implementing). But if you change the scope of the problem from trains on tracks to semi-convoys on roads the guessing solution breaks down. There's no way to know if the first car in the convoy took a turn...
This returns the question to trying to find a reasonable analytical/computational solution to synchronizing the recorded metrics.
Is there a known strategy for dealing with this type of problem? Where do I start my research? Any prefab solutions?
The only solution I can think of that's analytical will be to find the smallest window of available data from the data points to create the smallest possible frame of data, select the mean time for the frame and then approximate the location for the other points based on the point closest to the mean.
This is very close to the strategy I disregarded earlier where I use the know distance between the cars to guess where carA is based on carC. Because it's a train I can assume that the speed is the same for all cars at any time. In a semi convoy this isn't true, semiACar1 could be slightly slower or faster than semiBcar1.
So by now I think you understand the problem and where I am .
Thanks for your ideas and interest.

If you don't have reported data, interpolate /extrapolate the positions. If your temporal resolution is high enough this should work just fine.
If your temporal resolution isn't good enough, first model the likely track, then interpolate along the track.

Related

Graph databases (eg Neo4j) for similar matches

I am investigating an algorithm for similar matches and am trying to work out if a graph database would be the best data model for my solution. Let's use "find a similar car" as an example.
If we had car data like:
Owner | Make | Model | Engine | Colour
Jeff | Ford | Focus | 1400cc | Light Red
Bob | Ford | Focus | 1800cc | Dark Red
Paul | Ford | Mondeo | 2000cc | Blue
My understanding is that a graph database would be extremely performant with queries like:
Get me all owners who own a car of the same make as Jeff
Because you would start at the 'Jeff' node, follow the 'Make' edge to the 'Ford' node, and from here follow all the 'Owner' edges to get all people that own a Ford.
Now my question is would it be performant to do "Similar" lookups, eg:
Get me all owners whose car is within 500cc of Jeff
Presumably if you had "1400cc" as an Engine node, you would not be able to traverse the graph from here to find other Engines of a similar size, and so it would not be performant. My thinking is you would have to run some sort of overnight batch to create new edges between all Engine nodes, with the size difference between those two engines.
Have I understood correctly? Does a graph database seem like a good fit, or is there some other storage / retrieval / analysis method that would fit exactly to this problem?
What about in the case where I want to see the top 10 most similar cars, and my algorithm for similarity is something like "Start at 100%, deduct 2% for every 100cc difference, deduct 20% for different model, deduct 30% for different make, deduct 20% for different colour (or 5% if it's different shades of the same colour)". The only way I can think of doing this currently so that an application would be performant, would be to have a background task constantly iterating through the entire dataset and creating "similarity score" edges between every Owner.
Obviously with small datasets the solution doesn't really matter as any hodge-podge will be performant, but eventually we will have potentially hundreds of thousands of cars.
Any thoughts appreciated!
To get you started, here is a simple model, illustrated using sample data for "Jeff":
(make:Make {name: "Ford"})-[:MAKES]->(model:Model {name: "Focus", cc: 14000, year: 2016})
(o:Owner {name: "Jeff"})-[:OWNS]->(v:Vehicle {vin: "WVWZZZ6XZXW068123", plate: "ABC123", color: "Light Red"})-[:MODEL]->(model)
To get all owners who own a car of the same make as Jeff:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(:Model)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
To get all owners whose car is within 500cc of Jeff:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(x:Model)
WHERE (x.cc >= model.cc - 500) AND (x.cc <= model.cc + 500)
MATCH (x)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
The above queries will be a bit faster if you first create an index on :Owner(name):
CREATE INDEX ON :Owner(name);
As #manonthemat said in the comments, there's no best answer for your question, but I'll try to provide you a datamodel to help you :
First of all, you have to know which properties will be "the same" on your matches, like this :
Get me all owners who own a car of the same make as Jeff
Here, you'll want to create one Node per Make, and create a relationship from each car to show their brand.
Example data model for this use case:
You can still create one node per property value, but it's not always the best since if you have an infinite property value possibilities, you'll have to create one node per value.
Keep in mind that Graph Databases are really good for data modeling, because their relationship management is really easy to understand and use. So everything is about data model, and each data model is unique. This guide should help you.

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

How to notice unusual news activity

Suppose you were able keep track of the news mentions of different entities, like say "Steve Jobs" and "Steve Ballmer".
What are ways that could you tell whether the amount of mentions per entity per a given time period was unusual relative to their normal degree of frequency of appearance?
I imagine that for a more popular person like Steve Jobs an increase of like 50% might be unusual (an increase of 1000 to 1500), while for a relatively unknown CEO an increase of 1000% for a given day could be possible (an increase of 2 to 200). If you didn't have a way of scaling that your unusualness index could be dominated by unheard-ofs getting their 15 minutes of fame.
update: To make it clearer, it's assumed that you are already able to get a continuous news stream and identify entities in each news item and store all of this in a relational data store.
You could use a rolling average. This is how a lot of stock trackers work. By tracking the last n data points, you could see if this change was a substantial change outside of their usual variance.
You could also try some normalization -- one very simple one would be that each category has a total number of mentions (m), a percent change from the last time period (δ), and then some normalized value (z) where z = m * δ. Lets look at the table below (m0 is the previous value of m) :
Name m m0 δ z
Steve Jobs 4950 4500 .10 495
Steve Ballmer 400 300 .33 132
Larry Ellison 50 10 4.0 400
Andy Nobody 50 40 .20 10
Here, a 400% change for unknown Larry Ellison results in a z value of 400, a 10% change for the much better known Steve Jobs is 495, and my spike of 20% is still a low 10. You could tweak this algorithm depending on what you feel are good weights, or use standard deviation or the rolling average to find if this is far away from their "expected" results.
Create a database and keep a history of stories with a time stamp. You then have a history of stories over time of each category of news item you're monitoring.
Periodically calculate the number of stories per unit of time (you choose the unit).
Test if the current value is more than X standard deviations away from the historical data.
Some data will be more volatile than others so you may need to adjust X appropriately. X=1 is a reasonable starting point
Way over simplified-
store people's names and the amount of articles created in the past 24 hours with their name involved. Compare to historical data.
Real life-
If you're trying to dynamically pick out people's names, how would you go about doing that? Searching through articles how do you grab names? Once you grab a new name, do you search for all articles for him? How do you separate out Steve Jobs from Apple from Steve Jobs the new star running back that is generating a lot of articles?
If you're looking for simplicity, create a table with 50 people's names that you actually insert. Every day at midnight, have your program run a quick google query for past 24 hours and store the number of results. There are a lot of variables in this though that we're not accounting for.
The method you use is going to depend on the distribution of the counts for each person. My hunch is that they are not going to be normally distributed, which means that some of the standard approaches to longitudinal data might not be appropriate - especially for the small-fry, unknown CEOs you mention, who will have data that are very much non-continuous.
I'm really not well-versed enough in longitudinal methods to give you a solid answer here, but here's what I'd probably do if you locked me in a room to implement this right now:
Dig up a bunch of past data. Hard to say how much you'd need, but I would basically go until it gets computationally insane or the timeline gets unrealistic (not expecting Steve Jobs references from the 1930s).
In preparation for creating a simulated "probability distribution" of sorts (I'm using terms loosely here), more recent data needs to be weighted more than past data - e.g., a thousand years from now, hearing one mention of (this) Steve Jobs might be considered a noteworthy event, so you wouldn't want to be using expected counts from today (Andy's rolling mean is using this same principle). For each count (day) in your database, create a sampling probability that decays over time. Yesterday is the most relevant datum and should be sampled frequently; 30 years ago should not.
Sample out of that dataset using the weights and with replacement (i.e., same datum can be sampled more than once). How many draws you make depends on the data, how many people you're tracking, how good your hardware is, etc. More is better.
Compare your actual count of stories for the day in question to that distribution. What percent of the simulated counts lie above your real count? That's roughly (god don't let any economists look at this) the probability of your real count or a larger one happening on that day. Now you decide what's relevant - 5% is the norm, but it's an arbitrary, stupid norm. Just browse your results for awhile and see what seems relevant to you. The end.
Here's what sucks about this method: there's no trend in it. If Steve Jobs had 15,000 a week ago, 2000 three days ago, and 300 yesterday, there's a clear downward trend. But the method outlined above can only account for that by reducing the weights for the older data; it has no way to project that trend forward. It assumes that the process is basically stationary - that there's no real change going on over time, just more and less probable events from the same random process.
Anyway, if you have the patience and willpower, check into some real statistics. You could look into multilevel models (each day is a repeated measure nested within an individual), for example. Just beware of your parametric assumptions... mention counts, especially on the small end, are not going to be normal. If they fit a parametric distribution at all, it would be in the Poisson family: the Poisson itself (good luck), the overdispersed Poisson (aka negative binomial), or the zero-inflated Poisson (quite likely for your small-fry, no chance for Steve).
Awesome question, at any rate. Lend your support to the statistics StackExchange site, and once it's up you'll be able to get a much better answer than this.

When calculating trends, how do you account for low sample size?

I'm doing some work processing some statistics for home approvals in a given month. I'd like to be able to show trends - that is, which areas have seen a large relative increase or decrease since the last month(s).
My first naive approach was to just calculate the percentage change between two months, but that has problems when the data is very low - any change at all is magnified:
// diff = (new - old) / old
Area | June | July | Diff |
--------------|--------|--------|--------|
South Sydney | 427 | 530 | +24% |
North Sydney | 167 | 143 | -14% |
Dubbo | 1 | 3 | +200% |
I don't want to just ignore any area or value as an outlier, but I don't want Dubbo's increase of 2 per month to outshine the increase of 103 in South Sydney. Is there a better equation I could use to show more useful trend information?
This data is eventually being plotted on Google Maps. In this first attempt, I'm just converting the difference to a "heatmap colour" (blue - decrease, green - no change, red - increase). Perhaps using some other metric to alter the view of each area might be a solution, for example, change the alpha channel based on the total number of approvals or something similar, in this case, Dubbo would be bright red, but quite transparent, whereas South Sydney would be closer to yellow but quite opaque.
Any ideas on the best way to show this data?
Look into measures of statistical significance. It could be as simple as assuming counting statistics.
In a very simple minded version, the thing you plot is
(A_2 - A_1)/sqrt(A_2 + A_1)
i.e. change over 1 sigma in simple counting statistics.
Which makes the above chart look like:
Area Reduced difference
--------------------------
S.S. +3.3
N.S. -1.3
D. +1.0
which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb
1 sigma changes are just noise
3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
5 sigma changes almost certainly point to a trend
Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.
This is really a statistics question. I'm not a statistician, but I suspect the answer is along the lines of well, you have no data — what do you expect‽
Perhaps you could merge Dubbo with a nearby region? You've sliced your data small enough that your signal has fallen below noise.
You could also just not show Dubbo, or make a color for not enough data.
I kinda like your transparency idea -- the data you're confident about is opaque and the data you're not confident is transparent. It's easy for the user to understand, but it will look cluttered.
My take: Don't use heatmap. It's for continuous data, while you have discrete. Use dots. Color represents increase/decrease in the surrounding region and raw volume is proportional to size of the dot.
Now how does user know what region does the dot represent? Where does South Sydney convert into North Sydney? Best approach would be to add voronoi-like guiding lines between the dots, but smartly placed rectangles will do too.
If you happen to have the area of each region in units such as sq. km, you can normalize your data by calculating home approvals/km^2 to get home approval density and use that in your equation rather than the count of home approvals. This is fix the problem if Dubbo contains less home approvals then other regions due to its size. You could also normalize by population if you have that, to get the number of home approvals per person.
Maybe you could use the totals. Add all old and new values which gives old=595, new=676, diff=+13.6%. Then calculate the changes bases on the old total which gives you +17.3% / -4.0% / +0.3% for the three places.
With a heat map you are generally attempting to show easily assimilated information. Anything too complex would probably be counter-productive.
In the case of Dubbo, the reality is that you don't have the data to draw any firm conclusions about it, so I'd color it white, say. You could possibly label it with the difference/current value too.
I think this would be preferable to possibly misleading the users.
I would highly recommend going with a hierarchical model (i.e., partial pooling). Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill is an excellent resource on the topic.
You can use an exact test like Fischer's exact test http://en.wikipedia.org/wiki/Fisher%27s_exact_test , or use the sudent's t test http://en.wikipedia.org/wiki/Student%27s_t-test , both of which are designed for low sample sizes.
As a note, the t-test is pretty much the same as a z-test but in the t-test you don't have to know the standard deviation nor do you have to approximate it like you would if you did a z-test.
You can apply a z or t test without any justification in 99.99% of cases because of the Central Limit Theorem http://en.wikipedia.org/wiki/Central_limit_theorem (formally you only need that the underlying distribution X has finite variance.) You don't need justification for the fisher test either, its exact and does not make any assumptions.

K Nearest Neighbour Algorithm doubt

I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.
Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Burger | Pizza | Fries | Burritos | Likes my food
person1 1 | 0 | 1 | 1 | 1
person2 0 | 0 | 1 | 0 | 0
person3 1 | 1 | 0 | 1 | 1
person4 0 | 1 | 1 | 1 | 0
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).
Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.
This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.
If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.
I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.
There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.
Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.

Resources