Graph databases (eg Neo4j) for similar matches - algorithm

I am investigating an algorithm for similar matches and am trying to work out if a graph database would be the best data model for my solution. Let's use "find a similar car" as an example.
If we had car data like:
Owner | Make | Model | Engine | Colour
Jeff | Ford | Focus | 1400cc | Light Red
Bob | Ford | Focus | 1800cc | Dark Red
Paul | Ford | Mondeo | 2000cc | Blue
My understanding is that a graph database would be extremely performant with queries like:
Get me all owners who own a car of the same make as Jeff
Because you would start at the 'Jeff' node, follow the 'Make' edge to the 'Ford' node, and from here follow all the 'Owner' edges to get all people that own a Ford.
Now my question is would it be performant to do "Similar" lookups, eg:
Get me all owners whose car is within 500cc of Jeff
Presumably if you had "1400cc" as an Engine node, you would not be able to traverse the graph from here to find other Engines of a similar size, and so it would not be performant. My thinking is you would have to run some sort of overnight batch to create new edges between all Engine nodes, with the size difference between those two engines.
Have I understood correctly? Does a graph database seem like a good fit, or is there some other storage / retrieval / analysis method that would fit exactly to this problem?
What about in the case where I want to see the top 10 most similar cars, and my algorithm for similarity is something like "Start at 100%, deduct 2% for every 100cc difference, deduct 20% for different model, deduct 30% for different make, deduct 20% for different colour (or 5% if it's different shades of the same colour)". The only way I can think of doing this currently so that an application would be performant, would be to have a background task constantly iterating through the entire dataset and creating "similarity score" edges between every Owner.
Obviously with small datasets the solution doesn't really matter as any hodge-podge will be performant, but eventually we will have potentially hundreds of thousands of cars.
Any thoughts appreciated!

To get you started, here is a simple model, illustrated using sample data for "Jeff":
(make:Make {name: "Ford"})-[:MAKES]->(model:Model {name: "Focus", cc: 14000, year: 2016})
(o:Owner {name: "Jeff"})-[:OWNS]->(v:Vehicle {vin: "WVWZZZ6XZXW068123", plate: "ABC123", color: "Light Red"})-[:MODEL]->(model)
To get all owners who own a car of the same make as Jeff:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(:Model)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
To get all owners whose car is within 500cc of Jeff:
MATCH (o1:Owner { name: "Jeff" })-[:OWNS]->(:Vehicle)-[:MODEL]->(model:Model)<-[:MAKES]-(make:Make)
MATCH (make)-[:MAKES]->(x:Model)
WHERE (x.cc >= model.cc - 500) AND (x.cc <= model.cc + 500)
MATCH (x)<-[:MODEL]-(:Vehicle)<-[:OWNS]-(owners:Owner)
RETURN DISTINCT owners;
The above queries will be a bit faster if you first create an index on :Owner(name):
CREATE INDEX ON :Owner(name);

As #manonthemat said in the comments, there's no best answer for your question, but I'll try to provide you a datamodel to help you :
First of all, you have to know which properties will be "the same" on your matches, like this :
Get me all owners who own a car of the same make as Jeff
Here, you'll want to create one Node per Make, and create a relationship from each car to show their brand.
Example data model for this use case:
You can still create one node per property value, but it's not always the best since if you have an infinite property value possibilities, you'll have to create one node per value.
Keep in mind that Graph Databases are really good for data modeling, because their relationship management is really easy to understand and use. So everything is about data model, and each data model is unique. This guide should help you.

Related

how to align asynchronous reports for a related asset?

The example I will give is a train. A train has multiple train-cars and in my system each train car will send me a packet of info on a determined interval. For example, I can guarantee I will have at least 1 packet from each train-car every 10 minutes.
What I want to do is show an animation of the train cars proceeding along the map.
The problem I have is how to align all the data to make a train - visualization with each car in it's relative position to each other - when the data is recorded at different times and without any order relevant to the order of the train cars?
What ends up happening is sometimes car2 will look like it's in front of car1! In the example below, if I showed the data reported for this period, car2 looks like it's on top of car1!
For example
CAR | TIME | LOCATION | LOCATION of 1 (guessed, not reported)
1 | 05 | 10,10 | 10,10
2 | 06 | 10,10 | 10,11
3 | 04 | 10,07 | 10,09
You may suggest that I look at my own example and visualize the guessed location of the other cars at a specific point. For a train where the route is known this is a good solution (and a likely candidate I'll be implementing). But if you change the scope of the problem from trains on tracks to semi-convoys on roads the guessing solution breaks down. There's no way to know if the first car in the convoy took a turn...
This returns the question to trying to find a reasonable analytical/computational solution to synchronizing the recorded metrics.
Is there a known strategy for dealing with this type of problem? Where do I start my research? Any prefab solutions?
The only solution I can think of that's analytical will be to find the smallest window of available data from the data points to create the smallest possible frame of data, select the mean time for the frame and then approximate the location for the other points based on the point closest to the mean.
This is very close to the strategy I disregarded earlier where I use the know distance between the cars to guess where carA is based on carC. Because it's a train I can assume that the speed is the same for all cars at any time. In a semi convoy this isn't true, semiACar1 could be slightly slower or faster than semiBcar1.
So by now I think you understand the problem and where I am .
Thanks for your ideas and interest.
If you don't have reported data, interpolate /extrapolate the positions. If your temporal resolution is high enough this should work just fine.
If your temporal resolution isn't good enough, first model the likely track, then interpolate along the track.

Ranking/ weighing search result

I am trying to build an application that has a smart adaptive search engine (lets say for cars). If I search for for 4x4 then the DB will return all the 4x4 cars I have (100 cars) - but as time goes by and I start checking out cars, liking them, commenting on them, etc the order of the search result should be the different. That means 1 month later when searching for 4x4, I should get the same result set ordered differently as per my previous interaction with the site. If I was mainly liking and commenting on German cars, BMW should be on the top and Land cruiser should be further down.
This ranking should be based on attributes that I captureduring user interaction (eg: car origin, user age, user location, car type[4x4, coupe, hatchback], price range). So for each car result I get, I will be weighing it based on how well it is performing on the 5 attributes above.
I intend to use the DB just as a repository and do the ranking and the thinking on the server. My question is, what kind of algorithm should I be using to weigh/rank my search result?
Thanks.
You're basically saying that you already have several ordering schemes:
Keyword search result
amount of likes for car's category
likely others, such as popularity, some form of date, etc.
What you do then is make up a new scheme, call it relevance:
relevance = W1 * keyword_score + W2*likes_score + ...
and sort by relevance. Experiment with the weights W1, W2, ..., until you get something you find useful.
From my understanding search engines work on this principle. It's been long thrown around that Google has on the order of 200 different inputs into the relevance score, PageRank being just one. The beauty of this approach is that it lets you fine tune the importance of everything (even individually for every query), and it lets you add additional inputs without screwing everything up.

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

displaying sorted data in multiple columns

Suppose I'm trying to displaying all US states in two columns, ordered alphabetically. Which approach is better from the usability standpoint?
Is it sorting horizontally, like:
Alabama | Alaska
Arizona | Arkansas
Colorado | Connecticut
Delaware | Georgia
or is it vertically, like:
Alabama | Montana
Alaska | Nebraska
Arizona | New Hampshire
Arkansas | New Jersey
I tried googling for an authoritative answer backed by some testing, but all I've found are opinions.
Is it just a personal preference thing and no option is better than the other?
It’s faster and easier for users to scan down one column of words than across a row of words, especially if they are searching for a specific target word (e.g., “Is my state on this list?”). See Parkinson SR, Sisson N, & Snowberry K, 1985. Organization of broad computer menu displays, International Journal of Man-Machine Studies, 23, 689-697. Ordering down a column is also a equirement (5.14.3.5.6.5) in the US DOD Design Criteria Standard - Human Engineering (MIL-STD-1472-F), presumably for human performance reasons.
In this case, I would expect an especially large performance advantage to sorting vertically because it has fewer direction changes for the eyes. For vertically sorted, users only have to reverse direction once to get to the top of the second column, while for horizontally sorted, the number of reversals is equal to the number of items divided by two. I believe these reversals predict scan speed and eye fatigue.
Be sure to use graphic design to communicate that the list is sorted vertically, such as by including vertical rule like you did in your example.
It’s quite important that the list fit within a window-full so that users don’t have to scroll down and up and down to read the whole list. Scrolling costs time and effort. Worse, some users may not realize the list continues below what they can see (“Oh, I guess my state isn’t on this list”). Better to add columns -or use only one column -than require vertical scrolling of a multi-column vertically-sorted list.
Another way to determine how to lay out data/components:
print out the ui/panel on a piece of paper (or take a screenshot into an image editor like Gimp or photoshop)
use a highlighter to draw on it the pattern that your eyes take, from element to element.
for example:
it's obvious which one is simpler and easier both cognitively and on the eyes.
I think it depends on what you are trying to do, but, for me I think the first choice is easier to read, but then it isn't really in two columns.
If the assignment is to sort into two columns, then the second one is probably more correct, but, if you want to be fancy you could perhaps give a checkbox so that they can switch between the two.
If you were doing this for a job then I would suggest that you do give a checkbox, and talk with people as to what works best with the best of the application.
I'd prefer the second one as it is more consistent with if you were reading a newspaper article that was split into two columns. You read all the way down and finish, then go to the next column.
Another thing to consider is the proximity between each entry. In the first example you may have more horizontal separation between the first two items because of the difference in the text length, which makes them visually less connected.
Alabama [Lots of space] Alaska
VS.
Alabama
[Less Space]
Alaska
simply put, if it is a subjective approach to solution then you can go with what ever you like better.
you can always make things work just by applying principles of the design.
but if you trying to go about it from the point of view on what is more common in web and print publications then your vertical approach would be the best as it is something that average user/visitor is used to interact with on daily basis.
this is my 5c

K Nearest Neighbour Algorithm doubt

I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.
Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Burger | Pizza | Fries | Burritos | Likes my food
person1 1 | 0 | 1 | 1 | 1
person2 0 | 0 | 1 | 0 | 0
person3 1 | 1 | 0 | 1 | 1
person4 0 | 1 | 1 | 1 | 0
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).
Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.
This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.
If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.
I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.
There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.
Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.

Resources