How to implement a real estate recommendation engine? - algorithm

I am talking about something like movie/item recommendation, but it seems that real estate is more tricky. When visiting a web-site and doing some search for RE, the user should be presented with some suggestions. Let's separate the task in two tasks:
a) the user has still not entered any personal info - item based recommendation
b) the user has already entered his/hers details such as income, location, etc. - item/user based recommendation
The first thing that comes to my mind for task a) is to start modeling RE features, but using some ranges instead of exact values. For example:
Area in m2
40 - 50 we can mark it for "1"
50 - 70 is "2"
etc ...
Price:
20 - 30 thousands € will be marked as 1
30 - 40 will be 2
etc ...
Proximity to city center:
1 for the RE being within the city center
2 for Zone 2 or up to 2/3 kilometers from center
3 for Zone 3 or 7 kilometers from center
So having ranges lets us assign a vector to each RE property which will allows us to use: Euclidean distance, Pearson correlation and some nearest neighbor algorithms.
Please comment on my approach or suggest a new one.

If you already have a website with enough traffic, you can try a pure collaborative filtering approach, i.e people who viewed this property also viewed these other properties. You could use the Pearson correlation there for good results.
Similarity between 2 RE can be defined as
number of people who viewed both RE1 and RE2
sim = ---------------------------------------------
number of people who viewed either 1 or both
When a user is viewing property RE you can sort all other RE properties based on the similarity score with the property being shown and show the top few.
You could add some obvious filters on top of this like the location of the property, the price range etc.
You can also define the similarity as you have suggested and mix the results from both for good representation from new RE entries which do not have a high chance of getting in if a pure collaborative filtering algorithm is used.

Related

VW contextual bandits: historical data and online learning

I'd like to test CB for e-commerce task: personal offer recommendations (like "last chance to buy", "similar positions", "consumers recommend", "bestsellers", etc.). My task is to order them (more relevant issue is higher in the list of recommendations).
So, there are 5 possible offers.
I have some historical data collected without using any model: context (user and web-session features), action id (one of my 5 offers), reward (1 if user clicked this offer, 0 - not clicked). So I have N users and 5 offers with known reward, totally 5*N rows in my historical data.
Ex:
1:1:1 | user_id:1 f1:... f2:...
2:-1:1 | user_id:1 f1:... f2:...
3:-1:1 | user_id:1 f1:... f2:...
This means that user 1 have seen 3 offers (1,2,3), cost of the 1 offer is equal to 1 (didn't click), user ckickes on offers 2 and 3 (cost is negative -> reward is positive). Probabilities are equal to 1, since all offers were shown and we know rewards.
Global task is to increase CTR. I'd like to use this data for training CB and then improve the model by exploration/exploitation policies. I set probabilities equal to 1 in this data (Is it right?). But next I'd like to set the order of offers according to rewards.
Should I use for this warm start in VW CB? Will this work correctly with data collected without using CB? Maybe you can advise more relevant methods in CB for this data and task?
Thanks a lot.
If there are only 5 possible offers and if you (as indicated) have data of the form "I have N users and 5 offers with known reward, totally 5*N rows in my historical data." then your historical data is supervised multilabel data and the warm-start functionality would apply; make sure you use the cost-sensitive version to accommodate the multilabel aspect of your historical data (i.e., there is more than one offer that would result in a click).
Will this work correctly with data collected without using CB?
Because the every action-reward is specified for every user in the data set, you only have to ensure that the sample of users is representative of the population you care about.
Maybe you can advise more relevant methods in CB for this data and task?
The first paragraph started with "if" because the more typical case is 1) there are many possible offers and 2) users have only seen a few of them historically.
In such case what you have is a combination of a degenerate logging policy and multiple rewards being revealed. If there are k possible actions but each user has only seen n<=k historically then you could try and make n lines for each user as you did. Theoretically this does not necessarily work but in practice it might help.
Out of the box: change the data
If the data you have was collected as the result of running an existing policy, then an alternative would be to start randomizing the decisions made by that system in order to collect a dataset which conforms to CB. For example, use your current system to pick the "best" action 96% of the time, and one of the other 4 actions at random 4% of the time, and log the probability along with the reward (either 0.96 or 0.01 depending upon whether it was the considered best), and then set up a proper CB-style training set for vw. With this you can also counterfactually estimate the value of both your current policy and the policy vw generates, and only switch to vw when it is winning.
The fastest way to implement the last paragraph is to just start using APS.

Sorting a list based on multiple indices and weights

Sort of a very long winded explanation of what I'm looking at so I apologize in advance.
Let's consider a Recipe:
Take the bacon and weave it ...blahblahblah...
This recipe has 3 Tags
author (most important) - Chandler Bing
category (medium importance) - Meat recipe (out of meat/vegan/raw/etc categories)
subcategory (lowest importance) - Fast food (our of fast food / haute cuisine etc)
I am a new user that sees a list of randomly sorted recipes (my palate/profile isn't formed yet). I start interacting with different recipes (reading them, saving them, sharing them) and each interaction adds to my profile (each time I read a recipe a point gets added to the respective category/author/subcategory). After a while my profile starts to look something like this :
Chandler Bing - 100 points
Gordon Ramsey - 49 points
Haute cuisine - 12 points
Fast food - 35 points
... and so on
Now, the point of all this exercise is to actually sort the recipe list based on the individual user's preferences. For example in this case I will always see Chandler Bing's recipes on the top (regardless of category), then Ramsey's recipes. At the same time, Bing's recipes will be sorted based on my preferred categories and subcategories, seeing his fast food recipes higher than his haute cuisine ones.
What am I looking at here in terms of a sorting algorithm?
I hope that my question has enough information but if there's anything unclear please let me know and I'll try to add to it.
I would allow the "Tags" with the most importance to have the greatest capacity in point difference. Example: Give author a starting value of 50 points, with a range of 0-100 points. Give Category a starting value of 25 points, with a possible range of 0-50 points, give subcategory a starting value of 12.5 points, with a possible range of 0-25 points. That way, if the user's palate changes over time, s/he will only have to work down from the maximum, or work up from the minimum.
From there, you can simply add up the points for each "Tag", and use one of many languages' sort() methods to compare each recipe.
You can write a comparison function that is used in your sort(). The point is when you're comparing two recipes just add up the points respectively based on their tags and do a simple comparison. That and whatever sorting algorithm you choose should do just fine.
You can use a recursively subdividing MSD (sort of radix sort algorithm). Works as follows:
Take the most significant category of each recipe.
Sort the list of elements based on that category, grouping elements with the same category into one bucket (Ramsay bucket, Bing bucket etc).
Recursively sort each bucket, starting with the next category of importance (Meat bucket etc).
Concatenate the buckets together in order.
Complexity: O(kn) where k is the number of category types and N is the number of recipes.
I think what you're looking for is not a sorting algorithm, but a rating scheme.
You say, you want to sort by preferences. Let's assume, these preferences have different “dimensions”, like level of complexity, type of cuisine, etc.
These dimensions have different levels of measurement. These can be e.g. numeric or simple categories/tags. It would be your job to:
Create a scheme of dimensions and scales that can represent a user's preferences.
Operationalize real-world data to fit into this scheme.
Create a profile for the users which reflects their preferences. Same for the chefs; treat them just like normal users here.
To actually match a user to a chef (or, even to another user), create a sorting callback that matches all your dimensions against each other and makes sure that in each of the dimension the compared users have a similar value (on a numeric scale), or an overlapping set of properties (on a nominal scale, like tags). Then you sort the result by the best match.

Ranking/ weighing search result

I am trying to build an application that has a smart adaptive search engine (lets say for cars). If I search for for 4x4 then the DB will return all the 4x4 cars I have (100 cars) - but as time goes by and I start checking out cars, liking them, commenting on them, etc the order of the search result should be the different. That means 1 month later when searching for 4x4, I should get the same result set ordered differently as per my previous interaction with the site. If I was mainly liking and commenting on German cars, BMW should be on the top and Land cruiser should be further down.
This ranking should be based on attributes that I captureduring user interaction (eg: car origin, user age, user location, car type[4x4, coupe, hatchback], price range). So for each car result I get, I will be weighing it based on how well it is performing on the 5 attributes above.
I intend to use the DB just as a repository and do the ranking and the thinking on the server. My question is, what kind of algorithm should I be using to weigh/rank my search result?
Thanks.
You're basically saying that you already have several ordering schemes:
Keyword search result
amount of likes for car's category
likely others, such as popularity, some form of date, etc.
What you do then is make up a new scheme, call it relevance:
relevance = W1 * keyword_score + W2*likes_score + ...
and sort by relevance. Experiment with the weights W1, W2, ..., until you get something you find useful.
From my understanding search engines work on this principle. It's been long thrown around that Google has on the order of 200 different inputs into the relevance score, PageRank being just one. The beauty of this approach is that it lets you fine tune the importance of everything (even individually for every query), and it lets you add additional inputs without screwing everything up.

How to calculate difficulty metric?

Note: I have completely changed the original question!
I do have several texts, which consists of several words. Words are categorized into difficulty categories from 1 to 6, 1 being the easiest one and 6 the hardest (or from common to least common). However, obviously not all words can be put into these categories, because they are countless words in the english language.
Each category has twice as many words as the category before.
Level: 100 words in total (100 new)
Level: 200 words in total (100 new)
Level: 400 words in total (200 new)
Level: 800 words in total (400 new)
Level: 1600 words in total (800 new)
Level: 3200 words in total (1600 new)
When I use the term level 6 below, I mean introduced in level 6. So it is part of the 1600 new words and can't be found in the 1600 words up to level 5.
How would I rate the difficulty of an individual text? Compare these texts:
An easy one
would only consist of very basic vocabulary:
I drive a car.
Let's say these are 4 level 1 words.
A medium one
This old man is cretinous.
This is a very basic sentence which only comes with one difficult word.
A hard one
would have some advanced vocabulary in there too:
I steer a gas guzzler.
So how much more difficult is the second or third of the first one? Let's compare text 1 and text 3. I and a are still level 1 words, gas might be lvl 2, steer is 4 and guzzler is not even in the list. cretinous would be level 6.
How to calculate a difficulty of these texts, now that I've classified the vocabulary?
I hope it is more clear what I want to do now.
The problem you are trying to solve is how to quantify your qualitative data.
The search term "quantifying qualitative data" may help you.
There is no general all-purpose algorithm for this. The best way to do it will depend upon what you want to use the metric for, and what your ratings of each individual task mean for the project as a whole in terms of practical impact on the factors you are interested in.
For example if the hardest tasks are typically unsolvable, then as soon as a project involves a single type 6 task, then the project may become unsolvable, and your metric would need to reflect this.
You also need to find some way to address the missing data (unrated tasks). It's likely that a single numeric metric is not going to capture all the information you want about these projects.
Once you have understood what the metric will be used for, and how the task ratings relate to each other (linear increasing difficulty vs. categorical distinctions) then there are plenty of simple metrics that may codify this analysis.
For example, you may rate projects for risk based on a combination of the number of unknown tasks and the number of tasks with difficulty above a certain threshold. Alternatively you may rate projects for duration based on a weighted sum of task difficulty, using a default or estimated difficulty for unknown tasks.

Algorithm for Rating Objects Based on Amount of Votes and 5 Star Rating

I'm creating a site whereby people can rate an object of their choice by allotting a star rating (say 5 star rating). Objects are arranged in a series of tags and categories eg. electronics>graphics cards>pci express>... or maintenance>contractor>plumber.
If another user searches for a specific category or tag, the hits must return the highest "rated" object in that category. However, the system would be flawed if 1 person only votes 5 stars for an object whilst 1000 users vote an average of 4.5 stars for another object. Obviously, logic dictates that credibility would be given to the 1000 user rated object as opposed to the object that is evaluated by 1 user even though it has a "lower" score.
Conversely, it's reliable to trust an object with 500 user rating with score of 4.8 than it is to trust an object with 1000 user ratings of 4.5 for example.
What algorithm can achieve this weighting?
A great answer to this question is here:
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
You can use the Bayesian average when sorting by recommendation.
I'd be tempted to have a cutoff (say, fifty votes though this is obviously traffic dependent) before which you consider the item as unranked. That would significantly reduce the motivation for spam/idiot rankings (especially if each vote is tied to a user account), and also gets you a simple, quick to implement, and reasonably reliable system.
simboid_function(value) = 1/(1+e^(-value));
rating = simboid_function(number_of_voters) + simboid_function(average_rating);

Resources