What's the best way to classify a list of sites? - algorithm

I have a list of X sites that I need to classify in some way. Is the site about cars, health, products or is it about everything(wikihow, about.com, etc?) What are some of the better ways to classify sites like this? Should I get keywords that bring traffic to the site and use those? Should I read the content of some random pages and judge it off of that?

Well if the site is well designed there will be meta tags in the header specifically for this.

Yahoo has a api to extract terms, http://developer.yahoo.com/search/content/V2/termExtraction.html
"The Term Extraction Web Service provides a list of significant words or phrases extracted from a larger content. It is one of the technologies used in Y!Q."

Maybe I'm a bit biased (disclaimer : I have a degree in library science, and this topic is one of the reasons I got the degree), so the easiest answer is that there is no best way.
Consider this like you would database design -- once you have your system populated, what sort of questions are you going to ask of it?
Is the fact that the site is run by the government significant? Or that it uses flash? Or that the pages are blue? Or that it's a hobbyist site? Or that the intended audience is children?.
Then we get the question of if we're going to have a hierarchical category for any of the facets we're concerned with -- if it's about both cars and motorcycles, should we use the term 'vehicles' instead? And if we do that, will we use keyword expansion so that 'motorcycle' matches the broader terms (ie, vehicles) as well?
So ... the point is ... figure out what your needs are, and work towards that. 'Best' will never come, even with years of refinement (if anything, it gets more difficult, as terms start changing meanings. Remember when 'weblog' was related to web server metrics?)

This is a tough question to answer. Consider:
How granular do you want your classification to be?
Do you want to classify sites based on your own criteria or the criteria provided by the sites? In other words, if a site classifies itself as "a premier source for motorcycle maintenance", do you want to create a "motorcycle maintenance" category just for that site? This, of course, will cause your list to become inconsistent. However, if you pigeon-hole the sites to follow your own classification scheme, there is a loss of information, and a risk that the site will not match any of the categories you've defined.
Do you allow subcategories? The problem becomes much more complicated if so.
Can a site belong to more than one category? If so, is there an ordering or a weight (ie. Primary Category, Secondary Categories, etc.), or do you follow a scheme similar to SO's tags?
As an initial stab at the problem, I think I'd define a set of categories, and then spider each site, keeping track of the number of occurrences of each category name, or a mutation thereof. Then, you can choose the name that had the greatest number of "hits."
For instance, given the following categories:
{ "Cars", "Motorcycles", "Video Games" }
Spidering the following blocks of text from a site:
The title is an incongruous play on the title of the book Zen in the Art of Archery by Eugen Herrigel. In its introduction, Pirsig explains that, despite its title, "it should in no way be associated with that great body of factual information relating to orthodox Zen Buddhist practice. It's not very factual on motorcycles, either."
and:
Most motorcycles made since 1980 are pretty reliable if properly maintained but that's a big if. To some extent the high reliability of today's motorcycles has worked to the disadvantage of many riders. Some riders have been lulled into believing that motorcycles are like modern cars and require essentially no maintenance. This is not the case (even with cars). Modern bikes require less maintenance than they did in the 60's and 70's but they still need a lot more maintence than a car. This higher reliability also means that there are a a whole bunch of motorcyclists out there who haven't a clue how to work on their bikes or what really needs to be done to ensure reliability.
We get the following scores:
{ "Cars" : 3, "Motorcycles" : 4, "Video Games" : 0 }
And we can thus categorize the site as being related mostly to "Motorcycles".
Note that I said "mutations thereof" with regards to category names, so "motorcycle" or "car" are both detected. We can see from this that you should also perhaps consider using a list of related words. For instance, perhaps we should detect the word "motorcyclists" when searching for instances of "Motorcycles". Perhaps we should've seen "modern bikes", too.
You could also save those hits, perhaps combined them with some other data, and use Bayesian probability to determine which category the site is most likely to fit into.

Related

What matching algorithm could I use?

I would need some help because I don't know what algorithm i could use for the following (I use python) :
Steve is 25 and he buys everyday orange juice
Maria is 23 and she likes to buy smoothies
Steve & Maria tastes are pretty much the same.
Juan is 16 and he only drinks sodas
Juan tastes are not the same as Steve and Maria.
====================================================
I would like to use a matching algorithm that will detect the users who have the same drink preference and a close age. To continue with the example, Steve and Maria would be matched together but not Juan. Which one should I use ?
I agree with #klutt that your task is pretty vague. There are two approaches that come to mind, but not knowing more details about your problem does limit the details I can provide in my answer that would help you. I am interpreting the question as if you are taking in raw text and might want to process more sentences that have very similar semantic and syntactical structure.
An algorithmic approach:
Assuming that your word choices are static in their semantic meaning (Maria is 23 ... Steve is 25), we can parse each sentence and identify tokens like is or and or same and essentially perform lexical analysis on the text... from here, you could continue thinking about how you would go about matching and so forth... but this is rather complicated...
Neural Network approach:
If you are taking in raw text in the form of sentences, it's a problem that's not straight forward to solve using a top-down algorithmic approach.
You could take an approach with neural networks that trains a model to solve your problem, but then again what you seem to be asking is quite complex since there are multiple "facts" within each sentence that are not semantically related. For example, your second sentence identifies that Maria is 23 but at the end of that sentence there is a comparison between Steve and Maria. And your first sentence only identifies Steve as 25.
Even if you chunk raw text into sentences, you would have to have a very fine tuned neural network architecture and a lot of training data to get remotely close to your goal.
Now, both of those solutions are very complex... but if you wanted to create an application that collects this data (via a form or prompt) and puts it into a structured format (like a json or xml object) to organize and store the data in memory (perhaps writing out to a database or file for persistent storage), that might be a good route to go down.
This can serve as a good lesson in how to think about data as well. It is one thing if you have a pool of thousands of sentences, just raw data that you need to organize for quantitative purposes (classic qualitative -> quantitative problems). It is another thing if you are going to be collecting this data. If you are going to be collecting data, having a program that collects and organizes names, ages, and drink preferences (and then organizes that data within certain data structures), then we can talk about matching algorithms.
I will also add here that if you do have structured data, Collaborative filtering (mentioned by Shridhar) is a great starting place.
Collaborative filtering best suits your needs.
In the newer, narrower sense, collaborative filtering is a method of
making automatic predictions (filtering) about the interests of a user
by collecting preferences or taste information from many users
(collaborating). The underlying assumption of the collaborative
filtering approach is that if a person A has the same opinion as a
person B on an issue, A is more likely to have B's opinion on a
different issue than that of a randomly chosen person. For example, a
collaborative filtering recommendation system for television tastes
could make predictions about which television show a user should like
given a partial list of that user's tastes (likes or dislikes).[3]
Note that these predictions are specific to the user, but use
information gleaned from many users. This differs from the simpler
approach of giving an average (non-specific) score for each item of
interest, for example based on its number of votes.

How to develop an english .com domain value rating algorithm?

I've been thinking about an algorithm that should rougly be able to guess the value of an english .com domain in most cases.
For this to work I want to perform tests that consider the strengths and weaknesses of an english .com domain.
A simple point based system is what I had in mind, where each domain property can be given a certain weight to factor it's importance in.
I had these properties in mind:
domain character length
Eg. initially 20 points are added. If the domain has 4 or less characters, no points are substracted. For each extra character, one or more points are substracted on an exponential basis (the more characters, the higher the penalty).
domain characters
Eg. initially 20 points are added. If the domain is only alphabetic, no points are substracted. For each non-alhabetic character, X points are substracted (exponential increase again).
domain name words
Scans through a big offline english database, including non-formal speech, eg. words like "tweet" should be recognized.
Question 1 : where can I get a modern list of english words for use in such application? Are these lists available for free? Are there lists like these with non-formal words?
The more words are found per character, the more points are added. So, a domain with a lot of characters will still not get a lot of points.
words hype-level
I believe this is a tricky one, but this should be the cause to differentiate perfect but boring domains from perfect and interesting domains.
For example, the following domain is probably not that valueable: www.peanutgalaxy.com
The algorithm should identify that peanuts and galaxies are not very popular topics on the web. This is just an example.
On the other side, a domain like www.shopdeals.com should ring a bell to the hype test, as shops and deals are quite popular on the web.
My initial thought would be to see how often these keywords are references to on the web, preferably with some database.
Question 2: is this logic flawed, or does this hype level test have merit?
Question 3: are such "hype databases" available? Or is there anything else that could work offline? The problem with eg. a query to google is that it requires a lot of requests due to the many domains to be tested.
domain name spelling mistakes
Domains like "freemoneyz.com" etc. are generally (notice I am making a lot of assumptions in this post but that's necessary I believe) not valueable due to the spelling mistakes.
Question 4: are there any offline APIs available to check for spelling mistakes, preferably in javascript or some database that I can use interact with myself. Or should a word list help here as well?
use of consonants, vowels etc.
A domain that is easy to pronounce (eg. Google) is usually much more valueable than one that is not (eg. Gkyld).
Question 5: how does one test for such pronuncability? Do you check for consonants, vowels, etc.? What does a valueable domain have? Has there been any work in this field, where should I look?
That is what I came up with, which leads me to my final two questions.
Question 6: can you think of any more english .com domain strengths or weaknesses? Which? How would you implement these?
Question 7: do you believe this idea has any merit or all, or am I too naive? Anything I should know, read or hear about? Suggestions/comments?
Thanks!
The value of a domain is the highest price someone is prepared to pay on the day of sale - it really is pretty arbitrary, especially when un pre-registered domain can be bought for less than $20.
The domain evaluator tool is what you're looking for.
Here's a screenshot.
Optionally, you can make a simple bidding website. I someone clicks that there is interest in a certain domain, then it raises the hidden price. Only when a user does XYZ can they see the price, by then the community has successfully done your valuation automatically.

Algorithm: Determining type of homepage?

I've been thinking about this for a while now, so I thought I would ask for suggestions:
I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what "kind of homepage" I'm visiting.. Different types could for instance be:
Forum
Blog
Link catalog
Social media site
News site
"One man site"
I've been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.
But this is where I get stuck.. How do you detect trends?
Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..
BUT I can't really find too many trends.
SO: My question is:
Any ideas on how to do this?
Thanks so much..
I believe you are attempting document classification, which is a well-researched topic.
http://en.wikipedia.org/wiki/Document_classification
You will see a considerable list of many different methods. But to suggest any one of those (or neural networks or the like) prior to determining the "trends" as you call them is to suggest it prematurely. I would recommend looking into "web document classification" or the like. It is evidently a considerable subset of document classification, and if you have access to academic journals there are plenty of incomprehensible articles for your enjoyment.
I did also find your idea as a homework assignment -- perhaps if you are particularly audacious you could contact the professor.
http://uhaweb.hartford.edu/compsci/ccli/wdc.htm
Lastly, I believe that this is an accessible (if strangely formatted) website that has a general and perhaps outdated discussion:
http://www.webology.ir/2008/v5n1/a52.html
I'm afraid I don't have much personal knowledge of the topic, so the most I could do was tell you the keyword "document classification" and provide some quick googling. However, if I wanted to play around with this concept, I think simply looking for the rate of certain keywords is a decent starting "trend." ("Sale" or "purchase" or "customers" are trends for shopping sites, "my," "opinion," "comment," for blogs, and so on)
You could train a neural network to recognise them. Give it number/types of links, maybe types of HTML tags as well.
I think otherwise you're just going to be second-guessing what makes a site what it is.

How the computer knows "Recommended for You"?

Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.

Algorithm for suggesting products

What's a good algorithm for suggesting things that someone might like based on their previous choices? (e.g. as popularised by Amazon to suggest books, and used in services like iRate Radio or YAPE where you get suggestions by rating items)
Simple and straightforward (order cart):
Keep a list of transactions in terms of what items were ordered together. For instance when someone buys a camcorder on Amazon, they also buy media for recording at the same time.
When deciding what is "suggested" on a given product page, look at all the orders where that product was ordered, count all the other items purchased at the same time, and then display the top 5 items that were most frequently purchased at the same time.
You can expand it from there based not only on orders, but what people searched for in sequence on the website, etc.
In terms of a rating system (ie, movie ratings):
It becomes more difficult when you throw in ratings. Rather than a discrete basket of items one has purchased, you have a customer history of item ratings.
At that point you're looking at data mining, and the complexity is tremendous.
A simple algorithm, though, isn't far from the above, but it takes a different form. Take the customer's highest rated items, and the lowest rated items, and find other customers with similar highest rated and lowest rated lists. You want to match them with others that have similar extreme likes and dislikes - if you focus on likes only, then when you suggest something they hate, you'll have given them a bad experience. In suggestions systems you always want to err on the side of "lukewarm" experience rather than "hate" because one bad experience will sour them from using the suggestions.
Suggest items in other's highest lists to the customer.
Consider looking at "What is a Good Recommendation Algorithm?" and its discussion on Hacker News.
There isn't a definitive answer and it's highly unlikely there is a standard algorithm for that.
How you do that heavily depends on the kind of data you want to relate and how it is organized. It depends on how you define "related" in the scope of your application.
Often the simplest thought produces good results. In the case of books, if you have a database with several attributes per book entry (say author, date, genre etc.) you can simply chose to suggest a random set of books from the same author, the same genre, similar titles and others like that.
However, you can always try more complicated stuff. Keeping a record of other users that required this "product" and suggest other "products" those users required in the past (product can be anything from a book, to a song to anything you can imagine). Something that most major sites that have a suggest function do (although they probably take in a lot of information, from product attributes to demographics, to best serve the client).
Or you can even resort to so called AI; neural networks can be constructed that take in all those are attributes of the product and try (based on previous observations) to relate it to others and update themselves.
A mix of any of those cases might work for you.
I would personally recommend thinking about how you want the algorithm to work and how to suggest related "products". Then, you can explore all the options: from simple to complicated and balance your needs.
Recommended products algorithms are huge business now a days. NetFlix for one is offering 100,000 for only minor increases in the accuracy of their algorithm.
As you have deduced by the answers so far, and indeed as you suggest, this is a large and complex topic. I can't give you an answer, at least nothing that hasn't already been said, but I an point you to a couple of excellent books on the topic:
Programming CI:
http://oreilly.com/catalog/9780596529321/
is a fairly gentle introduction with
samples in Python.
CI In Action:
http://www.manning.com/alag looks a
bit more in depth (but I've only just
read the first chapter or 2) and has
examples in Java.
I think doing a Google on Least Mean Square Regression (or something like that) might give you something to chew on.
I think most of the useful advice has already been suggested but I thought I'll just put in how I would go about it, just thinking though, since I haven't done anything like this.
First I Would find where in the application I will sample the data to be used, so If I have a store it will probably in the check out. Then I would save a relation between each item in the checkout cart.
now if a user goes to an items page I can count the number of relations from other items and pick for example the 5 items with the highest number of relation to the selected item.
I know its simple, and there are probably better ways.
But I hope it helps
Market basket analysis is the field of study you're looking for:
Microsoft offers two suitable algorithms with their Analysis server:
Microsoft Association Algorithm Microsoft Decision Trees Algorithm
Check out this msdn article for suggestions on how best to use Analysis Services to solve this problem.
link text
there is a recommendation platform created by amazon called Certona, you may find this useful, it is used by companies such as B&Q and Screwfix find more information at www.certona.com/‎

Resources