Detecting duplicate webpages among large number of URLs

Detecting duplicate webpages among large number of URLs - algorithm

From a quote in google blogspot,
"In fact, we found even more than 1 trillion individual links, but not all of
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "
How does Google detect those exact duplicate webpages or documents? Any idea on Algorithm that Google uses?

According to http://en.wikipedia.org/wiki/MinHash:
A large scale evaluation has been conducted by Google in 2006 [10] to
compare the performance of Minhash and Simhash[11] algorithms. In 2007
Google reported using Simhash for duplicate detection for web
crawling[12] and using Minhash and LSH for Google News
personalization.[13]
A search for Simhash turns up this page:
https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/
https://github.com/leonsim/simhash
which references a paper written by google employees: Detecting near-duplicates for web crawling
Abstract:
Near-duplicate web documents are abundant. Two such documents differ
from each other in a very small portion that displays advertisements,
for example. Such differences are irrelevant for web search. So the
quality of a web crawler increases if it can assess whether a newly
crawled web page is a near-duplicate of a previously crawled web page
or not. In the course of developing a near-duplicate detection system
for a multi-billion page repository, we make two research
contributions. First, we demonstrate that Charikar's fingerprinting
technique is appropriate for this goal. Second, we present an
algorithmic technique for identifying existing f-bit fingerprints that
differ from a given fingerprint in at most k bit-positions, for small
k. Our technique is useful for both online queries (single
fingerprints) and all batch queries (multiple fingerprints).
Experimental evaluation over real data confirms the practicality of
our design.
Another Simhash paper:
http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf‎

possible solutions
exact methods
1) brute force: compare every new page to all visited pages (very slow and inefficient)
2) calculate hash of every visited page (md5,sha1) and store the hashes in a database and look up every new page's hash in the database
3)standard Boolean model of information retrieval (BIR)
........many other possible methods
near exact methods
1)fuzzy hash
2)latent semantic indexing
....

Related

Unsupervised automatic tagging algorithms?

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!

The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation

These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.

Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?

You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.

I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

How do search engines evaluate the usefulness of their results?

A search engine returns 2 results A and B for search "play game XYZ".
People who click on result B spend a much longer time and play a lot more XYZ games at site B, while clickers to site A leave the site after a short visit.
I'd imagine site B is a better search result than site A even though it's possible that site B is new and has few links to it in comparison with A. I assume better search engines would take this into account.
My question is, if so, how do they keep track of usage pattern of a search result in the real world?

There are two issues here:
If a user plays game B a lot - he is likely to write and link about it (blogs, reviews, social networks,....) If he does it, the static score of B will raise. This is a part of Page Rank algorithm, that gives the static score of each page and helps the Search Engine decide which page is better.
There is another factor that some Search Engines use: If a user clicked a page, but searched the same/similar query very soon after it - it is likely he did not found what he was after. In this case, the search engine can assume the page is not a good fit ti the query and reduce the score given to this page.
Other then it, the SE cannot really know "how much time you played a game" (unless you revisit it multiple times, by researching the query - and not navigate to it directly, and then it can use the #times the user navigated to the game by searching)

Search engines get results with algorithms like PageRank, which sort websites in it's database by how many sites link to it.
From Wikipedia:
that assigns a numerical weighting to each element of a hyperlinked
set of documents, such as the World Wide Web, with the purpose of
"measuring" its relative importance within the set.
As more sites like to it, it's reputation is assumed to raise, thus it's ranking.
Other methods can be used as well, like search engines can also detect how much time spent on a website, through there external services. Like Google tracks time spent through their widely used tracking/web stat service Google Analytics.
As the other answers mention, another method is detecting a site's relevance to the query is if a similar search is conducted within a short time-frame from the previous. This can indicate if the user actually found what they were looking for on the previous site.

What algorithms should I experiment with to try and classify these PDFs?

We are crawling and downloading lots of companies' PDFs and trying to pick out the ones that are Annual Reports. Such reports can be downloaded from most companies' investor-relations pages.
The PDFs are scanned and the database is populated with, among other things, the:
Title
Contents (full text)
Page count
Word count
Orientation
First line
Using this data we are checking for the obvious phrases such as:
Annual report
Financial statement
Quarterly report
Interim report
Then recording the frequency of these phrases and others. So far we have around 350,000 PDFs to scan and a training set of 4,000 documents that have been manually classified as either a report or not.
We are experimenting with a number of different approaches including Bayesian classifiers and weighting the different factors available. We are building the classifier in Ruby. My question is: if you were thinking about this problem, where would you start?

You should try a quick and basic approach first to form a baseline, which may be good enough for your purposes. Here is one such approach:
Scan all pdfs and form the vocabulary which is a numbered list of all words that occur in any document.
Create a feature vector from this vocabulary for each document by counting the word frequency of each word (all words, dont bother hand picking them). Feature i of document j, is the number of times word i appears in document j.
Then exponentiate features by word importance, which is the opposite of how often the word occurs in all documents. (ie The more often the word occurs in all documents (eg "the") the less information it contains.)
Then use a unsupervised clustering algorithm such as k-means to cluster the documents. You initialize by randomly placing k cluster centroids, assign the nearest documents to them, then move the centroids to the average of the documents assigned to them, then repeat the last two steps until convergence.
Then find the cluster that contains annual reports by using a few hand labeled examples.
Adjust the number of clusters with a cross validation set until the accuracy on the cross validation set is high.
Then finally test on a held out test set. If this is low come back.

For my dissertation a few years back I did something similar, but with digitised lecture slides and exam papers. One of the nicest books I came across for a good broad overview of search engines, search algorithms, and determining the effectiveness of the search was:
Search Engines: Information Retrieval in Practice, W. Bruce Croft, Donald Metzler, Trevor Strohman
There are some sample chapters in the publishers website which will tell you if the book's for you or not: pearsonhighered.com
Hope that helps.

Building or Finding a "relevant terms" suggestion feature

Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)

You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.

Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.

Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.

What searching algorithm/concept is used in Google?

What searching algorithm/concept is used in Google?

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Indexing
If you want to get down to basics:
Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.
Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.
For veteran users:
Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.
In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.
If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.
Ranking
To get down to basics:
Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.
For veteran users:
Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.

Google's patented PigeonRank™
Wow, they initially posted this 7 years ago from Wednesday ...

PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.

I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated.
Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems

Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.
MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.
I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.

While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and
Google’s Personalized Results

This question cannot be answered canonically. The Algorithms used by Google (and other search engines) are their closest guarded secrets and change constantly. Every correct answer can be invalid a month or a year later.
(I know this doesn't really answer the question, but that's the point, there is no possible answer.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio