How does Google Chrome Omnibox autocomplete work? - algorithm

The comment in the following post is particularly helpful in understanding part of the algorithm.
How does Chrome update URL bar completions?
Yet questions remain here. I did some experiment on Chrome:
When I input "eddit", it only suggests "reddit" for general google search, while if I input "reddit" fully, historical reddit url pops.
If I input substring of "facebook" or "google" or "youtube", then urls pop successfully. Say "ceb", "ogl", "utu". Hence tries should not be the (only) data structure used here.
Furthermore, I know Chrome is using sqlite's fts to do full text search(sqlite attribute fts 3/4 to Google). So I guess that Chrome is using inverted index of url in sqlite.
My question is:
How does Chrome manages to autocomplete "utu" -> "youtube"?(based on my local history urls)
I know suffix array/tree can match substring efficiently. But finding the particular word "youtube" will be linear.
I guess a tailored tokenizer(for fts3/4) may achieve this. Say "google" -> {"g",..., "gle", ..}. But there will be too many tokens generated.

Related

Correct way to search videos with multiple keywords with OR condition for youtube search API

I'm trying to use youtube data search and video API in my web application to display top view-counted videos related with several keywords. I'm planing to use totally two calls: the first call get id list with search API, and the second call get details for the ids hit on the first call, with video API.
My question is with regard to search API. Based on my trial and error, If I input multiple keyword with space separation in the parameter q for search API, it's looks behaves as AND condition it's not same as common behavior such as google. To search with multiple keywords with OR condition, As far as I tried, it's looks working if I Include the OR between keywords, but I would like to confirm my assumption correct, officially if possible.
I should be able to find this kind of specification in the official documentation, but finally I have no luck. It's very helpful if you could share these links if exists or give me the official answer.
By the way, it is my first post to stackoverflow. If there is missing point of my question, please kindly advice.

Skip common/duplicate parts while indexing web pages with ElasticSearch

I don't have any experience with ElasticSearch yet, but from what I read I think it suits most my needs. I have a web scraper which scrapes pages of certain domains.
I want to feed these pages into SE and offer a front end interface to search the scraped content. I'm building some sort of vertical search engine.
But as we all know, web pages of one host often only contain a little bit of unique content, a great part of the pages are common. Footer, header, menu etc. are the same on every page.
Does ElasticSearch have some build in intelligence that can filter out the common parts and only search the real content??
It's not terribly difficult to pump web content into Elastic, so I'll assume you have that down. =)
I think this article is fantastic for understanding how to index/search web pages:
http://blog.urx.com/urx-blog/2014/9/4/the-science-of-crawl-part-1-deduplication-of-web-content
It's a complex problem and they have some great detail. There is nothing I know of natively in Elastic that has intelligence to help you eliminate duplicates etc.
The strategy you need to adopt here would be to create a unique key per document. Taking checksum using sha1 or similar algorithm will do the job for getting the unique key. Make this the document ID so that only one page occurs at all point of time. Again use _create API to index if you dont want new duplicates to be indexed ( More efficient ) , and in case you want the new ones to be the document use normal indexing.
In case you need to modify the orginal document in case of disocvery of duplicate document , use upser.
I have explained a great deal of this in this blog.

How is the search snippet and meta "description" of a post determined?

Wen you search in stackoverflow, in most cases a search snippet (first 40 words or so of a post / question) is shown. In some cases, more text is shown and this text includes the search terms. Both blocks of text are ending with ellipsis symbol.
If you look at the meta tag "description" or "og:description", a similar text is included, thus allowing Google to index correctly.
My questions:
What search engine is stackoverflow using (elastic search / Lucene) ?
How and when is the search snippet determined (in realtime during search action or when saving a post / question ?)
How and when is the meta-description determined.
I ask these questions because I want to prevent that I start coding an algorithm to determine the first 40 words or so of an html article (in our case a blog post).
thx
Marc
Stackoverflow uses Elasticsearch.
Elasticsearch has a highlighting-features that take care of these things.
The snippet is determined search time, to find the snippet that is most likely to be relevant for the user's query.

Can I rely on Google CSE results accuracy compared to google.com?

I have been testing a CSEs accuracy in comparison to google and it seems to fall down when I type in full urls with long query strings. Shorter keyword based and nice url pages are coming through fine.
At first I just thought the pages were not indexed, but they are on google.com and google.co.uk, the only problem is with my CSE. Hence the confusion.
Does anyone know if there is a fundamental difference between:
The ranking algorithm used
The datasets being used
The datacenters being used.
Anything else.
I have tried only allowing the specific site, as well as allowing results from the entire web.
To put is basically, can I reliably expect a CSE and Google's results to match or be very similar, assuming no variables?
No, the mismatch between google.com results and CSE results is a known issue. Google has said that they value speed of results over completeness, and that's just how it is.
This answer has been the same since 2007:
http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=141877
I've noticed that in search results by CSE are missing the ones from forums.

How to search usenet for programming questions?

I've been using usenet searches since about 1995 to get programming information, mostly for microsoft APIs. First searching via dejanews, and now google "groups" which bought out dejanews. Over the last few years I've noticed a steady decline in the quantity of search results for usenet from google, and today I find I'm completely unable to get a working usenet search on their advanced group search page. I'm used to searching on "microsoft.*" sometimes suplemented with "microsoft" or "microsoft*". Just try to find a post from 1996-1998 time period on "database" in either the comp.* or microsoft.* hierarchies, and if you can do it, please show your search expression. There should be thousands of results.
http://groups.google.com/groups/search?safe=off&q=database+group%3Amicrosoft*&btnG=Rechercher&as_mind=1&as_minm=1&as_miny=1996&as_maxd=1&as_maxm=1&as_maxy=1999&as_drrb=b&sitesearch=
seems to work nicely... 994 results (no thousands but still...)
It appears to be problem with the advanced search form. I can't get the one at
http://groups.google.fr/advanced_search?hl=fr&q=&hl=fr&
to work either. But I can use the basic form with "database group:microsoft*" and I get many results as expected.
http://www.google.ca/groups/search?safe=off&q=database+group%3Acomp.*&btnG=Search&sitesearch=
returns 3,000 results
The advanced search isn't working for me either:
Broken advanced search results URL
However, removing lr=selected from the query string in that URL makes it work, for some reason:
Working advanced search results URL
In fact, hitting the search button again on the broken advanced search results page will return those results as well for me.
Or actually, it's only partly working, since entering multiple comma-separated groups in the advanced search form (or using the group: search operator) doesn't quite work as expected and ends up adding all the words in the additional group names as search keywords too.
You could try learning Julian dates and use the daterange search operator:
Search results using daterange:

Resources