Can I rely on Google CSE results accuracy compared to google.com? - google-api

I have been testing a CSEs accuracy in comparison to google and it seems to fall down when I type in full urls with long query strings. Shorter keyword based and nice url pages are coming through fine.
At first I just thought the pages were not indexed, but they are on google.com and google.co.uk, the only problem is with my CSE. Hence the confusion.
Does anyone know if there is a fundamental difference between:
The ranking algorithm used
The datasets being used
The datacenters being used.
Anything else.
I have tried only allowing the specific site, as well as allowing results from the entire web.
To put is basically, can I reliably expect a CSE and Google's results to match or be very similar, assuming no variables?

No, the mismatch between google.com results and CSE results is a known issue. Google has said that they value speed of results over completeness, and that's just how it is.
This answer has been the same since 2007:
http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=141877

I've noticed that in search results by CSE are missing the ones from forums.

Related

Has anything changed on geocode API

I just wanted to know if anything changed on geocode API from 21 st February because before 21st it was validating zip code 9 digits but from yesterday it is giving an error on 9 digits zip code and now it only validating 5 digits zip code.
More information in your question would be helpful.
I haven't noticed any change, but I thought I'd take a look at the GeoCoder Documentation FAQ for you.
Yes, based on that date, I'd say something changed recently.
Perhaps this is what you're referring to, but that's only a speculation since you didn't provide any detail or examples.
Troubleshooting
I’m getting more queries that return ZERO_RESULTS with the new geocoder. What’s going on?
In the new geocoder, ambiguous, incomplete and badly formatted queries, such as misspelled or nonexistent addresses, are prone to produce ZERO_RESULTS. These queries would typically produce incorrect results in the old geocoder, such as returning the suburb if the address could not be found. We believe that returning ZERO_RESULTS is actually a more correct response in such situations.
If your application deals with user input of addresses, the Place Autocomplete feature in the Places API may produce better quality results. Place Autocomplete allows users to select from a set of results based on what they’ve typed, which allows users to choose between similarly named results, and to adjust their query if they misspell an address.
If you have an application dealing with ambiguous or incomplete queries or queries that may contain errors, we recommend you use the Place Autocomplete feature in the Places API rather than the forward geocoder available in the Geocoding API. For more details, see Best Practices When Geocoding Addresses and the Address Geocoding in the Google Maps APIs blog post.
More Information:
Documentation FAQ
Related Issue Tracker

google search appliance - explain results

I might be missing something obvious, but is there any way to get an insight into why the GSA results for a query are what they are? E.g. Lucene searchers have explain method. Is there anything similar in GSA?
This would be extremely useful when you don't quite understand why you are getting results that you are getting and why the order is what it is.
No. According to all expert reports in enterprise search domain (i.e. Gartner, but not only), Google never explained how it ranks search results in GSA.

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

How does google know if I type in redflower.jpg I mean Red Flower?

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!
If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.
Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

Resources