How does google know if I type in redflower.jpg I mean Red Flower? - algorithm

I'm curious what the programming terms or methodology is used when Google shows you the "did you mean" link for a word that is made up of multiple words?
For example if I type in "redflower.jpg" It knows to break that up into Red Flower
Is there a common paradigm for doing that sort of operation? Would a Lucene search give you that?
thanks!

If google does not see a lot of matching results for reflowers.jpg, it might then try to cut the words in multiple words until it finds a lot of matching results.
It might also recognize the extension (.jpg), recognize the image extension and then try to find images with the similar name.
If I would have to make an algorithm like this, I would use an huge EXISTING database (either a dictionary or a search engine) and then try what I said in the beginning of my post.

Perhaps they could to look at what other people do when they have searched for redflowers.jpg? Maybe a number of people searched for "redflowers.jpg", didn't click on any links, and then searched for "Red Flower" and found some results worth clicking on.
Of course they would have to take into account that the queries are similar (contain matching strings), otherwise some strange results might appear.

Related

How does spell checker and spell fixer of Google (or any search engine) work?

When searching for something in Google, if you misspell a word (may be by mistake or may be when you really mean this non-dictionary word), Google says:
"Showing results for ..... Search instead for .......".
I am trying to figure out how this would work.
This basically means being able to find the closest dictionary word to the non-dictionary word entered. How does it work? One way I can guess is :
count no. of instances of each character and then scan dictionary to find a word with same no. of instances of each character (only with +-1 difference). But this will also return anagrams.
Is some kind of probabilistic model of any use here such as Markov etc. I don't understand Markov well enough to throw it around but just a very wild guess.
Any insights?
You're forgetting that google has a lot more information available to it then you do. They track when people type in a word, don't select a result, and then do another search shortly afterwards. They then use this information to suggest better searches for you.
See How does the Google "Did you mean?" Algorithm work? for a fuller explanation.
Note that this approach makes sense when you consider that Google aren't actually doing spell-checking. Instead, they are trying to work out what search term will give you the answer you are looking for. Obviously there is a lot of overlap between this and spell-checking, but it means they are not always trying to correct a search for, e.g., "Flickr".
When you search something which is related to other searches performed earlied closed to yours and got more results, google shows suggest on them.
We are sure that it is not spell checking but it shows what other people queried the related keywords.

Bing/Google/Flickr API: how would you find an image to go along each of 150,000 Japanese sentences?

I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
毎年 交通事故で 多くの人が 死にます
(many people die in traffic accidents every year)
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

How to Google for --depend?

The latest makefiles we've received from a third party vendor contain rules with --depend on the end of build rules, so I thought I would look it up on Google, but try as I might, I can't persuade it to display any pages with exactly the characters --depend
I've tried surrounding it with quotes "--depend": I've tried the Advanced Search: I've tried backslashes "\-\-depend" in the (vain) hope that there is some sort of unpublished regular expression search available.
Am I missing something blindingly obvious?
Please note that this is NOT a question about what --depend does, I know that, it's a question about how you Google for very precise, programmer oriented, text.
You can specifiy literal symbols in a Google Code Search but not Google Web Search.
Examples;
Google Code Search for +"--depend"
Google Web Search for +"--depend"
I had the same issue searching for 'syntax-rules'. You would think they would have solved this by now.
I remember to have read somewhere that google's web search does not index non alphanumeric characters, treating them as word separators, so that's not possible.
Reason for this problem is that a minus sign at the start of a token indicates that you want to EXCLUDE it from the search.
This is how you can filter out really popular results that really have nothing to do with you want.
For example, try searching for "wow". Then try searching for "wow -warcraft".

Resources