Wen you search in stackoverflow, in most cases a search snippet (first 40 words or so of a post / question) is shown. In some cases, more text is shown and this text includes the search terms. Both blocks of text are ending with ellipsis symbol.
If you look at the meta tag "description" or "og:description", a similar text is included, thus allowing Google to index correctly.
My questions:
What search engine is stackoverflow using (elastic search / Lucene) ?
How and when is the search snippet determined (in realtime during search action or when saving a post / question ?)
How and when is the meta-description determined.
I ask these questions because I want to prevent that I start coding an algorithm to determine the first 40 words or so of an html article (in our case a blog post).
thx
Marc
Stackoverflow uses Elasticsearch.
Elasticsearch has a highlighting-features that take care of these things.
The snippet is determined search time, to find the snippet that is most likely to be relevant for the user's query.
Related
We have social collaboration enabled on our FileNet system. I can add comment, tag, like and track how many times a document has been downloaded. These features are nice. When I tag a document, I can search documents by the tag text.
Ex: If I tag a document as say "test". I can user a search template to search for a document by its tag value i.e. test.
When I comment, I can't search document based on Comment Text.
Say I added a comment as "good doc". I can't search it by the text. Rather I need to provide an integer value like 1 search. Then search happens like "get all documents which has number of comments =1". I don't want this behavior instead I should be able to search on the comment text.
Can anybody help on this?
One way to achieve this would be to use CBR on the property. See how to enable CBR on a property
The property will then be full-text searchable using the CONTAINS statement, see doc.
Optionally (but i'm not sure as i've never personally used it) - the satisfies operator might exactly what you're looking for according to the documentation.
I am new to a elastic and I am trying to find a way to convert greeklish character to greek when the search executes.
e.g word "papoutsia" to be searched as "παπουτσια" (shoes)
Due to my search I found the following plugins:
elasticsearch-analysis-greeklish
elasticsearch-skroutz-greekstemmer
Applied the filters to my index as the example but my queries still hit nothing.
Do I have to apply the filter some way in every query or do a special one?
Sorry I this question has a very large/broad answer to be given.
I trying to figure how the whole filtering thing works for a couple of days to understand if I am even in the correct direction or have to find an other way for this solution.
Unfortunately, the intention of the greeklish plugin / char filter is the inverse of what you want to achieve:
Using this filter, you can retrieve greek text from a document, using a query that is written in latin characters ("greeklish").
So, for your example, you can add a document with the text παπούτσια and retrieve it using the terms papoutsia, papoutsi, etc.
We have prepared a detailed text pipeline example in the repo's wiki for future reference.
The comment in the following post is particularly helpful in understanding part of the algorithm.
How does Chrome update URL bar completions?
Yet questions remain here. I did some experiment on Chrome:
When I input "eddit", it only suggests "reddit" for general google search, while if I input "reddit" fully, historical reddit url pops.
If I input substring of "facebook" or "google" or "youtube", then urls pop successfully. Say "ceb", "ogl", "utu". Hence tries should not be the (only) data structure used here.
Furthermore, I know Chrome is using sqlite's fts to do full text search(sqlite attribute fts 3/4 to Google). So I guess that Chrome is using inverted index of url in sqlite.
My question is:
How does Chrome manages to autocomplete "utu" -> "youtube"?(based on my local history urls)
I know suffix array/tree can match substring efficiently. But finding the particular word "youtube" will be linear.
I guess a tailored tokenizer(for fts3/4) may achieve this. Say "google" -> {"g",..., "gle", ..}. But there will be too many tokens generated.
For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.
I've been using usenet searches since about 1995 to get programming information, mostly for microsoft APIs. First searching via dejanews, and now google "groups" which bought out dejanews. Over the last few years I've noticed a steady decline in the quantity of search results for usenet from google, and today I find I'm completely unable to get a working usenet search on their advanced group search page. I'm used to searching on "microsoft.*" sometimes suplemented with "microsoft" or "microsoft*". Just try to find a post from 1996-1998 time period on "database" in either the comp.* or microsoft.* hierarchies, and if you can do it, please show your search expression. There should be thousands of results.
http://groups.google.com/groups/search?safe=off&q=database+group%3Amicrosoft*&btnG=Rechercher&as_mind=1&as_minm=1&as_miny=1996&as_maxd=1&as_maxm=1&as_maxy=1999&as_drrb=b&sitesearch=
seems to work nicely... 994 results (no thousands but still...)
It appears to be problem with the advanced search form. I can't get the one at
http://groups.google.fr/advanced_search?hl=fr&q=&hl=fr&
to work either. But I can use the basic form with "database group:microsoft*" and I get many results as expected.
http://www.google.ca/groups/search?safe=off&q=database+group%3Acomp.*&btnG=Search&sitesearch=
returns 3,000 results
The advanced search isn't working for me either:
Broken advanced search results URL
However, removing lr=selected from the query string in that URL makes it work, for some reason:
Working advanced search results URL
In fact, hitting the search button again on the broken advanced search results page will return those results as well for me.
Or actually, it's only partly working, since entering multiple comma-separated groups in the advanced search form (or using the group: search operator) doesn't quite work as expected and ends up adding all the words in the additional group names as search keywords too.
You could try learning Julian dates and use the daterange search operator:
Search results using daterange: