Exact phrase search using lucene without increasing number of fields

Exact phrase search using lucene without increasing number of fields - full-text-search

For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
I have asked same question at Different indexing and search strategies on same field without doubling index size?. However folks at office are not happy about indexing the same data as part of two fields. (we currently have around 20 text fields in lucene document). Is there any way to support both the cases I listed above using TokenFilters?
Say, for a StopFilter, make changes so that it emits both the input token and ? (for ignored word) with same position increments. Similarly for StemFilter, it emits both the input token and stemmed token with same position increments. Basically input and output tokens (even ignored ones) have same positions.
Is it safe to go ahead with this approach? Has anyone else faced the requirements listed here? Are there any Filters readily available which do something similar to what I mentioned in my approach?
Thanks

I don't understand what you mean by "input and output tokens." Are you storing the data twice - once as stemmed and once non-stemmed?
If you aren't storing it twice, I don't think your method will work. Suppose the stored word is jumping and they search for jumped. Your query parser can emit jump and jumped but it still won't match jumping unless you have a value stored as jump.
And if you're going to store the value once as stemmed and once as non-stemmed, then why not just store it in two fields? Then you won't have to deal with weird tokenizer changes.

Related

Elasticsearch show what is matched from a query

I'm implementing a sort of "natural language" search assistant. I have a form with a number of select fields. The list of options in each field can be pretty lengthy. So rather than having to select each item individually, I'm adding a text input box where people can just type what they're looking for and the app will suggest possible searches, based on the options in the select dropdowns.
Let's say my options are:
Color: red, blue, black, yellow, green
Size: very small, kinda medium, super large
Shape: round, square, oblong, cylindrical
Year: 2007, 2008, 2009, 2010
If you typed in "2007 very small star-spangled", the text input would suggest "Search all 2007 very small widgets for 'star-spangled'". It understood that "2007" and "very small" were select options in the form, and that "star-spangled" was not, and suggested a search where "2007" and "very small" are selected, and then left the "star-spangled" bit for a plaintext search.
What I'm working on right now is parsing the search query and picking out the bits that fit into the select fields. I have all the options in Elasticsearch. I was thinking of searching each type individually to see if it matches anything in the search query. That seems straightforward to me. I can easily find matches. However, I don't know which part of the query actually matches each type, which I need in order to find out that e.g. "star-spangled" is the part that didn't match options.
So, in the end, I need to know that only the "2007" substring matched the year, only the "very small" substring matched the size, and "star-spangled" didn't match anything.
My first thought is to split the query into word-grams (e.g. "2007", "2007 very", "2007 very small", "2007 very small star-spangled", "very", "very small", "very small star-spangled", "small", "small star-spangled", "star-spangled") and search each option for each gram. Then I would know for sure which gram matched. However, this could obviously get resource intensive pretty quickly. Also, I know Elasticsearch can do that sort of search internally much faster.
So what I really need is to be able to perform a search and, along with the results, get back which part of the original query actually matched. So if I searched, "2007 verr small" (intentional misspelling) and did a fuzzy search of sizes, passing the entire query string, and I get the "Very Small" size back as a result, it would indicate that "verr small" is the part of the query that matched that size.
Any idea of how to do that? Or possibly some other solutions?
I could do the search and parse the results to see which bits match the string. Though I could see that being resource intensive as well. And if I'm doing a fuzzy search, it wouldn't necessarily be clear which part of the query triggered a match in the result.
I was also thinking that highlighting might work for this, but I don't know enough about Elasticsearch to know for sure.
EDIT: I tested this out using highlighting. It's so close to working. The highlight field comes back with the part of the string that matches. However, it only shows the part of the result that matches. It doesn't show the part of the query that matches. So if I want to allow for fuzzy searches, the highlight field won't match the original query and I won't be able to tell which part of the query matched. For example, a query of "very smaal" will return the size "Very Small", but the highlight field will show very small, not very smaal.

There are 2 types of queries in Elasticsearch, Match Query and Filtered Query. Match query matches your term in the documents and find all the relevant documents with a relevance score. For example when you search for term: "help fixing javascript problem" you are interested in all documents which contain one or more of the search term.
On the other hand, when you are using Filtered Query, a document is either a match or not match... there is no relevance score here... as an example, you want all the products built in year "2007"... here you need to use a filtered query. All the product built in 2007 have the same score and all other years are excluded from the result.
In my opinion, your problem should be dealt with Filter Query...
When using filter query, normally each filter has its own corresponding input in the UI, consider the following screen-shot which is from ebay:
If I have understood your requirement correctly, you want to include all those filters in a single search-box. In my opinion, this is nearly impossible to implement because you have no way to parse user input and decide which word corresponds to which filter...
If you want to go down the filter path, it's better to introduce corresponding UI fields for each filter...
If you want to stick to a single search box, then don't implement the filter functionality and stick to Elasticsearch Multi-match query... you can match the input term across multiple fields but you won't be able to filter out (exclude) result instead you get a relevance score.

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!

Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.

It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

ElasticSearch: A way to know which term hit in which field?

there are usecases where I really would like to know which term was matched in which field by my search. With this information I would like to disclose the information which field caused the hit to the user on my webpage. I also would like to know the term playing part in the hit. In my case it is a database identifier, so I would take the matched term - an ID - get the respective database record and display useful information to the user.
I currently know two ways: Highlighting and the explain API. However, the first requires stored values which seems unnecessary. The second is meant for debugging only and is rather expensive so I wouldn't want it to run with every query.
I don't know another way which is confusing: The highlighting algorithms need the information I want to use anyway, can't I just get it somehow?
On a related note, I would also be interested in the opposite case: Which term did not hit at all? This information would allow for features like "terms that didn't match your query" like Google does sometimes (where the respective words are shown in grey-strikeout).
Thanks for hints!

Sphinx reverse search - when new item is added, execute searches on existing stored keywords

I have an app where people can list stuff to sell/swap/give away, with 200-character descriptions. Let's call them sellers.
Other users can search for things - let's call them buyers.
I have a system set up using Django, MySQL and Sphinx for text search.
Let's say a buyer is looking for "t-shirts". They don't get any results they want. I want the app to give the buyer the option to check a box to say "Tell me if something comes up".
Then when a seller lists a "Quicksilver t-shirt", this would trigger a sort of reverse search on all saved searches to notify those buyers that a new item matching their query has been listed.
Obviously I could trigger Sphinx searches on every saved search every time any new item is listed (in a loop) to look for matches - but this would be insane and intensive. This is the effect I want to achieve in a sane way - how can I do it?

You literally build a reverse index!
Store the 'searches' in the databases, and build an index on it.
So 't-shirts' would be a document in this index.
Then when a new product is submitted, you run a query against this index. Use 'Quorum' syntax or even match-any - to get matches that only match one keyword.
So in your example, the query would be "Quicksilver t-shirt"/1 which means match Quicksilver OR t-shirt. But the same holds with much longer titles, or even the whole description.
The result of that query would be a list of (single word*) original searches that matched. Note this also assumes you have your index setup to treat - as a word char.
*Note its slightly more complicated if you allow more complex queries, multi keywords, or negations and an OR brackets, phrases etc. But in this case the reverse search jsut gives you POTENTIAL matches, so you need to confirm that it still matches. Still a number of queries, but you you dont need to run it on all
btw, I think the technical term for these 'reverse' searches is Prospective Search
http://en.wikipedia.org/wiki/Prospective_search

Solr: How to search for a full match on a text field? Is there a hidden equal() operator?

it is too simple to describe:
q=mydynamicfield_txt:"video"
I want only hits when mydynamicfield is exact "video.
Other way round, how to supress hits, where "video" is only part of the field (like "home video").
Is this supported with Solr3.1 out of the box, or do I have to add my own special brackets like "SOLRSTARTSOLR video SOLRENDSOLR" in my index, to retrieve later my term between "START" and "END". Kind of manual regex anchoring.
This is PITA cause it needs special handling in index/gui and breaks highlighting.
Where is the way to go?
regards
Peter
(=PA=)

One solution to create untokenized(KeywordAnalyzed) field and search within it - all your text will be distinct token in Solr index.
Other solution is to write filter which will read token count from index and compare to query tokens i.e. filter entities where doc_tokens > query_tokens assuming that all query tokens are matched.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio