Searching for a term as both a single string and multi worded string - elasticsearch

I'm setting up my elastic instance in a schema-less manner (no up front mappings) and the application requires users be able to search against a field that contains a word that may or may not be tokenized into multiple strings. For example, the field may contain the word "ONETWO". The spec requires that a user should be able to search "ONETWO", "ONE", and "TWO" and retrieve that same document. There doesn't seem any easy way to accomplish this even with a custom tokenizer (and I don't think there SHOULD be an easy way to do this -- or any way at all). Just want to confirm my thoughts.

Its very easy to cater your requirement using the custom analyzer which uses the n-gram tokenizer, You can even pass it to a lowercase token filter, so that in your case even your text was ONETWO but if user searches for one, One, ONE he should get a result. Although for this you need to apply a different analyzer search time read more about it https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html.
Refer https://devticks.com/how-to-improve-your-full-text-search-in-elasticsearch-with-ngram-tokenizer-e346f29f8ddb for more information and let me know if you need any information.

Related

How to index mixed language contents on Elasticsearch?

How to index mixed language contents in Elasticsearch. Let's say that we have a system where people submit contents from various parts of the world. Countries ranges from US, Canada, Europe, Japan, Korea, India, China, Kenya, Arabs, Russia to all other parts of the world.
Contents can be in any language that we can't know beforehand and can even be in mixed language. We don't want to guess the language of the contents and create multiple language specific indexes for each of the inputted language, we believe this is unmanageable.
We need an easy solution to index those contents efficiently in Elasticsearch with full text search capability as well as fuzzy string searching. Can anyone help in this regard?
What is the target you want to achieve? Do you want to have hits only in the language used at query time? Or would you also accept hits in any other language?
One approach would be to run all of elasticsearch's different language analyzers on the input and store the result in separate fields, for instance suffixed by the language of the current analyzer.
Then, at query time, you would have to search in all of these fields if you have no method to guess the most relevant ones.
However, this is likely to explode since you create a multitude of unused duplicates. This is IMHO also less elegant than having separate indices.
I would strongly recommend to evaluate if you really do not know the number of languages you will see during production. Having a distinct index per language would give you much more control over the input/output and enable you to fine tune your engine to the actual use case.
Alternatively, you may start with a simple whitespace tokenizer and evaluate the quality of the search results (per use case).
You will not have language specific stemming but at least token streams for most languages.

autocomplete and search in Elasticsearch

Is there any possibility to make a search on two non-complete words in the same field using Elasticsearch in Rails? I mean the situation when I could successfully search for example "victorian buildings" phrase by inserting into search input for example "vict bui" phrase (only beginnings of words, also with fuzziness).
Partial match (word_start, text_start etc. available in Searchkick) doesn't work in this project. I've also tried using wildcard queries, but it also failed. Maybe writing some custom mappings/settings would be a good idea?
Can I ask you for any suggestions on what to search/read to do this task?
Try this example
"%#{params[:place]}%"
Since % is a wildcard, doing a like on '%%' matches everything,
and you get all the records in the result.

How to configure standard tokenizer in elasticsearch

I have a multi language data set and a Standard analyzer that takes care of the tokenizing for this data set very nicely. The only bad part is that it removes the special characters like #, #, :, etc.. Is there any way that I can use the standard tokenizer and still be able to search on the special characters?
I have already looked into combo analyzer plugin which did not work as I had hoped.Apparently the combination of analyzers do not work in a chain like the token filters. They work independently which is not useful for me.
Also I looked into the char mapping filter in order to process the data before tokenizing it, but it does not work like the word delimiter token filter where we can specify "type_table" to convert a special character into an ALPHANUM. It just maps one word to another word. As a result I won't be able to search on the special characters.
Also, I have looked into the pattern analyzers, which would work for the special characters but they are not recommended for a multi language data set.
Can anybody point me in the right direction in order to solve this problem?
Thanks in advance!

ElasticSearch search for partial alphanumeric values

I have a string field with values like PA2456U or PA23U-RB and I would like to do a partial match, so that I can search for PA24 and I would get the first result, or search PA23U-RB and find the second result (so that would be a full match.
I tried using ngram, but it ignores the numeric values, so, if I enter pa111 it returns anything that starts with pa
See this gist for an example.
This may be a separate question, or related, but searching for 12345001 should also match 12345-001
Thanks
Update
The final analyzer I used is here: https://gist.github.com/3803180
Making ngrams looks like a good choice based on your requirements, but I think edge_ngrams should be enough. This way your index would grow a little bit slower since you'd be indexing less terms. Anyway the problem is that you don't need to apply the same analyzer to the query too, otherwise querying for pa111 would mean querying for all the ngrams that you can make out of it, which would lead you to a lot more matches that you'd expect.
You just need to change your search_analyzer to an analyzer which doesn't make ngrams. You can use the same you already have and remove the ngram token filter (only for the search_analyzer, the index_analyzer is fine).
Regarding the dash question, have a look at the Word delimiter token filter. You need to configure it to make it work as you expect. I guess the generate_number_parts=false, generate_word_parts=false and split_on_numerics=false options should make it work as you want. That way the dash won't be indexed. You need to apply the token filter at both index time and query time.

to_tsquery() validation

I'm currently developing a website that allows a search on a PostgreSQL
database, the search works with to_tsquery() and I'm trying to find a way to validate the input before it's being sent as a query.
Other than that I'm also trying to add a phrasing capability, so that if someone searches for HELLO | "I LIKE CATS" it will only find results with "hello" or the entire phrase "i like cats" (as opposed to I & LIKE & CATS that will find you articles that have all 3 words,
regardless where they might appear).
Is there some reason why it's too expensive to let the DB server validate it? It does seem a bit excessive to duplicate the ts_query parsing algorithm in the client.
If the concern is that you don't want it to try running the whole query (which presumably will involve table access) each time it validates, you could use the input in a smaller query, just in pseudocode (which may look a bit like Python, but that's just coincidence):
is_valid_query(input):
try:
execute("SELECT ts_query($1)", input);
return True
except DatabaseError:
return False
With regard to phrasing, it's probably easiest to search by the non-phrased query first (using indexes), then filter those for having the phrase. That could be done server side or client side. Depending on the language being parsed, it might be easiest to construct a simple regex of the phrase that deals with repeated whitespace or other ignorable symbols.
Search for to_tsquery('HELLO|(I&LIKE&CATS)'), getting back a list of documents which loosely match.
In the client, filter that to those matching the regex "HELLO|(I\s+LIKE\s+CATS)".
The downside is you do need some additional code for translating your query into the appropriate looser query, and then for translating it into a regex.
Finally, there might be a technique in PostgreSQL to do proper phrase searching using the lexeme positions that are stored in ts_vectors. I'm guessing that phrase searches are one of the intended uses, but I couldn't find an example of it in my cursory search. There's a section on it near the bottom of http://linuxgazette.net/164/sephton.html at least.

Resources