I am trying to search names in elastic search,
Consider name as kanal-kannan
normally we search name with * na, I tried to search like this -
"/index/party_details/_search?size=200&from=0&q=(first_name_v:kanal-*)"
this results in zero records.
Unless the hyphen character has been dealt with specifically by the analyzer then the two words in your example, kanal and kannan will be indexed separately because any non-alpha character is treated by default as a word delimiter.
Have a look at the documentation for Word Delimiter Token Filter and specifically at the type_table parameter.
Here's an example I used to ensure that an email field was correctly indexed
ft.custom_delimiter = {
"type": "word_delimiter",
"split_on_numerics": false,
"type_table": ["# => ALPHANUM", ". => ALPHANUM", "- => ALPHA", "_ => ALPHANUM"]
};
- is a special character that needs to be escaped to be searched literally: \-
If you use the q parameter (that is, the query_string query), the rules of the Lucene Queryparser Syntax apply.
Depending on your analyzer chain, you might not have any - characters in your index, replacing them with a space in your query would work in that cases too.
#l4rd's answer should work properly (I have the same setup). Another option you have is to mark field with keyword analyzer to prevent tokenizing at all. Note that keyword tokenizer wouldn't lowercase anything, so use custom analyzer with keyword tokenizer in this case.
Related
Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'
It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use 「 and 」 as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
「doc-size type」
Output tokens:
「docsize」
「type」
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
「doc-size type $foo^bar」
Output tokens:
「docsize」
「type」
「foobar」
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
「doc-size 123 %^&*{} type $foo^bar」
Output tokens:
「docsize」
「123」
「」
「type」
「foobar」
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
「doc-size type」
Output tokens:
「doc」
「size」
「type」
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.
I need to grok a pipe-delimited string of values in a grok line; for example:
|NAME=keith|DAY=wednesday|TIME=09:27:423227|DATE=08/06/2019|amount=68.23|currency=USD|etc...
What is the easiest way to do this?
Is there any form of a grok split?
Thanks,
Keith
Your scenario is the perfect use case of logstashs kv (key-value) filter!
The basic idea behind this filter plugin is to extract key-value pairs in a repetitive pattern like yours.
In this case the field_split character would be the pipe ( | ).
To distinguish keys from values you would set the value_split character to the equal sign ( = ).
Here's a sample but untested filter configuration:
filter{
kv{
source => "your_field_name"
target => "kv"
field_split => "\|"
value_split => "="
}
}
Notice how the pipe character in the field_split setting is escaped. Since the pipe is a regex-recognized character, you have to escape it!
This filter will extract all found key-value pairs from your source field and set it into the target named "kv" (the name is arbitrary) from that you can access the fields.
You might want to take a look at the other possible settings of the kv filter to satisfy your needs.
I hope I could help you! :-)
So in DB I have this entry:
Mark-Whalberg
When searching with term
Mark-Whalberg
I get not match.
Why? Is minus a special character what I understand? It symbolizes "exclude"?
The query is this:
{"query_string": {"query": 'Mark-Whalberg', "default_operator": "AND"}}
Searching everything else, like:
Mark
Whalberg
hlb
Mark Whalberg
returns a match.
Is this stored as two different pieces? How can I get a match when including the minus sign in the search term?
--------------EDIT--------------
This is the current query:
var fields = [
"field1",
"field2",
];
{"query_string":{"query": '*Mark-Whalberg*',"default_operator": "AND","fields": fields}};
You have an analyzer configuration issue.
Let me explain that. When you defined your index in ElasticSearch, you didn't indicate any analyzer for the field. It means it's the Standard Analyzer that will apply.
According to the documentation :
Standard Analyzer
The standard analyzer is the default analyzer which is used if none is
specified. It provides grammar based tokenization (based on the
Unicode Text Segmentation algorithm, as specified in Unicode Standard
Annex #29) and works well for most languages.
Also, to answer to your question :
Why? Is minus a special character what I understand? It symbolizes
"exclude"?
For the Standard Analyzer, yes it is. It doesn't mean "exclude" but it is a special char that will be deleted after analysis.
From documentation :
Why doesn’t the term query match my document?
[...] There are many ways to analyze text: the default standard
analyzer drops most punctuation, breaks up text into individual words,
and lower cases them. For instance, the standard analyzer would turn
the string “Quick Brown Fox!” into the terms [quick, brown, fox].
[...]
Example :
If you have the following text :
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
Then the Standard Analyzer will produce :
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
If you don't want to use the analyzer you have 2 solutions :
You can use match query.
You can ask ElasticSearch to not analyze the field when you create your index : here's how
I hope this will help you.
I've stuck in same question and the answer from #Mickael was perfect to understand what is going on (I really recommend you to read the linked documentation).
I solve this by defining an operator to the query:
GET http://localhost:9200/creative/_search
{
"query": {
"match": {
"keyword_id": {
"query": "fake-keyword-uuid-3",
"operator": "AND"
}
}
}
}
For better understand the algorithm that this query uses, try to add "explain": true and analyse the results:
GET http://localhost:9200/creative/_search
{
"explain": true,
"query": // ...
}
I have a string field "title"(not analyzed) in elasticsearch. A document has title "Garfield 2: A Tail Of Two Kitties (2006)".
When I use the following json to query, no result returns.
{"query":{"term":{"title":"Garfield 2: A Tail Of Two Kitties (2006)"}}}
I tried to escape the colon character and the braces, like:
{"query":{"term":{"title":"Garfield 2\\: A Tail Of Two Kitties \\(2006\\)"}}}
Still not working.
Term query wont tokenize or apply analyzers to the search text. Instead if looks for the exact match which wont work as the string fields are analyzed/tokenized by default.
To give this a better explanation -
Lets say there is a string value as - "I am in summer:camp"
When indexing this its broken into tokens as below -
"I am in summer:camp" => [ I , am , in , summer , camp ]
Hence even if you do a term search for "I am in summer:camp" , it wont still work as the token "I am in summer:camp" is not present in the index.
Something like phrase query might work better here.
Or you can leave "index" field as "not_analyzed" to make sure that string is not tokenized.
I am running elasticsearch v1.1.1 and I am having trouble getting results from regex searches.
{
"query" : {
"regexp" : {
"lastname" : "smit*"
}
}
}
Returns 0 results (when I know I have 'smith' in the data.
I have also tried:
{
"query" : {
"filtered" : {
"filter" : {
"regexp" : {
"lastname" : "smit*"
}
}
}
}
}
Any help would be appreciated.
So first off, a lot of this is dependent on how you indexed the field - analyzed or not, what kind of tokenizer, was it lowercased, etc.
To answer your specific question concerning regexp queries, assuming your field is indexed as "smith" (all lower case) you should change your search string to "smit.*" which should match "smith". "smit." should also work.
The reason is that in regexp (which is different than wildcard) "." matches any character. "*" matches any number of the previous character. So your search would match "smitt" or "smittt". The construct ".*" means match any number (including 0) of the previous character - which is "." which matches any. The combination of the two is the regexp equivalent of the wildcard "*".
That said, I'd caution that regexp and wildcard searches can have significant performance challenges in text indexes, depending upon the nature of the field, how it's indexed and the number of documents. These kinds of searches can be very useful but more than one person has built wildcard or regexp searches tested on small data sets only to be disappointed by the production performance. Use with caution.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html
ElasticSearch Regexp Filter