Understanding ELK Analyzer - elasticsearch

Am newby to ELK 5.1.1 stack and I have a few questions just for my understanding.
I have setup this stack basicaly with standard analyzers / filters and everything works great.
My data source is a MySQL backend that I index using Logstash.
I would like to deal with queries containing accents and hopefully asciifolding token filter can help achieve this.
First I learned out how to create custom analyzer and save as template.
Right now when I query this url http://localhost:9200/_template?pretty I have 2 templates: the logstash default template named logstash and my custom template which settings are:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : { },
"aliases" : { }
}
Searching for the keyword Yaoundé returns 70 hits but when I search for Yaounde I keep having no hit.
Below is my query for the second case
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
]
}
},
"from": 0,
"size": 10
}
Please can somebody help me guess what am doing wrong here?
Also knowing that my data is analyzed by Logstash during the index process do I really have to specify that the analyzer myCustomAnalyzer should be applied during the research as per this second query ?
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
],
"analyzer": "myCustomAnalyzer"
}
},
"from": 0,
"size": 10
}
Here is a sample of the output part of my logstash config file
output {
stdout { codec => json_lines }
if [type] == "announces" {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{job_id}"
index => "dooone"
document_type => "%{type}"
}
} else {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{uid}"
index => "dootwo"
document_type => "%{type}"
}
}
}
Thank You

A good place to start is with the index template documentation of elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
An example for your scenario that could work for the title field:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : {
"your_type": {
"properties": {
"title": {
"type": "text",
"analyzer": "myCustomAnalyzer"
}
}
}
},
"aliases" : { }
}
An alternative would be to change the dynamic mapping. You can find a good example right here for strings.
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#match-mapping-type

Can you show the mapping of your document ?
(GET /my_index/my_doc/_mapping )
The analyser you provide as argument in your query does only apply at search time, not at indexation time. So if you haven't set this analyser in your mapping, the string is still indexed with "default" analyzer, so it will not match your results.
The analyzer you provide at search time will apply on your query string, but then it will look into indexed data, which is indexed as "Yaoundé", not "yaounde".

Related

Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]

Elasticsearch version 7.13.1
GET test/_mapping
{
"test" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text"
}
}
}
}
}
POST test/_doc/101
{
"id":101,
"name":"hello"
}
POST test/_doc/102
{
"id":102,
"name":"hi"
}
Wildcard Search pattern
GET test/_search
{
"query": {
"query_string": {
"query": "*101* *hello*",
"default_operator": "AND",
"fields": [
"id",
"name"
]
}
}
}
Error is : "reason" : "Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]",
It was working fine in version 7.6.0 ..
What is new change in latest ES and what is the resolution of this issue?
It's not directly possible to perform wildcards on numeric data types. It is better to convert those integers to strings.
You need to modify your index mapping to
PUT /my-index
{
"mappings": {
"properties": {
"code": {
"type": "text"
}
}
}
}
Otherwise, if you want to perform a partial search you can use edge n-gram tokenizer

How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

CSV geodata into elasticsearch as a geo_point type using logstash

Below is a reproducible example of the problem I am having using to most recent versions of logstash and elasticsearch.
I am using logstash to input geospatial data from a csv into elasticsearch as geo_points.
The CSV looks like the following:
$ head simple_base_map.csv
"lon","lat"
-1.7841,50.7408
-1.7841,50.7408
-1.78411,50.7408
-1.78412,50.7408
-1.78413,50.7408
-1.78414,50.7408
-1.78415,50.7408
-1.78416,50.7408
-1.78416,50.7408
I have create a mapping template that looks like the following:
$ cat simple_base_map_template.json
{
"template": "base_map_template",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"node_points" : {
"properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
and have a logstash config file that looks like the following:
$ cat simple_base_map.conf
input {
stdin {}
}
filter {
csv {
columns => [
"lon", "lat"
]
}
if [lon] == "lon" {
drop { }
} else {
mutate {
remove_field => [ "message", "host", "#timestamp", "#version" ]
}
mutate {
convert => { "lon" => "float" }
convert => { "lat" => "float" }
}
mutate {
rename => {
"lon" => "[location][lon]"
"lat" => "[location][lat]"
}
}
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "base_map_simple"
template => "simple_base_map_template.json"
document_type => "node_points"
}
}
I then run the following:
$cat simple_base_map.csv | logstash-2.1.3/bin/logstash -f simple_base_map.conf
Settings: Default filter workers: 16
Logstash startup completed
....................................................................................................Logstash shutdown completed
However when looking at the index base_map_simple, it suggests the documents would not have a location: geo_point type in it...and rather it would be two doubles of lat and lon.
$ curl -XGET 'localhost:9200/base_map_simple?pretty'
{
"base_map_simple" : {
"aliases" : { },
"mappings" : {
"node_points" : {
"properties" : {
"location" : {
"properties" : {
"lat" : {
"type" : "double"
},
"lon" : {
"type" : "double"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1457355015883",
"uuid" : "luWGyfB3ToKTObSrbBbcbw",
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "2020099"
}
}
},
"warmers" : { }
}
}
How would i need to change any of the above files to ensure that it goes into elastic search as a geo_point type?
Finally, I would like to be able to carry out a nearest neighbour search on the geo_points by using a command such as the following:
curl -XGET 'localhost:9200/base_map_simple/_search?pretty' -d'
{
"size": 1,
"sort": {
"_geo_distance" : {
"location" : {
"lat" : 50,
"lon" : -1
},
"order" : "asc",
"unit": "m"
}
}
}'
Thanks
The problem is that in your elasticsearch output you named the index base_map_simple while in your template the template property is base_map_template, hence the template is not being applied when creating the new index. The template property needs to somehow match the name of the index being created in order for the template to kick in.
It will work if you simply change the latter to base_map_*, i.e. as in:
{
"template": "base_map_*", <--- change this
"order": 1,
"settings": {
"index.number_of_shards": 1
},
"mappings": {
"node_points": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
UPDATE
Make sure to delete the current index as well as the template first., i.e.
curl -XDELETE localhost:9200/base_map_simple
curl -XDELETE localhost:9200/_template/logstash

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Logstash/Elasticsearch CSV Field Types, Date Formats and Multifields (.raw)

I've been playing around with getting a tab delimited file into Elasticsearch using the CSV filter in Logstash. Getting the data in was actually incredibly easy, but I'm having trouble getting the field types to come in right when I look at the data in Kibana. Dates and integers continue to come in as strings, so I can't plot by date or do any analysis functions on integers (sum, mean, etc).
I'm also having trouble getting the .raw version of the fields to populate. For example, in device I have data like "HTC One", but when if I do a pie chart in Kibana, it'll show up as two separate groupings "HTC" and "One". When I try to chart device.raw instead, it comes up as a missing field. From what I've read, it seems like Logstash should automatically create a raw version of each string field, but that doesn't seem to be happening.
I've been sifting through the documentation, google and stack, but haven't found a solution. Any ideas appreciated! Thanks.
Config file:
#logstash.conf
input {
file {
path => "file.txt"
type => "event"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
columns => ["userid","date","distance","device"]
separator => " "
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
port => "9200"
protocol => "http"
index => "userid"
workers => 2
template => template.json
}
#stdout {
# codec => rubydebug
#}
}
Here's the template file:
#template.json:
{
"template": "event",
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index" : {
"query" : { "default_field" : "userid" }
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"_source": { "compress": true },
"dynamic_templates": [
{
"string_template" : {
"match" : "*",
"mapping": { "type": "string", "index": "not_analyzed" },
"match_mapping_type" : "string"
}
}
],
"properties" : {
"date" : { "type" : "date", "format": "yyyy-MM-dd HH:mm:ss"},
"device" : { "type" : "string", "fields": {"raw": {"type": "string","index": "not_analyzed"}}},
"distance" : { "type" : "integer"}
}
}
}
Figured it out - the template name IS the index. So the "template" : "event" line should have been "template" : "userid"
I found another (easier) way to specify the type of the fields. You can use logstash's mutate filter to change the type of a field. Simply add the following filter after your csv filter to your logstash config
mutate {
convert => [ "fieldname", "integer" ]
}
For details check out the logstash docs - mutate convert

Resources