How to aggregate by substring in elasticseach - elasticsearch

I have to index many document like this:
POST /example/doc
{
id : "type-name",
foo: bar
}
and i would like to retrieve a list of all the type that are present. For example
POST /example/doc
{
id : "AAA-123",
foo: bar
}
POST /example/doc
{
id : "AAA-456",
foo: bar
}
POST /example/doc
{
id : "BBB-123",
foo: bar
}
and ask elasticseaarch to give me a list where i have AAA and BBB.
UPDATE
I've also solved using a custom analyzer
"settings": {
"analysis": {
"char_filter" : {
"remove_after_minus":{
"type":"pattern_replace",
"pattern":"-(.*)",
"replacement":""
}
},
"analyzer": {
"id_analyzer":{
"tokenizer" : "standard",
"char_filter" : ["remove_after_minus"]
}
}
}
}

If you keep the standard analyzer, the id will be split at the "-". So, if for your types lower and upper case are the same, you can just go with a simple facet query
curl -XPOST "http://localhost:9023/index/type/_search?size=0&pretty=true" -d
'{
"query" : {
{ "regexp":{ "id": "[A-Z]+" }
},
"facets" : {
"id" : {
"terms" : {
"field" : "id",
"size" : 50
}
}
}
}'
should give you somtehing that you can use.

Related

How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

ElasticSearch filtering for a tag in array

I've got a bunch of events that are tagged for their audience:
{ id = 123, audiences = ["Public", "Lecture"], ... }
I've trying to do an ElasticSearch query with filtering, so that the search will only return events that have the an exact entry of "Public" in that audiences array (and won't return events that a "Not Public").
How do I do that?
This is what I have so far, but it's returning zero results, even though I definitely have "Public" events:
curl -XGET 'http://localhost:9200/events/event/_search' -d '
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"audiences": "Public"
}
},
"query" : {
"match" : {
"title" : "[searchterm]"
}
}
}
}
}'
You could use this mapping for you content type
{
"your_index": {
"mappings": {
"your_type": {
"properties": {
"audiences": {
"type": "string",
"index": "not_analyzed"
},
}
}
}
}
}
not_analyzed
Index this field, so it is searchable, but index the
value exactly as specified. Do not analyze it.
And use lowercase term value in search query

CSV geodata into elasticsearch as a geo_point type using logstash

Below is a reproducible example of the problem I am having using to most recent versions of logstash and elasticsearch.
I am using logstash to input geospatial data from a csv into elasticsearch as geo_points.
The CSV looks like the following:
$ head simple_base_map.csv
"lon","lat"
-1.7841,50.7408
-1.7841,50.7408
-1.78411,50.7408
-1.78412,50.7408
-1.78413,50.7408
-1.78414,50.7408
-1.78415,50.7408
-1.78416,50.7408
-1.78416,50.7408
I have create a mapping template that looks like the following:
$ cat simple_base_map_template.json
{
"template": "base_map_template",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"node_points" : {
"properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
and have a logstash config file that looks like the following:
$ cat simple_base_map.conf
input {
stdin {}
}
filter {
csv {
columns => [
"lon", "lat"
]
}
if [lon] == "lon" {
drop { }
} else {
mutate {
remove_field => [ "message", "host", "#timestamp", "#version" ]
}
mutate {
convert => { "lon" => "float" }
convert => { "lat" => "float" }
}
mutate {
rename => {
"lon" => "[location][lon]"
"lat" => "[location][lat]"
}
}
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "base_map_simple"
template => "simple_base_map_template.json"
document_type => "node_points"
}
}
I then run the following:
$cat simple_base_map.csv | logstash-2.1.3/bin/logstash -f simple_base_map.conf
Settings: Default filter workers: 16
Logstash startup completed
....................................................................................................Logstash shutdown completed
However when looking at the index base_map_simple, it suggests the documents would not have a location: geo_point type in it...and rather it would be two doubles of lat and lon.
$ curl -XGET 'localhost:9200/base_map_simple?pretty'
{
"base_map_simple" : {
"aliases" : { },
"mappings" : {
"node_points" : {
"properties" : {
"location" : {
"properties" : {
"lat" : {
"type" : "double"
},
"lon" : {
"type" : "double"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1457355015883",
"uuid" : "luWGyfB3ToKTObSrbBbcbw",
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "2020099"
}
}
},
"warmers" : { }
}
}
How would i need to change any of the above files to ensure that it goes into elastic search as a geo_point type?
Finally, I would like to be able to carry out a nearest neighbour search on the geo_points by using a command such as the following:
curl -XGET 'localhost:9200/base_map_simple/_search?pretty' -d'
{
"size": 1,
"sort": {
"_geo_distance" : {
"location" : {
"lat" : 50,
"lon" : -1
},
"order" : "asc",
"unit": "m"
}
}
}'
Thanks
The problem is that in your elasticsearch output you named the index base_map_simple while in your template the template property is base_map_template, hence the template is not being applied when creating the new index. The template property needs to somehow match the name of the index being created in order for the template to kick in.
It will work if you simply change the latter to base_map_*, i.e. as in:
{
"template": "base_map_*", <--- change this
"order": 1,
"settings": {
"index.number_of_shards": 1
},
"mappings": {
"node_points": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
UPDATE
Make sure to delete the current index as well as the template first., i.e.
curl -XDELETE localhost:9200/base_map_simple
curl -XDELETE localhost:9200/_template/logstash

Elastic exact match w/o changing indexing

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?
Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

elasticsearch: can I defined synonyms with boost?

Let's say A, B, C are synonyms, I want to define B is "closer" to A than C
so that when I search the keyword A, in the searching results, A comes the first, B comes the second and C comes the last.
Any help?
There is no search-time mechanism (as of yet) to differentiate between matches on synonyms and source field. This is because, when indexed, a field's synonyms are placed into the inverted index alongside the original term, leaving all words equal.
This is not to say however that you cannot do some magic at index time to glean the information you want.
Create an index with two analyzers: one with a synonym filter, and one without.
PUT /synonym_test/
{
settings : {
analysis : {
analyzer : {
"no_synonyms" : {
tokenizer : "lowercase"
},
"synonyms" : {
tokenizer : "lowercase",
filter : ["synonym"]
}
},
filter : {
synonym : {
type : "synonym",
format: "wordnet",
synonyms_path: "prolog/wn_s.pl"
}
}
}
}
}
Use a multi-field mapping so that the field of interest is indexed twice:
PUT /synonym_test/mytype/_mapping
{
"properties":{
"mood": {
"type": "multi_field",
"fields" : {
"syn" : {"type" : "string", "analyzer" : "synonyms"},
"no_syn" : {"type" : "string", "analyzer" : "no_synonyms"}
}
}
}
}
Index a test document:
POST /synonym_test/mytype/1
{
mood:"elated"
}
At search time, boost the score of hits on the field with no synonymn.
GET /synonym_test/mytype/_search
{
query: {
bool: {
should: [
{ match: { "mood.syn" : { query: "gleeful", "boost": 3 } } },
{ match: { "mood.no_syn" : "gleeful" } }
]
}
}
}
Results in _score":0.2696457
Searching for the original term returns a better score:
GET /synonym_test/mytype/_search
{
query: {
bool: {
should: [
{ match: { "mood.syn" : { query: "elated", "boost": 3 } } },
{ match: { "mood.no_syn" : "elated" } }
]
}
}
}
Results in: _score":0.6558018,"

Resources