Ignoring Apostrophes (Possessive) In ElasticSearch - elasticsearch

I'm trying to get user submitted queries for "Joe Frankles", "Joe Frankle", "Joe Frankle's" to match the original text "Joe Frankle's". Right now we're indexing the field this text is in with (Tire / Ruby Format):
{ :type => 'string', :analyzer => 'snowball' }
and searching with:
query { string downcased_query, :default_operator => 'AND' }
I tried this unsuccessfully:
create :settings => {
:analysis => {
:char_filter => {
:remove_accents => {
:type => "mapping",
:mappings => ["`=>", "'=>"]
}
},
:analyzer => {
:myanalyzer => {
:type => 'custom',
:tokenizer => 'standard',
:char_filter => ['remove_accents'],
:filter => ['standard', 'lowercase', 'stop', 'snowball', 'ngram']
}
},
:default => {
:type => 'myanalyzer'
}
}
},

There's two official ways of handling possessive apostrophes:
1) Use the "possessive_english" stemmer as described in the ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html
Example:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "possessive_english"
}
}
}
}
}
Use other stemmers or snowball in addition to the "possessive_english" filter if you like. Should/Must work, but it's untested code.
2) Use the "word_delimiter" filter:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_word_delimiter"]
}
},
"filter" : {
"my_word_delimiter" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
}
}
}
}
Works for me :-) ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
Both will cut off "'s".

I ran into a similar problem, the snowball analyzer alone didn't work for me. Don't know if it's supposed to or not. Here's what I use:
properties: {
name: {
boost: 10,
type: 'multi_field',
fields: {
name: { type: 'string', index: 'analyzed', analyzer: 'title_analyzer' },
untouched: { type: 'string', index: 'not_analyzed' }
}
}
}
analysis: {
char_filter: {
remove_accents: {
type: "mapping",
mappings: ["`=>", "'=>"]
}
},
filter: {},
analyzer: {
title_analyzer: {
type: 'custom',
tokenizer: 'standard',
char_filter: ['remove_accents'],
}
}
}
The Admin indices analyze tool is also great when working with analyzers.

It looks like in your query you are searching _all field, but your analyzer is applied only to the individual field. To enable this functionality for the _all field, simply make snowball your default analyzer.

Related

Logstash - how to change field types

Hello guys I am new to Elasticsearch and I have a table called 'american_football_sacks_against_stats'. It consists of three columns.
id
sacks_against_total
sacks_against_yards
1
12
5
2
15
3
...
...
...
The problem is that sacks_against_total and sacks_against_yards aren't 'imported' as integers/longs/floats whatsoever but as a text field and a keyword field. How can I convert them into numbers?
I tried this but its not working:
mutate {
convert => {
"id" => "integer"
"sacks_against_total" => "integer"
"sacks_against_yards" => "integer"
}
}
This is my logstash.conf file:
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/sportsdb"
jdbc_user => "user"
jdbc_password => "password"
jdbc_driver_class => "org.postgresql.Driver"
schedule => "*/5 * * * *"
statement => "SELECT * FROM public.american_football_sacks_against_stats"
jdbc_paging_enabled => "true"
jdbc_page_size => "300"
}
}
filter {
mutate {
convert => {
"id" => "integer"
"sacks_against_total" => "integer"
"sacks_against_yards" => "integer"
}
}
}
output {
stdout { codec => "json" }
elasticsearch {
hosts => "http://localhost:9200"
index => "sportsdb"
doc_as_upsert => true #
}
}
This is the solution I was looking for:
https://linuxhint.com/elasticsearch-reindex-change-field-type/
To start you create a input from your index and change the types.
PUT _ingest/pipeline/convert_pipeline
{
“description”: “converts the field sacks_against_total,sacks_against_yards fields to a long from string”,
"processors" : [
{
"convert" : {
"field" : "sacks_against_yards",
"type": "long"
}
},
{
"convert" : {
"field" : "sacks_against_total",
"type": "long"
}
}
]
}
Or in cURL:
curl -XPUT "http://localhost:9200/_ingest/pipeline/convert_pipeline" -H 'Content-Type: application/json' -d'{ "description": "converts the sacks_against_total field to a long from string", "processors" : [ { "convert" : { "field" : "sacks_against_total", "type": "long" } }, {"convert" : { "field" : "sacks_against_yards", "type": "long" } } ]}'
And then reindexing the index into another one
POST _reindex
{
“source”: {
"index": "american_football_sacks_against_stats"
},
"dest": {
"index": "american_football_sacks_against_stats_withLong",
"pipeline": "convert_pipeline"
}
}
Or in cURL:
curl -XPOST "http://localhost:9200/_reindex" -H 'Content-Type: application/json' -d'{ "source": { "index": "sportsdb" }, "dest": { "index": "sportsdb_finish", "pipeline": "convert_pipeline" }}'
Elasticsearch does not provide the functionality to change types for existing fields.
You can read here for some options:
https://medium.com/#max.borysov/change-field-type-in-elasticsearch-index-2d11bb366517

geo_point in Elastic

I'm trying to map a latitude and longitude to a geo_point in Elastic.
Here's my log file entry:
13-01-2017 ORDER COMPLETE: £22.00 Glasgow, 55.856299, -4.258845
And here's my conf file
input {
file {
path => "/opt/logs/orders.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+)"}
}
mutate {
convert => { "order_amount" => "float" }
convert => { "order_lat" => "float" }
convert => { "order_long" => "float" }
rename => {
"order_long" => "[location][lon]"
"order_lat" => "[location][lat]"
}
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales"
document_type => "order"
}
stdout {}
}
I start logstash with /bin/logstash -f orders.conf and this gives:
"#version"=>{"type"=>"keyword", "include_in_all"=>false}, "geoip"=>{"dynamic"=>true,
"properties"=>{"ip"=>{"type"=>"ip"},
"location"=>{"type"=>"geo_point"}, "latitude"=>{"type"=>"half_float"},
"longitude"=>{"type"=>"half_float"}}}}}}}}
See? It's seeing location as a geo_point. Yet GET sales/_mapping results in this:
"location": {
"properties": {
"lat": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lon": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
},
Update
Each time I reindex, I stop logstash thenremove the .sincedb from /opt/logstash/data/plugins/inputs/file.... I have also made a brand new log file and I increment the index each time (I'm currently up to sales7).
conf file
input {
file {
path => "/opt/ag-created/logs/orders2.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+), (?<order_lat>[0-9.]+), (?<order_long>[-0-9.]+)( - (?<order_failure_reason>[A-Za-z :]+))?" }
}
mutate {
convert => { "order_amount" => "float" }
}
mutate {
convert => { "order_lat" => "float" }
}
mutate {
convert => { "order_long" => "float" }
}
mutate {
rename => { "order_long" => "[location][lon]" }
}
mutate {
rename => { "order_lat" => "[location][lat]" }
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales7"
document_type => "order"
template_name => "myindex"
template => "/tmp/templates/custom-orders2.json"
template_overwrite => true
}
stdout {}
}
JSON file
{
"template": "sales7",
"settings": {
"index.refresh_interval": "5s"
},
"mappings": {
"sales": {
"_source": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
},
"aliases": {}
}
index => "sales7"
document_type => "order"
template_name => "myindex"
template => "/tmp/templates/custom-orders.json"
template_overwrite => true
}
stdout {}
}
Interestingly, when the geo_point mapping doesn't work (ie. both lat and long are floats), my data is indexed (30 rows). But when the location is correctly made into a geo_point, none of my rows are indexed.
There is two way to do this. First one is creating a template for your mapping to create a correct mapping while indexing you data. Because Elasticseach does not understand what your data type is. You should say it theses things like below.
Firstly, create a template.json file for your mapping structure:
{
"template": "sales*",
"settings": {
"index.refresh_interval": "5s"
},
"mappings": {
"sales": {
"_source": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
},
"aliases": {}
}
After that change your logstash configuration to put this mapping your index :
input {
file {
path => "/opt/logs/orders.log"
start_position => "beginning"
}
}
filter {
grok {
match => { "message" => "(?<date>[0-9-]+) (?<order_status>ORDER [a-zA-Z]+): (?<order_amount>£[0-9.]+) (?<order_location>[a-zA-Z ]+)"}
}
mutate {
convert => { "order_amount" => "float" }
convert => { "order_lat" => "float" }
convert => { "order_long" => "float" }
rename => {
"order_long" => "[location][lon]"
"order_lat" => "[location][lat]"
}
}
}
output {
elasticsearch {
hosts => "localhost"
index => "sales"
document_type => "order"
template_name => "myindex"
template => "/etc/logstash/conf.d/template.json"
template_overwrite => true
}
stdout {}
}
Second option is ingest node feature. I will update my answer for this option but now you can check my dockerized repository. At this example, I used ingest node feature instead of template while parsing location data.

Understanding ELK Analyzer

Am newby to ELK 5.1.1 stack and I have a few questions just for my understanding.
I have setup this stack basicaly with standard analyzers / filters and everything works great.
My data source is a MySQL backend that I index using Logstash.
I would like to deal with queries containing accents and hopefully asciifolding token filter can help achieve this.
First I learned out how to create custom analyzer and save as template.
Right now when I query this url http://localhost:9200/_template?pretty I have 2 templates: the logstash default template named logstash and my custom template which settings are:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : { },
"aliases" : { }
}
Searching for the keyword Yaoundé returns 70 hits but when I search for Yaounde I keep having no hit.
Below is my query for the second case
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
]
}
},
"from": 0,
"size": 10
}
Please can somebody help me guess what am doing wrong here?
Also knowing that my data is analyzed by Logstash during the index process do I really have to specify that the analyzer myCustomAnalyzer should be applied during the research as per this second query ?
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
],
"analyzer": "myCustomAnalyzer"
}
},
"from": 0,
"size": 10
}
Here is a sample of the output part of my logstash config file
output {
stdout { codec => json_lines }
if [type] == "announces" {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{job_id}"
index => "dooone"
document_type => "%{type}"
}
} else {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{uid}"
index => "dootwo"
document_type => "%{type}"
}
}
}
Thank You
A good place to start is with the index template documentation of elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
An example for your scenario that could work for the title field:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : {
"your_type": {
"properties": {
"title": {
"type": "text",
"analyzer": "myCustomAnalyzer"
}
}
}
},
"aliases" : { }
}
An alternative would be to change the dynamic mapping. You can find a good example right here for strings.
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#match-mapping-type
Can you show the mapping of your document ?
(GET /my_index/my_doc/_mapping )
The analyser you provide as argument in your query does only apply at search time, not at indexation time. So if you haven't set this analyser in your mapping, the string is still indexed with "default" analyzer, so it will not match your results.
The analyzer you provide at search time will apply on your query string, but then it will look into indexed data, which is indexed as "Yaoundé", not "yaounde".

Stemming and highlighting for phrase search

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...
Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Which elasticsearch analyzer should I use?

I'm creating a search engine for videogames with about 100.000 entries and want to index them with Elasticsearch.
I played around with some analyzer configurations, but I'm not quite sure what configuration is the best for ecommence products.
My current setups looks like this:
:filter => {
:en_stop_filter => {
"type" => "stop",
"stopwords" => ["_english_"]
},
:en_stem_filter => {
"type" => "stemmer",
"name" => "minimal_english"
}
},
:analyzer => {
:ja_analyzer => {
"type" => "custom",
"tokenizer" => "kuromoji",
"filter" => ["icu_folding", "icu_normalizer"],
"char_filter" => ["html_strip"],
"mode" => "search"
},
:en_analyzer => {
"type" => "custom",
"tokenizer" => "icu_tokenizer",
"filter" => ["icu_folding", "icu_normalizer", "en_stop_filter", "en_stem_filter"],
"char_filter" => ["html_strip"]
}
},
:tokenizer => {
:kuromoji => {
"type" => "kuromoji_tokenizer",
}
}
en_analyzer for english titles and ja_analyzer for japanese titles.
Should I use ngrams, or try other types of analyzers?
It's hard for me to compare the search results; maybe someone has some practice in ecommerce searching and can help me out.

Resources