Which elasticsearch analyzer should I use? - elasticsearch

I'm creating a search engine for videogames with about 100.000 entries and want to index them with Elasticsearch.
I played around with some analyzer configurations, but I'm not quite sure what configuration is the best for ecommence products.
My current setups looks like this:
:filter => {
:en_stop_filter => {
"type" => "stop",
"stopwords" => ["_english_"]
},
:en_stem_filter => {
"type" => "stemmer",
"name" => "minimal_english"
}
},
:analyzer => {
:ja_analyzer => {
"type" => "custom",
"tokenizer" => "kuromoji",
"filter" => ["icu_folding", "icu_normalizer"],
"char_filter" => ["html_strip"],
"mode" => "search"
},
:en_analyzer => {
"type" => "custom",
"tokenizer" => "icu_tokenizer",
"filter" => ["icu_folding", "icu_normalizer", "en_stop_filter", "en_stem_filter"],
"char_filter" => ["html_strip"]
}
},
:tokenizer => {
:kuromoji => {
"type" => "kuromoji_tokenizer",
}
}
en_analyzer for english titles and ja_analyzer for japanese titles.
Should I use ngrams, or try other types of analyzers?
It's hard for me to compare the search results; maybe someone has some practice in ecommerce searching and can help me out.

Related

Nested object in Elasticsearch Using NEST

We have created a nested object in our index mapping as shown:
.Nested<Bedroom>(n=>n.Name(c=>c.Beds).IncludeInParent(true).Properties(pp=>pp
.Number(d => d.Name(c => c.BedId).Type(NumberType.Long))
.Number(d => d.Name(c => c.PropertyId).Type(NumberType.Long))
.Number(d => d.Name(c => c.SingleDoubleShared).Type(NumberType.Integer))
.Number(d => d.Name(c => c.Price).Type(NumberType.Integer))
.Number(d => d.Name(c => c.RentFrequency).Type(NumberType.Integer))
.Date(d => d.Name(c => c.AvailableFrom))
.Boolean(d => d.Name(c => c.Ensuite))
However we are experiencing 2 problems.
1- The AvailableFrom field does not get included in the index mapping (the following show the missing field from Kibana index pattern page)
beds.bedId
beds.ensuite
beds.price
beds.propertyId
beds.rentFrequency
beds.singleDoubleShared
Thanks JFM for a constructive comment. This is that part of the mapping in Elastic
"beds" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"availableFrom" : {
"type" : "date"
},
"bedId" : {
"type" : "long"
},
"ensuite" : {
"type" : "boolean"
},
"price" : {
"type" : "integer"
},
"propertyId" : {
"type" : "long"
},
"rentFrequency" : {
"type" : "integer"
},
"singleDoubleShared" : {
"type" : "integer"
}
}
I can see the availablfrom here but not in the index pattern?
Why
2- When we try to index a document with a nested object, the whole Application (MVC Core 3) crashes.
Would appreciate any assistance.

ELK - Kibana doesn't recognize geo_point field

I'm trying to create a Tile map on Kibana, with GEO location points.
For some reason, When I'm trying to create the map, I get the following message on Kibana:
No Compatible Fields: The "logs" index pattern does not contain any of
the following field types: geo_point
My settings:
Logstash (version 2.3.1):
filter {
grok {
match => {
"message" => "MY PATTERN"
}
}
geoip {
source => "ip"
target => "geoip"
add_field => [ "location", "%{[geoip][latitude]}, %{[geoip][longitude]}" ] #added this extra field in case the nested field is the problem
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
hosts => ["localhost:9200"]
index => "logs"
}
}
When log input arrives, I can see it parse it as should and I do get the geoIp data for a given IP:
"geoip" => {
"ip" => "XXX.XXX.XXX.XXX",
"country_code2" => "XX",
"country_code3" => "XXX",
"country_name" => "XXXXXX",
"continent_code" => "XX",
"region_name" => "XX",
"city_name" => "XXXXX",
"latitude" => XX.0667,
"longitude" => XX.766699999999986,
"timezone" => "XXXXXX",
"real_region_name" => "XXXXXX",
"location" => [
[0] XX.766699999999986,
[1] XX.0667
]
},
"location" => "XX.0667, XX.766699999999986"
ElasticSearch (version 2.3.1):
GET /logs/_mapping returns:
{
"logs": {
"mappings": {
"logs": {
"properties": {
"#timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
.
.
.
"geoip": {
"properties": {
.
.
.
"latitude": {
"type": "double"
},
"location": {
"type": "geo_point"
},
"longitude": {
"type": "double"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
}
}
Kibana (version 4.5.0):
I do see all the data and everything seems to be fine.
Just when I go to "Visualize" -> "Tile map" -> "From a new search" -> "Geo Coordinates", I get this error message:
No Compatible Fields: The "logs" index pattern does not contain any of the following field types: geo_point
Even tho I see in elasticsearch mapping that the location type is geo_point.
What am I missing?
Found the issue!
I called the index "logs". changed the index name to "logstash-logs" (need logstash-* prefix) and everything started to function!

Logstash/Elasticsearch CSV Field Types, Date Formats and Multifields (.raw)

I've been playing around with getting a tab delimited file into Elasticsearch using the CSV filter in Logstash. Getting the data in was actually incredibly easy, but I'm having trouble getting the field types to come in right when I look at the data in Kibana. Dates and integers continue to come in as strings, so I can't plot by date or do any analysis functions on integers (sum, mean, etc).
I'm also having trouble getting the .raw version of the fields to populate. For example, in device I have data like "HTC One", but when if I do a pie chart in Kibana, it'll show up as two separate groupings "HTC" and "One". When I try to chart device.raw instead, it comes up as a missing field. From what I've read, it seems like Logstash should automatically create a raw version of each string field, but that doesn't seem to be happening.
I've been sifting through the documentation, google and stack, but haven't found a solution. Any ideas appreciated! Thanks.
Config file:
#logstash.conf
input {
file {
path => "file.txt"
type => "event"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
columns => ["userid","date","distance","device"]
separator => " "
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
port => "9200"
protocol => "http"
index => "userid"
workers => 2
template => template.json
}
#stdout {
# codec => rubydebug
#}
}
Here's the template file:
#template.json:
{
"template": "event",
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index" : {
"query" : { "default_field" : "userid" }
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"_source": { "compress": true },
"dynamic_templates": [
{
"string_template" : {
"match" : "*",
"mapping": { "type": "string", "index": "not_analyzed" },
"match_mapping_type" : "string"
}
}
],
"properties" : {
"date" : { "type" : "date", "format": "yyyy-MM-dd HH:mm:ss"},
"device" : { "type" : "string", "fields": {"raw": {"type": "string","index": "not_analyzed"}}},
"distance" : { "type" : "integer"}
}
}
}
Figured it out - the template name IS the index. So the "template" : "event" line should have been "template" : "userid"
I found another (easier) way to specify the type of the fields. You can use logstash's mutate filter to change the type of a field. Simply add the following filter after your csv filter to your logstash config
mutate {
convert => [ "fieldname", "integer" ]
}
For details check out the logstash docs - mutate convert

ElasticSearch : field not returned

I am new to ElasticSearch, please forgive my stupidity.
I cant seem to get the keepalive field out of ES.
{
"_index" : "2013122320",
"_type" : "log",
"_id" : "Y1M18ZItTDaap_rOAS5YOA",
"_score" : 1.0
}
I can get other field out of it cdn:
{
"_index" : "2013122320",
"_type" : "log",
"_id" : "2neLlVNKQCmXq6etTE6Kcw",
"_score" : 1.0,
"fields" : {
"cdn" : "-"
}
}
The mapping is there:
{
"log": {
"_timestamp": {
"enabled": true,
"store": true
},
"properties": {
"keepalive": {
"type": "integer"
}
}
}
}
EDIT
We create a new index every hour using the following perl code
create_index(
index => $index,
settings => {
_timestamp => { enabled => 1, store => 1 },
number_of_shards => 3,
number_of_replicas => 1,
},
mappings => {
varnish => {
_timestamp => { enabled => 1, store => 1 },
properties => {
content_length => { type => 'integer' },
age => { type => 'integer' },
keepalive => { type => 'integer' },
host => { type => 'string', index => 'not_analyzed' },
time => { type => 'string', store => 'yes' },
<SNIPPED>
location => { type => 'string', index => 'not_analyzed' },
}
}
}
);
With so little informations, I can only guess :
In the mapping you gave, keepalive is not explicitely stored and alasticsearch defaults to no. If you do not store a field, you can only get it via the complete source, wich is stored by default. Or you change, the mapping, adding ("store" : "yes") to your field and reindex.
Good luck with ES, It is well worth a few days of learning.

Ignoring Apostrophes (Possessive) In ElasticSearch

I'm trying to get user submitted queries for "Joe Frankles", "Joe Frankle", "Joe Frankle's" to match the original text "Joe Frankle's". Right now we're indexing the field this text is in with (Tire / Ruby Format):
{ :type => 'string', :analyzer => 'snowball' }
and searching with:
query { string downcased_query, :default_operator => 'AND' }
I tried this unsuccessfully:
create :settings => {
:analysis => {
:char_filter => {
:remove_accents => {
:type => "mapping",
:mappings => ["`=>", "'=>"]
}
},
:analyzer => {
:myanalyzer => {
:type => 'custom',
:tokenizer => 'standard',
:char_filter => ['remove_accents'],
:filter => ['standard', 'lowercase', 'stop', 'snowball', 'ngram']
}
},
:default => {
:type => 'myanalyzer'
}
}
},
There's two official ways of handling possessive apostrophes:
1) Use the "possessive_english" stemmer as described in the ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html
Example:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_stemmer"]
}
},
"filter" : {
"my_stemmer" : {
"type" : "stemmer",
"name" : "possessive_english"
}
}
}
}
}
Use other stemmers or snowball in addition to the "possessive_english" filter if you like. Should/Must work, but it's untested code.
2) Use the "word_delimiter" filter:
{
"index" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "my_word_delimiter"]
}
},
"filter" : {
"my_word_delimiter" : {
"type" : "word_delimiter",
"preserve_original": "true"
}
}
}
}
}
Works for me :-) ES docs:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
Both will cut off "'s".
I ran into a similar problem, the snowball analyzer alone didn't work for me. Don't know if it's supposed to or not. Here's what I use:
properties: {
name: {
boost: 10,
type: 'multi_field',
fields: {
name: { type: 'string', index: 'analyzed', analyzer: 'title_analyzer' },
untouched: { type: 'string', index: 'not_analyzed' }
}
}
}
analysis: {
char_filter: {
remove_accents: {
type: "mapping",
mappings: ["`=>", "'=>"]
}
},
filter: {},
analyzer: {
title_analyzer: {
type: 'custom',
tokenizer: 'standard',
char_filter: ['remove_accents'],
}
}
}
The Admin indices analyze tool is also great when working with analyzers.
It looks like in your query you are searching _all field, but your analyzer is applied only to the individual field. To enable this functionality for the _all field, simply make snowball your default analyzer.

Resources