Elastic search searms to prioritize results with an isolated search term during a full text search - elasticsearch

I am having problems with Elastic Search. It seams the search term is being isolated in search results.
We have a large subtitle database that was indexed using Elastic Search.
It seams however, that our searches prioritize search results where the search term is isolated.
Ie: the search for "Eat" produces:
Oh, skydiving. // Skydiving. // Oh, I got that one. // Eating crazy. // Eating, eating. // Just pass, just pass. // You guys suck at that. // What was that? // Synchronized swimming
AND
it's my last night so we're gonna live // life like there's no tomorrow. // - I think I'd just wanna, // - Eat. // - Bring all the food, // whether it's Mcdonald's, whether it's, // - Ice cream.
We need to INSTEAD prioritize search results where the searchTerm is found WITHIN the sentence, rather than just on its own.
I need help determining what needs to be fixed - The Mapping, the filters, the tokenizers etc.
Here are my settings:
static public function getSettings(){
return [
'number_of_shards' => 1,
'number_of_replicas' => 1,
'analysis' => [
'filter' => [
'filter_stemmer' => [
'type' => 'stemmer',
'language' => 'english'
]
],
'analyzer' => [
'text_analyzer' => [
'type' => 'custom',
"stopwords" => [],
'filter' => ['lowercase', 'filter_stemmer','stemmer'],
'tokenizer' => 'standard'
],
]
]
];
}
and here are my mapping:
https://gist.github.com/firecentaur/d0e1e196f7fddbb4d02935bec5592009
And here is my search
https://gist.github.com/firecentaur/5ac97bbd8eb02c406d6eecf867afc13c
What am I doing wrong?

This behavior must be caused by the TL/IDF algorithm.
If a query match a field, it will be more important if their is few words in the field.
If you want to adapt this to your use case, you can use a function_score query.
This post should help you to find a solution.
How can I boost the field length norm in elasticsearch function score?

Related

Is it possible to change a field by a previous value in logstash

I'm searching on internet a way to put a variable in logstash and use or modify the value if a term is corresponding to a pattern.
Here, the is an example of my data source:
2017-04-12 15:49:57,641|OK|file1|98|||
2017-04-12 15:49:58,929|OK|file2|1387|null|msg_fils|
2017-04-12 15:49:58,931|OK|file3|2|msg_pere|msg_fils|
2017-04-12 15:50:17,666|OK|file1|25|||
2017-04-12 15:50:17,929|OK|file2|1387|null|msg_fils|
I'm using this grok code to parse my source.
grok {
match => {"message" => '%{TIMESTAMP_ISO8601:msgdates:date}\|%{WORD:verb}\|%{DATA:component}\|%{NUMBER:temps:int}\|%{DATA:msg_pere}\|%{DATA:msg_fils}\|'}
}
But in fact I want to modify the first field by the previous value of the line which contains file1
Can you tell me if it's possible or not?
Thanks
I have found a solution to my issue. I'm sharing you the solution to my problem.
I'm using a plugin named logstash-filter-memorize, it can be install by the command :
logstash-plugin install logstash-filter-memorize
So my filter is like this :
grok {
match => {"message" => '%{TIMESTAMP_ISO8601:msgdates:date}\|%{WORD:verb}\|%{DATA:component}\|%{NUMBER:temps:int}\|%{DATA:msg_pere}\|%{DATA:msg_fils}\|'}
}
if [component] =~ "file1" {
mutate {
add_field => [ "msg_id", "%{msgdates}" ]
}
memorize {
fields => [ "msg_id" ]
default => { "msg_id" => "NOTFOUND" }
} }
memorize {
fields => [ "msg_id9" ]
}
I hope that it can be useful for others.

Fos Elastica remove common words(or, and etc..) from search query

Hello I`m trying to get query results using FosElasticaBundle with this query, I
can't find a working example for filtering common words like (and, or) if it is possible this words not to be highlighted also would be really good. My struggle so far :
$searchForm = $this->createForm(SearchFormType::class, null);
$searchForm->handleRequest($request);
$matchQuery = new \Elastica\Query\Match();
$matchQuery->setField('_all', $queryString);
$searchQuery = new \Elastica\Query();
$searchQuery->setQuery($matchQuery);
$searchQuery->setHighlight(array(
"fields" => array(
"title" => new \stdClass(),
"content" => new \stdClass()
),
'pre_tags' => [
'<strong>'
],
'post_tags' => [
'</strong>'
],
'number_of_fragments' => [
'0'
]
));
Thanks in advance ;)
Do you want (and, or) to be ignored or not to have a value on your search?
If that's the case you may want to use stop words on your elasticsearch index.
Here's a reference.
https://www.elastic.co/guide/en/elasticsearch/guide/current/using-stopwords.html

Logstash output from json parser not being sent to elasticsearch

This is kind of a follow up from another one of my questions:
JSON parser in logstash ignoring data?
But this time I feel like the problem is more clear then last time and might be easier for someone to answer.
I'm using the JSON parser like this:
json #Parse all the JSON
{
source => "MFD_JSON"
target => "PARSED"
add_field => { "%{FAMILY_ID}" => "%{[PARSED][platform][family_id][1]}_%{[PARSED][platform][family_id][0]}" }
}
The part of the output for one the logs in logstash.stdout looks like this:
"FACILITY_NUM" => "1",
"LEVEL_NUM" => "7",
"PROGRAM" => "mfd_status",
"TIMESTAMP" => "2016-01-12T11:00:44.570Z",
MORE FIELDS
There are a whole bunch of fields that like the ones above that work when I remove the JSON code. When I add the JSON filter, the whole log just disappears form elasticserach/kibana for some reason. The bit added by the JSON filter is bellow:
"PARSED" => {
"platform" => {
"boot_mode" => [
[0] 2,
[1] "NAND"
],
"boot_ver" => [
[0] 6,
[1] 1,
[2] 32576,
[3] 0
],
WHOLE LOT OF OTHER VARIABLES
"family_id" => [
[0] 14,
[1] "Hatchetfish"
],
A WHOLE LOT MORE VARIABLES
},
"flash" => [
[0] 131072,
[1] 7634944
],
"can_id" => 1700,
"version" => {
"kernel" => "3.0.35 #2 SMP PREEMPT Thu Aug 20 10:40:42 UTC 2015",
"platform" => "17.0.32576-r1",
"product" => "next",
"app" => "53.1.9",
"boot" => "2013.04 (Aug 20 2015 - 10:33:51)"
}
},
"%{FAMILY_ID}" => "Hatchetfish 14"
Lets pretend the JSON won't work, I'm okay with that now, that shouldn't mess with everything else to do with the log from elasticsearch/kibana. Also, at the end I've got FAMILY_ID as a field that I added separately using add_field. At the very least that should show up, right?
If someone's seen something like this before it would be great help.
Also sorry for spamming almost the same question twice.
SAMPLE LOG LINE:
1452470936.88 1448975468.00 1 7 mfd_status 000E91DCB5A2 load {"up":[38,1.66,0.40,0.13],"mem":[967364,584900,3596,116772],"cpu":[1299,812,1791,3157,480,144],"cpu_dvfs":[996,1589,792,871,396,1320],"cpu_op":[996,50]}
The sample line will be parsed (Everything after load is JSON), and in stdout I can see that it is parsed successfully, But I don't see it in elasticsearch.
This is my output code:
elasticsearch
{
hosts => ["localhost:9200"]
document_id => "%{fingerprint}"
}
stdout { codec => rubydebug }
A lot of my logstash filter is in the other question, but I think like all the relevant parts are in this question now.
If you want to check it out here's the link: JSON parser in logstash ignoring data?
Answering my own question here. It's not the ideal answer, but if anyone has a similar problem as me you can try this out.
json #Parse all the JSON
{
source => "MFD_JSON"
target => "PARSED"
add_field => { "%{FAMILY_ID}" => "%{[PARSED][platform][family_id][1]}_%{[PARSED][platform][family_id][0]}" }
}
That's how I parsed all the JSON before, I kept at the trial and error hoping I'd get it sometime. I was about to just use a grok filter to get bits that I wanted, which is a option if this doesn't work for you. I came back to this later, and thought "What if I removed everything after" because of some crazy reason that I've forgotten. In the end I did this:
json
{
source => "MFD_JSON"
target => "PARSED_JSON"
add_field => { "FAMILY_ID" => "%{[PARSED_JSON][platform][family_id][1]}_%{[PARSED_JSON][platform][family_id][0]}" }
remove_field => [ "PARSED_JSON" ]
}
So, extract the field/fields your interested in, and then remove the field made by the parser at the end. That's what worked for me. I don't know why, but it might work for other people too.

Add reusable field type to elasticsearch

Is it possible to define a custom field type and reuse that definition for multiple fields. I'm trying to do something like a template, but I don't want it to be defined dynamically.
For example, I have something in the system called "keywords" - keywords always have a specific mapping -
'keywords' => [
'type' => 'object',
'properties' => [
'id' => [
'type' => 'integer'
],
'name' => [
'type' => 'string',
'position_offset_gap'=>100,
'analyzer'=>'my_keyword',
]
]
]
I have these throughout the system - post, media, folder, etc. and I have two kinds that are very similar - lets say keywords and categories. It's the same definition, I just keep them separate for business reasons.
Ideally, what I would like to do is define a "keyword" type and then for a field I would just define
'keywords' => [
'type' => 'keyword'
]
or something similar. Then also when I want to change that definition I can do it in one place for all the fields using it.
Is this possible in Elasticsearch? I'd prefer not to use index template because I like having my mappings explicit.
I would recommend you to use dynamic templates + a naming convention for keyword fields. In practice, for example:
(1) Define a dynamic template that maps any field with a name starting by k_ to your custom mapping
{
"mappings": {
"_doc": {
"dynamic_templates": [
"keywords": {
"match": "k_*",
"mapping": {
"type": "keyword",
...
}
}
]
}
}
}
(2) Add the k_ prefix to the name of any field that should apply your custom mapping (e.g., k_post, k_media, ...)
Of course, you can choose any other naming convention for your keyword fields (e.g., *_keywords k*, ...)

Selectively turn off stop words in Elastic Search

So I would like to turn off stop word filtering on the username, title, and tags fields but not the description field.
As you can imagine I do not want to filter out a result called the best but I do want to stop the from affecting the score if it is in the description field (search the on GitHub if you want an example).
Now #Javanna says ( Is there a way to "escape" ElasticSearch stop words? ):
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to.
Failing to provide an example so I searched around and tried the common query: http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ which didn't work for me either.
So I searched for specifically stopping the filtering stop words however the closest I have come to is stopping it index wide: Can I customize Elastic Search to use my own Stop Word list? by attacking the analyzer directly, or failing that the documentation hints at making my own analyzer :/.
What is the best way selectively disable stop words on certain fields?
I think you already know what to do, which would be to customize your analyzers for certain fields. From what I understand you did not manage to create a valid syntax example for that. This is what we used in a project, I hope that this example points you in the right direction:
{
:settings => {
:analysis => {
:analyzer => {
:analyzer_umlauts => {
:tokenizer => "standard",
:char_filter => ["filter_umlaut_mapping"],
:filter => ["standard", "lowercase"],
}
},
:char_filter => {
:filter_umlaut_mapping => {
:type => 'mapping',
:mappings_path => es_config_file("char_mapping")
}
}
}
},
:mappings => {
:company => {
:properties => {
[...]
:postal_city => { :type => "string", :analyzer => "analyzer_umlauts", :omit_norms => true, :omit_term_freq_and_positions => true, :include_in_all => false },
}
}
}
}

Resources