Selectively turn off stop words in Elastic Search - elasticsearch

So I would like to turn off stop word filtering on the username, title, and tags fields but not the description field.
As you can imagine I do not want to filter out a result called the best but I do want to stop the from affecting the score if it is in the description field (search the on GitHub if you want an example).
Now #Javanna says ( Is there a way to "escape" ElasticSearch stop words? ):
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to.
Failing to provide an example so I searched around and tried the common query: http://www.elasticsearch.org/blog/stop-stopping-stop-words-a-look-at-common-terms-query/ which didn't work for me either.
So I searched for specifically stopping the filtering stop words however the closest I have come to is stopping it index wide: Can I customize Elastic Search to use my own Stop Word list? by attacking the analyzer directly, or failing that the documentation hints at making my own analyzer :/.
What is the best way selectively disable stop words on certain fields?

I think you already know what to do, which would be to customize your analyzers for certain fields. From what I understand you did not manage to create a valid syntax example for that. This is what we used in a project, I hope that this example points you in the right direction:
{
:settings => {
:analysis => {
:analyzer => {
:analyzer_umlauts => {
:tokenizer => "standard",
:char_filter => ["filter_umlaut_mapping"],
:filter => ["standard", "lowercase"],
}
},
:char_filter => {
:filter_umlaut_mapping => {
:type => 'mapping',
:mappings_path => es_config_file("char_mapping")
}
}
}
},
:mappings => {
:company => {
:properties => {
[...]
:postal_city => { :type => "string", :analyzer => "analyzer_umlauts", :omit_norms => true, :omit_term_freq_and_positions => true, :include_in_all => false },
}
}
}
}

Related

Elastic search searms to prioritize results with an isolated search term during a full text search

I am having problems with Elastic Search. It seams the search term is being isolated in search results.
We have a large subtitle database that was indexed using Elastic Search.
It seams however, that our searches prioritize search results where the search term is isolated.
Ie: the search for "Eat" produces:
Oh, skydiving. // Skydiving. // Oh, I got that one. // Eating crazy. // Eating, eating. // Just pass, just pass. // You guys suck at that. // What was that? // Synchronized swimming
AND
it's my last night so we're gonna live // life like there's no tomorrow. // - I think I'd just wanna, // - Eat. // - Bring all the food, // whether it's Mcdonald's, whether it's, // - Ice cream.
We need to INSTEAD prioritize search results where the searchTerm is found WITHIN the sentence, rather than just on its own.
I need help determining what needs to be fixed - The Mapping, the filters, the tokenizers etc.
Here are my settings:
static public function getSettings(){
return [
'number_of_shards' => 1,
'number_of_replicas' => 1,
'analysis' => [
'filter' => [
'filter_stemmer' => [
'type' => 'stemmer',
'language' => 'english'
]
],
'analyzer' => [
'text_analyzer' => [
'type' => 'custom',
"stopwords" => [],
'filter' => ['lowercase', 'filter_stemmer','stemmer'],
'tokenizer' => 'standard'
],
]
]
];
}
and here are my mapping:
https://gist.github.com/firecentaur/d0e1e196f7fddbb4d02935bec5592009
And here is my search
https://gist.github.com/firecentaur/5ac97bbd8eb02c406d6eecf867afc13c
What am I doing wrong?
This behavior must be caused by the TL/IDF algorithm.
If a query match a field, it will be more important if their is few words in the field.
If you want to adapt this to your use case, you can use a function_score query.
This post should help you to find a solution.
How can I boost the field length norm in elasticsearch function score?

elastic stack twitter sample tweets

I am new to elastic stack and not sure how to approach the problem. I have managed to get live stream of tweets with specific keyword using Twitter input plugin for elastic however I want to get a sample real time tweets with no specific keyword, just a percentage of all real time tweets. I tried to search how to do it but cannot find a good documentation, I believe I need to use the GET statuses/sample API but there is no documentation on it. This is what I have for now:
input {
twitter {
consumer_key => " cosumer_key"
consumer_secret => "consumer_secret"
oauth_token => "token"
oauth_token_secret => "secret"
keywords => ["something"]
languages => ["en"]
full_tweet => true
}
}
output {
elasticsearch {}
}
How would I search for all sample tweets without using the keyword?
Thank you so much in advance.
Here's an example random score query, this should solve your problem:
GET /twitter/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"random_score": {}
}
]
}
}
}
Edit - Adding a logstash config that takes random entries as well:
input {
twitter {
consumer_key => " cosumer_key"
consumer_secret => "consumer_secret"
oauth_token => "token"
oauth_token_secret => "secret"
keywords => ["something"]
languages => ["en"]
full_tweet => true,
use_samples => true
}
}
output {
elasticsearch {}
}
use_samples:
Returns a small random sample of all public statuses. The tweets returned by the default access level are the same, so if two different clients connect to this endpoint, they will see the same tweets. If set to true, the keywords, follows, locations, and languages options will be ignored. Default ⇒ false

Can we extract numeric values from string through script in kibana?

I need to extract numeric values from string and store in new field..Can we do this through scripted field?
Ex: 1 hello 3 test
I need to extract 1 and 3.
You can do this through logstash if you are using elasticsearch.
Run a logstash process with a config like
input {
elasticsearch {
hosts => "your_host"
index => "your_index"
query => "{ "query": { "match_all": {} } }"
}
}
filter {
grok {
match => { "your_string_field" => "%{NUMBER:num1} %{GREEDYDATA:middle_stuff} %{NUMBER:num2} %{GREEDYDATA:other_stuff}" }
}
mutate {
remove_field => ["middle_stuff", "other_stuff"]
}
}
output{
elasticsearch {
host => "yourhost"
index => "your index"
document_id => %{id}
}
}
This would essentially overwrite each document in your index with two more fields, num1 and num2 that correspond to the numbers that you are looking for. This is just a quick and dirty approach that would take up more memory, but would allow you to do all of the break up at one time instead of at visualization time.
I am sure there is a way to do this with scripting, look into groovy regex matching where you return a specific group.
Also no guarantee my config representation is correct as I don't have time to test it at the moment.
Have a good day!

How to overwrite field value in Kibana?

I am using Logstash to feed data into Elasticsearch and then analyzing that data with Kibana. I have a field that contains numeric identifiers. These are not easy to read. How can I have Kibana overwrite or show a more human-readable value?
More specifically, I have a 'ip.proto' field. When this field contains a 6, it should be shown as 'TCP'. When this field contains a 7, it should be shown as 'UDP'.
I am not sure which tool in the ELK stack I need to modify to make this happen.
Thanks
You can use conditionals and the mutate filter:
filter {
if [ip][proto] == "6" {
mutate {
replace => ["[ip][proto]", "TCP"]
}
} else if [ip][proto] == "7" {
mutate {
replace => ["[ip][proto]", "UDP"]
}
}
}
This quickly gets clumsy, and the translate filter is more elegant (and probably faster). Untested example:
filter {
translate {
field => "[ip][proto]"
dictionary => {
"6" => "TCP"
"7" => "UDP"
}
}
}

reparsing a logstash record? fix extracts?

I'm taking a JSON message (Cloudtrail, many objects concatenated together) and by the time I'm done filtering it, Logstash doesn't seem to be parsing the message correctly. It's as if the hash was simply dumped into a string.
Anyhow, here's the input and filter.
input {
s3 {
bucket => "stanson-ops"
delete => false
#snipped unimportant bits
type => "cloudtrail"
}
}
filter {
if [type] == "cloudtrail" {
json { # http://logstash.net/docs/1.4.2/filters/json
source => "message"
}
ruby {
code => "event['RecordStr'] = event['Records'].join('~~~')"
}
split {
field => "RecordStr"
terminator => "~~~"
remove_field => [ "message", "Records" ]
}
}
}
By the time I'm done, elasticsearch entries include a RecordStr key with the following data. It doesn't have a message field, nor does it have a Records field.
{"eventVersion"=>"1.01", "userIdentity"=>{"type"=>"IAMUser", "principalId"=>"xxx"}}
Note that is not JSON style, it's been parsed. (which is important for the concat->split thing to work).
So, the RecordStr key looks not quite right as one value. Further, in Kibana, filterable fields include RecordStr (no subfields). It includes some entries that aren't there anymore: Records.eventVersion, Records.userIdentity.type.
Why is that? How can I get the proper fields?
edit 1 here's part of the input.
{"Records":[{"eventVersion":"1.01","userIdentity":{"type":"IAMUser",
It's unprettified JSON. It appears the body of the file (the above) is in the message field, json extracts it and I end up with an array of records in the Records field. That's why I join and split it- I then end up with individual documents, each with a single RecordStr entry. However, the template(?) doesn't seem to understand the new structure.
I've worked out a method that allows for indexing the appropriate CloudTrail fields as you requested. Here are the modified input and filter configs:
input {
s3 {
backup_add_prefix => \"processed-logs/\"
backup_to_bucket => \"test-bucket\"
bucket => \"test-bucket\"
delete => true
interval => 30
prefix => \"AWSLogs/<account-id>/CloudTrail/\"
type => \"cloudtrail\"
}
}
filter {
if [type] == \"cloudtrail\" {
json {
source => \"message\"
}
ruby {
code => \"event.set('RecordStr', event.get('Records').join('~~~'))\"
}
split {
field => \"RecordStr\"
terminator => \"~~~\"
remove_field => [ \"message\", \"Records\" ]
}
mutate {
gsub => [
\"RecordStr\", \"=>\", \":\"
]
}
mutate {
gsub => [
\"RecordStr\", \"nil\", \"null\"
]
}
json {
skip_on_invalid_json => true
source => \"RecordStr\"
target => \"cloudtrail\"
}
mutate {
add_tag => [\"cloudtrail\"]
remove_field=>[\"RecordStr\", \"#version\"]
}
date {
match => [\"[cloudtrail][eventTime]\",\"ISO8601\"]
}
}
}
The key observation here is that once the split is done we no longer possess valid json in the event and are therefore required to execute the mutate replacements ('=>' to ':' and 'nil' to 'null'). Additionally, I found it useful to get the timestamp out of the CloudTrail eventTime and do some cleanup of unnecessary fields.

Resources