How to perfrom `sum` and `avg` aggs in even if the mappings of the field is on `text` and `keyword` types? - elasticsearch

I'm trying to perform a sum and avg aggregations on my Elasticsearch query, everything works perfectly fine but I've encountered a problem -- I want to perform the aforementioned aggs to my nested fields that are on text / keyword types.
The reason that they're as such, is because we'll be using the keywords analyzer when we are performing the search API if these specific nested field and subfields are required.
Here's my mapping:
"eng" : {
"type" : "nested",
"properties" : {
"date_updated" : {
"type" : "long"
},
"soc_angry_count" : {
"type" : "float"
},
"soc_comment_count" : {
"type" : "float"
},
"soc_dislike_count" : {
"type" : "float"
},
"soc_eng_score" : {
"type" : "float"
},
"soc_er_score" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"soc_haha_count" : {
"type" : "float"
},
"soc_kf_score" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"soc_like_count" : {
"type" : "float"
},
"soc_love_count" : {
"type" : "float"
},
"soc_mm_score" : {
"type" : "float"
},
"soc_sad_count" : {
"type" : "float"
},
"soc_save_count" : {
"type" : "float"
},
"soc_share_count" : {
"type" : "float"
},
"soc_te_score" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"soc_view_count" : {
"type" : "float"
},
"soc_wow_count" : {
"type" : "float"
}
}
}
Please focus on the soc_er_score, soc_kf_score and soc_te_score subfields of the eng nested field...
When I'm performing the following aggs, it's working fine:
'aggs' => [
'ENGAGEMENT' => [
'nested' => [
'path' => "eng"
],
'aggs' => [
'ARTICLES' => [
//Use Histogram because the pub_date is of
//long data type
//Use interval 86400 to represent 1 day
'histogram' => [
'field' => "eng.date_updated",
"interval" => "86400",
],
'aggs'= [
'SUM' => [
'sum' => [
"field" => "eng.soc_like_score"
]
]
]
]
]
]
]
Here's the output after doing the search API
BUT if the query is like this:
'aggs' => [
'ENGAGEMENT' => [
'nested' => [
'path' => "eng"
],
'aggs' => [
'ARTICLES' => [
//Use Histogram because the pub_date is of
//long data type
//Use interval 86400 to represent 1 day
'histogram' => [
'field' => "eng.date_updated",
"interval" => "86400",
],
'aggs'= [
'SUM' => [
'sum' => [
"field" => "eng.soc_te_score"
]
]
]
]
]
]
]
The output looks like this:
SOLUTIONS PERFORMED
SOLUTION 1 (for confirmation)
After reading some thorough forum discussions, I've learned that java-based parsing is available but it seems not working on my end
Here's my revised query:
'aggs'= [
'SUM' => [
'sum' => [
"field" => "Float.parseFloat(eng.soc_te_score).value"
]
]
]
But unfortunately, it's responding with null or 0 values
By the way, I'm using Laravel as my Web Framework, that's why this is how my debugger or error message window look like
Requesting for your help please, thank you in advance!

I would create another numeric subfield in addition to the keyword one. So you can use the keyword field for search and the numeric one for aggregations.
For example, modify your mapping like this:
"soc_er_score" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
},
"numeric" : {
"type" : "long",
"ignore_malformed": true
}
}
},
You can then use:
soc_er_score for full text search
soc_er_score.keyword for sorting, terms aggregations and exact matching
soc_er_score.numeric for sum and other metric aggregations.
If you already have data in your index, simply modify the mapping by adding the new sub-field, like this:
PUT my-index/_mapping/doc
{
"properties": {
"eng": {
"soc_er_score" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
},
"numeric" : {
"type" : "long",
"ignore_malformed": true
}
}
}
}
}
}
And then call the update by query endpoint in order to pick up the new mapping:
POST my-index/_update_by_query
When done, the eng.soc_er_score.numeric field will be indexed for all your existing documents.

I was able to solve my problem with the following simple script:
"aggs" = [
'SUM' => [
'sum' => [
"script" => "Float.parseFloat(doc['eng.soc_te_score.keyword'].value)"
]
]
];
With this, even if my nested field is of text and keyword type, I can still compute for their average and sum

Related

Extract Hashtags and Mentions into separate fields

I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these
"_source" : {
"id" : 26930655,
"status" : 1,
"title" : "Hereโ€™s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : null,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐Ÿ™๐Ÿšฉ๐Ÿšฉ๐Ÿ‡ฎ๐Ÿ‡ณ""",
"hashtags" : null,
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
There Mappings are settings are like
"sentiment-en" : {
"mappings" : {
"properties" : {
"category" : {
"type" : "text"
},
"created_at" : {
"type" : "integer"
},
"hashtags" : {
"type" : "text"
},
"id" : {
"type" : "long"
},
"language" : {
"type" : "integer"
},
"status" : {
"type" : "integer"
},
"title" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"raw_text" : {
"type" : "text"
},
"stop" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "stop_words_filter"
},
"syn" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "synonyms_filter"
}
},
"index_options" : "docs",
"analyzer" : "all_ok_filter"
}
}
}
}
}
"settings" : {
"index" : {
"number_of_shards" : "10",
"provided_name" : "sentiment-en",
"creation_date" : "1627975717560",
"analysis" : {
"filter" : {
"stop_words" : {
"type" : "stop",
"stopwords" : [ ]
},
"synonyms" : {
"type" : "synonym",
"synonyms" : [ ]
}
},
"analyzer" : {
"stop_words_filter" : {
"filter" : [ "stop_words" ],
"tokenizer" : "standard"
},
"synonyms_filter" : {
"filter" : [ "synonyms" ],
"tokenizer" : "standard"
},
"all_ok_filter" : {
"filter" : [ "stop_words", "synonyms" ],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "0",
"uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
"version" : {
"created" : "7090199"
}
}
Now the problem is i want to extract all the Hashtags and mentions in a seprate field.
What i want as O/P
"id" : 26930655,
"status" : 1,
"title" : "Hereโ€™s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐Ÿ™๐Ÿšฉ๐Ÿšฉ๐Ÿ‡ฎ๐Ÿ‡ณ""",
"hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
What i have tried so far
Create a pattern based tokenizer to just read Hashtags and mentions and no other token for field hashtag and mentions did not had much success there.
Tried to write an n-gram tokenizer without any analysers did not achive much success there as well.
Any help would be appreciated, I am open to reindex my data. Thanks in advance !!!
You can use Logstash Twitter input plugin for indexing data and configured below ruby script in filter plugin as mentioned in blog.
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
You can use Logtstash Elasticsearch Input plugin for source index and configured about ruby code in Filter plugin and Logtstash elasticsearch output plugin with destination index.
input {
elasticsearch {
hosts => "localhost:9200"
index => "current_twitter"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
}
}
filter{
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
}
output {
elasticsearch {
index => "new_twitter"
}
}
Another option is to use reingest API with ingest pipeline but ingest pipeline not support ruby code. So you need to convert above ruby code to the painless script.

ElasticSearch Multi-Match in Nest

I have this DSL query which works. It returns the result as expected.
GET /filedocuments/_search
{
"query": {
"multi_match": {
"query": "abc",
"fields": ["fileName", "metadata"]
}
}
}
But, when it runs at NEST library below, it returns no result. What have I missed out?
var response = await _elasticClient.SearchAsync<FileDocument>(s => s
.Query(q => q
.MultiMatch(c => c
.Fields(f => f.Field(p => p.FileName).Field(p => p.Metadata))
.Query("abc")
)
)
);
This is the mapping:
"fileName" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
and
"metadata" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
Solved after I convert it to .ToUpper()

logstash remove_field not working in order to upload csv to elasticsearch

I'm using elasticsearch, kibana and logstash 6.0.1.
I wish to upload csv data to elasticsearch by logstash and removing fields (path, #timestamp, #version, host and message). I'm showing logstash.conf and emp.csv files below. The upload will work if I don't use the remove_field instruction but I need to. Furthermore, the index was not created.
logstash.conf:
input {
file {
path => "e:\emp.csv"
start_position => "beginning"
}
}
filter {
csv {
separator => ","
columns => ["code","color"]
remove_field => ["path", "#timestamp", "#version", "host", "message"]
}
mutate {convert => ["code", "string"]}
mutate {convert => ["color", "string"]}
}
output {
elasticsearch {
hosts => "http://localhost:9200"
index => "emp5"
user => "elastic"
password => "password"
}
stdout {}
}
emp.csv:
1,blue
2,red
What is missing in this case?
In your csv file the data is not available that you are trying to delete.
Instead try this to delete for example path and host field:
(...)
filter {
csv {
separator => ","
columns => ["code","color"]
}
mutate {
remove_field => ["path", "host"]
}
(...)
And for information, if field path and/or host doesn't exist, there's no problem. The plugin will remove field if field exists, and just do nothing if field does not exist.
Edit:
I have tested it on fresh elastic stack:
You can delete index with:
curl -X DELETE "localhost:9200/emp5"
Also note that in your current config logstash will read the file only once.
You can change that behaviour by adding sincedb_path => "/dev/null"
or in Windows case: sincedb_path => "NUL" inside:
input {
file {
(...) # here
}
section.
Then after logstash work verify result with:
curl -X GET "localhost:9200/emp5?pretty"
{
"emp5" : {
"aliases" : { },
"mappings" : {
"doc" : {
"properties" : {
"#timestamp" : {
"type" : "date"
},
"#version" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"code" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"color" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"message" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : "5",
"blocks" : {
"read_only_allow_delete" : "true"
},
"provided_name" : "emp5",
"creation_date" : "1576099826712",
"number_of_replicas" : "1",
"uuid" : "reXYzqPgQryYcASoov9l5A",
"version" : {
"created" : "6080599"
}
}
}
}
}
As you can see there is no host and path field.

Must match query is not working in elastic search

I am tying to find all Videos with the name "The Shining"
but also with a parent_id = 189, and parent_type = "folder"
My query seams to connect all of the match statements with "OR" instead of "AND"
What am I doing wrong?
{
"fields": ["name","parent_id","parent_type"],
"query": {
"and": {
"must":[
{
"match": {
"name": "The Shining"
}
},
{
"match": {
"parent_id": 189
}
},
{
"match": {
"parent_type": "folder"
}
}
]
}
}
}
Mapping:
{"video" : {
"mappings" : {
"video" : {
"properties" : {
"homepage_tags" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "integer"
},
"metaType" : {
"type" : "string"
},
"tag_category_name" : {
"type" : "string"
},
"tag_category_order" : {
"type" : "integer"
},
"tag_name" : {
"type" : "string"
}
}
},
"id" : {
"type" : "integer"
},
"name" : {
"type" : "string"
},
"parent_id" : {
"type" : "integer"
},
"parent_type" : {
"type" : "string"
},
"provider" : {
"type" : "string"
},
"publish" : {
"type" : "string"
},
"query" : {
"properties" : {
"bool" : {
"properties" : {
"must" : {
"properties" : {
"match" : {
"properties" : {
"name" : {
"type" : "string"
},
"parent_id" : {
"type" : "long"
}
}
}
}
}
}
}
}
},
"source_id" : {
"type" : "string"
},
"subtitles" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"content" : {
"type" : "string",
"store" : true,
"analyzer" : "no_stopwords"
},
"end_time" : {
"type" : "float"
},
"id" : {
"type" : "integer"
},
"parent_type" : {
"type" : "string"
},
"start_time" : {
"type" : "float"
},
"uid" : {
"type" : "integer"
},
"video_id" : {
"type" : "string"
},
"video_parent_id" : {
"type" : "integer"
},
"video_parent_type" : {
"type" : "string"
}
}
},
"tags" : {
"type" : "nested",
"properties" : {
"content" : {
"type" : "string"
},
"end_time" : {
"type" : "string"
},
"id" : {
"type" : "integer"
},
"metaType" : {
"type" : "string"
},
"parent_type" : {
"type" : "string"
},
"start_time" : {
"type" : "string"
}
}
},
"uid" : {
"type" : "integer"
},
"vid_url" : {
"type" : "string"
}
}
}
}}}
I was able to solve the problem by reading Volodymryrs answer above.
Aditional matches were being found because they were matching "the". I tried to add the operator argument, but that did not work unfortunately. What I did instead was to use "match_phrase" and also switched my two other match fields to "term" - see my answer below โ€“
[
'query' => [
'bool' => [
'must' => [
[
'match_phrase' => [
'name' => $searchTerm
]
],
[
'term' => [
'parent_id' => intVal($parent_id)
]
],
[
'term' => [
'parent_type' => strtolower($parent_type)
]
]
]
]
]
];
You will get matches on the or shining because of match query, and depending on matches you will get score. One of the easiest fixes would be to add operator and:
{
"match": {
"name": "The Shining",
"operator": "and"
}
}
But it's not what you need since this will also match names "shining The" or "The sun is shining".
Other option is that if you need to do exact matches on name, then you would need to make field name as non-analyzed. In ES 5 you can set field type as a keyword
In addition I would recommend you to use bool query with term queries since they will do exact match.
{
"fields": ["name","parent_id","parent_type"],
"query": {
"bool": {
"must":[{
"term": {
"name": "The Shining"
}
},
{
"term": {
"parent_id": 189
}
},
{
"term": {
"parent_type": "folder"
}
}
]
}
}
}

How to map geoip field in logstash with elasticsearch in order to display it in tile map of Kibana4

I'd like to display geoip fields in tile map of Kibana4.
Using the standard / automatic logstash geoip mapping to elasticsearch it all works fine.
However when creating a non-standard geoip field, I am not quite sure how to customize the elasticsearch-template.json in logstash in order to represent this field correctly in elasticsearch so that it can be chosen in Kibana4 for tile map creation.
Sure, customizing the standard template is not the best way - better create a custom template and point to it in elasticsearch output of logstash.conf. I just quickly wanted to check how the mapping has to be defined, so I modified the standard template.
My logstash.conf:
input {
tcp {
port => 514
type => syslog
}
udp {
port => 514
type => syslog
}
}
filter {
# Standard geoip field is automatically mapped by logstash to
# elastic search by using the elasticsearch-template.json file
geoip { source => "host" }
grok {
match => [
"message", "<%{POSINT:syslog_pri}>%{YEAR} %{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:device} <%{POSINT:status}> %{WORD:activity} %{DATA:inout} \(%{DATA:msg}\) Src:%{IPV4:src} SPort:%{INT:sport} Dst:%{IPV4:dst} DPort:%{INT:dport} IPP:%{INT:ipp} Rule:%{INT:rule} Interface:%{WORD:iface}",
"message", "<%{POSINT:syslog_pri}>%{YEAR} %{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:device} <%{POSINT:status}> %{WORD:activity} %{DATA:inout} \(%{DATA:msg}\) Src:%{IPV4:src} Dst:%{IPV4:dst} IPP:%{INT:ipp} Rule:%{INT:rule} Interface:%{WORD:iface}",
"message", "<%{POSINT:syslog_pri}>%{YEAR} %{SYSLOGTIMESTAMP:syslog_timestamp} %{DATA:device} <%{POSINT:status}> %{WORD:activity} %{DATA:inout} \(%{DATA:msg}\) Src:%{IPV4:src} Dst:%{IPV4:dst} Type:%{POSINT:type} Code:%{INT:code} IPP:%{INT:ipp} Rule:%{INT:rule} Interface:%{WORD:iface}"
]
}
# Is not mapped automatically by logstash in that it can be
# chosen in Kibana4 for tile map creation
geoip {
source => "src"
target => "src_geoip"
}
}
output {
elasticsearch {
host => "localhost"
protocol => "http"
}
}
My ...logstash-1.4.2\lib\logstash\outputs\elasticsearch\elasticsearch-template.json:
{
"template" : "logstash-*",
"settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true},
"dynamic_templates" : [ {
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string", "index" : "analyzed", "omit_norms" : true,
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
}
}
}
} ],
"properties" : {
"#version": { "type": "string", "index": "not_analyzed" },
"geoip" : {
"type" : "object",
"dynamic": true,
"path": "full",
"properties" : {
"location" : { "type" : "geo_point" }
}
},
"src_geoip" : {
"type" : "object",
"dynamic": true,
"path": "full",
"properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
}
}
UPDATE: I havent figured out yet when this json file gets applied in elasticsearch. I followed the hints outlined in this question and copied the json file to a config/templates folder in elasticsearch directory. After deleting the indizes and restart of elasticsearch, the template was applied successfully.
Anyway, the field "src_geoip.location" still does not show up in the tile map creation form of Kibana4 (only the standard geoip.location field does).
Try overwrite template after editing template. Re-create indexes in Kibana after config change.
output {
elasticsearch {
template_overwrite => "true"
...
}
}
You also need to add objects for the src_geoip object in the index template on your elasticsearch instance. To set the default template for all indexes that match "logstash-netflow-*", execute the following on your elasticsearch instance:
curl -XPUT localhost:9200/_template/logstash-netflow -d '{
"template" : "logstash-netflow-*",
"mappings" : {
"_default_" : {
"_all" : {
"enabled" : false
},
"properties" : {
"#timestamp" : { "index" : "analyzed", "type" : "date" },
"#version" : { "index" : "analyzed", "type" : "integer" },
"src_geoip" : {
"dynamic" : true,
"type" : "object",
"properties" : {
"area_code" : { "type" : "long" },
"city_name" : { "type" : "string" },
"continent_code" : { "type" : "string" },
"country_code2" : { "type" : "string" },
"country_code3" : { "type" : "string" },
"country_name" : { "type" : "string" },
"dma_code" : { "type" : "long" },
"ip" : { "type" : "string" },
"latitude" : { "type" : "double" },
"location" : { "type" : "double" },
"longitude" : { "type" : "double" },
"postal_code" : { "type" : "string" },
"real_region_name" : { "type" : "string" },
"region_name" : { "type" : "string" },
"timezone" : { "type" : "string" }
}
},
"netflow" : { ....snipped......
}
}
}
}}'

Resources