Elasticsearch Nested search with match - elasticsearch

I've got orders index, and It has nested structure like below;
{
'baskets' => [
{
'id' => 123,
'product' => {
'title': 'blabla',
'tags': [
'value': 'Tag1',
'value': 'Tag2'
]
}
},
{
'id' => 1234,
'product' => {
'title': 'blabla2',
'tags': [
'value': 'Tag3',
'value': 'Tag2'
]
}
}
]
}
I want to make "similarity search" inside that informations like;
"Match inside tags, most match is the most closest. (For example If I search with Tag1, Tag2 It must be matched better"
But how can I make matching inside arrays in nested searchs?
I need to search like: where baskets.product.tags.value in ['Tag1', 'Tag2', 'Tag4']
Better matching must be ordered first.
My search query is below. But not working well.
['bool' =>
[
'should' => [
'nested' => [
'path' => 'baskets.product.tags',
'query' => ['terms' => ['baskets.product.tags' => $concept_array]]
]
],
]
]
]

Arrays in ES are not exactly treated as arrays, more like multiple objects in the same field so you can't easily score on multiple matches. Maybe with a score script but it will be very brittle.
Instead you can add a new field to the product, like all_tags, type text (you might need to change the separator and tune analyzers if your tags can contain spaces) and put all your tags in there, it will be "Tag1 Tag2" and "Tag3 Tag2" in your example. Then a simple
"match": {"baskets.product.all_tags": "Tag3 Tag2"}
inside your nested query should do the trick, scoring higher docs with closely matching baskets.

Related

Logstash giving _rubyexception while adding a field and altering its value

Logstash version 6.5.4
I want to create jobExecutionTime field when status is COMPLETE and set its value as current_timestamp-created_timestamp.
These are few lines from my config file.
match => { "message" => '%{DATA:current_timestamp},%{WORD:status},%{DATA:created_timestamp}}
if [status] == "COMPLETE" {
mutate {
add_field => [ "jobExecutionTime" , "null" ]
}
ruby {
code => "event.set('jobExecutionTime', event.get('current_timestamp') - event.get('created_timestamp'))"
}
}
This my input
"created_timestamp" => "2022-07-10 23:50:03.644"
"current_timestamp" => "2022-07-10 23:50:03.744"
"status" => "COMPLETE"
I am getting this as output
"jobExecutionTime" => "null",
"exportFrequency" => "RECURRENT",
"successfulImportMilestone" => 0,
"tags" => [
[0] "_rubyexception"
],
Here jobExecutionTime is set to null rather than concerned value
Your [created_timestamp] and [current_timestamp] fields are strings. You cannot do math on a string, you need to convert it an object type that you can do math on. In this case you should use date filters to convert them to LogStash::Timestamp objects
If you add
date { match => [ "created_timestamp", "ISO8601" ] target => "created_timestamp" }
date { match => [ "current_timestamp", "ISO8601" ] target => "current_timestamp" }
to your filter section then your ruby filter will work as-is, and you will get
"created_timestamp" => 2022-07-11T03:50:03.644Z,
"current_timestamp" => 2022-07-11T03:50:03.744Z,
"jobExecutionTime" => 0.1

Elasticsearch Aggregation using Nested returns empty buckets

Elasticsearch: v7.2
Application: Laravel v.5.7
Hello and good day! I'm using Elasticsearch to make a report of my documents, I stumbled upon the need of presenting data fields that are nested in nature.
I have the following mappings, I have my index web with a field called ent:
Now I have the following query, using the aggs, my goal is to present the entities that have the MOST counts that can be found in my documents:
'aggs' => [
'ENT' => [
'nested' => [
'path' => 'ent'
],
'aggs' => [
'TOP_ENTITIES' => [
'terms' => [
'field' => 'ent.ent_count'
]
]
]
]
]
What I'm finding weird about this, is that when I'm targeting the ent.ent_count field, the buckets works perfectly fine, finding the distinct ent_count together with its respective doc_counts which portrays the total number of occurence of that ent_count:
BUT when I'm targeting the ent.ent_name field, it returns empty:
'aggs' => [
'ENT' => [
'nested' => [
'path' => 'ent'
],
'aggs' => [
'TOP_ENTITIES' => [
'terms' => [
'field' => 'ent.ent_name.keyword'
]
]
]
]
]
RESULTS TO
With non-nested fields, this works perfectly fine, am I doing something wrong with my query? because even the examples from the documentation shows the same scripts
There's no other way to solve this problem unless you change the mappings
So instead of letting the ent.ent_name nested field to be of text field, we have found out that short words in nested fields should be of KEYWORD type:
After changing the _mappings to keyword, everything worked perfectly fine

Require Elasticsearch highlight words enclosed with Double Quotes (") instead of chopping them

Elasticsearch: v7.2
Application: PHP - Laravel v5.7
Hello and good day!
I'm developing a web application that is similar to a search engine, whereas a user will enter words that will be designated to the variable $keywords. Then I have this query to search throughout my index:
$params = [
'index' => 'my_index',
'type' => 'my_type',
'from' => 0,
'size' => 10,
'body' => [
"query" => [
'bool' => [
'must' => [
[
"query_string" => [
"fields" => ['title','content'],
"query" => $keywords
]
]
]
]
]
]
];
$articles = $client->search($params);
Now, in line with my previous post, I was able to count the number of occurrences my $keywords occurred within the documents of my index.
Here's my highlight query that is attached to the $params above:
"highlight" => [
"fields" => [
"content" => ["number_of_fragments" => 0],
"title" => ["number_of_fragments" => 0]
]
'require_field_match' => true
]
Even though that the $keywords are enclosed with double quotation mark ("), the highlighter still chops/separates the $keywords and I already specified them with double quotation mark to strictly follow these words.
For example, my $keywords contains "Ayala Alabang", but as I displayed the output, it goes like this
The $keywords were separated, but according to the output, they're just adjacent to each other.
Is there any other tweaks or revision to my query? I found some related posts or questions in some forums, their last reply was from March 2019, any advice would be an excellent help for this dilemma
After a few days of looking into deep documentation, I found a way to properly segregate keywords that are found in a document
STEP 1
Apply the "explain" => true in your $params
$params = [
'index' = "myIndex",
'type' => "myType",
'size' => 50,
'explain' => true,
'query' => [
'match_all' => [
//your elasticearch query here
]
]
]
STEP 2
Then fetch the result after doing the $client->search($params) code:
$result = $client->search($params);
Then a long literal EXPLANATION will be included in your $result whereas your keywords and their frequency will be displayed in a text format.:
try displaying via dd($result['explanation'])
NOTE the problem here is that a lot of nested arrays will be the contents of the _explanation array key, so we came up with a recursive function to look for the keywords and their frequency
STEP 3
You need to create a function that will get a string IN BETWEEN of repetitive or other strings:
public static function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}
STEP 4
Then create the recursive function:
public static function extract_kwds($expln,$kwds)
{
foreach($expln as $k=>$v)
{
if($k == 'description' && strpos(json_encode($v),'weight(')!==false)
{
if(isset($kwds[$this->get_string_between($v,':',')')]))
{
$kwds[$this->get_string_between($v,':',')')] += intVal($this->get_string_between($expln['details'][0]['description'],'score(freq=',')'));
}
else
{
$kwds[$this->get_string_between($v,':',')')] = intVal($this->get_string_between($expln['details'][0]['description'],'score(freq=',')'));
}
}
if($k == 'details' && count($v) != 0)
{
foreach($v as $k2=>$v2)
{
$kwds = $this->extract_kwds($v2,$kwds);
}
}
}
return $kwds;
}
FINALLY
I was able to fetch all the keywords together with their frequency or how many times these keywords appeared in the documents.

Hierarchical data matching and displaying

In my log files, I have data that represents a the hierarchy of items, much like an http log file might show the hierarchy of a website.
I may have data such as this
41 2016-01-01 01:41:32-500 show:category:all
41 2016-01-01 04:11:20-500 show:category:animals
42 2016-01-02 01:41:32-500 show:item:wallaby
42 2016-01-02 01:41:32-500 show:home
and I would have 3 items in here... %{NUMBER:terminal} %{TIMESTAMP_ISO8601:ts} and (?<info>([^\r])*)
I parse the info data into an array using mutate and split to convert lvl1:lvl2:lvl3 into ['lvl1','lvl2','lvl3'].
I'm interested in aggregating the data to get counts at various levels easily, such as counting all records where info[0] is the same or where info[0] and info[1] are the same. (and be able to select time range and terminal)
Is there a way to set up kibana to visualize this kind of information?
Or should I change the way the filter is matching the data to make the data easier to access?
the depth of levels varies but I can be pretty certain that the max levels are 5, so I could parse the text into various fields lvl1 lvl2 lvl3 lvl4 lvl5 instead of putting them in an array.
As per your question, I agree with your way of parsing data. But I would like to add on more to make it directly aggregatable & visualize using Kibana.
The approach should be :-
Filter the data using %{NUMBER:terminal} %{TIMESTAMP_ISO8601:ts} and (?([^\r])*) {As per information given by you}
Mutate
Filter
Then after using mutate & filter you will get data in terms of array {as you have mentioned}
Now you can add a field as level 1 by mentioning add_field => [ "fieldname", "%{[arrayname][0]}" ]
Now you can add a field as level 2 by mentioning add_field => [ "fieldname", "%{[arrayname][1]}" ]
Now you can add a field as level 3 by mentioning add_field => [ "fieldname", "%{[arrayname][2]}" ]
Then you can directly use Kibana to visualize such information.
my solution
input {
file {
path => "C:/Temp/zipped/*.txt"
start_position => beginning
ignore_older => 0
sincedb_path => "C:/temp/logstash_temp2.sincedb"
}
}
filter {
grok {
match => ["message","^%{NOTSPACE}\[%{NUMBER:terminal_id}\] %{NUMBER:log_level} %{NUMBER} %{TIMESTAMP_ISO8601:ts} \[(?<facility>([^\]]*))\] (?<lvl>([^$|\r])*)"]
}
mutate {
split => ["lvl", ":"]
add_field => {"lvl_1" => "%{lvl[0]}"}
add_field => {"lvl_2" => "%{lvl[1]}"}
add_field => {"lvl_3" => "%{lvl[2]}"}
add_field => {"lvl_4" => "%{lvl[3]}"}
add_field => {"lvl_5" => "%{lvl[4]}"}
add_field => {"lvl_6" => "%{lvl[5]}"}
add_field => {"lvl_7" => "%{lvl[6]}"}
add_field => {"lvl_8" => "%{lvl[7]}"}
lowercase => [ "terminal_id" ] # set to lowercase so that it can be used for index - additional filtering may be required
}
date {
match => ["ts", "YYYY-MM-DD HH:mm:ssZZ"]
}
}
filter {
if [lvl_1] =~ /%\{lvl\[0\]\}/ {mutate {remove_field => [ "lvl_1" ]}}
if [lvl_2] =~ /%\{lvl\[1\]\}/ {mutate {remove_field => [ "lvl_2" ]}}
if [lvl_3] =~ /%\{lvl\[2\]\}/ {mutate {remove_field => [ "lvl_3" ]}}
if [lvl_4] =~ /%\{lvl\[3\]\}/ {mutate {remove_field => [ "lvl_4" ]}}
if [lvl_5] =~ /%\{lvl\[4\]\}/ {mutate {remove_field => [ "lvl_5" ]}}
if [lvl_6] =~ /%\{lvl\[5\]\}/ {mutate {remove_field => [ "lvl_6" ]}}
if [lvl_7] =~ /%\{lvl\[6\]\}/ {mutate {remove_field => [ "lvl_7" ]}}
if [lvl_8] =~ /%\{lvl\[7\]\}/ {mutate {remove_field => [ "lvl_8" ]}}
mutate{
remove_field => [ "lvl","host","ts" ] # do not keep this data
}
}
output {
if [facility] == "mydata" {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-mydata-%{terminal_id}-%{+YYYY.MM.DD}"
}
} else {
elasticsearch {
hosts => ["localhost:9200"]
index => "logstash-other-%{terminal_id}-%{+YYYY.MM.DD}"
}
}
# stdout { codec => rubydebug }
}

Perl 2D array comparison issues

I'm coding a perl script that audits a library and compares the list of installed software with a list from another machine to ensure that they are working off of the same stuff. I've taken the raw data and placed it into two, 2-dimensional arrays of size Nx4 where N is the number of software titles. For example:
[Fileset1], [1.0.2.3], [COMMITTED], [Description of file]
[Fileset2], [2.4.2.2], [COMMITTED], [Description of a different file]
....
I now need to compare the two lists to find discrepancies, whether they be missing files of level differences. Not being a perl pro yet, the only way I can conceive of doing this is to compare each element of the first array against each element of the other array to look first for matching filesets with different levels or no matching filesets at all. The I would have to repeat the process with the other list to ensure that I'd found all possible differences. Obviously with this procedure I'm looking at efficiency of greater than n^2. I was wondering if there was some application of grep that I could make use of or something similar to avoid this when comparing libraries with upwards of 20,000 entries.
In short, I need to compare two 2 dimensional arrays and keep track of the differences for each list, instead of merely finding the intersection of the two.
Thanks in advance for the help!
The output is a little unwieldy, but I like Data::Diff for tasks like this:
use Data::Diff 'Diff';
use Data::Dumper;
#a = ( ["Fileset1", "1.0.2.3", "COMMITTED", "Description of file" ],
["Fileset2", "2.4.2.2", "COMMITTED", "Description of a different file" ],
["Fileset3", "1.2.3.4", "COMMITTED", "Description of a different file" ] );
#b = ( ["Fileset1", "1.0.2.3", "COMMITTED", "Description of file" ],
["Fileset2", "2.4.2.99", "COMMITTED", "Description of a different file" ] );
$out = Diff(\#a,\#b);
print Dumper($out);
Result:
$VAR1 = {
'diff' => [
{
'uniq_a' => [
'2.4.2.2'
],
'same' => [
{
'same' => 'COMMITTED',
'type' => ''
},
{
'same' => 'Description of a different file',
'type' => ''
},
{
'same' => 'Fileset2',
'type' => ''
}
],
'type' => 'ARRAY',
'uniq_b' => [
'2.4.2.99'
]
}
],
'uniq_a' => [
[
'Fileset3',
'1.2.3.4',
'COMMITTED',
'Description of a different file'
]
],
'same' => [
{
'same' => [
{
'same' => '1.0.2.3',
'type' => ''
},
{
'same' => 'COMMITTED',
'type' => ''
},
{
'same' => 'Description of file',
'type' => ''
},
{
'same' => 'Fileset1',
'type' => ''
}
],
'type' => 'ARRAY'
}
],
'type' => 'ARRAY'
};

Resources