Elasticsearch, upsert a document with script when the index does not exist - elasticsearch

I'm receiving some payloads in a logstash, that I push in Elastic in a monthly rolling index with a script that allows me to override the fields depending on the order of the status of those payloads.
Example :
{
"id" : "abc",
"status" : "OPEN",
"field1" : "foo",
"opening_ts" : 1234567
}
{
"id" : "abc",
"status" : "CLOSED",
"field1" : "bar",
"closing_ts": 7654321
}
I want that, even if i receive the payload OPEN after the CLOSE for the id "abc", my elastic document to be :
{
"_id" : "abc",
"status": "CLOSED",
"field1" : "bar",
"closing_ts": 7654321,
"opening_ts" : 1234567
}
I order to guarantee that, i have added a script in my elastic output plugin in logstash
script => "
if (ctx._source['status'] == 'CLOSED') {
for (key in params.event.keySet()) {
if (ctx._source[key] == null) {
ctx._source[key] = params.event[key]
}
}
} else {
for (key in params.event.keySet()) {
ctx._source[key] = params.event[key]
}
}
"
Buuuuut, adding this script also added an extra step between the implicit "PUT" on the index, and if the target index does not exist, the script will fail and the whole document will never be created. (Nor the index)
Do you know how could i handle an error in this scripts ?

You need to resort to scripted upsert:
output {
elasticsearch {
index => "your-index"
document_id => "%{id}"
action => "update"
scripted_upsert => true
script => "... your script..."
}
}

Related

Elasticsearch index not being created with settings from logstash template

I have a bulk upload for a new index that I'm sending to my ES cluster from logstash. As such I want replication and refreshing turned off until the load is done, and I'll re-enable those values after the upload is complete.
I have a config file that looks like the following
input {
stdin { type => stdin }
}
filter {
csv {
separator => " "
columns => [ ...]
}
}
output {
amazon_es {
hosts =>
["my-domain.us-east-1.es.amazonaws.com"]
index => "my-index"
template => "conf/my-index-template.json"
template_name => "my-index-template-name"
region => "us-east-1"
}
}
And the template file looks like
{
"template" : "my-index-template-name",
"mappings" : {
...
},
"settings" : {
"index" : {
"number_of_shards" : "48",
"number_of_replicas" : "0",
"refresh_interval": "-1"
}
}
}
And when I run logstash and go to look at the settings for that index, the mappings are all respected from this template which is good, but everything in the settings section is ignored and it takes on default values (i.e. number_of_shards=5, and number_of_replicas=1)
Some investigation notes:
If I get the template after it's installed from ES itself I see the proper values in the template (for both mappings and settings). They just don't seem to be applying to the index
Also if I take the contents of the template file and create the index manually w/ a PUT it shows up as I would expect
My logstash version is 7.3.0 and my elasticsearch version is 6.7
Not sure what I'm doing wrong here
Your index name is my-index, but the template setting in your mapping uses my-index-template-name, it needs to be a regular expression or the same name as your index.
Since you are using elasticsearch 6.7 you should use index_patterns instead of template in your mapping.
{
"index_patterns" : ["my-index"],
"mappings" : {
...
},
"settings" : {
"index" : {
"number_of_shards" : "48",
"number_of_replicas" : "0",
"refresh_interval": "-1"
}
}
}

How to get ElasticSearch output?

I want to add my log document to ElasticSearch and, then I want to check the document in the ElasticSearch.
Following is the conntent of the log file :
Jan 1 06:25:43 mailserver14 postfix/cleanup[21403]: BEF25A72965: message-id=<20130101142543.5828399CCAF#mailserver14.example.com>
Feb 2 06:25:43 mailserver15 postfix/cleanup[21403]: BEF25A72999: message-id=<20130101142543.5828399CCAF#mailserver15.example.com>
Mar 3 06:25:43 mailserver16 postfix/cleanup[21403]: BEF25A72998: message-id=<20130101142543.5828399CCAF#mailserver16.example.com>
I am able to run my logstash instance with following logstast configuration file :
input {
file {
path => "/Myserver/mnt/appln/somefolder/somefolder2/testData/fileValidator-access.LOG"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
grok {
patterns_dir => ["/Myserver/mnt/appln/somefolder/somefolder2/logstash/pattern"]
match => { "message" => "%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}" }
}
}
output{
elasticsearch{
hosts => "localhost:9200"
document_id => "test"
index => "testindex"
action => "update"
}
stdout { codec => rubydebug }
}
I have define my own grok pattern as :
POSTFIX_QUEUEID [0-9A-F]{10,11}
When I am running the logstash instance, I am successfully sending the data to elasticsearch, which gives following output :
Now, I have got the index stored in elastic search under testindex, but when I am using the curl -X GET "localhost:9200/testindex" I am getting following output :
{
"depositorypayin" : {
"aliases" : { },
"mappings" : { },
"settings" : {
"index" : {
"creation_date" : "1547795277865",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "5TKW2BfDS66cuoHPe8k5lg",
"version" : {
"created" : "6050499"
},
"provided_name" : "depositorypayin"
}
}
}
}
This is not what is stored inside the index.I want to query the document inside the index.Please help. (PS: please forgive me for the typos)
The API you used above only returns information about the index itself (docs here). You need to use the Query DSL to search the documents. The following Match All Query will return all the documents in the index testindex:
curl -X GET "localhost:9200/testindex/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match_all": {}
}
}
'
Actually I have edited my config file whic look like this now :
input {
. . .
}
filter {
. . .
}
output{
elasticsearch{
hosts => "localhost:9200"
index => "testindex"
}
}
And now I am able to get fetch the data from elasticSearch using
curl 'localhost:9200/testindex/_search'
I don't know how it works, but it is now.
can anyone explain why ?

elasticsearch bulk delete by custom field values

I'm building app with elasticsearch (5.4) and everything was going well until I try to delete several documents by field values. My x-ndjson looks like this:
{ "delete" : {} }
{ "id" : "109991" }
{ "delete" : {} }
{ "id" : "109992" }
{ "delete" : {} }
{ "id" : "109993" }
<- empty line
and i am POSTing it on http://localhost:9200/someindex/sometype/_bulk, but it responds with "Malformed action/metadata line [2], expected START_OBJECT or END_OBJECT but found [VALUE_NUMBER]".
Note that my "id" is my custom field, not the _id.
Is something missing in my request?
Thank you
I guess you need to use Delete By Query for this.
POST index/_delete_by_query
{
"query": {
"terms": {
"id": [
109991,
109992
]
}
}
}

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Mongo DB MapReduce: Emit key from array based on condition

I am new to mongo db so excuse me if this is rather trivial. I would really appreciate the help.
The idea is to generate a histogram over some specific values. In that case the mime types of some files. For that I am using a map reduce job.
I have a mongo with documents in the following form:
{
"_id" : ObjectId("4fc5ed3e67960de6794dd21c"),
"name" : "some name",
"uid" : "some app specific uid",
"collection" : "some name",
"metadata" : [
{
"key" : "key1",
"value" : "Plain text",
"status" : "SINGLE_RESULT",
},
{
"key" : "key2",
"value" : "text/plain",
"status" : "SINGLE_RESULT",
},
{
"key" : "key3",
"value" : 3469,
"status" : "OK",
}
]
}
Please note, that in almost every document there are more metadata key values.
Map Reduce job
I tried doing the following:
function map() {
var mime = "";
this.metadata.forEach(function (m) {
if (m.key === "key2") {
mime = m.value;}
});
emit(mime, {count:1});
}
function reduce() {
var res = {count:0};
values.forEach(function (v) {res.count += v.count;});
return res;
}
db.collection.mapReduce(map, reduce, {out: { inline : 1}})
This seems to work for a small number of documents (~15K) but the problem is that iterating through all metadata key values takes a lot of time during the mapping phase. When running this on more documents (~1Mio) the operation takes for ever.
So my question is:
Is there some way in which I can emit the mime type (the value) directly instead of iterating through all keys and selecting it? Or is there a better way to write a map reduce functions.
Something like emit (this.metadata.value {$where this.metadata.key:"key2"}) or similar...
Thanks for your help!
Two thoughts ...
First thought: How attached are you to this document schema? Could you instead have the metadata field value as an embedded document rather than an embedded array, like so:
{
"_id" : ObjectId("4fc5ed3e67960de6794dd21c"),
"name" : "some name",
"uid" : "some app specific uid",
"collection" : "some name",
"metadata" : {
"key1" : {
"value" : "Plain text",
"status" : "SINGLE_RESULT"
},
"key2": {
"value" : "text/plain",
"status" : "SINGLE_RESULT"
},
"key3" : {
"value" : 3469,
"status" : "OK"
}
}
}
Then your map step does away with the loop entirely:
function map() {
emit( this.metadata["key2"].value, { count : 1 } );
}
At that point, you might even be able to cast this as a "group" command rather than a "mapReduce".
Second thought: Absent a schema change like that, particularly if "key2" appears early in the metadata array, you could at least exit the loop eagerly once the key is found to save yourself some iterations, like so:
function map() {
var mime = "";
this.metadata.forEach(function (m) {
if (m.key === "key2") {
mime = m.value;
break;
}
});
emit(mime, {count:1});
}
Not sure if either path is the key to victory, but hopefully helpful thoughts. Best of luck!

Resources