Share data between Elasticsearch data nodes on recipient request - elasticsearch

I have two Elasticsearch data nodes, Slave and Master.
M and S can communicate with each other, however for security reasons S cannot send data to M when it receives it, M must request data from S, and (assuming no other requirements on what data S exports) when this happens M receives the requested data from S.
S's data is then incorporated in to M's data.
Is this behaviour achievable with Elasticsearch? Unless I am mistaken, neither replication nor snapshotting achieve this behaviour, and while I am aware that I could use S's REST API to receive this data on M before purging copied data from S, this solution seems clunky and prone to error.
Is there an elegant solution to achieve this architecture?

It is true that Cross Cluster Replication (CCR) is a potential solution for this, but that solution requires the most expensive version of elasticsearch, and there is a free alternative.
The elasticsearch input and output plugins for logstash work for this, albeit with some tweaking to get it to behave exactly as you want.
Below is a crude example which queries one elasticsearch node for data, and exports to another. This does mean that you require a logstash instance between the Slave and Master nodes to handle this behaviour.
input {
elasticsearch {
docinfo => true #Necessary to get metadata info
hosts => "192.168.0.1" #Slave (Target) elasticsearch instance
query => '{ "query": { "query_string": { "query": "*" } } }' #Query to return documents, this example returns all data which is bad if you combine with the below schedule
schedule => "* * * * *" #Run periodically, this example runs every minute
}
}
output {
elasticsearch {
hosts => "192.168.0.2:9200" #Master (Destination) elasticsearch instance
index => "replica.%{[#metadata][_index]}"
document_id => "%{[metadata][_id]}"
}
}

Related

ElasticSearch retrieves documents slowly

I'm using Java_API to retrieve records from ElasticSearch, it needs approximately 5 second to retrieve 100000 document (record/row) in Java application.
Is it slow for ElasticSearch? or is it normal?
Here is the index settings:
I tried to get better performance but without result, here is what I did:
Set ElasticSearch heap space to 3GB it was 1GB(default) -Xms3g -Xmx3g
Migrate the ElasticSearch on SSD from 7200 RPM Hard Drive
Retrieve only one filed instead of 30
Here is my Java Implementation Code
private void getDocuments() {
int counter = 1;
try {
lgg.info("started");
TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
.addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
SearchResponse scrollResp = client.prepareSearch("ebpp_payments_union").setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.matchAllQuery())
.setScroll(new TimeValue(1000))
.setFetchSource(new String[] { "payment_id" }, null)
.setSize(10000)
.get();
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
if (counter % 100000 == 0) {
lgg.info(counter + "--" + hit.getSourceAsString());
}
counter++;
}
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
.setScroll(new TimeValue(60000))
.execute()
.actionGet();
} while (scrollResp.getHits().getHits().length != 0);
client.close();
} catch (UnknownHostException e) {
e.printStackTrace();
}
}
I know that TransportClient is deprecated, I tried by
RestHighLevelClient also, but it does not changes anything.
Do you know how to get better performance?
Should I change something in ElasticSearch or modify my Java code?
Performance troubleshooting/tuning is hard to do with out understanding all of the stuff involved but that does not seem very fast. Because this is a single node cluster you're going to run into some performance issues. If this was a production cluster you would have at least a replica for each shard which can also be used for reading.
A few other things you can do:
Index your documents based on your most frequently searched attribute - this will write all of the documents with the same attribute to the same shard so ES does less work reading (This won't help you since you have a single shard)
Add multiple replica shards so you can fan out the reads across nodes in the cluster (once again, need to actually have a cluster)
Don't have the master role on the same boxes as your data - if you have a moderate or large cluster you should have boxes that are neither master nor data but are the boxes your app connects to so they can manage the meta work for the searches and let the data nodes focus on data.
Use "query_then_fetch" - unless you are using weighted searches, then you should probably stick with DFS.
I see three possible axes for optimizations:
1/ sort your documents on _doc key :
Scroll requests have optimizations that make them faster when the sort
order is _doc. If you want to iterate over all documents regardless of
the order, this is the most efficient option:
( documentation source )
2/ reduce your page size, 10000 seems a high value. Can you make differents test with reduced values like 5000 /1000?
3/ Remove the source filtering
.setFetchSource(new String[] { "payment_id" }, null)
It can be heavy to make source filtering, since the elastic node needs to read the source, transformed in Object and then filtered. So can you try to remove this? The network load will increase but its a trade :)

Mongoid each + set vs Critera#set vs update_all + $addToSet

I was wondering what is better performance / memory wise: Iterating over all objects in a collection and calling set/add_to_set or calling set/add_to_set directly on the Criteria or using update all with set/add_to_set.
# update_all
User.where(some_query).update_all(
{
'$addToSet': {
:'some.field.value' => :value_to_add
}
}
)
# each do + add_to_set
User.where(some_query).each do |user|
user.add_to_set(:'some.field.value' => :value_to_add)
end
# Criteria#add_to_set
User.where(some_query).add_to_set(
:'some.field.value' => :value_to_add
)
Any input is appreciated. Thanks!
I started MongoDB server with verbose flag. That's what I got.
Option 1. update_all applied on a selector
2017-04-25 COMMAND command production_v3.$cmd command: update { update: "products", updates: [ { q: { ... }, u: { $addToSet: { test_field: "value_to_add" } }, multi: true, upsert: false } ], ordered: true }
I removed some output so that is easier to read. The flow is:
MongoID generates a single command with query and update specified.
MongoDB server gets the command. It goes through collection and updates each match in [vague] one go.
Note! You may learn from the source code or take as granted. Since MongoID, as per my terminology, generates command to send in step 1, it does not check your models. e.g. If 'some.field.value' is not one of your field in the model User, then the command will still go through and persist on DB.
Option 2. each on a selector
I get find commands like below followed by multiple getMore-s:
2017-04-25 COMMAND command production_v3.products command: find { find: "products", filter: { ... } } 0ms
I also get a huge number of update-s:
2017-04-25 COMMAND command production_v3.$cmd command: update { update: "products", updates: [ { q: { _id: ObjectId('52a6db196c3f4f422500f255') }, u: { $addToSet: { test_field: { $each: [ "value_to_add" ] } } }, multi: false, upsert: false } ], ordered: true } 0ms
The flow is radically different from the 1st option:
MongoID sends a simple query to to MongoDB server. Provided your collection is large enough and the query covers a material chunk of it, the following happens in a loop:
[loop] Respond with a subset of all matches. Leave the rest for the next iteration.
[loop] MongoID gets an array of matching items in Hash format. MongoID parses the each entry and initializes User class for it. That's an expensive operatation!
[loop] For each User instance from the previous step MongoID generates an update commands and sends it to serve. Sockets are expensive too.
[loop] MongoDB gets the command and goes through the collection until the first match. Updates the match. It is quick, but adds up once in a loop.
[loop] MongoID parses the response and updates its User instance accordingly. Expensive and unnecessary.
Option 3. add_to_set applied on a selector
Under the hood it is equivalent to Option 1. Its CPU and Memory overhead is immaterial for the sake of the question.
Conclusion.
Option 2 is so much slower that there is no point in benchmarking. In the particular case I tries, it resulted in 1000s of request to MongoDB and 1000s of User class initialization. Options 1 and 3 resulted in a single request to MongoDB and relied on MongoDB highly optimized engine.

Logstash doc_as_upsert cross index in Elasticsearch to eliminate duplicates

I have a logstash configuration that uses the following in the output block in an attempt to mitigate duplicates.
output {
if [type] == "usage" {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usage-%{+YYYY-MM-dd-HH}"
document_id => "%{[#metadata][fingerprint]}"
action => "update"
doc_as_upsert => true
}
}
}
The fingerprint is calculated from a SHA1 hash of two unique fields.
This works when logstash sees the same doc in the same index, but since the command that generates the input data doesn't have a reliable rate at which different documents appear, logstash will sometimes insert duplicates docs in a different date stamped index.
For example, the command that logstash runs to get the input generally returns the last two hours of data. However, since I can't definitively tell when a doc will appear/disappear, I tun the command every fifteen minutes.
This is fine when the duplicates occur within the same hour. However, when the hour or day date stamp rolls over, and the document still appears, elastic/logstash thinks it's a new doc.
Is there a way to make the upsert work cross index? These would all be the same type of doc, they would simply apply to every index that matches "usage-*"
A new index is an entirely new keyspace and there's no way to tell ES to not index two documents with the same ID in two different indices.
However, you could prevent this by adding an elasticsearch filter to your pipeline which would look up the document in all indices and if it finds one, it could drop the event.
Something like this would do (note that usages would be an alias spanning all usage-* indices):
filter {
elasticsearch {
hosts => ["elastic4:9204"]
index => "usages"
query => "_id:%{[#metadata][fingerprint]}"
fields => {"_id" => "other_id"}
}
# if the document was found, drop this one
if [other_id] {
drop {}
}
}

JSON parser in logstash ignoring data?

I've been at this a while now, and I feel like the JSON filter in logstash is removing data for me. I originally followed the tutorial from https://www.digitalocean.com/community/tutorials/how-to-install-elasticsearch-logstash-and-kibana-elk-stack-on-ubuntu-14-04
I've made some changes, but it's mostly the same. My grok filter looks like this:
uuid #uuid and fingerprint to avoid duplicates
{
target => "#uuid"
overwrite => true
}
fingerprint
{
key => "78787878"
concatenate_sources => true
}
grok #Get device name from the name of the log
{
match => { "source" => "%{GREEDYDATA}%{IPV4:DEVICENAME}%{GREEDYDATA}" }
}
grok #get all the other data from the log
{
match => { "message" => "%{NUMBER:unixTime}..." }
}
date #Set the unix times to proper times.
{
match => [ "unixTime","UNIX" ]
target => "TIMESTAMP"
}
grok #Split up the message if it can
{
match => { "MSG_FULL" => "%{WORD:MSG_START}%{SPACE}%{GREEDYDATA:MSG_END}" }
}
json
{
source => "MSG_END"
target => "JSON"
}
So the bit causing problems is the bottom, I think. My gork stuff should all be correct. When I run this config, I see everything in kibana displayed correctly, except for all the logs which would have JSON code in them (not all of the logs have JSON). When I run it again without the JSON filter it displays everything.
I've tried to use a IF statement so that it only runs the JSON filter if it contains JSON code, but that didn't solve anything.
However, when I added a IF statement to only run a specific JSON format (So, if MSG_START = x, y or z then MSG_END will have a different json format. In this case lets say I'm only parsing the z format), then in kibana I would see all the logs that contain x and y JSON format (not parsed though), but it won't show z. So i'm sure it must be something to do with how I'm using the JSON filter.
Also, whenever I want to test with new data I started clearing old data in elasticsearch so that if it works I know it's my logstash that's working and not just running of memory from elasticsearch. I've done this using XDELETE 'http://localhost:9200/logstash-*/'. But logstash won't make new indexes in elasticsearch unless I provide filebeat with new logs. I don't know if this is another problem or not, just thought I should mention it.
I hope that all makes sense.
EDIT: I just check the logstash.stdout file, it turns out it is parsing the json, but it's only showing things with "_jsonparsefailure" in kibana so something must be going wrong with Elastisearch. Maybe. I don't know, just brainstorming :)
SAMPLE LOGS:
1452470936.88 1448975468.00 1 7 mfd_status 000E91DCB5A2 load {"up":[38,1.66,0.40,0.13],"mem":[967364,584900,3596,116772],"cpu":[1299,812,1791,3157,480,144],"cpu_dvfs":[996,1589,792,871,396,1320],"cpu_op":[996,50]}
MSG_START is load, MSG_END is everything after in the above example, so MSG_END is valid JSON that I want to parse.
The log bellow has no JSON in it, but my logstash will try to parse everything after "Inf:" and send out a "_jsonparsefailure".
1452470931.56 1448975463.00 1 6 rc.app 02:11:03.301 Inf: NOSApp: UpdateSplashScreen not implemented on this platform
Also this is my output in logstash, since I feel like that is important now:
elasticsearch
{
hosts => ["localhost:9200"]
document_id => "%{fingerprint}"
}
stdout { codec => rubydebug }
I experienced a similar issue and found that some of my logs were using a UTC time/date stamp and others were not.
Fixed the code to use exclusively UTC and sorted the issue for me.
I asked this question: Logstash output from json parser not being sent to elasticsearch
later on, and it has more relevant information on it, maybe a better answer if anyone ever has a similar problem to me you can check out that link.

Logstash Indexing

I would like to create two separate indexes for two different systems that are sending data to the logstash server setup for udp - syslog. In Elasticsearch, I created an Index called CiscoASA01 and another Index called CiscoASA02. How can I configure Logstash to filter all events coming from the first device to go into the CiscoASA01 index and the events coming from the second device to go to the second index? Thank you.
You can use if to separate the logs. Assume your first device is CiscoASA01 & second is CiscoASA02.
Here is the output
output {
if [host] == "CiscoASA01"
{
elasticsearch {
host => "elasticsearch_server"
index => "CiscoASA01"
}
}
if [host] == "CiscoASA02"
{
elasticsearch {
host => "elasticsearch_server"
index => "CiscoASA02"
}
}
}
The [host] is the field in logstash event. You can use it to separate the log to different output.
Hope this can help you.

Resources