Elasticsearch indexing multiple files in parallel - elasticsearch

I'm trying to index multiple files in ES. Since there are many files and each file has its own index, sequential indexing seems to be slow in production usage. What I would like is to index multiple files in parallel. Let's say I have 100 files and would like to index 10 files at a time and complete indexing in 10 batches. I was expecting the time taken by 10 files to index and a single file to index be the same since they are executing in parallel and resources are also enough. However, on the ES side, indexing is being done sequentially and the time required to index 10 files is almost 10 times the time required by a single file.
It seems like ES indexing runs sequentially although parallel requests are sent for indexing from this question. Is it possible to index data in parallel to reduce indexing time or am I missing something here? Thanks for the help
Note: I am testing this in a single node setup. Can that be an issue?

As you have not provided the code and your performance test numbers, so it's difficult to guess, where you are making a mistake, I am guessing you are sending 10 different index request simultaneously when you are saying parallel request which is not the right way, you should instead use the Bulk API which is the right choice for your setup.
if you are already using the Bulk API, please provide all the relevant information required to debug the issue further.

Related

ELK indexes design according to development goal

I used elastic in the past to analyze logs but I don't have any experience in elastic "architecture". I have an application that I deployed to multiple machines (200+). I want to connect to each machine and gather metadata like logs, metrics, db stats and so on..
With that data I want to be able to :
Find problems in each machine and notify about them (finding problems requires joining data between different sources, for example, finding exception in log1 requires me to go check the db)
Analyze common issues for all machines and implement ML model that will be able to predict issues.
I need to create indexes, and I thought about 2 options:
Create one index per each machine and then all the data related to each machine will be available in its index.
Create index per data source. For example, all db logs from all machines will be available in one dedicated index. Another index will contain only data that related to machine metrics (cpu/ram usage..)
What would be the best to create those indexes?
Ok, now that I got a better understanding of your needs, here's my suggestion:
I strongly recommend to not create an index per machine. I don't know much about your use case(s) but I assume you want to search the data either in kibana or by implementing search requests in your application.
Let's say you are interested in the ram usage of every machine. You would need to execute 200 search requests against elasticsearch since the data (ram usage) is spread over 200 indices (of course one could create aliases but these had to be updated for every new machine). Furthermore you wouldn't be able to do basic aggregations like which machine has the highest ram usage? in a convenient way. In my opinion there are plenty more disadvantages like index-management, shard-allocation etc.
So what's a better solution?
As you have already suggested, you should create an index per datasource. With that, your indices have a dedicated "purpose", e.g. one index that stores database data, the other system metrics and so on. Referring to my examples above, you only would need to execute one search request to determine a) the ram usage of every machine and b) which machine has the highest ram usage. However, this would require, that every document contains a field that references the particular host like so:
PUT metrics/_doc/1
{
"system":{
"ram": {
"usage": "45%",
"free": "55%"
}
},
"host":{
"name": "YOUR HOSTNAME",
"ip": "192.168.17.100"
}
}
In addition to that I recommend using daily indices. So instead of creating one huge index for the system metrics you would create an index for every day like metrics-2020.01.01, metrics-2020.01.02 and so on. This approach has the following advantages:
your indices will be much smaller in size, making them better to manage and (re-)allocate.
after some time period, you can roughly estimate the data size and be able to define the number of shards much better. With only one huge index, you would constantly need to update the number of shards in order to handle your requests in a fast way.
furthermore, you can search your data on a day-by-day basis in a convenient way.
you are able to setup ilm-policies to automate the maintenance of your indices, e.g. delete metrics-indices that are older than X days.
...
I hope I could help you!

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

Update nested field for millions of documents

I use bulk update with script in order to update a nested field, but this is very slow :
POST index/type/_bulk
{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}
... [a lot more splitted in several batches]
Do you know another way that could be faster ?
It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.
As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.
In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:
... to update a document is to retrieve it, change it, and then reindex the whole document.
Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting
right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.
You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.
It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.

MarkLogic 8 - Reporting and Aggregation from Large Collection

Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?
Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Resources