Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed? - ruby

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.

ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

Related

How to increase the speed of bulk inserts with GORM?

I am using v2 GORM's CreateInBatches and changed SkipDefaultTransaction: true as mentioned in the "Performance" page, but I find that +350k records inserted in batches of 1000 take almost 3 minutes.
I tried removing the gorm.Model{} fields but didn't see much improvement.
What can I do to increase bulk-insert speed?
EDIT for anyone reading this with the same problem: I ended up saving my data in CSVs and importing it with pg_bulkload - got 1 million rows imported in 1 second (on a server with lots of cores & RAM)

Elasticsearch indexing multiple files in parallel

I'm trying to index multiple files in ES. Since there are many files and each file has its own index, sequential indexing seems to be slow in production usage. What I would like is to index multiple files in parallel. Let's say I have 100 files and would like to index 10 files at a time and complete indexing in 10 batches. I was expecting the time taken by 10 files to index and a single file to index be the same since they are executing in parallel and resources are also enough. However, on the ES side, indexing is being done sequentially and the time required to index 10 files is almost 10 times the time required by a single file.
It seems like ES indexing runs sequentially although parallel requests are sent for indexing from this question. Is it possible to index data in parallel to reduce indexing time or am I missing something here? Thanks for the help
Note: I am testing this in a single node setup. Can that be an issue?
As you have not provided the code and your performance test numbers, so it's difficult to guess, where you are making a mistake, I am guessing you are sending 10 different index request simultaneously when you are saying parallel request which is not the right way, you should instead use the Bulk API which is the right choice for your setup.
if you are already using the Bulk API, please provide all the relevant information required to debug the issue further.

Improving query execution time

I am working with spring data mongo, I have around 2000 documents stored(would probably reach 10000 in the upcoming 2-3 months), I would like to extract them all, however the query takes around ~2.5 seconds, which is pretty bad in my opinion, I am using MongoRepository default - findAll()
Tried to increase the cursor batchsize to 500,1000,2000 without any much improvement(best result was 2.13 seconds).
Currently I'm using a workaround - I store the documents in a different collection which used for cache, extracting this data takes around 0.25 seconds, but I would like to figure out how to fix the original query execution time.
Would like the answer will return in less then 1 sec, less is even better.
Without knowing the exact details i cannot confirm you a method.
But for data selection queries "Indexing" will help you.
Please Try Indexing the DB.
https://docs.mongodb.com/manual/indexes/

Execute Select Statement at Cassandra

I have been a problem in Cassandra. Please help me..
I am executing Select statement at 500K rows table at intervals 1 millisecond. After some time I get message "All host(s) tried for query failed. First host tried, 10.1.60.12:9042: Host considered as DOWN. See innerErrors".
I run select statement the fallowing:
select * from demo.users
this returning to me 5K rows. There are 500K rows in the users table.
I don't know what is wrong. I have not changed the cassandra.yaml file.
I need to make settings for the memory cache? There is too much disk i/o when I run select statement.
Please help me
A range query (select * with no primary key or token ranges) can be a very expensive query that has to hit at least 1 of every replica set (depend on size of dataset). If your trying to read the entire dataset or doing batch processing either be best to use spark connector or behave like it, and query individual token ranges to prevent putting too much load on coordinators.
If you are going to be using inefficient queries (which is fine, just don't expect the same throughput as normal reads) you will probably need more resources or some specialized tuning. You could add more nodes or look into whats causing it to go DOWN. Most likely its GCs from heap load, so can check GC log. If you have the memory available you can increase heap. Would be good idea to max heap size since with reading everything, system caches are not going to be as meaningful. Use G1 once over 16gb (which you should be) in the jvm.options.

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Resources