Improving query execution time - spring-boot

I am working with spring data mongo, I have around 2000 documents stored(would probably reach 10000 in the upcoming 2-3 months), I would like to extract them all, however the query takes around ~2.5 seconds, which is pretty bad in my opinion, I am using MongoRepository default - findAll()
Tried to increase the cursor batchsize to 500,1000,2000 without any much improvement(best result was 2.13 seconds).
Currently I'm using a workaround - I store the documents in a different collection which used for cache, extracting this data takes around 0.25 seconds, but I would like to figure out how to fix the original query execution time.
Would like the answer will return in less then 1 sec, less is even better.

Without knowing the exact details i cannot confirm you a method.
But for data selection queries "Indexing" will help you.
Please Try Indexing the DB.
https://docs.mongodb.com/manual/indexes/

Related

Initial ElasticSearch Bulk Index/Insert /Upload is really slow, How do I increase the speed?

I'm trying to upload about 7 million documents to ES 6.3 and I've been running into and issue where the bulk upload slows to a crawl at about 1 million docs (I have no documents previous to this in the index).
I have a 3 node ES setup with 16GB with 8GB JVM settings, 1 index, 5 shards.
I have turned off refresh ("-1"), set replica to 0, increased the index buffer size to 30%.
On my upload side I have 22 threads running 150 docs per request of bulk insert. This is just a basic ruby script using Postgresql, ActiveRecord, Net/HTTP (For the network call), and and using the ES Bulk API (No gem).
For all of my nodes and upload machines the CPU, Memory, SSD Disk IO is low.
I've been able to get about 30k-40k inserts per/minute, but that seems really slow to me since others have been able to do 2k-3k per/sec. My documents do have nested json, but they don't seem to be very large to me (Is there way to check a single size doc or average?).
I would like to be able to bulk upload these documents in less than 12 - 24hrs and seems like ES should handle that, but once I get to 1 million it seems like it slows to a crawl.
I'm pretty new to ES so any help would be appreciated. I know this seems like question that has already been asked, but I've tried just about everything that I could find and wonder why my upload speed is a factor slower.
I've also checked the logs and only saw some errors about mapping field couldn't change, but nothing about memory over or anything like that.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
I think I found a bottleneck at the active connections to my original database and increased that connection pool which helped, but still slows to a crawl at about 1 Million records, but got to 2 Million over about 8hrs of running.
I also tried an experiment on a big machine, that is used to run the upload job, running 80 threads at 1000 document uploads each. I did some calculations and found out that my documents are about 7-10k per document so doing uploads of 7-10MBs each bulk index. This got to the document count faster to 1M, but once you get there everything slows to a crawl. The machines stats are still really low. I do see output of the threads about every 5 mins or so on the logs for the job, about the same time I see the ES count change.
The ES machines still have low CPU, Memory. The IO is around 3.85MBs and the Network Bandwidth was at 55MBs and drops to about 20MBs.
Any help would be appreciated. Not sure if I should try the ES gem, and use the bulk insert which maybe keeps a connection open, or try something totally different to insert.
ES 6.3 is great, but I'm also finding that the API has changed a bunch to 6 and settings that people were using are no longer supported.
Could you give an example for a breaking change between 6.0 and 6.3 that is a problem for you? We're really trying to avoid those and I can't really recall anything from the top of my head.
I've started profiling that DB and noticed that once you use offset of about 1 Million the queries are starting to take a long time.
Deep pagination is terrible performance wise. There is the great blog post no-offset, which explains
why it's bad: To get the result 1,000 to 1,010 you sort the first 1,010 records, throw away 1,000, and then send 10. The deeper the pagination the more expensive it will be
how to avoid it: Make a unique order of your entries (for example by ID or combine date and ID, but something that is absolute) and add a condition on where to start. For example order by ID, fetch the first 10 entries, and keep the ID of the 10th entry for the next iteration. In that one order by the ID again, but with the condition that the ID must be greater than the last one in your previous run, and fetch the next 10 entries plus remember the last ID again. Repeat until done.
Generally, with your setup you really shouldn't have a problem inserting more than 1 million records. I'd look into the part that is fetching the data first.

JPA getResultList much slower than SQL query

I have a (oracle)table with about 5 million records and a quite complex query which returns about 5000 records of it in less than 5 seconds with a database tool like toad.
However when I ran the query via entityManager(eclipseLink) the query runs for minutes...
I'm probably too naive in the implementation.
I do:
Query query = em.createNativeQuery(complexQueryString, Myspecific.class);
... setParameter...
List result = query.getResultList();
The complexQueryString starts with a "SELECT *".
What kinds of optimization do I have?
May be one is only to select the fields I really need later. Some explanation would be great.
I had a similar problem (I tried to read 800000 records with 8 columns in less than one second) and the best solution was to fall back to jdbc. The ResultSet was created and read really 10 times faster than using JPA, even when doing a native query.
How to use jdbc: normally in the J2EE-Servers a JDBC-DataSource can be injected as #Resource.
An explanation: I think the OR-Mappers try to create and cache objects so that changes can easily detected later. This is a very substantial overhead, that can't be recognized if you are just working with single entities.
Query.setFetchSize(...) may help a bit. It tells the jdbc driver how many rows to return in one chunk. Just call it before getResultList();
query.setFetchSize(5000);
query.getResultList();

Hibernate fetching quick if setMaxResults is set

There is a simple sql query which fetches only ONE record. The database is oracle.
Following is the simple query:
select *
from APPEALCASE appealcase0_
where appealcase0_.caseNumber='BAXXXXX00' and appealcase0_.DELETED_FLAG='N'
When I fetch this row using hibernate the response time is 500ms which is slow since it has to be really quick and withing 10ms. But when I set the MaxResults in the hibernate query object to 1(one) the response time improved to 15ms.
Though my issue is fixed I'm still puzzled how setting MaxResults to 1 improved the response time drastically. Can anyone explain me this?
Well, that's quite logical to me. Since you tell Oracle to retrieve at most one record, it stops searching for more as soon as it finds one. Whereas if you don't, it scans the whole table (or index) to find all the records matching the search criteria.
What you should check, though, is if you have an index defined on the caseNumber column.

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Awful performance of JPA batch processing compared to Hibernate or JDBC

I needed to create a batch recently which reads over a table with millions of rows. The table has about 12 columns and I only need to do a read operation. But I needed all fields therefore I thought about using persistence objects.
I really used only the most basic code only to achieve that and with no tweaks. JPA was quite annoying because it forced me to use custom paging with maxResults and minResults. You can view the approximate code hyperlinks below, if you are interested. There really is nothing else to it, beside the default XML files etc.
The JPA code: http://codeviewer.org/view/code:297e
The Hibernate code: http://codeviewer.org/view/code:297f
The JDBC code: same as above, but with "d" on the end (sorry I can only post 2 links)
The result in time of finished operations was something like that. I am only talking of read-operations:
JPA: Per 5 seconds: 1.000||Per Minute: 12.000||Per Hour: 720.000
Hibernate: Per 5 seconds: 20.000||Per Minute: 240.000||Per Hour: 14.400.000
JDBC: Per 5 seconds: 50.000-80.000||Per Minute: 600.000-960.000||Per Hour: 36.000.000-57.600.000
I can't explain it, but JPA is ridiculous. It can only be a big bad joke. The funny thing is that it startet with the same speed as the Hibernate code, but after about 30.000 records it became slower and slower until it got stable at 1.000 read operations per 5 seconds. It has reached that point after finishing approximately 100.000 records. But honestly... there is no point in that speed.
Why is that so? Please explain it to me. I really don't know what I'm doing wrong. But I also think it shouldn't be that slow, even with default settings. It can't be and it must not be! In comparison to that Hibernate and JDBC speed is acceptable and stable all the time.
With Hibernate you get a good performance using only one query and scrollable results. Unfortunately, this is not currently possible in JPA, and you must execute a query for every result page.
So, you are doing it right. But your page size is only set to 20 results. This is very few, so your code makes a very high number of queries. Try with greater size, for example 10000 results and performance probably will increase. Anyway, I think you won't be able to get numbers close to Hibernate ones.

Resources