Spring data Page/Pageable returns duplicates on large data sets? - spring

When operating on large data sets, Spring Data presents two abstractions: Stream and Page. We've been using Stream for awhile and had no issues, but recently I wanted to try a paginated approach and ran into a reliability issue.
Consider the following:
#Entity
public class MyData {
}
public interface MyDataRepository extends JpaRepository<MyData, UUID> {
}
#Component
public class MyDataService {
private MyDataRepository repository;
// Bridge between a Reactive service and a transactional / non-reactive database call
#Transactional
public void getAllMyData(final FluxSink<MyData> sink) {
final Pageable firstPage = PageRequest.of(0, 500);
Page<MyData> page = repository.findAll(firstPage);
while (page != null && page.hasContent()) {
page.getContent().forEach(sink::next);
if (page.hasNext()) {
page = repository.findAll(page.nextPageable());
}
else {
page = null;
}
}
sink.complete();
}
}
Using two Postgres 9.5 databases, the source database had close to 100,000 rows while the destination was empty. The example code was then used to copy from the source to the destination. At the end I would find that my destination database had far smaller row count than the source.
Run as a springboot app
The flux doing the copy was using 4-6 threads in parallel (for speed)
Total run time of at least an hour (max was 2 hours)
As it turns out, I was eventually processing the same rows multiple times (and missing other rows as a result). This lead me to discovering a fix that others had already ran into, where you should provide a Sort.by("") argument.
After changing the service to use:
// Make our pages sorted by the PKEY
final Pageable firstPage = PageRequest.of(0, 500, Sort.by("id"));
I found that while it GREATLY helped, I would still process multiple rows (from losing about half the rows to only seeing ~12 duplicates). When I use a Stream instead, I have no issues.
Does anyone have any explanation for what is going on? I don't seem to have any duplicates come through until the test has been running for at least 10-15min, which almost leads me to believe that there is some kind of session or other timeout happening (either in the client, or on the database) that causes the hiccups. But I'm really far out of my knowledge area for troubleshooting it further heh.

Related

Benchmarking spring data vs JDBI in select from postgres Database

I wanted to compare the performence for Spring data vs JDBI
I used the following versions
Spring Boot 2.2.4.RELEASE
vs
JDBI 3.13.0
the test is fairly simple select * from admin table and convert to a list of Admin object
here is the relevant details
with spring boot
public interface AdminService extends JpaRepository<Admin, Integer> {
}
and for JDBI
public List<Admin> getAdmins() {
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
try {
handle = Sql2oConnection.getInstance().getJdbi().open();
return handle.createQuery(sql).mapToBean(Admin.class).list();
}catch(Exception ex) {
log.error("Could not select admins from admins: {}", ex.getMessage(), ex );
return null;
} finally {
handle.close();
}
}
the test class is executed using junit 5
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
adminService.getAdmins(); // for spring its findAll()
for(Admin admin: admins) {
if(admin.getAdminId() == 654) {
System.out.println("just to simulate work with the data");
}
}
}
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
i was quite shocked to get the following results
Spring Data: 2 seconds
JDBI: 59 seconds
any idea why i got these results? i was expecting JDBI to be faster
The issue was that spring manages the connection life cycle for us and for a good reason
after reading the docs of JDBI
There is a performance penalty every time a connection is allocated
and released. In the example above, the two insertFullContact
operations take separate Connection objects from your database
connection pool.
i changed the test code of the JDBI test to the following
#Test
#DisplayName("How long does it take to run 1000 queries")
public void loadAdminTable() {
System.out.println("Running load test");
String sql = "Select admin_id as adminId, username from admins";
Handle handle = null;
handle = Sql2oConnection.getInstance().getJdbi().open();
Instant start = Instant.now();
for(int i= 0;i<1000;i++) {
List<Admin> admins = handle.createQuery(sql).mapToBean(Admin.class).list();
if(!admins.isEmpty()) {
for(Admin admin: admins) {
System.out.println(admin.getUsername());
}
}
}
handle.close();
Instant end = Instant.now();
Duration duration = Duration.between(start, end);
System.out.println("Total duration: " + duration.getSeconds());
}
this way the connection is opened once and the query runs 1000 times
the final result was 1 second
twice as fast as spring
On the one hand you seem to make some basic mistakes of benchmarking:
You are not warming up the JVM.
You are not using the results in any way.
Therefore what you are seeing might just be effects of different optimisations of the VM.
Look into JMH in order to improve your benchmarks.
Benchmarks with an external resource are extra hard, because you have so many more parameters to control.
One big question is for example if the connection to the database is realistically slow as in most production systems the database will be on a different machine at least virtually, quite possibly on different hardware.
Is that true in your test as well?
Assuming your results are real, the next step is to investigate where the extra time gets spent.
I would expect the most time to be spent with executing the SQL statements and obtaining the result via the network.
Therefore you should inspect what SQL statements actually get executed.
This might point you to one possible answer that JPA is doing lots of lazy loading and hasn't even loaded most of you really need.

Spring batch reader - How to avoid returning a list of objects

So I have a spring batch app that I have getting a list of ids that it then uses 'read()' on to get 1 to many results back. The issue is, I have no control over how many results I get back for each id meaning that my chunking is spotty at best. Is there a suggested way to avoid spikes in memory/cpu? An example is below:
#Before
public void getIds() {
*getListOfIds* //Usually around 10,000 or so
}
#Override
public AccountObject read() {
if(list of ids havent all been used) {
List<AccountObject> myAccounts = myService.getAccounts(id);
return myAccounts; //This could be anywhere from 1 result to 100,000 results.
} else {
return null;
}
}
So the myAccounts object above could be small or huge. This causes chunking to basically be useless because at the moment I am chunking by List. I'd really rather chunk by straight AccountObject but don't see an easy way to do this.
Is there a class, strategy, etc. that I am missing here?

Why are connections to Azure Redis Cache so high?

I am using the Azure Redis Cache in a scenario of high load for a single machine querying the cache. This machine roughly gets and sets about 20 items per second. During daytime this increases, during nighttime this is less.
So far, things have been working fine. Today I realized that the metric of "Connected Clients" is extremely high, although I only have 1 client that just constantly Gets and Sets items. Here is a screenshot of the metric I mean:
My code looks like this:
public class RedisCache<TValue> : ICache<TValue>
{
private IDatabase cache;
private ConnectionMultiplexer connectionMultiplexer;
public RedisCache()
{
ConfigurationOptions config = new ConfigurationOptions();
config.EndPoints.Add(GlobalConfig.Instance.GetConfig("RedisCacheUrl"));
config.Password = GlobalConfig.Instance.GetConfig("RedisCachePassword");
config.ConnectRetry = int.MaxValue; // retry connection if broken
config.KeepAlive = 60; // keep connection alive (ping every minute)
config.Ssl = true;
config.SyncTimeout = 8000; // 8 seconds timeout for each get/set/remove operation
config.ConnectTimeout = 20000; // 20 seconds to connect to the cache
connectionMultiplexer = ConnectionMultiplexer.Connect(config);
cache = connectionMultiplexer.GetDatabase();
}
public virtual bool Add(string key, TValue item)
{
return cache.StringSet(key, RawSerializationHelper.Serialize(item));
}
I am not creating more than one instance of this class, so this is not the problem. Maybe I missunderstand the connections metric and what they really mean is the number of times I access the cache, however, it would not really make sense in my opinion. Any ideas, or anyone with a similar problem?
StackExchange.Redis had a race condition that could lead to leaked connections under some conditions. This has been fixed in build 1.0.333 or newer.
If you want to confirm this is the issue you are hitting, get a crash dump of your client application and look at the objects on the heap in a debugger. Look for a large number of StackExchange.Redis.ServerEndPoint objects.
Also, several users have had a bugs in their code that resulted in leaked connection objects. This is often because their code is trying to re-create the ConnectionMultiplexer object if they see failures or disconnected state. There is really no need to recreate the ConnectionMultiplexer as it has logic internally to recreate the connection as necessary. Just make sure to set abortConnect to false in your connection string.
If you do decide to re-create the connection object, make sure to dispose the old object before releasing all references to it.
The following is the pattern we are recommending:
private static Lazy lazyConnection = new Lazy(() => {
return ConnectionMultiplexer.Connect("contoso5.redis.cache.windows.net,abortConnect=false,ssl=true,password=...");
});
public static ConnectionMultiplexer Connection {
get {
return lazyConnection.Value;
}
}

ListChangeListener, JavaFX

I have a question related to filters on a ObservableList. The code that i have works fine, but i think is too slow. This is beacause i load like 40,000 orders at the beginning of the app, after that the app keeps receiving orders, but for now i only have the problem in the initial load. My main problem is that the copy of my original collection of orders is considerably slower, why? because the code that i have in the changeListener i think that could be better.. but i haven't found a solution yet. Well here's an example of my code.
public MainController()
{
filteredData.addAll(Repository.masterObservableList);
Repository.masterObservableList.addListener(new ListChangeListener<OrderVo>()
{
#Override
public void onChanged(ListChangeListener.Change<? extends OrderVo> change)
{
filteredData.clear();
for (OrderVo o : Repository.masterObservableList)
filteredData.add(o);
}
});
}
}
I'll explain the code a little bit. the "Repository" is a singleton, the masterObservableList is a ObservableList and as the name says.. is the "master" or the original. The filteredData is also a ObservableList but is only declared in the controller of my fxml (MainController) and works as the copy of the master collection. Every time that my master collection recieve a change (update or new order) the filteredData should apply that change.. but im doing a for each iteration and this is problem, because it works but, too slow. Why im saying that is slow? Because in the beginning i was using the master collection as the data provider of a TableView that shows the orders and it worked fast and clean. After that i wanted to add filters to the table and thats when i began to do a research and found the filtered data and another methods.. i keeped this method (is a larger code, but the main problem is here).. and it works.. but the time that took to load the orders in the beginning is like 1:30 mins more than before.. So guys, if you have any idea! of how to keep the filteredData updated without making a for each in the changeListener i will be very happy and grateful. Thanks for reading!
Your algorithm is forcefully resetting all elements of filteredData everytime masterObservableList changes.
This is too exhaustive, as the listener can be triggered upon each item added or removed.
Maybe it's better to only add or remove the elements that have changed?
Repository.masterObservableList.addListener(new ListChangeListener<OrderVo>()
{
#Override
public void onChanged(ListChangeListener.Change<? extends OrderVo> change)
{
while(change.next()) {
if(change.wasAdded()) {
List added = change.getAddedSubList();
//add those elements to your data
}
if(change.wasRemoved()) {
List removed = change.getRemoved();
//remove those elements from your data
}
}
}
});
}

Thread safe caching

I am trying to analyze what problem i might be having with unsafe threading in my code.
In my mvc3 webapplication i try to the following:
// Caching code
public static class CacheExtensions
{
public static T GetOrStore<T>(this Cache cache, string key, Func<T> generator)
{
var result = cache[key];
if(result == null)
{
result = generator();
lock(sync) {
cache[key] = result;
}
}
return (T)result;
}
}
Using the caching like this:
// Using the cached stuff
public class SectionViewData
{
public IEnumerable<Product> Products {get;set;}
public IEnumerable<SomethingElse> SomethingElse {get;set;}
}
private void Testing()
{
var cachedSection = HttpContext.Current.Cache.GetOrStore("Some Key", 0 => GetSectionViewData());
// Threading problem?
foreach(var product in cachedSection.Products)
{
DosomestuffwithProduct...
}
}
private SectionViewData GetSectionViewData()
{
SectionViewData viewData = new SectionViewData();
viewData.Products = CreateProductList();
viewData.SomethingElse = CreateSomethingElse();
return viewData;
}
Could i run inte problem with the IEnumerable? I dont have much experience with threading problems. The cachedSection would not get touched if some other thread adds a new value to cache right? To me this would work!
Should i cache Products and SomethingElse indivually? Would that be better than caching the whole SectionViewData??
Threading is hard;
In your GetOrStore method, the get/generator sequence is entirely unsynchronized, so any nymber of threads can get null from the cache and run the generator function at the same time. This may - or may not - be a problem.
Your lock statement only locks the setter of cache[string], which is already thread safe and doesn't need to be "extra locked".
The variation of double-checked locking in the cache is suspect, I'd try to get rid of it. Since the thread that never enters the lock() section can get result without a memory barrier, result may not be entirely constructed by the time the thread gets it.
Enumerating the cached IEnumrators is safe as long as nothing modifies them at the same time. If GetSectionViewData() returns an object with immutable (as in non changing) collections, you're safe.
Your code is missing parts like how would Products be populated? Only in GetSectionViewData?
If so, then I don't see a major problem with your code.
There is however a chance that two threads generate the same data(CachedSection) for the same key, it shouldn't create a threading problem except that you are doing the work twice, so if this was an expensive operation I would change the code so it only generates it once per key. If it is not expensive, it works fine as is.
IEnumerable for Products is not touched (assuming you create it separately per thread, but the enumerator on the cache is modified for each insert operation, hence it is not thread safe. So if you are using this I would be careful about that.

Resources