Spring Data Cassandra gigabytes of data findAll() approach - java-8

I know this has been asked before, but Spring Data is growing a lot with time.
How would you implement a findAll() that returns millions of rows?
I know Spring Data has stream API, even though I'm not exactly sure if it would be safe with so much data. From my understanding, this is not going to retrieve all data at once, but while processing it.
Stream<T> streamAllBy...(...);
Also, a second approach would be this, only downside would be that I have to deal manually with pagination.
Slice<T> findAllBy...(..., Pageable pageable)
Any ideas?

Declaring Stream<T> as a return type for a query method is indeed the preferred approach. The repository layer adapts query execution to the declared type and performs transparent forward-pagination while consuming the stream.
Spring Data's repository approach requires certain method signatures, that might be not practical in each scenario.

Related

Shall I use a DTO or not?

I'm building a web application with Spring, and I'm at the point where I have an Entity, a Repository, a RestController, and I can access endpoints in my browser.
I'm now trying to return JSON data to the browser, and I'm seeing all of this stuff about DTOs in various guides.
Do I really need a DTO? Can't I just put the serialization logic on the entity itself?
I think, this is a little bit debatable question, where the short answer would be:
It depends.
Little longer answer
There are plenty of people, who, in plenty of cases, would prefer one approach (using DTOs) over another (using bare entities), and vice versa; however, there is no the single source of truth on which is better to use.
It very much depends on the requirements, architectural approach you decide to stick with, (even on) personal preference and other (project-related) specific details.
Some even claim that DTO is an anti-pattern; some love using them; some think, that data refinement/adjustment should happen on the consumer/client side (for various reasons, out of which, one can be No Policy for API changes).
That being said, YES, you can simply return the #Entity instance (or list of entities) right from your controller and there is no problem with this approach. I would even say, that this does not necessarily violate something from SOLID or Clean Code principles.again, it depends on what do you use a response for, what representation of data do you need, what should be the capacity and purpose of the object in question, and etc..
DTO is generally a good practice in the following scenarios:
When you want to aggregate the data for your object from different resources, i.e. you want to put some object transformation logic between the Persistence Layer and the Business(or Web) Layer:
Imagine you fetch from your database a List<Employee>; however, from another 3rd party web-service, you also receive some complementary-to-employee data for each Employee object, which you have to aggregate in the Employee objects (aggregate, or do some calculation, or etc. point is that you want to combine the data from different resources). This is a good case when you might want to use DTO pattern. It is reusable, it conforms to Single-Responsibility Principle, and it is well segregated from other layers;
When you don't necessarily combine data received from different sources, but you want to modify the entity which you will be returning:
Imagine you have a very big Entity (with a lot of fields), and the client, which calls the corresponding endpoint (Front-End application, Mobile, or any client), has no need of receiving this huge entity (or list of entities). If you, despite the client's requirement, will still be sending the original/unchanged entity, you will end up consuming network bandwidth/load inefficiently (more than enough), performance will be weaker, and generally, you will be just wasting computing resources for no good reason. In this case, you might want to transform your original Entity to the DTO object, which the client needs (only with required fields). Here, you might even want to implement different DTO classes, for one entity, for different consumers/clients.
However, if you are sure, that your table/relation representations (instances of #Entity classes) are exactly what the client needs, I see no necessity of introducing DTOs.
Supporting further the idea, that #Entity can be returned to the presentation layer without DTO
Java Persistence with Hibernate, Second Edition, in §3.3.2, even motivates it explicitly, that:
You can reuse persistent classes outside the context of persistence, in unit tests or in the presentation layer, for example. You can create instances in any runtime environment with the regular Java new operator, preserving testability and reusability;
Hibernate entities do not need to be explicitly Serializable;
You might also want to have a look at this question.
In general, it’s up to you to decide. If your application is relatively simple and you don’t expose any sensitive information, an response is y ambiguous for the client, there is nothing criminal in returning back the whole entity. If your client expect a small slice of entity, eg only 2-3 fields from 30 fields entity, then it make sense to do the translation or consider different protocol such as GraphQL.
It is ideal design where you should not expose the entity.
It is a good design to convert your entity to DTO before you pass the same to web layer.
These days RestJpacontrollers are also available.
But again it all varies from application to application which one to use.
If your application does a need only read only operation then make sense to use RestJpacontrollers and can use entity at web layer.
In other case where application modifies data frequently then in that case better option to opt DTO and use it at the UI layer.
Another case is of multiple requests are required to bring data for a particular task. In the same case data to be brought can be combined in a DTO so that only one request can bring all the required data.
We can use data of multiple entities data into one DTO.
This DTO can be used for the front end or in the rest API.
Do I really need a DTO? Can't I just put the serialization logic on the entity itself?
I'd say you don't, but it is better to use them, according to SOLID principles, namely single responsibility one. Entities are ORM should be used to interact with database, not being serialized and passed to the other layers.

Spring Data MongoDB Reactive - Dealing with findAll for a large number of documents?

Let's say I have a ReactiveMongoRepository defined like this:
#Repository
interface MyRepo extends ReactiveMongoRepository<MyDTO, String> {}
Given that the repository contains a lot of MyData documents (hundreds of thousands at least) and you do a simple "findAll()" followed by a deletion:
myRepo.findAll()
.doOnNext( myDto -> {
System.out.println(myDto.message);
})
.flatMap( myDto -> {
myRepo.deleteById(myDto.id);
})
This will be executed roughly once a month.
Is it safe to use Spring Data / MongoDB like this when streaming large sets of data? Or is it recommended to using some sort of batching or pagination to avoid cursor issues etc?
The general answer is it depends, but in your specific case in my opinion is no, at least not in your presented way
first of all, I guess that a find all operation, for all collection has very few sense.
I suppose that find an use case that need to handle hundreds of thousands is near to impossible, supposing that you have implement a data ingestion pipeline ok you have handle an infinite stream of data but for this use case a more I can suggest a more suitable architecture like streaming with kafka using spring cloud stream for example.
The problem is not the possibility of handle many data because the mongo reactive drive is very performant and tuning the back pressure mechanism you should save your server but repeat using a find all in streaming so big is few applicable, probably if you should handle a stream of data a messaging middleware with spring cloud stream may be the best option, imaging that you fire a find all ok your server and mogno probably will fine but your user will attend many hours before the request will finished, otherwise if the use case is a of line process as said before ok for processing an infinite data stream spring cloud stream may be the best option
UPDATE
Considering the use case of a lets say batch that should be ran one times per month I can say that the music change a lot.
Reading the code of Spring data reactive mongo I see that:
#NoRepositoryBean
public interface ReactiveMongoRepository<T, ID> extends ReactiveSortingRepository<T, ID>, ReactiveQueryByExampleExecutor<T> {
....
}
instead of
#NoRepositoryBean
public interface MongoRepository<T, ID> extends PagingAndSortingRepository<T, ID>, QueryByExampleExecutor<T> {
...
}
The key point of attention here is that the reactive version of the repository do not has the pagination feature in fact the name of base interface do not contains the word Paging, the key point here is the kind of technology.
In the blocking io the pagination is necessary for the model one thread per req and a so blocking pattern is dangerous for database application and so on busy a connection and the client for all the query is dangerous for timeout, load and so on and the split the query in page can help to not stress too much the system. But in a no blocking io the behavior is different you are attaching to a stream of data, the driver is a no blocking driver you do not use the classical mongo driver, spring data use the specific reactive mongo drive that is optimized for this job and it is based on a event loop model.
said that the key point here is that use a io intensive model for a off line profess probably is not so useful rather than safe, I mean using the reactive model is useful for software that are mainly io bound and with high traffic, the model support the high concurrency. But if your use case is a clean collection one times per month I guess that probably use reactive programming is safe since that is thought for support io intensive use case but in this case a classical batch blocking io model with pagination is a more suitable approach. The key point is i suppose that it should be safe the driver is thougth for manage a lot of data in high and streaming use case but it is useless use this approach for a batch use case
I hope that it can help you

Am I misusing GraphQL if I must decompose REST data, then re-aggregate it?

We are considering using GraphQL on top of a REST service (using the
FHIR standard for medical records).
I understand that the pattern with GraphQL is to aggregate the results
of multiple, independent resolvers into the final result. But a
FHIR-compliant REST server offers batch endpoints that already aggregate
data. Sometimes we’ll need à la carte data—a patient’s age or address
only, for example. But quite often, we’ll need most or all of the data
available about a particular patient.
So although we can get that kind of plenary data from a single REST call
that knits together multiple associations, it seems we will need to
fetch it piecewise to do things the GraphQL way.
An optimization could be to eager load and memoize all the associated
data anytime any resolver asks for any data. In some cases this would be
appropriate while in other cases it would be serious overkill. But
discerning when it would be overkill seems impossible given that
resolvers should be independent. Also, it seems bloody-minded to undo
and then redo something that the REST service is already perfectly
capable of doing efficiently.
So—
Is GraphQL the wrong tool when it sits on top of a REST API that can
efficiently aggregate data?
If GraphQL is the right tool in this situation, is eager-loading and
memoization of associated data appropriate?
If eager-loading and memoization is not the right solution, is there
an alternative way to take advantage of the REST service’s ability
to aggregate data?
My question is different from
this
question and
this
question because neither touches on how to take advantage of another
service’s ability to aggregate data.
An alternative approach would be to parse the request inside the resolver for a particular query. The fourth parameter passed to a resolver is an object containing extensive information about the request, including the selection set. You could then await the batched request to your API endpoint based on the requested fields, and finally return the result of the REST call, and let your lower level resolvers handle parsing it into the shape the data was requested in.
Parsing the info object can be a PITA, although there's libraries out there for that, at least in the Node ecosystem.

My Concerns about Spring-Batch that you cant actually multi-thread/read in chunks while reading items

I was trying to batch simple file. I understand that I couldnt multi-thread it. So at least I tried to perform better while increasing the chunks param:
#Bean
public Step processFileStep() {
return stepBuilderFactory.get("processSnidFileStep")
.<MyItem, MyItem>chunk(10)
.reader(reader())
....
My logic needs the processor to 'filter' our non valid records.
but than I found out that the processor not able to get chunks.. but only one Item at a time:
public interface ItemProcessor<I, O> {
O process(I item) throws Exception;
}
In my case I need to access the database and valid my record over there. so for each Item I have to query the DB(instead of doing it with bunch of items together)
I cant multi-thread or make my process perform better? what am I missing here? It will take too long to process each record one by one from a file.
thanks.
From past discussions, the CSV reader may have serious performance issues. You might be better served by writing a reader using another CSV parser.
Depending on your validation data, you might create a job scoped filter bean that wraps a Map that can be either preloaded very quickly or lazy loaded. This way you would limit the hits on the database to either initialization or first reference (repectively), and reduce the filter time to a hashmap lookaside.
In the Spring Batch chunk-oriented processing architecture, the only component where you get access to the complete chunk of records is the ItemWriter.
So if you want to do any kind of bulk processing this is where you would typically do that. Either with an ItemWriteListener#beforeWrite or by implementing your own custom ItemWriter.

If I expose IQueryable from my service layer, wouldn't the database calls be less if I need to grab information from multiple services?

If I expose IQueryable from my service layer, wouldn't the database calls be less if I need to grab information from multiple services?
For example, I'd like to display 2 separate lists on a page, Posts and Users. I have 2 separate services that provides a list of these. If both provides IQueryable, will they be joint in 1 database call? Each repository creates a context for itself.
It's best to think of an IQueryable<T> as a single query waiting to be run. So if you return 2 IQueryable<T> instances and run them in the controller, it wouldn't be any different than running them separably in their own service methods. Each time you execute the IQuerable<T> to get results, it will run the query by itself independent of other IQuerable<T> objects.
The only time (as far as I know) it will make an impact if there is enough time between the two service calls that the database connection might close, but you would need a considerable amount of time in between the service calls for that to be the case.
Returning IQuerable<T> to the controller still has some usefulness, such as easier handling of paging and sorting (so sorting is done on the controller and is not done on the service layer which doesn't necessarily care about how data is sorted or paged). This isn't a performance concern though, and people will disagree about if it's best to do this in the controller or not (I've seen reputable developers do this and give well thought out reasons why).
No. The best an IQueryable can do is reduce the number of calls within a singular database context. An IQueryable will not cross contexts.
Personally, I don't use IQueryables past the repositories for a number of reasons:
1) I don't use the same domain objects as database objects, and seeing "no translation to SQL" pisses me off ;)
2) I don't like the necessary structure for IQueryables in views: foreach (var item in collection){var tempItem = item; code on tempItem}
3) I've come up with a method of passing generic filters to the data layer (LinqKit and PredicateBuilder are gods)
If these reasons don't apply to you, of course you should feel free to use IQueryables to whichever layer you desire.
Not with two different contexts.
Definitely NO. It's a leaky abstraction.
It allows abominations like this:
q.Where(x=>{Console.WriteLine("fail");return true;});
Thing is - when exposing IQueryable, You are saying that Your data layer fully supports linq to objects.
If you make two method calls you will make two queries.
You can combine the methods into a single method which gets all the data at once.
If you are implementing the repository pattern you will have an easier time if you instantiate one database context per request.
Your service layer is exactly that, a layer which serves up what you need. Often times my service layers are named things like SearchService which has methods for returning every packaged collection I will ever need (the actual view models themselves). And if I ever need a new search, my service layer gets a new method. The backing for your service layer can then contain any data backing or persistence model you would like, be it a repository or Entity Framework provider, etc.
To answer your question though, the line needs to be drawn at the service layer, all queries need to be contained within it and only data returned.

Resources