hibernate indexing time based UUID - performance

We have currently a postgresql database using UUID as a primary key. We are using hibernate as a ORM tool in order to manage the postgresql with the dialect
“org.hibernate.dialect.PostgreSQLDialect”
I have read an article (http://mysql.rjweb.org/doc.php/uuid) that there is a way to increase the read/write performance if "Time based" / "Version 1" UUIDs are used. In order to do that we need to rearrange the 3rd part of the UUID i.e. rearrange it in a way that the 3rd part becomes the first one. Then we are getting a number that increases nicely over time. The idea is to place the index on such table, thus we can make the IO even faster
Now the question is : Has anybody tried using this in a table with hundreds of million rows and if so does this change make sense. I mean UUID's is an idea not discovered today and I bet there is a reason why this has NOT been done so far.
How can I produces such UUID's using hibernate, there are 5 types and as far as I understand none of them produce those time based rearranged UUID's . How should the field "private UUID studentId;" look like i.e. which annotations should be used in order to tell hibernate to generate such an UUID.
import java.util.UUID;
#Id
#Index
#Column(name="STUNDENT_ID")
#NotNull
private UUID studentId;

Related

Where does spring/hibernate store the Id?

I am using spring with a basic ( except for the credentials for the access, everything is default values ) PostgreSQL database and, using JPA, I get the expected Id increment when using #Id and #GeneratedValue for my #Entity but when I drop the entire table I noticed that the Id is incremented from the previous ( and deleted ) values.
Where are the Id values stored ?
From the Hibernate documentation for identifier generators:
AUTO (the default)
Indicates that the persistence provider (Hibernate) should choose an appropriate generation strategy.
You didn't list GenerationType as one of the annotations present, so it would default to AUTO. From the documentation for how AUTO works:
If the identifier type is numerical (e.g. Long, Integer), then Hibernate is going to use the IdGeneratorStrategyInterpreter to resolve the identifier generator strategy. The IdGeneratorStrategyInterpreter has two implementations:
FallbackInterpreter
This is the default strategy since Hibernate 5.0. For older versions, this strategy is enabled through the hibernate.id.new_generator_mappings configuration property. When using this strategy, AUTO always resolves to SequenceStyleGenerator. If the underlying database supports sequences, then a SEQUENCE generator is used. Otherwise, a TABLE generator is going to be used instead.
Postgres supports sequences, so you get a sequence. From a bit farther down in the same document:
The simplest form is to simply request sequence generation; Hibernate will use a single, implicitly-named sequence (hibernate_sequence) for all such unnamed definitions.
Hibernate asks Postgres to create a sequence. The sequence keeps track of what ids have been handed out, the database persists this internally. You should be able to get into the admin UI of the database and reset this sequence if you want.
To clarify, a database sequence is a database object independent of any tables (multiple tables can use the same sequence), so in general dropping a table won't affect any sequences. The exception is when you're using auto-increment, in which case there is an ownership relationship, and the sequence implementing the auto-increment is reset when the table is dropped.
It's a judgment call on Hibernate's part whether to make the default implementation of id generation use a sequence directly or auto-increment. If it used auto-increment you would see the values get recycled like you expected, but with the sequence there is no automatic reset.

How to update ReadModel of an Aggregate that has an association with another Aggregate

I'm trying to separate read and write models. In summary I have this 2 entities with an association between them:
//AgregateRoot
class ProfessionalFamily {
private ProfessionalFamilyId id;
private String name;
}
//AgregateRoot
class Group {
private GroupId id;
private String literal;
private ProfessionalFamilyId professionalFamilyId; //ManyToOne association referenced by the ID of "professional-family"
}
The read model I'm using for return data in a Grid is the next one.
class GroupReadModel {
private String id;
private String groupLiteral;
private String professionalFamilyName;
}
I want to use NoSql for ReadModel queries and separate them for the write models. But my headache is: with that approach, when a Group is created I fire an Event (GroupCreated) and an Event handler listen the Event and store de Read/View/Projection Model in the NoSql database. So my question is: If I need to update the ProfessionalFamilyName and this is related with more than, for example 1000 groups (there are many more groups), how can I update all the Groups in ReadModel who is related with the professionalFamily I've been updated? Most probably I'm not doing anything well.
Thanks a lot.
NoSql databases are usually not designed to support data normalization and even intentionally break with this concept. If you would use a relational database system you would usually normalize your data and to each group you would only store the id of the ProfessionalFamily rather than duplicating the name of the ProfessionalFamily in each group document. So in general, for NoSql database duplication is accepted.
But I think before deciding to go with NoSql or a relational database you should consider (at least) the following:
Priority for speed of reads vs. writes:
If you need your writes (in your case changes of the name) to be very fast as they happen very often and the read speed is of lower priority maybe NoSql is not the best choice. You could still look into technology such as MongoDB which provides some kind of hybrid approach and allows to normalize and index data to a certain extent.
Writes will usally be faster when having a normalized structure in a relational database whereas reads will normally be faster without normalization and duplication in a NoSql database. But this is of course dependent on the technologies at hand which you are comparing as well as the amount of entities (in your case Groups) we are talking about as well as the amount of cross-referenced data. If you need to do lots of joins during the reads due to normalization you read performance will usually be worse compared to Group documents where all required data is already there due to duplication.
Control over the data structure/schema
If you are the one who knows how the data will look like you might not need the advantage of a NoSql database which is very well suited for data structures that change frequently or you are not in control of. If this is not really the case you might not benefit enough from NoSql technology.
And in addition there is another thing to consider: how consistent does your read model data have to be? As you are having some kind of event-sourcing approach I guess you are already embracing eventual consistency. That means not only the event processing is performed asynchronously but you could also accept that - getting back to your example - not all groups are being updated with the new family name at the same time but as well asynchronously or via some background jobs if it is not a problem that one Group still shows the old name while some other group already shows the new name for some time.
Most probably I'm not doing anything well.
You are not doing anything wrong or right per-se choosing this approach as long as you decide for NoSql (or against) for the right reasons which include these considerations.
My team and I discussed a similar scenario recently and we solved it by changing our CRUD approach to a DDD approach. Here is an example:
Given a traveler with a list of visited destinations.
If I have an event such a destinationUpdated then I should loop accross every travelers like you said, but does it make sens? What destinationUpdated means from a user point of view? Nothing! You should find the real user intent.
If the traveler made a mistake entering is visited destination then your event should be travelerCorrectedDestination which solve the problem because travelerCorrectedDestination now contains the travaler ID so you don't have to loop through all travelers anymore.
By applying a DDD approach problems usually solve by themselfs.

Aerospike reporting requirement | Hourly frequent scans required

We are having below aerospike set data model # prod. We are relying on ONLY Aerospike as our data-house. Now, we need to generate the hourly report for sales team: Report detailing no of customers acquired in every hour.
#Document(collection = "cust")
public class Customer {
#Id
#Field(value = "PK")
private String custId;
#Field(value = "mobileNumber")
private String mobileNumber;
#Field(value = "status")
private String customerStatus;
#Field(value = "creationTime")
private String creationTime;
#Field(value = "corrDetails")
private HashMap<String, Object> corrDetails;
}
Concerns needs help :-
a.) How the same can be achieved by avoiding the Secondary Indices ! We don't have any secondary indexes on production and would want to avoid them.
b.) is there a way where aforementioned kind of reports can be generated, Since we DON'T have MYSQL / RDBMS replicating the data unnderneath !
c.) Are frequent aerospike SET scans leads to deterioration in performance ?
Aerospike can scan/query for records whose 'last update time'(LUT) is greater than a particular value. Assuming that there are no other updates for the set that you are talking about, you should be able to exploit this feature. Also, it seems like you need to know only the count and do not need the details of users you acquired in the last one hour. In that case, you can avoid getting bin data, which is going to make the scan/query even more efficient.
Aerospike scan based on LUT is going to be efficient as LUT is part of the primary index and in memory. However, each scan needs to walk the entire in-memory primary index to compare LUTs. So, it is not as efficient as secondary index, but it possibly is still a better tradeoff overall given the other overheads with secondary indices. But be careful not to overwhelm the system with too many scans. May be you can cache summary in aerospike itself and keep refreshing it.
You can take a look at the java client example on how to do a scan with predicate expression (query without a where clause on a bin). Refer to runQuery2 function in the example. You do not need an end time for your use case. To avoid fetching bin data, you can set the includeBinData to false in the QueryPolicy.

What's the effect of removing #GeneratedValue from the Id on the performance

I'm working on a RCP-application, which communicates with an a Tomcat-server using Rest. Since we've gotten more and more data, the load/copy-routines are slowly but surly becoming obsolete. It's taking me sometimes minutes to execute some copy-operation. So I'm looking for some advise, how to speed up my routines.
Here are the technologies I'm using:
RCP-Client (e4-plattform)
Tomcat8-Server
Oracle-DB
JDBC as API with Hibernate
Rest
First thing first. I checked the entities and the pretty much all look like the code below
#Entity
#SequenceGenerator(name = "CHECKITEM_SEQ", sequenceName = "CHECKITEM_SEQ", allocationSize = 1)
public class CheckItem extends AbstractTreeNode implements Serializable,Cloneable {...}
I figured by copying data (which are most of the time over 200K per operation) since I use them as primary key,
#Id
#GeneratedValue(generator = "CHECKITEM_SEQ", strategy = GenerationType.SEQUENCE)
public Integer getId() {
return id;
}
the DB must generate per object a sequence and check the constraint on it, So I was wondering how much performance I would gain, if I remove the Sequence since i don't really use/need them in the DB. Now my questions:
Is there anything that speaks against removing a constraint(primary key in this particularly case) in DB?
Has anyone more/better suggestions how to increase the performance of DB for such operations?
Can I have a tutorial or document, which can help me through this process?
I hope, i was clear enough and I will appreciate any kind of help. thanks already.
The problem with using #GeneratedValue identifiers is that in order for Hibernate to place the new entity into the Persistence Context (the first level cache), it must know the identifier. So when you're using IDENTITY or SEQUENCE based identifiers, this can impact the JDBC driver from being able to adequately batch insert operations.
For example, you illustrated that most of your entities use the following sequence generation:
#SequenceGenerator(
name = "CHECKITEM_SEQ",
sequenceName = "CHECKITEM_SEQ",
allocationSize = 1)
So whenever a persist operation for an entity happens, you're telling the sequence generator to only generate one value, so the JDBC communication looks like this:
1. Get Next Sequence
2. Insert
3. Get Next Sequence
4. Insert
5. Get Next Sequence
6. Insert
As seen here, we cannot batch the insert operations because we must fetch the identifier for each insert operation before the insert operation can happen. One solution to minimize that impact and deal with batch inserts is to use a larger allocationSize.
1. allocationSize=10 -> Get Next 10 sequences
2 - 11. Perform 10 inserts in batch
Repeat
As you can see here the driver can do 10 inserts in a batch, Hibernate allocates the sequences in batches on 10 and so the inserts can happen much faster.
Obviously this comes with a small drawback, if you allocate 10 sequences but the remaining batch only needs to insert 6 entities; you've wasted 4 sequence values but you gain the performance from being able to support doing jdbc batch inserts.
The next logical step would be to determine if you can eliminate the use of a #GeneratedValue all together as that would given you the maximum performance with batch inserts for your copy operations; however that may not be possible with your data model. In the past when I dealt with moving large volumes of data, I tried to define the primary key based on natural keys from the data without involving a surrogate key if possible.
Feel free to read more about JDBC batch operations here.

I want to have number sequence as IDs in Spring Data MongoDB persistence layer. How to configure this behavior?

Spring Data with MongoDB having BigInteger ids is still generating alphanumeric ObjectIds. I want to have number sequence as IDs. How to configure this behavior?
Spring Data MongoDB tries to convert all types that can make up ObjectIds by chance as they are recommended. As described in the MongoDB reference documentation this is due to the fact that they allow creating steadily increasing ids over a cluster. If you really need linear ids (1,2,3… not only steadily increasing ones) use a type of Long and create ids manually.
As per the springdata doc:
An id property or field declared as BigInteger in the Java class will be converted to and stored as an ObjectId using a Spring Converter
What exactly are you trying to represent for the _id?
If it's just a large number, using a long value will enable to represent 64bit numbers.
If you need to represent values larger than 64bit, then they would have to be represented either as a String or BinData in mongo, but not ObjectId since that is a fixed 12 bytes.

Resources