Aerospike reporting requirement | Hourly frequent scans required - reporting

We are having below aerospike set data model # prod. We are relying on ONLY Aerospike as our data-house. Now, we need to generate the hourly report for sales team: Report detailing no of customers acquired in every hour.
#Document(collection = "cust")
public class Customer {
#Id
#Field(value = "PK")
private String custId;
#Field(value = "mobileNumber")
private String mobileNumber;
#Field(value = "status")
private String customerStatus;
#Field(value = "creationTime")
private String creationTime;
#Field(value = "corrDetails")
private HashMap<String, Object> corrDetails;
}
Concerns needs help :-
a.) How the same can be achieved by avoiding the Secondary Indices ! We don't have any secondary indexes on production and would want to avoid them.
b.) is there a way where aforementioned kind of reports can be generated, Since we DON'T have MYSQL / RDBMS replicating the data unnderneath !
c.) Are frequent aerospike SET scans leads to deterioration in performance ?

Aerospike can scan/query for records whose 'last update time'(LUT) is greater than a particular value. Assuming that there are no other updates for the set that you are talking about, you should be able to exploit this feature. Also, it seems like you need to know only the count and do not need the details of users you acquired in the last one hour. In that case, you can avoid getting bin data, which is going to make the scan/query even more efficient.
Aerospike scan based on LUT is going to be efficient as LUT is part of the primary index and in memory. However, each scan needs to walk the entire in-memory primary index to compare LUTs. So, it is not as efficient as secondary index, but it possibly is still a better tradeoff overall given the other overheads with secondary indices. But be careful not to overwhelm the system with too many scans. May be you can cache summary in aerospike itself and keep refreshing it.
You can take a look at the java client example on how to do a scan with predicate expression (query without a where clause on a bin). Refer to runQuery2 function in the example. You do not need an end time for your use case. To avoid fetching bin data, you can set the includeBinData to false in the QueryPolicy.

Related

Javers - #DiffIgnore a byte array from being audited because of a performance issue

We are using javers to audit a few entities, one entity has a relationship with another entity that stores the uploaded file data. Ideally we need the audit of the file data as well to know if a different file was uploaded. There is a significant performance issue when inserting data when javers is enabled and the file binary (<1MB) is part of the entity. Considering the performance issue in mind. We added the #DiffIgnore on the
#Type(type="org.hibernate.type.Binartype)
#Column(name= "file_data", columnDefinition = "BLOB NOT NULL")
#DiffIgnore
private byte [] filedata;
property. There is no improvement in performance when inserting new entities.
Are there ways to improve performance here when the entity has a byte []?

How to update ReadModel of an Aggregate that has an association with another Aggregate

I'm trying to separate read and write models. In summary I have this 2 entities with an association between them:
//AgregateRoot
class ProfessionalFamily {
private ProfessionalFamilyId id;
private String name;
}
//AgregateRoot
class Group {
private GroupId id;
private String literal;
private ProfessionalFamilyId professionalFamilyId; //ManyToOne association referenced by the ID of "professional-family"
}
The read model I'm using for return data in a Grid is the next one.
class GroupReadModel {
private String id;
private String groupLiteral;
private String professionalFamilyName;
}
I want to use NoSql for ReadModel queries and separate them for the write models. But my headache is: with that approach, when a Group is created I fire an Event (GroupCreated) and an Event handler listen the Event and store de Read/View/Projection Model in the NoSql database. So my question is: If I need to update the ProfessionalFamilyName and this is related with more than, for example 1000 groups (there are many more groups), how can I update all the Groups in ReadModel who is related with the professionalFamily I've been updated? Most probably I'm not doing anything well.
Thanks a lot.
NoSql databases are usually not designed to support data normalization and even intentionally break with this concept. If you would use a relational database system you would usually normalize your data and to each group you would only store the id of the ProfessionalFamily rather than duplicating the name of the ProfessionalFamily in each group document. So in general, for NoSql database duplication is accepted.
But I think before deciding to go with NoSql or a relational database you should consider (at least) the following:
Priority for speed of reads vs. writes:
If you need your writes (in your case changes of the name) to be very fast as they happen very often and the read speed is of lower priority maybe NoSql is not the best choice. You could still look into technology such as MongoDB which provides some kind of hybrid approach and allows to normalize and index data to a certain extent.
Writes will usally be faster when having a normalized structure in a relational database whereas reads will normally be faster without normalization and duplication in a NoSql database. But this is of course dependent on the technologies at hand which you are comparing as well as the amount of entities (in your case Groups) we are talking about as well as the amount of cross-referenced data. If you need to do lots of joins during the reads due to normalization you read performance will usually be worse compared to Group documents where all required data is already there due to duplication.
Control over the data structure/schema
If you are the one who knows how the data will look like you might not need the advantage of a NoSql database which is very well suited for data structures that change frequently or you are not in control of. If this is not really the case you might not benefit enough from NoSql technology.
And in addition there is another thing to consider: how consistent does your read model data have to be? As you are having some kind of event-sourcing approach I guess you are already embracing eventual consistency. That means not only the event processing is performed asynchronously but you could also accept that - getting back to your example - not all groups are being updated with the new family name at the same time but as well asynchronously or via some background jobs if it is not a problem that one Group still shows the old name while some other group already shows the new name for some time.
Most probably I'm not doing anything well.
You are not doing anything wrong or right per-se choosing this approach as long as you decide for NoSql (or against) for the right reasons which include these considerations.
My team and I discussed a similar scenario recently and we solved it by changing our CRUD approach to a DDD approach. Here is an example:
Given a traveler with a list of visited destinations.
If I have an event such a destinationUpdated then I should loop accross every travelers like you said, but does it make sens? What destinationUpdated means from a user point of view? Nothing! You should find the real user intent.
If the traveler made a mistake entering is visited destination then your event should be travelerCorrectedDestination which solve the problem because travelerCorrectedDestination now contains the travaler ID so you don't have to loop through all travelers anymore.
By applying a DDD approach problems usually solve by themselfs.

What's the effect of removing #GeneratedValue from the Id on the performance

I'm working on a RCP-application, which communicates with an a Tomcat-server using Rest. Since we've gotten more and more data, the load/copy-routines are slowly but surly becoming obsolete. It's taking me sometimes minutes to execute some copy-operation. So I'm looking for some advise, how to speed up my routines.
Here are the technologies I'm using:
RCP-Client (e4-plattform)
Tomcat8-Server
Oracle-DB
JDBC as API with Hibernate
Rest
First thing first. I checked the entities and the pretty much all look like the code below
#Entity
#SequenceGenerator(name = "CHECKITEM_SEQ", sequenceName = "CHECKITEM_SEQ", allocationSize = 1)
public class CheckItem extends AbstractTreeNode implements Serializable,Cloneable {...}
I figured by copying data (which are most of the time over 200K per operation) since I use them as primary key,
#Id
#GeneratedValue(generator = "CHECKITEM_SEQ", strategy = GenerationType.SEQUENCE)
public Integer getId() {
return id;
}
the DB must generate per object a sequence and check the constraint on it, So I was wondering how much performance I would gain, if I remove the Sequence since i don't really use/need them in the DB. Now my questions:
Is there anything that speaks against removing a constraint(primary key in this particularly case) in DB?
Has anyone more/better suggestions how to increase the performance of DB for such operations?
Can I have a tutorial or document, which can help me through this process?
I hope, i was clear enough and I will appreciate any kind of help. thanks already.
The problem with using #GeneratedValue identifiers is that in order for Hibernate to place the new entity into the Persistence Context (the first level cache), it must know the identifier. So when you're using IDENTITY or SEQUENCE based identifiers, this can impact the JDBC driver from being able to adequately batch insert operations.
For example, you illustrated that most of your entities use the following sequence generation:
#SequenceGenerator(
name = "CHECKITEM_SEQ",
sequenceName = "CHECKITEM_SEQ",
allocationSize = 1)
So whenever a persist operation for an entity happens, you're telling the sequence generator to only generate one value, so the JDBC communication looks like this:
1. Get Next Sequence
2. Insert
3. Get Next Sequence
4. Insert
5. Get Next Sequence
6. Insert
As seen here, we cannot batch the insert operations because we must fetch the identifier for each insert operation before the insert operation can happen. One solution to minimize that impact and deal with batch inserts is to use a larger allocationSize.
1. allocationSize=10 -> Get Next 10 sequences
2 - 11. Perform 10 inserts in batch
Repeat
As you can see here the driver can do 10 inserts in a batch, Hibernate allocates the sequences in batches on 10 and so the inserts can happen much faster.
Obviously this comes with a small drawback, if you allocate 10 sequences but the remaining batch only needs to insert 6 entities; you've wasted 4 sequence values but you gain the performance from being able to support doing jdbc batch inserts.
The next logical step would be to determine if you can eliminate the use of a #GeneratedValue all together as that would given you the maximum performance with batch inserts for your copy operations; however that may not be possible with your data model. In the past when I dealt with moving large volumes of data, I tried to define the primary key based on natural keys from the data without involving a surrogate key if possible.
Feel free to read more about JDBC batch operations here.

Neo4J Cypher: performance of matching multiple properties and creating relationships

A little context: I'm experimenting with Neo4J (as a newbie, but experienced in other database technologies) for possible use as a master data management system within our business of identity intelligence, in particular looking at building up a graph of places, identity attributes (eg: email addresses, telephone numbers, electoral roll data, etc.) with relationships between these nodes that express something meaningful, for example where an email address has been used, or where a telephone number is registered.
Desired system properties: I would like this system to have some specific properties that are valuble to us:
Fast ingestion of information from a significant number of providers (100+), this precludes lengthy (hours) ETL processes, short ones are ok!
On line at all times, this precludes use of the batch importer, we are most likely to use a fault tolerant cluster, sharding would be good :)
Capacity to eventually ingest ~30G records / year (~1000/second) and retain them, creation and retention of ~100G relationships / year, right now we are ingesting ~1/10 of this load.
Where I'm stuck: I have been experimenting with a single node in Azure, 32GB RAM, 4 cores, with non-local disk, running Debian 8 and Neo4J 3.1.1. This happily ingests and relates back together the UK postal address file (PAF), around 29M records, in a few 10s of minutes using either LOAD CSV or home-brew Java and bolt. I have also ingested but not related a test set of email address data, around 20M records, and now need to build relationships based on matching postcodes, building numbers, and possibly other fields between the two data sets. This is where things get much slower when using Cypher, here's the fastest query I have been able to create thus far:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i WITH e
MATCH(s:SUBBNAME) USING INDEX s:SUBBNAME(SBNA)
WHERE upper(e.Building) = s.SBNA WITH e,s
MATCH(m:MAINFILE)
WHERE trim(split(e.Postcode,' ')[0]) = m.OUTC AND
trim(split(e.Postcode,' ')[1]) = m.INCO AND
right('0000'+e.HouseNo,4) = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Indexes are:
ON :DDSEMAIL(HouseNo) ONLINE
ON :DDSEMAIL(Postcode) ONLINE
ON :DDSEMAIL(Building) ONLINE
ON :MAINFILE(OUTC) ONLINE
ON :MAINFILE(INCO) ONLINE
ON :MAINFILE(BNUM) ONLINE
ON :SUBBNAME(SBNA) ONLINE
Please note that the {list} parameter is being supplied through bolt from a Java client that has already enumerated all the ~20M DDSEMAIL nodes, and is batching into transactions (typically 1000 IDs at a time).
This is taking between 100-200msecs per ID, over a test run of 157000 IDs it took 7.3 hours, indicating a full execution time of ~760 hours or >1 month. The underlying machine appears CPU bound (no significant IO wait time).
Looking at the EXPLAIN for this query, there are no full scans, it's all schema index matching (once I had included the explicit index statement), so I'm not sure where to look for more speed..
(edited to add this PROFILE output):
PROFILE part 1
PROFILE part 2
This shows that the match to both parts of the postcode is filtering a lot of rows (56k), it may be better to re-order these fields to reduce the filter input size.
(end of edit)
As a (very unfair) comparision, I pushed both sets of data from CSV files into a custom Bloom filter written in C#/.NET, which performs similar field reformatting as above then concatenates to generate textual keys, and matches these keys together. This completed convolving all 20M email records against all 29M PAF records in under 5 minutes on a single core of my laptop. It was largely IO bound.
Right now I'm considering using an external application or a user procedure to perform the record matching, and just creating relationships using Cypher, but it feels wrong to avoid a well-written query engine that should be able to do this much, much quicker than it is.
What should I be looking at to improve performance please?
If I recall correctly, the index won't be utilized correctly when there are transformations occurring on the comparison values (such as UPPER() or LOWER() or TRIM()) when they're sourced from another node property. You may need to perform these operations first and alias them, then do the match.
Providing the index hint gets around this, I think, so your match to s.SBNA should be correctly using the index, but if there's an index on any of the matched properties on m:MAINFILE, that may not be using the index.
Test to see if this makes a difference, comparing this query to the older query on a smaller data set:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Also, if you could add a screenshot of a PROFILE or EXPLAIN of the query to your description (after expanding all plan nodes) that may help to see where things could improve.
EDIT
As you mentioned in your description, batching these may be a good idea. APOC Procedures has apoc.periodic.iterate(), which may help here.
Let's see if we can apply that to your query. Try this out:
WITH {list} AS list
CALL apoc.periodic.iterate('
UNWIND {list} as list
RETURN list
', '
WITH {list} as i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
MERGE (e)-[:USED_AT]->(m)
', {batchSize:1000, iterateList:true, params:{list:list}}) YIELD batches, total, committedOperations, failedOperations, failedBatches, errorMessages
RETURN batches, total, committedOperations, failedOperations, failedBatches, errorMessages
We have to sacrifice returning the total number of relationships created, however, as we can't return values from the batched query.

hibernate indexing time based UUID

We have currently a postgresql database using UUID as a primary key. We are using hibernate as a ORM tool in order to manage the postgresql with the dialect
“org.hibernate.dialect.PostgreSQLDialect”
I have read an article (http://mysql.rjweb.org/doc.php/uuid) that there is a way to increase the read/write performance if "Time based" / "Version 1" UUIDs are used. In order to do that we need to rearrange the 3rd part of the UUID i.e. rearrange it in a way that the 3rd part becomes the first one. Then we are getting a number that increases nicely over time. The idea is to place the index on such table, thus we can make the IO even faster
Now the question is : Has anybody tried using this in a table with hundreds of million rows and if so does this change make sense. I mean UUID's is an idea not discovered today and I bet there is a reason why this has NOT been done so far.
How can I produces such UUID's using hibernate, there are 5 types and as far as I understand none of them produce those time based rearranged UUID's . How should the field "private UUID studentId;" look like i.e. which annotations should be used in order to tell hibernate to generate such an UUID.
import java.util.UUID;
#Id
#Index
#Column(name="STUNDENT_ID")
#NotNull
private UUID studentId;

Resources