OSB schema validation performance - validation

Is there any performance issues when using Validate nodes in a proxy service inside Oracle Service Bus (OSB)?
What are the best practices when using the Validate node?
What's the time cost of using several Validate nodes, for example:
Validate header
Branch depending on incomping operation
Validate body schema depending on operation
Do a xquery transformation
Validate schema after transformation
Send request to business service
Is it useful a Validate node in step 5 after a xquery? Doesn't a xquery transformation assure schema integrity?
Thanks!

Validation does have a performance cost, but generally you validate by default and only reassess when the performance is not sufficient (it's likely that performance gains could be found elsewhere first, by using split-joins, or by rationalising multiple OSB nodes into a single xquery)
Personally, I'd validate the request after the first operational branch (so you know what element to validate against), then, optionally validate the response just before you send it back in the response pipeline.
And no, xquery transforms do not assure schema integrity. I would not recommend validating after your own xquery transformation; the result is within your control so you should be testing that in other ways (ideally, statically), rather than relying on a runtime validation.

Related

Retrieving all fields vs only some in graphql, Time Comparison

I am currently working on a project involving graphQL and I was wondering if the action of retrieving every elements of a given type in a query was taking significantly more time than only retrieving some or if this time was negligible.
Here is an exemple:
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
totalComments
totalCollects
totalFollows
totalRevenue {
...Erc20AmountFields
}}
vs
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
}
Thanks in advance!
The answer highly depends on the implementation on the backend side. Let's look at what three stages the data goes through and how this can impact response time.
1. Data fetching from the source
First, the GraphQL server has to fetch the data from the database or a different data source. Some data sources allow you to specify which fields you want to receive. If the GraphQL service is optimised to only fetch the data needed, some time can be saved here. In my experience, it is often not worth it to do this and it is much easier to just fetch all fields that could be needed for an object type. Some GraphQL implementations do this automatically, e.g. Hasure, Postgraphile, Pothos with the Prisma Plugin. What can be more expensive is resolving relationships between entities. Often, the GraphQL implementation has to do another roundtrip to the server.
2. Data transformation and business logic
Sometimes, the data has to be transformed before it is returned from the resolver. The resolver model allows this business logic to be called conditionally. Leaving out a field will skip its resolver. In my experience, most business logic is incredibly fast and does not really impact response time.
3. Data serialisation and network
Once all the data is ready on the server side, it has to be serialised to JSON and sent to the client. Serializing large amounts of data can be expensive, especially because GraphQL is hard to serialise in a stream. Sending data to the client can also take a while, if the connection is slow or the data has a large size. This was one of the motivations for GraphQL: Allow the client to select the required fields and reduce unused data transfer.
Summary
As you can see, the response time is mostly related to the amount of data returned from the API and the network connection. Depending on the implementation, real savings are only made on the network, but more advanced implementations can drastically reduce the work done on the server as well.

Microservice cross-db referencial integrity

We have a database that manages codes, such as a list of valid currencies, a list of country codes, etc (hereinafter known as CodesDB).
We also have multiple microservices that in a monolithic app + database would have foreign key constraints to rows in tables in the CodesDB.
When a microservice receives a request to modify data, what are my options for ensuring the codes passed in the request are valid?
I am currently leaning towards having the CodesDB microservice post an event onto a service bus announcing when a code is added or modified - and then each other microservice interested in that type of code (country / currency / etc) can then issue an API request to the CodeDB microservice to grab the state it needs and reflect the changes in its own local DB. That way we get referential integrity within each microservice DB.
Is this the correct approach? Are there any other recommended approaches?
Asynchronous event based notification is a pattern commonly used in micro services world for ensuring eventual consistency. Depending on how strict your consistency requirement are you may have to ensure additional checks.
Another possible approach could be to use
Read only data stores using materialized view. This is a form of CQRS pattern where data from multiple services is stored in a de-normalized form in read only data store. The data gets updated asynchronously using the approach mentioned above. The consumers gets fast access to data without having to query multiple services
Caching - You could also possibly use distributed or replicated depending on your performance or consistency requirements.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

Uniqueness validation performance

When performing uniqueness validation in Core Data in the usual way (via NSManagedObject validate…), complexity is O=(n²) because every entity is going to compare itself to every other entity of its type.
Is there a straightforward way to get linear performance for Core Data uniqueness validations? Unfortunately, there doesn't seem to be a class-level or context-level validation.
There is no default implementation for validation because it very much depends on your application and business logic.
If you are importing data, it is best to gather all of the unique IDs and then perform a single fetch to determine existence.
If you are creating a new record then I recommend doing the one off, expensive, fetch to determine uniqueness.

couchdb validation based on content from existing documents

QUESTION
Is it possible to query other couchdb documents as part of a standard couchdb validation function ?
If not, what is the standard approach for including properties of other documents as part of a validation rule inside a couchdb validation function?
RATIONALE
Consider a run-of-the-mill address book application where the validation function is intended to prevent two or more entries having the same value for the 'e-mail' in one of the address book entry fields.
Consider also an address book application where it is possible to specify validation rules in separate documents, based on whether the postal code is a US-based postal code or something else.
No, it is not possible to query other couchdb documents in a validate_doc_update function. Each runs in isolation passing references only to: the new document, the old document, and user (where applicable).
My personal experience has been there are at least three options for dealing with duplicate checking:
Use Cloudant as your CouchDB provider. They offer a free tier for now if you'd like to experiment, but they guarantee consistency across nodes for a CouchDB database. (See #2)
I've used a secondary "reserve table" for names using the type-key as the ID. Then, you need to check for conflicts if not using a system like Cloudant. Basically, there's a simple document that maintains a key to prevent duplicates. It's not fun code to write given that you need to watch for conflicts. (Even with cloudant, you need to deal with failed requests to write, but it's easier than dealing with timing issues surrounding data replication across multiple nodes).
Use a traditional DB like MySQL for example that can maintain a unique and consistent index for specific data values like you're describing. Store the documents away in CouchDB though. While slightly annoying that you need different data providers, it's reliable.
(Optional: decide that CouchDB isn't a great fit for the type of system you're building)

Resources