Data validation: fail fast, fail early vs. complete validation

Data validation: fail fast, fail early vs. complete validation - validation

Regarding data validation, I've heard that the options are to "fail fast, fail early" or "complete validation". The first approach fails on the very first validation error, whereas the second one builds up a list of failures and presents it.
I'm wondering about this in the context of both server-side and client-side data validation. Which method is appropriate in what context, and why?
My personal preference for data-validation on the client-side is the second method which informs the user of all failing constraints. I'm not informed enough to have an opinion about the server-side, although I would imagine it depends on the business logic involved.

Part of why this is confusing is that people don't discuss orthogonality as part of the criteria. 'Fail-early' is useful so that the error is caught where it happens, not downstream. But for orthogonal failures, there is no downstream, or multiple downstreams.
For example, most user forms are filled with lots of independent questions, like username, password, email, for example. Since they're independent, wait until all 3 arrive, and describe all errors at once. Making the user go through three submit-check-complain cycles is preposterous.

For trivial errors such as invalid inputs or missing data, it's up to you how convenient you want to make the system for the user. It could be very annoying, for example, if somebody was importing a full spreadsheet of data into a system and the first row failed and you said "first row failed." The user fixes the first row, imports, and gets the message "second row failed." Imagine there are 65536 rows. You already know you're not going to do anything with the data, but do you want to make the user's life easier? Again, these are the trivial errors I'm discussing, and of course you will be designing a system that validates all data before processing.
For more serious errors that either you do not expect or are more than mere validation issues, revert to fail fast and hard.

Related

How does Waze validate a user inputs and make decisions based on it?

The Waze application receives explicit input from users referring to many issues: traffic, police traps, construction, speedcams, accidents, weather conditions, or any hazards along the way.
How does Waze validate whether these inputs are true or false?
How does Waze make the decisions that can be triggered by such input validation?
Are these validations and decisions makings performed automatically or through the intervention of a computational scientist?
Since Waze handles a streaming data processing, the validation and decision makings after a user input should be done online.
Thanks and appreciate any tips.

AFAIK, when a report is sent by a user in Waze (police, accident, etc) - the data is sent to the Waze servers, and then Waze performs a gradual roll-out of the report to a subset of users. Other users will either confirm the report (thumbs up) or say its not there (tapping X). The more a report is confirmed, the larger exposure it gets. This also works the other way around - if a report has already been confirmed, but then users start reporting that it's gone (for example, if an accident has been cleared out), the effect on other users is gradual until the system is confident that the report is true.
Waze's data highly depends on community input, and this has proved to be fairly reliable. Engaged users will not only send reports, but also mark reports as incorrect, and so even if incorrect data is sent to the Waze servers, the community will be quick to correct it.

Compensating Events on CQRS/ES Architecture

So, I'm working on a CQRS/ES project in which we are having some doubts about how to handle trivial problems that would be easy to handle in other architectures
My scenario is the following:
I have a customer CRUD REST API and each customer has unique document(number), so when I'm registering a new customer I have to verify if there is another customer with that document to avoid duplicity, but when it comes to a CQRS/ES architecture where we have eventual consistency, I found out that this kind of validations can be very hard to address.
It is important to notice that my problem is not across microservices, but between the command application and the query application of the same microservice.
Also we are using eventstore.
My current solution:
So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%. That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
Altough this works, there are 2 things that bother me here, the first thing is my command application relying on the query application, so if my query application is down, my command is affected (today I just return false on my validation if query is down but still...) and second thing is, should a query/read model really be able to emit events? And if so, what is the correct way of doing it? Should the command have some kind of API for that? Or should the query emit the event directly to eventstore using some common shared library? And if I have more than one view/read? Which one should I choose to handle this?
Really hope someone could shine a light into these questions and help me this these matters.

For reference, you may want to be reviewing what Greg Young has written about Set Validation.
I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right?
That's exactly right - your read model is stale copy, and may not have all of the information collected by the write model.
That's when my second validation kicks in, when my query application is processing the events and saving them to my PostgreSQL, I check again if there is a customer with that document and if there is, I reject that event and emit a compensating event to undo/cancel/inactivate the customer with the duplicated document, therefore finishing that customer stream on eventstore.
This spelling doesn't quite match the usual designs. The more common implementation is that, if we detect a problem when reading data, we send a command message to the write model, telling it to straighten things out.
This is commonly referred to as a process manager, but you can think of it as the automation of a human supervisor of the system. Conceptually, a process manager is an event sourced collection of messages to be sent to the command model.
You might also want to consider whether you are modeling your domain correctly. If documents are supposed to be unique, then maybe the command model should be using the document number as a key in the book of record, rather than using the customer. Or perhaps the document id should be a function of the customer data, rather than being an arbitrary input.
as far as I know, eventstore doesn't have transactions across different streams
Right - one of the things you really need to be thinking about in general is where your stream boundaries lie. If set validation has significant business value, then you really need to be thinking about getting the entire set into a single stream (or by finding a way to constrain uniqueness without using a set).
How should I send a command message to the write model? via API? via a message broker like Kafka?
That's plumbing; it doesn't really matter how you do it, so long as you are sure that the command runs within its own transaction/unit of work.

So what I do today is, in my command application, before saving the CustomerCreated event, I ask the query application (using PostgreSQL) if there is a customer with that document, and if not, I allow the event to go on. But that doesn't guarantee 100%, right? Because my query can be desynchronized, so I cannot trust it 100%.
No, you cannot safely rely on the query side, which is eventually consistent, to prevent the system to step into an invalid state.
You have two options:
You permit the system to enter in a temporary, pending state and then, eventually, you will bring it into a valid permanent state; for this you could allow the command to pass, yield CustomerRegistered event and using a Saga/Process manager you verify against a uniquely-indexed-by-document-collection and issue a compensating command (not event!), i.e. UnregisterCustomer.
Instead of sending a command, you create&start a Saga/Process that preallocates the document in a uniquely-indexed-by-document-collection and if successfully then send the RegisterCustomer command. You can model the Saga as an entity.
So, in both solution you use a Saga/Process manager. In order for the system to be resilient you should make sure that RegisterCustomer command is idempotent (so you can resend it if the Saga fails/is restarted)

You've butted up against a fairly common problem. I think the other answer by VoicOfUnreason is worth reading. I just wanted to make you aware of a few more options.
A simple approach I have used in the past is to create a lookup table. Your command tries to register the key in a unique constraint table. If it can reserve the key the command can go ahead.
Depending on the nature of the data and the domain you could let this 'problem' occur and raise additional events to mark it. If it is something that's important to the business/the way the application works then you can deal with it either manually or at the time via compensating commands. if the latter then it would make sense to use a process manager.
In some (rare) cases where speed/capacity is less of an issue then you could consider old-fashioned locking and transactions. Admittedly these are much better suited to CRUD style implementations but they can be used in CQRS/ES.
I have more detail on this in my blog post: How to Handle Set Based Consistency Validation in CQRS
I hope you find it helpful.

Additional validation logic before sending a request to the external system

I'm publishing a comment on Instagram using their API. In the documentation they describe the rules that the message that is being sent has to pass. So far my approach always was to add the validation layer just before the message would be sent to the service checking if it satisfies all the requirements. I preferred to get back to the user quicker with the proper error without sending any requests to the social network.
It requires to maintain additional logic in my application and in case of Instagram, where rules are not so simple (like e.g. just limiting the length of the message) I started thinking if that's the optimal approach.
For example, one of the requirements on comments is that they cannot contain more than 4 hashtags which forces me to keep some logic to be able to check how many hashtags are in a string.
Would you think that the effort put into keeping that validation is worth it? I always thought so, but am not so sure any more.

Would you think that the effort put into keeping that validation is
worth it? I always thought so, but am not so sure any more.
Absolutely yes unless you don't care about user at all.
User is your primary value and his comfort is above all things. So, double validation is a must for a good software.
I think you shouldn't struggle to implement absolutely ALL Instagram checks, but at least most of those, that users fail most often.

Where do you perform your validation?

Hopefully you'll see the problem I'm describing in the scenario below. If it's not clear, please let me know.
You've got an application that's broken into three layers,
front end UI layer, could be asp.net webform, or window (used for editing Person data)
middle tier business service layer, compiled into a dll (PersonServices)
data access layer, compiled into a dll (PersonRepository)
In my front end, I want to create a new Person object, set some properties, such as FirstName, LastName according to what has been entered in the UI by a user, and call PersonServices.AddPerson, passing the newly created Person. (AddPerson doesn't have to be static, this is just for simplicity, in any case the AddPerson will eventually call the Repository's AddPerson, which will then persist the data.)
Now the part I'd like to hear your opinion on is validation. Somewhere along the line, that newly created Person needs to be validated. You can do it on the client side, which would be simple, but what if I wanted to validate the Person in my PersonServices.AddPerson method. This would ensure any person I want to save would be validated and removes any dependancy on the UI layer doing the work. Or maybe, validate both in UI and in by business server layer. Sounds good so far right?
So, for simplicity, I'll update the PersonService.AddPerson method to perform the following validation checks
- Check if FirstName and LastName are not empty
- Ensure this new Person doesn't already exist in my repository
And this method will return True if all validation passes and the Person is persisted, False if Validation fails or if the Person is not persisted.
But this Boolean value that AddPerson returns isn't enough for me at the UI layer to give the user a clear reason why the save process failed. So what's a lonely developer to do? Ultimately, I'd like the AddPerson method to be able to ensure what its about to save is valid, and if not, be able to communicate the reasons why it's not invalid to my UI layer.
Just to get your juices flowing, some ways of solving this could be: (Some of these solutions, in my opinion, suck, but I'm just putting them there so you get an understanding of what I'm trying to solve)
Instead of AddPerson returning a boolean, it can return an int (i.e. 0 = Success, Non Zero equals failure and the number indicates the reason why it failed.
In AddPerson, throw custom exceptions when validation fails. Each type of custom exception would have its own error message. In addition, each custom exception would be unique enough to catch in the UI layer
Have AddPerson return some sort of custom class that would have properties indicating whether validation passed or failed, and if it did fail, what were the reasons
Not sure if this can be done in VB or C#, but attach some sort of property to the Person and its underlying properties. This "attached" property could contain things like validation info
Insert your idea or pattern here
And maybe another here
Apologies for the long winded question, but I definately like to hear your opinion on this.
Thanks!

Multiple layers of validation go well with multi-layer apps.
The UI itself can do the simplest and quickest checks (are all mandatory fields present, are they using the appropriate character sets, etc) to give immediate feedback when the user makes a typo.
However the business logic should have the lion's share of validation responsibilities... and for once it's not a problem if this is "repetitious", i.e., if the business layer re-checks something that should already have been checked in the UI -- the BL should check all the business rules (this double checks on UI's correctness, enables multiple different UI clients that may not all be perfect in their checks -- e.g. a special client on a smart phone which may not have good javascript, and so on -- and, a bit, wards against maliciously hacked clients).
When the business logic saves the "validated" data to the DB, that layer should perform its own checks -- DBs are good at that, and, again, don't worry about some repetition -- it's the DB's job to enforce data integrity (you might want different ways to feed data to it one day, e.g. a "bulk loader" to import a number of Persons from another source, and it's key to ensure that all those ways to load data always respect data integrity rules); some rules such as uniqueness and referential integrity are really best enforced in the DB, in particular, for performance reasons too.
When the DB returns an error message (data not inserted as constraint X would be violated) to the business layer, the latter's job is to reinterpret that error in business terms and feed the results to the UI to inform the user; and of course the BL must similarly provide clear and complete info on business rules violation to the UI, again for display to the user.
A "custom object" is thus clearly "the only way to go" (in some scenarios I'd just make that a JSON object, for example). Keeping the Person object around (to maintain its "validation problems" property) when the DB refused to persist it does not look like a sharp and simple technique, so I don't think much of that option; but if you need it (e.g. to enable "tell me again what was wrong" functionality, maybe if the client went away before the response was ready and needs to smoothly restart later; or, a list of such objects for later auditing, &c), then the "custom validation-failure object" could also be appended to that list... but that's a "secondary issue", the main thing is for the BL to respond to the UI with such an object (which could also be used to provide useful non-error info if the insertion did in fact succeed).

Just a quick (and hopefully helpful) comment: when you're wondering where to place validation, try pretending that, soon, you're going to completely recreate your UI layer using a technology you're not yet so familiar with**. Try to keep out of that layer any validation-like business logic that you know for certain you'd have to rewrite in the new technology.
You'll find exceptions - business logic that ends up in your UI layer regardless, but it's a useful consideration nonetheless.
** Mobile dev, Silverlight, Voice XML, whatever - pretending you don't know the technology of your "new" UI layer helps you abstract your concerns and get less mired in implementation details.

The only important points are:
From the perspective of the front-end(s), the Middle Tier must perform all validation, you never know whether someone is going to try circumventing your front-end validation by talking directly to your Middle Tier (for whatever reason)
The Middle Tier may elect to delegate some of that validation to the DB layer (e.g. data integrity constraints)
You may optionally duplicate some validation in the UI, but that should only be for the sake of performance (to avoid round-trips to the Middle Tier for common scenarios, such as missing mandatory fields, incorrectly formatted data, etc.) These checks should never take the place of doing them in the Middle Tier

Validation should be done at all three levels.
When I am in a project I assume I am making a framework, which most of the time is not the case. Each layer is separate and must check all layers input before doing an operation
Each level can have a different way of doing it, it is not necessary they all use the same, but ideally they should all use the same validation with the ability to customize it.
You never want to let bad data into the database. So you can never trust the data you are getting from the business layer. It needs to be checked.
In the business layer you can never trust the UI layer, and you must check it to prevent un-needed calls to the database layer. The UI layer works the same way.

I disagree with David Basarab's comment that the same validations should be present in all layers. This defies the paradigm of responsibility of layers for one reason. Secondly, though the main intention is to make the layers (or components) loosely coupled, it is also important that a level of responsibility (and hence trust) is endowed on the layers. Though it might be necessary to duplicated some validations in UI and Business Layer (since UI layer can be bypassed by hacking attempts), however, it is not advisable to repeat the validations in each layer. Each layer should perform only those validations which they are responsible for. The biggest flaw in repeting validations in all layers is code redundancy, which can cause maintenance nightmare.

A lot of this is more style than substance. I personally favor returning status objects as a flexible and extensible solution. I would say that I think there are a couple classes of validation in play, the first being "does this person data conform to the contract of what a person is?" and the second being "does this person data violate constraints in the database?" I think the first validation can, and should be done at the client. The second should be done at the middle tier. With this division, you may find that the only reasons the save could fail are 1)violates a uniqueness constrains, or 2)something catastrophic. You could then return false for the first case, and throw an exception for the other.

If tier R is closer to the user (or any input stream you don't control) than tier S then tier S should validate all data received from tier R. This does not mean that tier R shouldn't validate data. It's better for the user if the GUI warns him he's making a mistake before he attempts a new transaction. But no matter how bulletproof the validation in your GUI is, the next tier up should not trust that any validation has taken place.
This assumes your database in completely under your control. If not, you have bigger problems.

Also, you could have the UI pass the data needed to build a Person object through some sort of PersonBuilder object, so that object creation is consolidated in the domain/business layer, and you can keep the Person object in a state that is always consistent. This makes more sense for more complex entities, however even for simple ones, it is good to centralize object creation, just like you centralize persistence, etc.

Double-click double-insert resolutions?

A team member has run into an issue with an old in-house system where a user double-clicking on a link on a web page can cause two requests to be sent from the browser resulting in two database inserts of the same record in a race condition; the last one to run fails with a primary key violation. Several solutions and hacks have been proposed and discussed:
Use Javascript on the web page to mitigate the second click by disabling the link on the first click. This is a quick and easy way to reduce the occurrences of the problem, but not entirely eliminate it.
Wrap the request execution on the sever side in a transaction. This has been deemed too expensive of an operation due to server load and lock levels on the table in question.
Catch the primary key exception thrown by the failed insert, identify it as such, and eat it. This has the disadvantages of (a) vendor lock-in, having to know the nuances of the database-specific exceptions, and (b) potentially not logging/dealing with legitimate database failures.
An extension of #3 by attempting to update the record if the insert fails and checking the result of the update to ensure it returns 1 record affected.
Are the other options that haven't been considered? Are there pros and cons of the options presented that were overlooked? Which is the lesser of all evils?

Put a unique identifier on the page in a hidden field. Only accept one response with a given unique identifier.

It sounds like you might be misusing a GET request to modify server state (although this is not necessarily the case). While it may not be appropriate for you situation, it should be stated that you should consider converting the link into a form POST.

You need to implement the Synchronizer Token pattern.
How it works is: a value (the token) is generated on the server for each request. This same token must then be included in your form submission. On receipt of the request the server token and client token are compared and if they are the same you may continue to add your record. The server side token is then regenerated, so subsequent requests containing the old token will fail.
There's a more thorough explanation about half-way down this page.
I'm not sure what technology you're using, but Struts provides framework level support for this pattern. See example here

It seems you already replied to your own question there; #1 seems to be the only viable option.
Otherwise, you should really do all three steps -- data integrity should be handled at the database level, but extra checks (such as the explicit transaction) in the code to avoid roundtrips to the database could be good for performance.

REF You need to implement the Synchronizer Token pattern.
This is for Javascript/HTML not JAVA

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio