Handling large transactions: any time/memory tradeoffs? - performance

In our system there is a (quite common) case where user's action can trigger operation that involves setting/removing labels onto/from nodes and relationships amounting to a total order of hundreds of thousands entities. (Remove label A from 100K nodes, set label B to 80K labels, set property [x,y,z] to 20K nodes and so on). Of course, I can't squeeze them all in one transaction and, thanks to the fact that these nodes can easily be separated into a large number of subsets, I perform the actions inside some number of separate transactions, which, of course, breaks all the ACIDity, but satisfies us in terms of performance. If I, however, try to nest those transaction into a single large one to rule them all, that top-level transaction tries to track all internal transactions' updates to DB, which, of course, results in an extremely poor performance.
What can you guys recommend me to solve the problem?
My config (well, its relevant parts):
"org.neo4j.server.database.mode" : "HA",
"use_memory_mapped_buffers" : "true",
"neostore.nodestore.db.mapped_memory" : "450M",
"neostore.relationshipstore.db.mapped_memory" : "450M",
"neostore.propertystore.db.mapped_memory" : "450M",
"neostore.propertystore.db.strings.mapped_memory" : "300M",
"neostore.propertystore.db.arrays.mapped_memory" : "50M",
"cache_type" : "hpc",
"dense_node_threshold" : "15",
"query_cache_size" : "150"
Any hints and clues are much appreciated :)

You are right that modifying hundreds of thousands of entities as a result of a user action in the same transaction isn't going to be performant. Nested transactions in Neo4j are just "placebo" transactions, as you correctly point out.
I would start by thinking about alternative strategies to achieve your goal (which I know nothing about) without needing to update so many entities.
If an alternative isn't possible, I would ask whether it is ok for the updates to happen a short time after the user action. If the answer is yes, then I would store a message about the user action in a persistent queue, which I would process asynchronously. That way, the user call returns quickly and the update happens eventually.
Finally, if it is acceptable for the time between the user action and the large update to take even longer, I would consider and "agent" that continuously crawls the graph and updates the labels of the entities that it encounters, as opposed to transaction-driven updates. Have a look at GraphAware NodeRank for inspiration.

Related

CQRS Where to Query for business logic/Internal Processes

I'm currently looking at implementing CQRS driven by events (not yet event sourcing) in for a service at work; the reasoning being:
I need aggregate data to support a RestAPI coming out of this service (which will be used to populate views)- however the aggregated data will not be used by the application logic/processing (ie the data originating outside this service, the bits that of the aggregate originating within it will be used)
I need to stream events to other systems so that they can react to the data (will produce to a Kafka topic, so the 'read'/'projection' side of this system will consume the same events as the external systems, from these Kafka topics
I will be consuming events from internal systems to help populate the aggregate for the views in first point (ie it's data from this service and other's)
The reason for not going event sourced currently is that a) we're in a bit of a time crunch, and b) due to still learning about it. Having said which, it is something that we are looking to do in the future- though currently, we have a static DB in the 'Command' side of the system, which will just store current state
I'm pretty confident with the concept of using the aggregate data to provide the Rest API; however my confusion is coming from when I want to change a resource from within the system (for example via a cron job triggered 5 times a day) Example:
If I have resource of class x, which (given some data), wants a piece of state changing
I need to select instances of the class x which meet the requirements (from one of the DB's). Think select * from {class x} where last_changed_ date > 5 days ago;
Then create a command to change the state of these instances of x (in my case, the static command DB would be updated, as well as an event made to update the read DB)
The middle bullet point is what is confusing me. If I pull the data out of the Read DB, and check some information on it, then decide to change a property; I then have to convert the object from the 'Read Object' to the 'Command Object', so that I can then persist it and create an event? With my current architecture- I could query the command DB no problem, to find all the instances of {class x} that match the criteria, however I don't know if a) this is the right thing to do, and b) how this would work if I was using an event store as a DB? I'd have to query a table with millions of rows to find the most recent bit of state about the objects, to then see if they match?
Lots of what I read online has been very conceptual- so I think when it comes to implementations it maybe seems more difficult than it is? Anyhow, if anyone has any advice it would be hugely appreciated!
TIA :)
CQRS can be interpreted in a "permissive" way: rather than saying "thou shalt not query the command/write side", it says "it's OK to have a query/read side that's separate from the command/write side". Because you have this permission to do such separation, it follows that one can optimize the command/write side for a more write-heavy workload (in practice, there are always some reads in the command/write side: since command validation is typically done against some state, that requires some means of getting the state!). From this, it's extremely likely that there will be some queries which can be performed efficiently against the command/write side and some that can't be (without deoptimizing the command/write side). From this perspective, it's OK to perform the first kind of query against the command/write side: you can get the benefit of strong consistency by doing that, though be sure to make sure that you're not affecting the command/write side's primary raison d'etre of taking writes.
Event sourcing is in many ways the maximally optimized persistence model for a command/write side, especially if you have some means of keeping the absolute latest state cached and ensuring concurrency control. This is because you can then have many times more writes than reads. The tradeoff in event sourcing is that nearly all reads become rather more expensive than in an update-in-place model: it's thus generally the case that CQRS doesn't force event sourcing but event sourcing tends to force CQRS (and in turn, event sourcing can simplify ensuring that a CQRS system is eventually consistent, which can be difficult to ensure with update-in-place).
In an event-sourced system, you would tend to have a read-side which subscribes to the event stream and tracks the mapping of X ID to last updated and which periodically queries and issues commands. Alternatively, you can have a scheduler service that lets you say "issue this command at this time, unless canceled or rescheduled before then" and a read-side which subscribes to updates and schedules a command for the given ID 5 days from now after canceling the command from the previous update.

Seeding microservices databases

Given service A (CMS) that controls a model (Product, let's assume the only fields that it has are id, title, price) and services B (Shipping) and C (Emails) that have to display given model what should the approach be to synchronize given model information across those services in event sourcing approach? Let's assume that product catalog rarely changes (but does change) and that there are admins that can access data of shipments and emails very often (example functionalities are: B:display titles of products the order contained and C:display content of email about shipping that is going to be sent). Each of the services has their own DB.
Solution 1
Send all required information about Product within event - this means following structure for order_placed:
{
order_id: [guid],
product: {
id: [guid],
title: 'Foo',
price: 1000
}
}
On service B and C product information is stored in product JSON attribute on orders table
As such, to display necessary information only data retrieved from the event is used
Problems: depending upon what other information needs to be presented in B and C, amount of data in event can grow. B and C might not require the same information about Product, but the event will have to contain both (unless we separate the events into two). If given data is not present within given event, code can not use it - if we'll add a color option to given Product, for existing orders in B and C, given product will be colorless unless we update the events and then rerun them.
Solution 2
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in product_id attribute on orders table
Product information is retrieved by services B and C when required by performing an API call to A/product/[guid] endpoint
Problems: this makes B and C dependant upon A (at all times). If schema of Product changes on A, changes have to be done on all services that depend on them (suddenly)
Solution 3
Send only guid of product within event - this means following structure for order_placed:
{
order_id: [guid],
product_id: [guid]
}
On services B and C product information is stored in products table; there's still product_id on orders table, but there's replication of products data between A, B and C; B and C might contain different information about Product than A
Product information is seeded when services B and C are created and are updated whenever information about Products changes by making call to A/product endpoint (that displays required information of all products) or by performing a direct DB access to A and copying necessary product information required for given service.
Problems: this makes B and C dependant upon A (when seeding). If schema of Product changes on A, changes have to be done on all services that depend on them (when seeding)
From my understanding, the correct approach would be to go with solution 1, and either update events history per certain logic (if Product catalog hasn't changed and we want to add color to be displayed, we can safely update history to get current state of Products and fill missing data within the events) or cater for nonexistence of given data (if Product catalog has changed and we want to add color to be displayed, we can't be sure if at that point in time in the past given Product had a color or not - we can assume that all Products in previous catalog were black and cater for by updating events or code)
Solution #3 is really close to the right idea.
A way to think about this: B and C are each caching "local" copies of the data that they need. Messages processed at B (and likewise at C) use the locally cached information. Likewise, reports are produced using the locally cached information.
The data is replicated from the source to the caches via a stable API. B and C don't even need to be using the same API - they use whatever fetch protocol is appropriate for their needs. In effect, we define a contract -- protocol and message schema -- which constrain the provider and the consumer. Then any consumer for that contract can be connected to any supplier. Backward incompatible changes require a new contract.
Services choose the appropriate cache invalidation strategy for their needs. This might mean pulling changes from the source on a regular schedule, or in response to a notification that things may have changed, or even "on demand" -- acting as a read through cache, falling back to the stored copy of the data when the source is not available.
This gives you "autonomy", in the sense that B and C can continue to deliver business value when A is temporarily unavailable.
Recommended reading: Data on the Outside, Data on the Inside, Pat Helland 2005.
Generally speaking, I'd strongly recommend against option 2 because of the temporal coupling between those two service (unless communication between these services is super stable, and not very frequent). Temporal coupling is what you describe as this makes B and C dependant upon A (at all times), and means that if A is down or unreachable from B or C, B and C cannot fulfill their function.
I personally believe that both options 1 and 3 have situations where they are valid options.
If the communication between A and B & C is so high, or the amount of data needed to go into the event is large enough to make it a concern, then option 3 is the best option, because the burden on the network is much lower, and latency of operations will decrease as the message size decreases. Other concerns to consider here are:
Stability of contract: if the contract of message leaving A changed often, then putting a lot of properties in the message would result in lots of changes in consumers. However, in this case I believe this to not be a big problem because:
You mentioned that system A is a CMS. This means that you're working on a stable domain and as such I don't believe you'll be seeing frequent changes
Since the B and C are shipping and email, and you're receiving data from A, I believe you'll be experiencing additive changes instead of breaking ones, which are safe to add whenever you discover them with no rework.
Coupling: There is very little to no coupling here. First since the communication is via messages, there is no coupling between the services other than a short temporal one during seeding of the data, and the contract of that operation (which is not a coupling you can or should try to avoid)
Option 1 is not something I'd dismiss though. There is the same amount of coupling, but development-wise it should be easy to do (no need for special actions), and stability of the domain should mean that these won't change often (as I mentioned already).
Another option I'd suggest is a slight variation to 3, which is not to run the process during start-up, but instead observe a "ProductAdded and "ProductDetailsChanged" event on B and C, wheneve there is a change in the product catalogue in A. This would make your deployments faster (and so easier to fix a problem/bug if you find any).
Edit 2020-03-03
I have a specific order of priorities when determining the integration approach:
What is the cost of consistency? Can we accept some milliseconds of inconsistency between things changed in A and them being reflected in B & C?
Do you need point-in-time queries (also called temporal queries)?
Is there any source of truth for the data? A service which owns them and is considered upstream?
If there is an owner / single source of truth is that stable? Or do we expect to see frequent breaking changes?
If the cost of inconsistency is high, (basically the product data in A need to be consistent as soon as possible with product cached in B and C), then youb cannot avoid needing to accept unavaibility, and make a synchronous request (like a web/rest request) from B & C to A to fetch the data. Be aware! This still does not mean transactionally consistent, but just minimizes the windows for inconsistency. If you absolutely, positively have to be immediately consistent, you need to rething your service boundaries. However, I very strongly believe this should not be a problem. From experience, it's actually extremely rare that the company can't accept some seconds of inconsistency, so you shouldn't even need to make synchronous requests.
If you do need point-in-time queries (which I didn't notice in your question and hence didn't include above, maybe wrongly), the cost of maintaining this on downstream services is so high (you'd need to duplicate internal event projection logic in all downstream services) that makes the decision clear: you should leave ownership to A, and query A ad-hoc over web request (or similar), and A should use event sourcing to retrieve all the events you knew about at the time to project to the state, and return it. I guess this may be option 2 (if I understood correctly?), but the costs are such that while temporal coupling is better than maintainance cost of duplciated events and projection logic.
If you don't need a point in time, and there isn't a clear, single owner of the data (which in my initial answer I did assume this based on your question), then a very reasonable pattern would be to hold representations of the product in each service separately. When you update the data for products, you update A, B and C in parallel by making parallel web requests to each one, or you have a command API which send multiple commands to each of A, B and C. B & C use their local version of the data to do their job, which may or may not be stale. This isn't any of the options above (although it could be made to be close to option 3), as data in A, B and C may differ, and the "whole" of the product may be a composition of all three data sources.
Knowing if the source of truth is has a stable contract is useful because you can use it to use the domain/internal events (or events you store in your event sourcing as storage pattern in A) for integration across A and services B and C. If the contract is stable you can integrate through the domain events. However, then you have an additional concern in the case where changes are frequent, or that contract of message is large enough that make transport a concern.
If you have a clear owner, with a contrac that is expected to be stable, the best options would be option 1; an order would contain all necessary information and then B and C would do their function using the data in the event.
If the contract is liable to change, or break often, following your option 3, that is falling back to web requests to fetch product data is actually a better option, since it's a much easier to maintain multiple versions. So B would make a request on v3 of product.
There are two hard things in Computer Science, and one of them is cache invalidation.
Solution 2 is absolutely my default position, and you should generally only consider implementing caching if you run into one of the following scenarios:
The API call to Service A is causing performance problems.
The cost of Service A being down and being unable to retrieve the data is significant to the business.
Performance problems are really the main driver. There are many ways of solving #2 that don't involve caching, like ensuring Service A is highly available.
Caching adds significant complexity to a system, and can create edge cases that are hard to reason about, and bugs that are very hard to replicate. You also have to mitigate the risk of providing stale data when newer data exists, which can be much worse from a business perspective than (for example) displaying a message that "Service A is down--please try again later."
From this excellent article by Udi Dahan:
These dependencies creep up on you slowly, tying your shoelaces
together, gradually slowing down the pace of development, undermining
the stability of your codebase where changes to one part of the system
break other parts. It’s a slow death by a thousand cuts, and as a
result nobody is exactly sure what big decision we made that caused
everything to go so bad.
Also, If you need point-in-time querying of product data, this should be handled in the way the data is stored in the Product database (e.g. start/end dates), should be clearly exposed in the API (effective date needs to be an input for the API call to query the data).
It is very hard to simply say one solution is better than the other. Choosing one among Solution #2 and #3 depends on other factors (cache duration, consistency tolerance, ...)
My 2 cents:
Cache invalidation might be hard but the problem statement mentions that product catalog change rarely. This fact make product data a good candidate for caching
Solution #1 (NOK)
Data is duplicated across multiple systems
Solution #2 (OK)
Offers strong consistency
Works only when product service is highly available and offers good performance
If email service prepares a summary (with lot of products), then the overall response time could be longer
Solution #3 (Complex but preferred)
Prefer API approach instead of direct DB access to retrieve product information
Resilient - consuming services are not impacted when product service is down
Consuming applications (shipping and email services) retrieve product details immediately after an event is published. The possibility of product service going down within these few milliseconds is very remote.

Consisntent N1QL Query Couchbase GOCB sdk

I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.

SubSonic AddMany() vs foreach loop Add()

I'm trying to figure out whether or not SubSonics AddMany() method is faster than a simple foreach loop. I poked around a bit on the SubSonic site but didn't see much on performance stats.
What I currently have. (.ForEach() just has some validation it it, other than that it works just like forEach(.....){ do stuff})
records.ForEach(record =>
{
newRepository.Add(record);
recordsProcessed++;
if (cleanUp) oldRepository.Delete<T>(record);
});
Which would change too
newRepository.AddMany(records);
if (cleanUp) oldRepository.DeleteMany<T>(records);
If you notice with this method I lose the count of how many records I've processed which isn't critical... But it would be nice to be able to display to the user how many records were moved with this tool.
So my questions boil down to: Would AddMany() be noticeably faster to use? And is there any way to get a count of the number of records actually copied over? If it succeeds can I assume all the records were processed? If one record fails, does the whole process fail?
Thanks in advance.
Just to clarify, AddMany() generates individual queries per row and submits them via a batch; DeleteMany() generates a single query. Please consult the source code and the generated SQL when you want to know what happens to your queries.
Your first approach is slow: 2*N queries. However, if you submit the queries using a batch it would be faster.
Your second approach is faster: N+1 queries. You can find how many will be added simply by enumerating 'records'.
If there is a risk of exceeding capacity limits on the size of a batch, then submit 50 or 100 at a time with little penalty.
Your final question depends on transactions. If the whole operation is one transaction, it will commit of abort as one. Otherwise, each query will stand alone. Your choice.

Keeping of track of the states of object - SOS

Hi the wise folks at SO. This is an SOS.
I'm in a deep trouble. In my web application there is an object (Say it is a request for something). User submits his/her request. After this it comes to the people who can approve/disapprove that request. During the period from submission to approval/disapproval many actions can be taken on the request. I have to present user with actions panel (collections of links) using which he/she can modify the state of the request.
Now based on which stage of processing the request is some actions are not allowed. Also if some action has already been taken it excludes the possibility of other actions.
Overall it creates a pretty complex matrix of allowed/forbidden actions that my tiny head is not able to take care of it.
I've create some static classes/methods which returns the arrays of allowed actions based on the state of the request. There are about 20 states that a application can be in. I've taken care based on state to remove/disable links for actions that are not possible in that state.
Now problem arises is that suppose request is in state X.
Now if in past action l has been taken on request we may not allow l or based on this some arbitrary actions m,n,o.
After writing all the methods to get arrays of links for 20 states, I have to filter the arrays based on the past history of actions (which is stored in sql db) which is very very big task.
Please suggest me some pattern which is easier to implement and efficient. It is getting on my nerves.
As I understand you have a real-world workflow scenario. In this case I would:
Model entire state as a single entity if possible (a single row with fixed number of fields). I would not model this as a set of actions.
Model each action as some change in the row. It is quite obvious when user enters some data, but I would also model each acceptance as either - a boolean field or a state field - depending on whether the acceptance is done by independent departments or it is a cascade of acceptance in a single department.
Also there may be a situation when an acceptance is given for some particular parameter and the parameter may change in the future, requiring new acceptance. In this case I would model such scenario as two fields. On for the parameter value and the second one for the accepted value. I would make the decision on whether an acceptance is still needed based on the difference of this two fields. This allows for implementing some thresholds.
Having a state modeled as a single row I would implement independent predicates for action allowance.
I think that point 4 is the most important one. If your are able to implement independent predicates for enabling actions then you will be able to easily modify them in the future.
Having 1-3 properly implemented you will be able to easily implement acceptance revoking, which may be required and in this case may make overall code size smaller.
Sounds like a job for a state machine workflow, or a few giant nested switches (which ever you prefer).
First thing that came into my mind: Statemachine. Each State is some kind of object. All states have some method "processRequest" that transits the execution into the next state.
The second thing that came into my mind - theses states have to be organized like a tree or graph. The graph represents the history of requests. You start in the initial State. You get Request A, you proceed to State A. After that, you get request B, you proceed to AB. Wether state AB is equal to BA is not clear by your description.
That way, you get far more states then your 20 states you have now, but each state includes the history. I'd suggest a naming convention after the path you had to take to get there (like AB before). And perhaps you can reuse state A and B in AB, to minimize coding.

Resources