In Firebase Database, how to read a large list then get child updates, efficiently - performance

I want to load a large list 5000 items, say, from a Firebase Database as fast as possible and then get any updates when a child is added or changed.
My approach so far is use ObserveSingleEvent to load the entire list at startup
pathRef.ObserveSingleEvent(DataEventType.Value, (snapshot) =>
and when that has finished use ObserveEvent for child changes and additions:
nuint observerHandle1 = pathRef.ObserveEvent(DataEventType.ChildChanged, (snapshot) =>
nuint observerHandle2 = pathRef.ObserveEvent(DataEventType.ChildAdded, (snapshot) =>
The problem is the ChildAdded event is triggered 5000 times, all of which are usually unnecessary and it takes a long time to complete.
(From the docs: "child_added is triggered once for each existing child and then again every time a new child is added to the specified path)
And I can't just ignore the first update for each child, because it might be a genuine change (eg database updated just after the ObserveSingleEvent completed).
So I have to check each item on the client and see if it has changed. This is all very slow and inefficient.
I've seen posts suggesting to use a query with LimitToLast etc, but this is not reliable as child changes may be missed.
My Plan B is to not do ObserveSingleEvent to load the list and just rely on the 5000 child added calls, but that seems crazy and is slower to get the list initially, and probably uses more of the user's data allowance.
I would think this is a common scenario, so what is the best solution?
Also, I have persistence enabled, so a couple of related questions:
1) If the entire list has been loaded before, does ObserveSingleEvent with Value just load from disk and there's no network use, when online (but no changes have happened since last download) or does it download the whole list again?
2) Similarly with the 5000 calls to child added - if it's all on disk, is there any network use, when online if no changes?
(I'm using C# for iOS, but I'm sure its the same issue for other platforms)

The wire traffic for a value listener and for a child_ listener is exactly the same. The events are purely a client-side interpretation of the underlying data.
If you need all children, consider using only the child_* listeners and no pathRef.ObserveSingleEvent(DataEventType.Value. It'll lead to the simplest code.

Related

CQRS Where to Query for business logic/Internal Processes

I'm currently looking at implementing CQRS driven by events (not yet event sourcing) in for a service at work; the reasoning being:
I need aggregate data to support a RestAPI coming out of this service (which will be used to populate views)- however the aggregated data will not be used by the application logic/processing (ie the data originating outside this service, the bits that of the aggregate originating within it will be used)
I need to stream events to other systems so that they can react to the data (will produce to a Kafka topic, so the 'read'/'projection' side of this system will consume the same events as the external systems, from these Kafka topics
I will be consuming events from internal systems to help populate the aggregate for the views in first point (ie it's data from this service and other's)
The reason for not going event sourced currently is that a) we're in a bit of a time crunch, and b) due to still learning about it. Having said which, it is something that we are looking to do in the future- though currently, we have a static DB in the 'Command' side of the system, which will just store current state
I'm pretty confident with the concept of using the aggregate data to provide the Rest API; however my confusion is coming from when I want to change a resource from within the system (for example via a cron job triggered 5 times a day) Example:
If I have resource of class x, which (given some data), wants a piece of state changing
I need to select instances of the class x which meet the requirements (from one of the DB's). Think select * from {class x} where last_changed_ date > 5 days ago;
Then create a command to change the state of these instances of x (in my case, the static command DB would be updated, as well as an event made to update the read DB)
The middle bullet point is what is confusing me. If I pull the data out of the Read DB, and check some information on it, then decide to change a property; I then have to convert the object from the 'Read Object' to the 'Command Object', so that I can then persist it and create an event? With my current architecture- I could query the command DB no problem, to find all the instances of {class x} that match the criteria, however I don't know if a) this is the right thing to do, and b) how this would work if I was using an event store as a DB? I'd have to query a table with millions of rows to find the most recent bit of state about the objects, to then see if they match?
Lots of what I read online has been very conceptual- so I think when it comes to implementations it maybe seems more difficult than it is? Anyhow, if anyone has any advice it would be hugely appreciated!
TIA :)
CQRS can be interpreted in a "permissive" way: rather than saying "thou shalt not query the command/write side", it says "it's OK to have a query/read side that's separate from the command/write side". Because you have this permission to do such separation, it follows that one can optimize the command/write side for a more write-heavy workload (in practice, there are always some reads in the command/write side: since command validation is typically done against some state, that requires some means of getting the state!). From this, it's extremely likely that there will be some queries which can be performed efficiently against the command/write side and some that can't be (without deoptimizing the command/write side). From this perspective, it's OK to perform the first kind of query against the command/write side: you can get the benefit of strong consistency by doing that, though be sure to make sure that you're not affecting the command/write side's primary raison d'etre of taking writes.
Event sourcing is in many ways the maximally optimized persistence model for a command/write side, especially if you have some means of keeping the absolute latest state cached and ensuring concurrency control. This is because you can then have many times more writes than reads. The tradeoff in event sourcing is that nearly all reads become rather more expensive than in an update-in-place model: it's thus generally the case that CQRS doesn't force event sourcing but event sourcing tends to force CQRS (and in turn, event sourcing can simplify ensuring that a CQRS system is eventually consistent, which can be difficult to ensure with update-in-place).
In an event-sourced system, you would tend to have a read-side which subscribes to the event stream and tracks the mapping of X ID to last updated and which periodically queries and issues commands. Alternatively, you can have a scheduler service that lets you say "issue this command at this time, unless canceled or rescheduled before then" and a read-side which subscribes to updates and schedules a command for the given ID 5 days from now after canceling the command from the previous update.

Is there a "best practice" in microservice development for versioning a database table?

A system is being implemented using microservices. In order to decrease interactions between microservices implemented "at the same level" in an architecture, some microservices will locally cache copies of tables managed by other services. The assumption is that the locally cached table (a) is frequently accessed in a "read mode" by the microservice, and (b) has relatively static content (i.e., more of a "lookup table" vice a transactional content).
The local caches will maintain synch using inter-service messaging. As the content should be fairly static, this should not be a significant issue/workload. However, on startup of a microservice, there is a possibility that the local cache has gone stale.
I'd like to implement some sort of rolling revision number on the source table, so that microservices with local caches can check this revision number to potentially avoid a re-synch event.
Is there a "best practice" to this approach? Or, a "better alternative", given that each microservice is backed by it's own database (i.e., no shared database)?
In my opinion you shouldn't be loading the data at start up. It might be bit complicated to maintain version.
Cache-Aside Pattern
Generally in microservices architecture you consider "cache-aside pattern". You don't build the cache at front but on demand. When you get a request you check the cache , if it's not there you update the cache with latest value and return response, from there it's always returned from cache. The benefit is you don't need to load everything at front. Say you have 200 records, while services are only using 50 of them frequently , you are maintaining the extra cache that may not be required.
Let the requests build the cache , it's the one time DB hit . You can set the expiry on cache and incoming request build it again.
If you have data which is totally static (never ever change) then this pattern may not be worth a discussion , but if you have a lookup table that can change even once a week, month, then you should be using this pattern with longer cache expiration time. Maintaining the version could be costly. But really upto you how you may want to implement.
https://learn.microsoft.com/en-us/azure/architecture/patterns/cache-aside
We ran into this same issue and have temporarily solved it by using a LastUpdated timestamp comparison (same concept as your VersionNumber). Every night (when our application tends to be slow) each service publishes a ServiceXLastUpdated message that includes the most recent timestamp when the data it owns was added/edited. Any other service that subscribes to this data processes the message and if there's a mismatch it requests all rows "touched" since it's last local update so that it can get back in sync.
For us, for now, this is okay as new services don't tend to come online and be in use same day. But, our plan going forward is that any time a service starts up, it can publish a message for each subscribed service indicating it's most recent cache update timestamp. If a "source" service sees the timestamp is not current, it can send updates to re-sync the data. This has the advantage of only sending the needed updates to the specific service(s) that need it even though (at least for us) all services subscribed have access to the messages.
We started with using persistent Queues so if all instances of a Microservice were down, the messages would just build up in it's queue. There are 2 issues with this that led us to build something better:
1) It obviously doesn't solve the "first startup" scenario as there is no queue for messages to build up in
2) If ANYTHING goes wrong either in storing queued messages or processing them, you end up out of sync. If that happens, you still need a proactive mechanism like we have now to bring things back in sync. So, it seemed worth going this route
I wouldn't say our method is a "best practice" and if there is one I'm not aware of it. But, the way we're doing it (including planned future work) has so far proven simple to build, easy to understand and monitor, and robust in that it's extremely rare we get an event caused by out-of-sync local data.

Angular2: How are change detection triggered?

I am currently trying to debug my application performance. It turns out that many frames take more than 200ms, and most of the time is spent in NumberFormat.format(), due to the various number pipes used.
In the developers tools timeline, I see that change detection is triggered more often than required.
For a single XHR ready state change of a single request, Lifecycle.tick() is called 11 times. I expected this to happen only once I assign the final result to a local field used in the template. Even though the template bindings are not modified, running NumberFormat.format() on each records 11 times in a frame causes a noticable lag in the app.
This is the Promise chain used in my code once the XHR ready state changes, in this case a search request:
The Promise wrapping the request resolves the response
An array is constructed by parsing the response JSON
For each item in the array, a local model (ItemModel) is constructed:
For each reference to other entities, the entity is fetched from the cache or, if absent, from the remote server; then converted to a local model instance. Once all references are fetched, the Promise resolves the ItemModel instance. Note that this may cause several new Promise to be created, as several levels of references need to be resolved. Usually, entities are in the cache and these Promises resolve directly.
The result, an array of ItemModel, is assigned to a field in the controller.
This sequence runs pretty fast. However, as seen in the timeline, Lifecycle.tick() is called each time after a local model instance is created. So if 10 references need to be resolved to build the complete local model tree, Lifecycle.tick() will be called 10 times. Why?
It also seems that sometimes Zone.fork is called when I create a local model. Why and why not always?
I'd be happy to hear more on how zones and change detection are triggered in order to improve my app performances.
Developer tools profile, timeline and network samples can be found here (expire on Oct 11 2015)
Application code can be found here.
I finally got around this by using immutables collections from immutable-js and changing the changeDetection strategy to ChangeDetectionStrategy.OnPush.
For informations about change detection in angular2:
http://victorsavkin.com/post/114168430846/two-phases-of-angular-2-applications
This blog contains other useful resources.

Posting an update request to ElasticSearch without waiting for completion

I have an ElasticSearch index that stores files, sometimes very large ones. Because the underlying Lucene engine is actually doing a complete replacement each time a document is updated, even if I am just modifying the value of one field, the entire document needs to be updated behind the scenes.
For large, multi-MB files this can take a fairly long time (several hundred ms). Since this is done as part of a web application this is not really acceptable. What I am doing right now is forking the process, so the update is called on a separate thread while the request finishes.
This works, but I'm not really happy with this as a long term solution, partially because it means that every time I create a new interface to the search engine I'll have to recode the forking logic. Also it means I basically can't know whether the request is successful or not, or if some kind of error occurred, without writing additional code to log successful or unsuccessful requests somewhere.
So I'm wondering if there is an unknown feature where you can post an UPDATE request to ElasticSearch, and have them return an acknowledgement without waiting for the update task to actually complete.
If you look at the documentation for Snapshot and Restore you'll see when you make a request you can add wait_for_completion=true in order to have the entire process run before receiving the result.
What I want is the reverse — the ability to add ?wait_for_completion=false to a POST request.

How to update/migrate data when using CQRS and an EventStore?

So I'm currently diving the CQRS architecture along with the EventStore "pattern".
It opens applications to a new dimension of scalability and flexibility as well as testing.
However I'm still stuck on how to properly handle data migration.
Here is a concrete use case:
Let's say I want to manage a blog with articles and comments.
On the write side, I'm using MySQL, and on the read side ElasticSearch, now every time a I process a Command, I persist the data on the write side, dispatch an Event to persist the data on the read side.
Now lets say I've some sort of ViewModel called ArticleSummary which contains an id, and a title.
I've a new feature request, to include the article tags to my ArticleSummary, I would add some dictionary to my model to include the tags.
Given the tags did already exist in my write layer, I would need to update or use a new "table" to properly use the new included data.
I'm aware of the EventLog Replay strategy which consists in replaying all the events to "update" all the ViewModel, but, seriously, is it viable when we do have a billion of rows?
Is there any proven strategies? Any feedbacks?
I'm aware of the EventLog Replay strategy which consists in replaying
all the events to "update" all the ViewModel, but, seriously, is it
viable when we do have a billion of rows?
I would say "yes" :)
You are going to write a handler for the new summary feature that would update your query side anyway. So you already have the code. Writing special once-off migration code may not buy you all that much. I would go with migration code when you have to do an initial update of, say, a new system that requires some data transformation once off, but in this case your infrastructure would exist.
You would need to send only the relevant events to the new handler so you also wouldn't replay everything.
In any event, if you have a billion rows of data your servers would probably be able to handle the load :)
Im currently using the NEventStore by JOliver.
When we started, we were replaying our entire store back through our denormalizers/event handlers when the application started up.
We were initially keeping all our data in memory but knew this approach wouldn't be viable in the long term.
The approach we use currently is that we can replay an individual denormalizer, which makes things a lot faster since you aren't unnecessarily replaying events through denomalizers that haven't changed.
The trick we found though was that we needed another representation of our commits so we could query all the events that we handled by event type - a query that cannot be performed against the normal store.

Resources