How to avoid lambda trigger recursive call - aws-lambda

I've written a lambda function that is triggered via an s3 bucket's putObject event. I am modifying the headers of an object post upload, downloading the object, and reuploading with appropriate headers. But because the function itself uses the putObject to reupload the object, the lambda triggers itself.

Three options:
Use a different API to upload your changes than the one that you have an event on. ie, if your lambda is triggered by PUT, then use a POST to modify the content afterwards (tough to do since POST isn't supported well by SDKs AFAIK, so this may not be an option).
Track usage and have a small guard at the beginning of your handler to short circuit if the only changes made to a file are ones you made. If you can't programmatically detect the headers you've set, you'll probably need a small dynamo table or similar for keeping track of which files you've already touched. This will let you abort immediately and only be charged the minimum 100ms fee.
Reorganize your project to have an 'ingest' bucket and an output bucket. Un-processed are put into the former, modified, and then placed into the latter. This has a number of advantages. The first is that you don't end up with the current situation, so that's a plus. The second is that you don't have whatever process consumes these modified files potentially pulling an unmodified version. The third is that you get better insight into the process - if something goes wrong, it's easy to see which batches of files have undergone which process.
Overall, I'd recommend option 3 for you, though I know that in my lazier moments I might try to opt for 1 or 2.
Either way, good luck.

Related

What does data look like when using Event Sourcing?

I'm trying to understand how Event Sourcing changes the data architecture of a service. I've been doing a lot of research, but I can't seem to understand how data is supposed to be properly stored with event sourcing.
Let's say I have a service that keeps track of vehicles transporting packages. The current non relational structure for the data model is that each document represents a vehicle, and has many fields representing origin location, destination location, types of packages, amount of packages, status of the vehicle, etc. Normally this gets queried for information to be read to the front end. When changes are made by the user, the appropriate changes are made to this document in order to update this.
With event sourcing, it seems that a snapshot of every event is stored, but there seem to be a few ways to interpret that:
The first is that the multiple versions of the document I described exist, each a new snapshot every time a change is made. Each event would create a new version of this document and alter it. This is the easiest way for me to wrap my head around it, but I believe this to be incorrect.
Another interpretation I have is that each event stores SPECIFIC information about what's been altered in the document. When the vehicle status changes from On Road to Available, for example, an event specifically for vehicle status changes is triggered. Let's say it's called VehicleStatusUpdatedEvent, and contains the Vehicle ID number, the new status, and the timestamp for this event. So this event is stored and is published to a messaging queue. When picked up from the queue, the appropriate changes are made to the current version of the document. I can understand this, but I think I still have some misconceptions here. My understanding is that event sourcing allows us to have a snapshot of data upon each change, so we can know what it looks like at any point. What I just described would keep a log of changes, but still only have one version of the file, as the events only contain specific pieces of the whole file.
Can someone describe how the data flow and architecture works with event sourcing? Using the vehicle data example I provided might help me frame it better. I feel that I am close to understanding this, but I am missing some fundamental pieces that I can't seem to understand by searching online.
The current non relational structure for the data model is that each document represents a vehicle
OK, let's start from there.
In the data model you've described, storage of a document destroys the earlier copy.
Now imagine that instead we were storing the the document in a git repository. Then then saving the document would also save metadata, and that metadata would include a pointer to the previous document.
Of course, we've probably got a lot of duplication in that case. So instead of storing the complete document every time, we'll store a patch document (think JSON Patch), and metadata pointing to the original patch.
Take that same idea again, but instead of storing generic patch documents, we use domain specific messages that describe what is going on in terms of the model.
That's what the data model of an event sourced entity looks like: a list of domain specific descriptions of document transformations.
When you need to reconstitute the current state, you start with a state you know (which could be the "null" state of the document before anything happened to it, and replay onto that document all of the patches (events) that have occurred since.
If you want to do a temporal query, the game is the same, you replay the events up to the point in time that you are interested in.
So essentially when referring to an older build, you reconstruct the document using the events, correct?
Yes, that's exactly right.
So is there still a "current status" document or is that considered bad practice?
"It depends". In the general case, there is no current status document; only the write-ordered list of events is "real", and everything else is derived from that.
Conversations about event sourcing often lead to consideration of dedicated message stores for managing persistence of those ordered lists, and it is common that the message stores do not also support document storage. So trying to keep a "current version" around would require commits to two different stores.
At this point, designers typically either decide that "recent version" is good enough, in which case they build eventually consistent representations of documents outside of the transaction boundary... OR they decide current version is important, and look into storage solutions that support storing the current version in the same transaction as the events (ex: using an RDBMS).
what is the procedure used to generate the snapshot you want using the events?
IF you want to generate a snapshot, then you'll normally end up using a pattern called a projection, to iterate over the events and either fold or reduce them to create the document.
Roughly, you have a function somewhere that looks like
document-with-meta-data = projection(event-history-with-metadata)

Event sourcing - delete event related files

I'd like to store some of my data in relative big files (a few GBs per file). I'd like to use event sourcing and save events related to these files, e.g. FileCreated: title, description, timestamp, author, personal, encryptionkey, etc. After a while some of the files won't be needed any longer, and they take up a lot of space. So in order to free up space, I need to delete them. Doing so is problematic, because I will have the history in the event storage, but not the file in the filesystem. Is there any way to keep integrity and somehow delete both? Or is there a best practice for this problem?
Since I did not get an answer, I try to answer this myself.
It is possible to remove an event from the history, you need to create a new event storage and filter the events for the same aggregate id you want to get rid of. After you are done, you can switch to the new event storage and remove the old one. Probably you need to replay projections as well. So it is very similar to a whole migration, it takes a lot of time. In the current case it is not problem if I need to do this only once every year or so. Another problem with storing this data in the event storage that either I stream it from there or I need to duplicate it in order to serve it. The latter one is not always a good solution, because sometimes it takes too much time to copy and in order to save the data you need to stream it anyways, otherwise you will be out of memory very fast. So the event storage should support streaming attachments.
Another solution to keep the relative big data in the files and display something like 404 not found, or file was removed because this and that. I see this frequently. In this case it is ok to keep the event in the storage and for example you can add a ContentRemoved event, where you can select the cause. Another option to hide the removed file, so it won't be listed by the app, this is usual I guess too. This solution has drawbacks too. Migration is more complex with this approach, because you need to move both the event storage and the files. If you remove a file by accident, you cannot undo it later unless you have the file in the backup. This can be fixed by delaying the actual file removal with a few days, so you can undo it if you change your mind. Another option to make a trash and files will be deleted only by emptying the trash.
I think both solutions are worth to consider and probably it depends on the actual project which one is better suited.

libtorrent new piece alerts

I am developing an application that will stream multimedia files over torrents.
The backend needs to serve new pieces to the frontend as they arrive.
I need a mechanism to get notified when new pieces have arrived and been verified. From what I can tell, I could do this using block_finished_alerts. I would keep track of which blocks have arrived for a given piece, and read the piece when all blocks have arrived.
This solution seems kind of roundabout and I was wondering if there was a better way.
What you're asking for is called piece_finished_alert. It's posted every time a new piece completes downloading and passes the hash-check. To read a piece from disk, you may use torrent_handle::read_piece() (and get the result in read_piece_alert).
However, if you want to stream media, you probably want to use torrent_handle::set_piece_deadline() and set the flag to send read_piece_alerts as pieces come in. This will invoke the built-in streaming feature of libtorrent.

write only stream

I'm using joliver/EventStore library and trying to find a way of how to get a stream not reading any events from it.
The reason is that I want just to write some events into that store for specific stream without loading all 10k messages from it.
The way you're expected to use the store is that you always do a GetById first. Even if you new up an Aggregate and Save it, you'll see in the CommonDomain EventStoreRepository that it will first correlate it with the existing data.
The key reason why a read is needed first is that the infrastructure needs to work out how many events have gone before to compute the new commit sequence number.
Regarding your citing of your example threshold that makes you want to optimize this away... If you're really going to have that level of events, you'll already be into snapshotting territory as you'll need to have an appropriately efficient way of doing things other than blind write too.
Even if you're not intending to lean on snapshotting, half the benefit of using EventStore is that the facility is buitl in for when you need it.

How would you implement a Workflow system?

I need to implement a Workflow system.
For example, to export some data, I need to:
Use an XSLT processor to transform an XML file
Use the resulting transformation to convert into an arbitrary data structure
Use the resulting (file or data) and generate an archive
Move the archive into a given folder.
I started to create two types of class, Workflow, which is responsible of adding new Step object and run it.
Each Steps implement a StepInterface.
My main concerns is all my steps are dependent to the previous one (except the first), and I'm wondering what would be the best way to handle such problems.
I though of looping over each steps and providing each steps the result of the previous (if any), but I'm not really happy with it.
Another idea would have been to allow a "previous" Step to be set into a Step, like :
$s = new Step();
$s->setPreviousStep(Step $step);
But I lose the utility of a Workflow class.
Any ideas, advices?
By the way, I'm also concerned about success or failure of the whole workflow, it means that if any steps fail I need to rollback or clean the previous data.
I've implemented a similar workflow engine a last year (closed source though - so no code that I can share). Here's a few ideas based on that experience:
StepInterface - can do what you're doing right now - abstract a single step.
Additionally, provide a rollback capability but I think a step should know when it fails and clean up before proceeding further. An abstract step can handle this for you (template method)
You might want to consider branching based on the StepResult - so you could do a StepMatcher that takes a stepResult object and a conditional - its sub-steps are executed only if the conditional returns true.
You could also do a StepException to handle exceptional flows if a step errors out. Ideally, this is something that you can define either at a workflow level (do this if any step fails) and/or at a step level.
I'd taken the approach that a step returns a well defined structure (StepResult) that's available to the next step. If there's bulky data (say a large file etc), then the URI/locator to the resource is passed in the StepResult.
Your workflow is going to need a context to work with - in the example you quote, this would be the name of the file, the location of the archive and so on - so think of a WorkflowContext
Additional thoughts
You might want to consider the following too - if this is something that you're planning to implement as a large scale service/server:
Steps could be in libraries that were dynamically loaded
Workflow definition in an XML/JSON file - again, dynamically reloaded when edited.
Remote invocation and call back - submit job to remote service with a callback API. when the remote service calls back, the workflow execution is picked up at the subsequent step in the flow.
Parallel execution where possible etc.
stateless design
Rolling back can be fit into this structure easily, as each Step will implement its own rollback() method, which the workflow can call (in reverse order preferably) if any of the steps fail.
As for the main question, it really depends on how sophisticated do you want to get. On a basic level, you can define a StepResult interface, which is returned by each step and passed on to the next one. The obvious problem with this approach is that each step should "know" which implementation of StepResult to expect. For small systems this may be acceptable, for larger systems you'd probably need some kind of configurable mapping framework that can be told how to convert the result of the previous step into the input of the next one. So Workflow will call Step, Step returns StepResult, Workflow then calls StepResultConverter (which is your configurable mapping thingy), StepResultConverter returns a StepInput, Workflow then calls the next Step with StepInput and so on.
I've had great success implementing workflow using a finite state machine. It can be as simple or complicated as you like, with multiple workflows linking to each other. Generally an FSM can be implemented as a simple table where the current state of a given object is tracked in a history table by keeping a journal of the transitions on the object and simply retrieving the last entry. So a transition would be of the form:
nextState = TransLookup(currState, Event, [Condition])
If you are implementing a front end you can use this transition information to construct a list of the events available to a given object in its current state.

Resources