How best to backfill data with Google Cloud Workflows - google-workflows

I am getting up to speed on GCP Workflows. What is the best practice for handling backfilling data with Workflows, and are there any built in methods like airflow? If not, then it looks like the best option is to pass in a start_date and keep looping through a subworkflow while incrementing the date.

I believe you can find how to backfile on the official Google documentation on backfiling, though this does refer specifically to BigQuery... It's what makes the most sense within the context that you're mentioning, seeing as Workflows is more of a... job scheduler? I guess?
If this is not what you are referring to, then could you please clarify what it is that you are attempting to do with GCP Workflows?
I'm asking because by the definition of workflows found in the overview for Google Workflows, they don't really have associated/built-in methods, they just call the methods that you program them to.

Related

The best way to schedule||automate MarkLogic data hub flows/custom steps

I use DMSDK to ingest data; I have multiple custom flows to run following data ingestion. Instead of manually running the flows one by one, What is the best way to orchestrate MarkLogic data hub flows?
gradle, trigger or other scheduling tools?
I concur with Dave Cassel that NiFi, or perhaps something like MuleSoft, or maybe even Camel is a great way to manage running your flows. Particularly if you are talking about operational management.
To answer on other mechanisms:
Crontab doesn't connect to MarkLogic itself. You'd have to write scripts or code to make something actually happen. You won't have much control either, nor logging, unless you add that as well.
We have great plugins for Gradle that make running flows real easy. Great during development and such, but perhaps less suited for scheduling or operational tasking.
Triggers inside MarkLogic only respond to insertion of data, so you'd still have to initiate an update from outside anyhow.
Scheduled Tasks inside MarkLogic has similar limitations to Crontab and Gradle. It doesn't do much by itself, so you have to write code anyhow. It provides no logging by itself, nor ways to operationally manage the tasks, other than through Admin ui.
JAR package might depend on what JAR package you actually mean. You can create a JAR of your ml-gradle project, but that doesn't give you a lot of gain over calling Gradle itself.
Personally, I'd have a close look at the operational requirements. Think of for instance: need to get status overview, interrupt schedules, loops to retry at failure, built-in logging, and facilities to send notifications when attention is needed.
HTH!
There are a variety of answers that will work, of course; my preference is NiFi. This keeps any scheduling overhead outside of MarkLogic, with the trade-off that you'll need to have NiFi running.

In Microsoft Dynamics 365 CRM what is the major difference in plugins and workflows when both serve the same purpose

Can someone please tell me which of the following has more advantages - plugin/workflow ?
As the Post in Custom WorkFlows vs Plug-ins in MS CRM seems to be a little outdated, i can share my experieces with you.
Workflows:
Contains certain Logic you provide by only "clicking" on the actions
you want to be made (Like Update, Create, etc.)
Can be run "onDemand"
Can often be handled by KeyUsers and do not need an explicit developer
Should not be used for complicated logic as the iterface often does not provide the possibility to add additional logic afterwards
If used for complicated logic (as statened above), refaktoring or changes are often very hard to integrate!
In current Cloud organisations you get the Information that you SHOULD not use these anymore, but to swith to MS Flow. (VERY IMPORTANT!!)
Plugins:
Custom Code - so you can provide very complicated or also simple server-side logic
You need a(n experienced) developer
Can perform faster than workflows!
nearly everything you can do with a Workflow can be done by a Plugin (or job) but not visa-vera
You have the possibility to trigger the plugin as well as hand in Data (Parameters!) as you can create your own "Messages" (With this i mean you do not only use Update, Delete and Create, etc. as Messages for Plugins, but you can define your own Message Steps by creating "Actions" in the Prozess Section in your Dynamics Organization. There you can define Input- AND Outputparameters. These custom Messages can be also triggered on demand!!! For instance by using javascript. Guid how to use/create custom Messages (Actions))
In my experience Plugins are mostly the better suited solution if you have (even a little) complicated matter, as workflows are far less maintainable. Simple "1 Liners" can often be replaced by workflows.
Nevertheless each developer/consultant has to suggest his own way for the improvement/developmet of his/her organization.
#Community: Feel free to correct me, if i am wrong anywhere or if you have different experiences.

Using AWS API Gateway+Lambda Without It becoming a Dependency

There's no doubt to the benefits of API Gatway+Lambda for a micro-services.
My concern is what would happen if we decide to move off API Gateway+Lambda to ECS/Fargate, or even another Cloud.
There seems to be a consensus on using one Lambda function per route/action.
I have some theories about how to design using this approach such that the code can be unplugged from Lambda and plugged-in some where else.
I would also like to know what others in the community have done to achieve this? Has anyone attempted to move the API off Lambda and was able to successfully do it using XXXX design? What are the lessons there?
The language should not really matter to this discussion but we are using python3
What you're facing right now has a name. It's called "vendor lock in". Pretty much nothing you can do about it.
However, I find it useful to treat AWS Lambda handler function as a controller in your web server. What would you do in your controller? You'd validate incoming data, pass it to service layer and then serialize response from the service and pass it back to API Gateway. Long story short, your handler function should not contain business logic, which makes it easy to migrate even from serverless to servers. It's also can be good because it leaves some place for optimization. If you end up seeing that your service layer architecture adds significant time to cold start, then just denormalize it to a single file. It'll work faster, but you'll sacrifice code maintainability. There is no silver bullet, software has always been about trade offs. :)

Suggestion for scheduling tool(s) for building hadoop based data pipelines

Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them?
I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift.
I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome.
Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
Check out the Azkaban CLI project for programmatic job creation. I have an Azkaban example workflows project on GitHub.
Airflow: Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
Oozie: Insane XML based job definitions. Here be dragons. ;-)
IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
If you can make it idempotent (running it again creates identical results) then that’s even better.
This post will give you an initial idea about different possible workflows
http://bytepawn.com/luigi-airflow-pinball.html

Which workflow engine should I chose for implementing a dynamic reconfiguration of workflows?

I want to be able to interrupt a running workflow instance, say when a new activity is about to be invoked, and extract information both about the structure of the workflow and the data in the particular instance. Then I will consult with an external system and according to its response I will possibly alter the behaviour of the workflow. The options I would like to have are addition/removal of activities and altering parameters for the activities to be invoked.
I am currently struggling with the engine it's best to go with. I have looked at WWF, Apache ODE, Oracle Workflow and Active BPEL and as far as I understand they can all provide me with the options I need. I would really appreciate any recommendations on which one will be the easiest to work with for my purpose and any restrictions either of the above might have that would prevent me from reaching my goal.
Thanks
I am sorry not to directly answer your question, but you may be interested in a state machine framework called Stateless created by Nicholas Blumhardt (AutoFac). I have used this instead of Windows Workflow where I needed to quickly configure my steps for a work flow. I have one configuration file that I alter and can introduce new steps into the workflow quite easily. See my SO answer here for more details.
Essentially you define a state as State<T> and this allows you to persist your state in a database easily.

Resources