The best way to schedule||automate MarkLogic data hub flows/custom steps - gradle

I use DMSDK to ingest data; I have multiple custom flows to run following data ingestion. Instead of manually running the flows one by one, What is the best way to orchestrate MarkLogic data hub flows?
gradle, trigger or other scheduling tools?

I concur with Dave Cassel that NiFi, or perhaps something like MuleSoft, or maybe even Camel is a great way to manage running your flows. Particularly if you are talking about operational management.
To answer on other mechanisms:
Crontab doesn't connect to MarkLogic itself. You'd have to write scripts or code to make something actually happen. You won't have much control either, nor logging, unless you add that as well.
We have great plugins for Gradle that make running flows real easy. Great during development and such, but perhaps less suited for scheduling or operational tasking.
Triggers inside MarkLogic only respond to insertion of data, so you'd still have to initiate an update from outside anyhow.
Scheduled Tasks inside MarkLogic has similar limitations to Crontab and Gradle. It doesn't do much by itself, so you have to write code anyhow. It provides no logging by itself, nor ways to operationally manage the tasks, other than through Admin ui.
JAR package might depend on what JAR package you actually mean. You can create a JAR of your ml-gradle project, but that doesn't give you a lot of gain over calling Gradle itself.
Personally, I'd have a close look at the operational requirements. Think of for instance: need to get status overview, interrupt schedules, loops to retry at failure, built-in logging, and facilities to send notifications when attention is needed.
HTH!

There are a variety of answers that will work, of course; my preference is NiFi. This keeps any scheduling overhead outside of MarkLogic, with the trade-off that you'll need to have NiFi running.

Related

Schema deployment management for Athena

In order to apply devops principles to data (ugh, dataops!), things like continuous deployment need to be considered.
Hence why tools like dbDeploy exist. However dbDeploy seems to have been orphaned and is not maintained any more. In the past i've used this tool again and again, but I don't see much support for it, and I'm not sure why?
So i'm wondering just what do people use to manage and version their schemas. In particular i'm looking for something that will work with Athena (But this has a jdbc driver, so in theory any jdbc compliant tool)
I know one answer may be to switch mindset, and use the AWS Glue crawlers instead. But do people actually do that? Or are the crawlers more for POC/Quick start situations? I'm pretty sure you'll always want to override decisions the crawler makes, so how can that be handled?

Suggestion for scheduling tool(s) for building hadoop based data pipelines

Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them?
I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift.
I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome.
Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
Check out the Azkaban CLI project for programmatic job creation. I have an Azkaban example workflows project on GitHub.
Airflow: Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
Oozie: Insane XML based job definitions. Here be dragons. ;-)
IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
If you can make it idempotent (running it again creates identical results) then that’s even better.
This post will give you an initial idea about different possible workflows
http://bytepawn.com/luigi-airflow-pinball.html

Marathon vs Aurora and their purposes

Both Marathon and Aurora are built on Mesos and supposedly are engineered for running long running services. My questions are:
What are their differences? I have struggled in finding any good explanations regarding their key differences
Do these frameworks run anything that runs on Linux? For Marathon they state that it can run anything that "is executable in a shell" but this is sort of vague :)
Thanks!
Disclaimer: I am the VP of Apache Aurora, and have been the tech lead of the Aurora team at Twitter for ~5 years. My likely-biased opinions are my own and do not necessarily represent those of Twitter or the ASF.
Do these frameworks run anything that runs on Linux? For Marathon they
state that it can run anything that "is executable in a shell" but
this is sort of vague :)
Essentially, yes. Ultimately these systems are sophisticated machinery to execute shell code somewhere in a cluster :-)
What are their differences? I have struggled in finding any good
explanations regarding their key differences
Aurora and Marathon do indeed offer similar feature sets, both being classified as "service schedulers". In other words, you hand us instructions for how to run your application servers, and we do our best to keep them up.
I'll offer some differences in broad strokes. When it comes to shortcomings mentioned in each, I think it's safe to say that the communities are aware and intend to fix them.
Ease of use
Aurora is not easy to install. It will likely feel like you are trailblazing while setting it up. It exposes a thrift API, which means you'll need a thrift client to interact with it programmatically (a REST-like API is coming, but is vaporware at the moment), or use our command line client. Aurora has a DSL for configuration which can be daunting, but allows you to easily share templates and common patterns as you use the system more.
Marathon, on the other hand, helps you to run 'Hello World' as quickly as possible. It has great docs to do this in many environments and there's little overhead to get going. It has a REST API, making it easier to adapt to custom tools. It uses JSON for configuration, which is easy to start with but more prone to cargo culting.
Targeted use cases
Aurora has always been designed to handle a large engineering organization. The clusters at Twitter have tens of thousands of machines and hundreds of engineers using them. It is critical to Twitter's business. As a result, we take our requirements of scale, stability, and security very seriously. We make sure to only condone features that we believe are trustworthy at scale in production (for example, we have our Docker support labeled as beta because of known issues with Docker itself and the Mesos-Docker integration). We also have features like preemption that make our clusters suitable for mixing business-critical services with prototypes and experiments.
I can't make any claim for or against Marathon's scalability. On the feature front, Marathon has build out features quickly, but this can feel bleeding edge in practice (Docker support is a good example). This is not always due to Marathon itself, but also layers down the stack. Marathon does not provide preemption.
Ownership
To some, ownership and governance of a project is important. It feel that in practice it does not define the openness of a project, but for some people/companies the legal fine print can be a deal-breaker.
Marathon is owned by a company (Mesosphere)
To some, this is beneficial, to others is is not. It means that you can pay for support and features. It also means that there is something to be sold, and the project direction is ultimately decided by Mesosphere's interests.
Aurora is owned by the Apache Software Foundation
This means it is subject to the governance model of the ASF, driven by the community. Aurora does not have paying customers, and there is not currently a software shop that you can pay for development.
tl;dr If you are just getting your feet wet with running services on Mesos, I would suggest Marathon as your first port of call. It will be easier for you to get running and poke around the ecosystem. If you are forming the 'private cloud strategy' for a company, I suggest seriously considering Aurora, as it is proven and specifically designed for that.
So I've been evaluating both and this is my summary.
Aurora
[+] also handles recurring jobs
[+] finer grained, extensive file-based configuration
[+] has namespaces so multiple environments can co-exist
[-] read-only UI, no official API
[~] file based configuration and cli based execution brings overhead (which can be justified with more extensive feature set)
Marathon
[+] very easy to setup and use
[+] UI that provides control and extensive API (even with features missing from UI at the moment)
[+] event bus to listen in on api calls
[-] handles only long-running jobs
[-] does not have separate deployment-run-cleanup steps, these if necessary need to be combined in a script of one-liner
Even though Aurora has better capabilities, I prefer Marathon due to Auroras complexity/overhead and lack of UI (for control) & API
I have more experience with Marathon.
Ideological:
Marathon is a relatively tested product that is used in production at AirBnB. Aurora is an early Apache project (so YMMV).
Both are open source and active. Feel free to contribute pull requests or file issues!
Technical:
Marathon doesn't schedule batch tasks or cron jobs
Marathon has a friendly UI and better health indicators (in 0.8.x)
In regards to your second question, you can run any command or docker container, and Mesos will do the resource isolation for you. If you have 50% CentOS nodes and 50% Ubuntu nodes and you run a task that executes apt-get, the task will have a 50% chance of failure. Mesos and Marathon have no awareness of the actual machines.
Disclaimer: I don't have hands-on experience with Aurora, only with Marathon.
ad Q1: In a nutshell Apache Aurora is capable of doing what Marathon + Chronos can provide, that is, schedule both long-running services and recurring (batch) jobs; see also Aurora user guide.
ad Q2: Yes, anything. Currently based on cgroups and Docker but hey, you can roll your own.

Remote Execution in Ruby (Capistrano or MCollective) to collect cloud server performance metrics

I am looking for a way to collect data remotely from various cloud instances (EC2, Rackpsace). The Rackspace API provides no way for collecting server performance metrics (ie load average, cpu usage, memory) via it's API, otherwise this would have never been asked.
I started looking at solutions like Capistrano or Mcollective (I have also considered collectd), but I am unsure of which one would best suit my application. I am trying to avoid using ssh keys for trending purposes (I don't want to have to keep logging in to collect these metrics) The script I am writing is a Ruby script which reboots a cloud server if it's load average is over a certain number. Because these providers don't expose these metrics via their API, I am looking at a way to gather them myself, and I am new to the Ruby community so after briefing over the documentation for all of these tools, I still haven't been able to get a sense of which framework would work best, or if there are other alternatives.
It sounds like Capistrano is more suited to be a deployment tool, although it can perform remote tasks, so after I read the documentation for that it was pretty much out for the purposes of my script.
MCollective looks really attractive for what I am trying to do but it seems I would have to write my own RPC style plugin for this purpose.
I've also considered plugging into some greater monitoring system such as Nagios, Munin, Zenoss, Hyperic, etc, but I'd rather not install some large bulk monitoring system when all I want to collect is but a few simple metrics.
If your intention is to trigger certain actions based on the system performance (like restarting when cpu usage is too high), you should check out god.
I'm not sure if this is also useful when you want to generate some performance statistics over a longer time period. Personally, I'm using Munin for this, but if you don't like it maybe you can find something on Ruby Toolbox | Server Monitoring.

Which workflow engine should I chose for implementing a dynamic reconfiguration of workflows?

I want to be able to interrupt a running workflow instance, say when a new activity is about to be invoked, and extract information both about the structure of the workflow and the data in the particular instance. Then I will consult with an external system and according to its response I will possibly alter the behaviour of the workflow. The options I would like to have are addition/removal of activities and altering parameters for the activities to be invoked.
I am currently struggling with the engine it's best to go with. I have looked at WWF, Apache ODE, Oracle Workflow and Active BPEL and as far as I understand they can all provide me with the options I need. I would really appreciate any recommendations on which one will be the easiest to work with for my purpose and any restrictions either of the above might have that would prevent me from reaching my goal.
Thanks
I am sorry not to directly answer your question, but you may be interested in a state machine framework called Stateless created by Nicholas Blumhardt (AutoFac). I have used this instead of Windows Workflow where I needed to quickly configure my steps for a work flow. I have one configuration file that I alter and can introduce new steps into the workflow quite easily. See my SO answer here for more details.
Essentially you define a state as State<T> and this allows you to persist your state in a database easily.

Resources