Gradle Support for GCP Dataflow Templates? - maven

According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow are pretty far apart in terms of details, I have some specific code requirements that lock me into the 2.0.0r3 codebase, so I'm pretty much required to use Apache Beam. Apache is -- understandably -- quite dedicated to Maven, but institutionally my company's thrown the bulk of its weight behind Gradle, so much so that they migrated all their Java projects over to it last year and have pushed back against re-introducing it.
However, now we seem to be at an impasse, because we've got a specific goal to try to centralize a lot of our back-end gathering in GCP's Dataflow, and GCP Dataflow doesn't appear to have formal support for Gradle. If it does, it's not in the official documentation.
Is there a sufficient technical basis to actually build Dataflow templates with Gradle and the issue is that Google's docs simply haven't been updated to support this? Is there a technical reason why Gradle can't do what's being done with Maven? Is there a better guide for working with GCP Dataflow than the docs on Google's and Apache's websites? I haven't worked with Maven archetypes before, and all the searches I've done for "gradle archetypes" turn up results from, at best, over a year ago. Most of the information points to forum discussions from 2014 and version 1.7rc3, but we're on 3.5. This feels like it ought to be a solved problem, but for the life of me I can't find any current information on this online.

Commandline to Run Cloud Dataflow Job With Gradle
Generic Execution
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner
Specific Example
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner
Explanation of Commandline
gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.
-DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.
-Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.
--runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner
--spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.
build.gradle contents (NOTE: You need to populate your specific dependencies)
apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'
group = 'com.foo.bar'
version = '0.3'
mainClassName = System.getProperty("mainClass")
sourceCompatibility = 1.8
targetCompatibility = 1.8
repositories {
maven { url "https://repository.apache.org/content/repositories/snapshots/" }
maven { url "http://repo.maven.apache.org/maven2" }
}
dependencies {
compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
// Insert your build deps for your Beam Dataflow project here
runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}
task execute (type:JavaExec) {
main = System.getProperty("mainClass")
classpath = sourceSets.main.runtimeClasspath
systemProperties System.getProperties()
args System.getProperty("exec.args").split()
}
Explanation of build.gradle
We use the task execute (type:JavaExec) in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specific PipelineOptions). more here
The line of build.gradle that reads runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0' is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.

There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.
Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.
It's just a runnable Java application.
The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.
You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.
Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.

Update: 7th December 2020
We can stage dataflow templates using gradle as well.
For stage:
Here are the mandatory parameters:
project
region
gcpTempLocation (good to have if you don't have bucket create access, if not given it will create automatically)
stagingLocation
templateLocation
Here is the sample command line in gradle:
gradle clean execute -D mainClass=com.something.mainclassname -D exec.args="--runner=DataflowRunner --project=<project_id> --region=<region_name> --gcpTempLocation=gs://bucket/somefolder --stagingLocation=gs://bucket/somefolder --templateLocation=gs://bucket/somefolder"
Assumptions:
GOOGLE_APPLICATION_CREDENTIALS environmental variable is set with service account key.
gradle is installed.
JAVA_HOME environmental variable is set.
Bare minimum dependencies are added.
compile 'org.apache.beam:beam-sdks-java-core:2.22.0'
compile 'org.apache.beam:beam-sdks-java-io-google-cloud-platform:2.22.0'
compile 'org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:2.22.0'
compile 'org.apache.beam:beam-runners-google-cloud-dataflow-java:2.22.0'

Related

Using Gradle for packing multiple artifacts from Artifactory

I have multiple non-java artifacts stored in Artifactory that I would like to pack into single zip/tar file.
I tried using Gradle for this and https://www.jfrog.com/confluence/display/RTF/Gradle+Artifactory+Plugin with "plugins" notation. I have created separate configuration and started fighting on how to get those dependencies into one archive. This is where I started doubting whether Gradle is a good tool for the job. If it isn't can you recommend something? If it is good tool, where can I find some example of how to accomplish it?
I was thinking of something more advanced than Bash script so that it leaves good room for future extensions.
If you have all of those artifacts in a single location (same folder / path) in Artifactory, you can use the "Retrieve Folder or Repository Archive" REST API.
If you would like to stick to Gradle I found the following in the Gradle documentation the following that might assist you:
task zip(dependsOn: jar, type: Zip) {
from { configurations.runtime.allArtifacts.files }
into(project.name + '-' + project.version)
}

Conditionally ordering tasks in Gradle

Consider a Gradle plugin that adds three tasks to a project - a buildZip task to create a distributable zip of the project, a publishZip task to publish that zip to a shared repository, and a cleanZip task to clean up any local version of the zip. For local development, cleanZip buildZip will be used frequently, but the automated build system will be running buildZip publishZip cleanZip.
One of the projects in which this plugin is being used wants to run their build using Gradle's parallel flag to allow the different parts of the project to be built in parallel. Unfortunately, this runs into a problem with the zip tasks - buildZip depends on the project actually building, but cleanZip doesn't have any dependencies so it can run right away, leading to the automated build system not being able to clean up.
Declaring any dependencies between these tasks isn't a good idea because they should be able to be run separately. Also, I can't specify mustRunAfter (at least between buildZip and cleanZip) because sometimes clean should be first and sometimes build should be first.
How can I tell Gradle what order to run these tasks in, in a way that will be honored by --parallel and isn't hardcoded to have a particular one always run before the other?
What you can do is: detect if gradle is run with --parallel and based on this configure dependencies between tasks appropriately. It can be done in the following way:
println project.gradle.startParameter.parallelProjectExecutionEnabled

Makefile to gradle conversion for golang application

I have a go lang application which exposes a rest API and logs the information to DB. I am trying to convert the make file to gradle build. Is there any default way similar to maven2gradle plugin or the gradle build file should be written manually? I checked the syntactical differences between gradle and make file but still not clear about passing run time arguments to gradle that is similar to
run:build
./hello -conf=/apps/content/properties/prop.json -v=0 -logDest="FILE" -log_dir="/var/log/logdir"
hello is my executable and others are the runtime arguments. This is my first attempt in migrating make to gradle and I couldnt find any clear documentation. Please help.
As far as I have checked, there is no direct plugin that could do this task. As a workaround, the build execution could be written as seperate tasks in gradle and ordered accordingly. Tasks here would contain setting Go path, installing dependencies and building the application and would be run as command line process in Gradle. Gradle provides support to run command line processes as described in gradle documentation. Hope it helps.

How to load a specific build.gradle/gradle.properties for default build process

I have three build.gradle with different name under the same directory
dev.build.gradle
uat.build.gradle
prd.build.gradle
I have 4 issues
"gradle build" will just use build.gradle only to start the java plugin build task, but "gradle -b dev.build.gradle" will not start the java plugin build task
gradle --help seems not having an option to load a specific gradle.properties. There is another way that creating three directories(dev, uat, prd) under the project root and putting a responding build.gradle version in it. finally, start the java plugin build process. I dont like this because I just want build.gradle or gradle.properties files in the same directory
how to copy files in gradle without explicitly specify task name in the command line(gradle build copy).
ad 1. The correct command is gradle -b dev.build.gradle build.
ad 2. If you want to use properties files other than build.gradle, you'll have to do it on your own (e.g. using the java.util.Properties class). There is also a third-party properties plugin.
ad 3. This doesn't seem to be a question.
ad 4. You should turn this into a separate question.

how to prevent gradle from downloading dependencies

We would like to have a script that does "svn update" and if the depedency.gradle file is in that list of updates, we would like to run a task that ONLY updates dependencies so the developers machine is up to date. What would that task be? I don't see it when running "gradle tasks". Looking for an updatejars or something.
When we build our project, we don't want it to check for jar updates at all!!!! most because that only needs to be done in 2 situations which are #1 above and when someone is updating the dependency.gradle file themselves. For the second thing, they can just run "gradle updatejars" once I know the answer to question #1 that is.
Any ideas? I am just getting into gradle and we really want to keep a consistent environment where when we run our update script, it gets the source code AND the jars in one atomic sweep and we are no longer bothered by checking the repositories every build.
It would be nice to know how to do it by changing the build.gradle file if possible. If not, is there a command line option? (The build.gradle obviously would give me a command line option which is why I prefer that method as I could say compile does not depend on downloading jars).
Regarding the second question. As far as I understand, Gradle will not attempt to do remote lookups or try to download the jar if it is already in the local cache. This should be true for jars declared with a static version, e.g. testCompile 'junit:junit:4.10'.
If you have dynamic versions, e.g. 1.+ or 1.0-SNAPSHOT, etc. then Gradle has to do a check every now and then. You can fine tune the cache expiry for such dependencies.
To make sure Gradle does not do remote lookups you can also use --offline option. See this doc for details.
With regard to svn update, you have at least 3 options:
Try to use an SvnKit plugin for Gradle
Use the ant svn task (here's how to do svn checkout)
Run external command from Gradle. Use the ExecPlugin or just implement it yourself using Groovy API.
Looks like the 1st question I can do with the answer in this post
how to tell gradle to download all the source jars
so I can just gradle eclipse and it will download new jars and update my classpath...nice.

Resources