Idiomatic approach to a Go plugin-based system - go

I have a Go project I would like to open source but there are certain elements which are not suitable for OSS, e.g. company specific logic etc.
I have conceived of the following approach:
interfaces are defined in the core repository.
Plugins can then be standalone repositories whose types implement the interfaces defined in core. This allows the plugins to be housed in completely separate modules and therefore have their own CI jobs etc.
Plugins are compiled into the final binary via symlinks.
This would result in a directory structure something like the following:
|- $GOPATH
|- src
|- github.com
|- jabclab
|- core-system
|- plugins <-----|
|- xxx |
|- plugin-a ------>| ln -s
|- yyy |
|- plugin-b ------>|
With an example workflow of:
$ go get git#github.com:jabclab/core-system.git
$ go get git#github.com:xxx/plugin-a.git
$ go get git#github.com:yyy/plugin-b.git
$ cd $GOPATH/src/github.com
$ ln -s ./xxx/plugin-a/*.go ./jabclab/core-system/plugins
$ ln -s ./yyy/plugin-b/*.go ./jabclab/core-system/plugins
$ cd jabclab/core-system
$ go build
The one issue I'm not sure about is how to make the types defined in plugins available at runtime in core. I'd rather not use reflect but can't think of a better way at the moment. If I was doing the code in one repo I would use something like:
package plugins
type Plugin interface {
Exec(chan<- string) error
}
var Registry map[string]Plugin
// plugin_a.go
func init() { Registry["plugin_a"] = PluginA{} }
// plugin_b.go
func init() { Registry["plugin_b"] = PluginB{} }
In addition to the above question would this overall approach be considered idiomatic?

This is one of my favorite issues in Go. I have an open source project that has to deal with this as well (https://github.com/cpg1111/maestrod), it has pluggable DB and Runtime (Docker, k8s, Mesos, etc) clients. Prior to the plugin package that is in the master branch of Go (so it should be coming to a stable release soon) I just compiled all of the plugins into the binary and allowed configuration decide which to use.
As of the plugin package, https://tip.golang.org/pkg/plugin/, you can use dynamic linking for plugins, so similar to C's dlopen() in its loading, and the behavior of go's plugin package is pretty well outlined in the documentation.
Additionally I recommend taking a look at how Hashicorp addresses this by doing RPC over a local unix socket. https://github.com/hashicorp/go-plugin
The added benefit of running a plugin as a separate process like Hashicorp's model does, is that you get great stability in the event that the plugin fails but the main process is able to handle that failure.
I should also mention Docker does its plugins in Go similarly, except Docker uses HTTP instead of RPC. Additionally, a Docker engineer has posted about embedding a Javascript interpreter for dynamic logic in the past http://crosbymichael.com/category/go.html.
The issue I wanted to point out with the sql package's pattern that was mentioned in the comments is that that's not a plugin architecture really, you're still limited to whatever is in your imports, so you could have multiple main.go's but that's not a plugin, the point of a plugin is such that the same program can run either one piece of code or another. What you have with things like the sql package is flexibility where a separate package determines what DB driver to use. Nonetheless, you end up modifying code to change what driver you are using.
I want to add, with all of these plugin patterns, aside from the compiling into the same binary and using configuration to choose, each can have its own build, test and deployment (i.e. their own CI/CD) but not necessarily.

Related

Downloading data from azure storage explorer using dvc

I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer.
Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url?
I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push .
And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format?
If it is not possible is there then another way to download data automatically from azure?
I'll build up on #Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
We're using a DVC pipeline, specified in an existing dvc.yaml file. The first stage in the current pipeline is called prepare.
Our data is stored on some Azure blob storage container, in a folder named dataset/. This folder follows a structure of sub-folders that we'd like to keep intact.
The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name myazure (more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.
This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.
Implementation
This procedure has been tested on Windows 10.
Add a simple update_dataset stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.
Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go.
This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.
To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy command, it uses azcopy sync.
The advantage of azcopy sync is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/ folder - i.e. the -o dataset part - as it was causing the azcopy command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy command, I've included the --delete-destination="true" option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset folder if deleted on the Azure container.
Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format? :). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url - it'll download objects, it'll cache them, and will by default push (dvc push) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file .dvc that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".
dvc get-url - this more or less wget or rclone or aws s3 cp, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
Similar to import-url but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and
filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.

Copy all Gradle dependencies without pre-registered custom task

Use Case
The use case for grabbing all dependencies (without the definition of a custom task in build.gradle) is to perform policy violation and vulnerability analysis on each of them via a templated pipeline. We are using Nexus IQ to do the evaluation.
Example
This can be done simply with Maven, by specifying the local repository to download all dependencies and then supply a pattern to Nexus IQ to scan. In the example below we would supply maven-dependencies/* as the scan target to Nexus IQ after rounding up all the dependencies.
mvn -B clean verify -Dmaven.repo.local=${WORKSPACE}/maven-dependencies
In order to do something similar in Gradle it seems the most popular method is to introduce a custom task into build.gradle. I'd prefer to do this in a way that doesn't require developers to implement custom tasks; it's preferred to keep those files as clean as possible. Here's one way I thought of making this happen:
Set GRADLE_USER_HOME to ${WORKSPACE}/gradle-user-home.
Run find ${WORKSPACE}/gradle-user-home -type f -wholename '*/caches/modules*/files*/**/*.*' to grab the location of all dependency resources (I'm fine with picking up non-archive files).
Copying all files from step #1 to a gradle-dependencies folder.
Supply gradle-dependencies/* as the scan target to Nexus IQ.
Results
I'm super leery about doing it this way, as it seems very hacky and doesn't seem like the most sustainable solution. Is there another way that I should consider?
UPDATE #1: I've adjusted my question to allow answers that have custom tasks, just not pre-registered. Pre-registered means the custom task is already in the build.gradle file. I'll also provide my answer shortly after this update.
I'm uncertain if Gradle has the ability to register external, custom tasks, but this is how I'm making ends meet. I've created a custom task in a file called copyAllDependencies.gradle, appending the contents of that file (after replacing all newlines and instances of two or more spaces with a single space) to build.gradle when the pipeline runs, and then running gradlew copyAllDependencies. I then pass gradle-dependencies/* as the scan target to Nexus IQ.
task copyAllDependencies(type: Copy) {
def allConfigurations = [];
configurations.each {
if (it.canBeResolved) {
allConfigurations += configurations."${it.name}"
}
};
from allConfigurations
into "gradle-dependencies"
}
I can't help but feel that this isn't the most elegant solution, but it suits my needs for now.
UPDATE #1: Ultimately, I decided to go with requiring development teams to specify this custom task in their build.gradle file. There were too many nuances with echoing script contents into another file (hence the need to include ; when defining allConfigurations and iterating over all configurations). However, I am still open answers that address the original question.

protoc command not generating all base classes (java)

I have been trying to generate the basic gRPC client and server interfaces from a .proto service definition here from the grpc official repo.
The relevant service defined in that file (from the link above) is below:
service RouteGuide {
rpc GetFeature(Point) returns (Feature) {}
rpc ListFeatures(Rectangle) returns (stream Feature) {}
rpc RecordRoute(stream Point) returns (RouteSummary) {}
rpc RouteChat(stream RouteNote) returns (stream RouteNote) {}
}
The command I run is protoc --java_out=${OUTPUT_DIR} path/to/proto/file
According to the grpc site (specifically here), a RouteGuideGrpc.java which contains a base class RouteGuideGrpc.RouteGuideImplBase, with all the methods defined in the RouteGuide service is supposed to have been generated from the protoc command above, but that file does not get generated for me.
Has anyone faced similar issues? Is the official documentation simply incorrect? And would anyone have any suggestion as to what I can do to generate that missing class?
This may help someone else in the future so I'll answer my own question.
I believe the java documentation for gRPC code generation is not fully up to date and the information is scattered amongst different official repositories.
So turns out that in order to generate all the gRPC java service base classes as expected, you need to specify an additional flag to the protoc cli like so grpc-java_out=${OUTPUT_DIR}. But in order for that additional flag to work, you need to have a few extra things:
The binary for the protoc plugin for gRPC Java protoc-gen-grpc-java: you can get the relevant one for your system from maven central here (the link is for v1.17.1). If there isn't a prebuilt binary available for your system, you can compile one yourself from the github repo instructions here.
Make sure the binary location is added to your PATH environment variable and the binary is renamed to "protoc-gen-grpc-java" exactly (that is the name the protoc cli expects to have in the path).
Finally, you are ready to run the correct command protoc --java_out=${OUTPUT_DIR} --grpc-java_out=${OUTPUT_DIR} path/to/proto/file and now the service base classes like RouteGuideGrpc.RouteGuideImplBase should be generated when it previously was not.
I hope this explanation helps someone else out in the future.
Thank you very much for this investigation. Indeed, the doc is incomplete, and people use Maven to compile everything without understanding of how it really works. Yp

Golang dependencies registering sql driver in init() causing clash

I have some Go tests that depend on some external code which has init() methods that registers mysql drivers. My code also needs to register mysql drivers and I therefore end up with a panic and the error "Register called twice for driver mysql" when running my tests.
It seems that the repo I'm dependent on has a vendors directory with the driver in it ("vendors/github.com/go-sql-driver/mysql"). It seems that when I run my tests the init() methods in the driver are called and registers the mysql driver causing the clash.
The best option I can see would be to copy the dependency to my own vendors directory and remove the mysql driver from the dependency's vendor directory. However I'm not keen on this as it involves duplicating my dependency and then modifying it by deleting files. Also, I'm only dependent on this for running tests so I'm not sure I should be moving it to my vendors directory any way.
Is there a way to prevent init() being called on a dependency's vendor dependencies?
First of all, I would ditch the dependency. If it's registering a database driver - something that dependencies really should never do - I predict there will be more problems with it. Also I suggest opening an issue.
As a workaround, you can import the library depending on whether you are in a test build or not. So you will have a file called e.g. mysqlimport.go with nothing but
// +build !test
import _ "github.com/go-sql-driver/mysql"
This way this code will only be called when you're not testing. Although you'll have to start your tests with go test -tags test.

How can I batch Kafka reads to Elasticsearch

I'm not too familiar with Kafka but I would like to know what's the best way to
read data in batches from Kafka so I can use Elasticsearch Bulk Api to load the data faster and reliably.
Btw, am using Vertx for my Kafka consumer
Thank you,
I cannot tell if this is the best approach or not, but when I started looking for similar functionality I could not find any readily available frameworks. I found this project:
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer/tree/branch2.0
and started contributing to it as it was not doing everything I wanted, and was also not easily scalable. Now the 2.0 version is quite reliable and we use it in production in our company processing/indexing 300M+ events per day.
This is not a self-promotion :) - just sharing how we do the same type of work. There might be other options right now as well, of course.
https://github.com/confluentinc/kafka-connect-elasticsearch
Or You can try this source
https://github.com/reachkrishnaraj/kafka-elasticsearch-standalone-consumer
Running as a standard Jar
**1. Download the code into a $INDEXER_HOME dir.
**2. cp $INDEXER_HOME/src/main/resources/kafka-es-indexer.properties.template /your/absolute/path/kafka-es-indexer.properties file - update all relevant properties as explained in the comments
**3. cp $INDEXER_HOME/src/main/resources/logback.xml.template /your/absolute/path/logback.xml
specify directory you want to store logs in:
adjust values of max sizes and number of log files as needed
**4. build/create the app jar (make sure you have MAven installed):
cd $INDEXER_HOME
mvn clean package
The kafka-es-indexer-2.0.jar will be created in the $INDEXER_HOME/bin. All dependencies will be placed into $INDEXER_HOME/bin/lib. All JAR dependencies are linked via kafka-es-indexer-2.0.jar manifest.
**5. edit your $INDEXER_HOME/run_indexer.sh script: -- make it executable if needed (chmod a+x $INDEXER_HOME/run_indexer.sh) -- update properties marked with "CHANGE FOR YOUR ENV" comments - according to your environment
**6. run the app [use JDK1.8] :
./run_indexer.sh
I used spark streaming and the it was quite a simple implementation using Scala.

Resources