How to read/write/sync data on cloud with Kedro - kedro

In short: how can I save a file both locally AND on cloud, similarly how to set to read from local.
Longer description: There are two scenario, 1) building model 2) serving model through API. In building the model, a series of analysis is done to generate feature and model. The result will be written locally. At the end everything will be uploaded to S3. For serving the data, first all required files which are generated from the first step, will be downloaded.
I am curious how I can leverage Kedro here. Perhaps I can define two entries for each file conf/base/catalog.yml one corresponds to the local version and the second for the S3. But perhaps not the most efficient way when I am dealing with 20 files.
Alternatively, I can upload the files using my own script to S3 and exclude the synchronization from Kedro ! in otherwords, Kedro is blind from the fact that there are copies exist on the cloud. Perhaps this approach is not the most Kedro-friendly way.

Not quite the same, but my answer here could potentially be useful.
I would suggest that the simplest approach in your case is indeed defining two catalog entries and having Kedro save to both of them (and load from the local one for additional speed up) which gives you the ultimate flexibility, though I do admit isn't the prettiest.
In terms of avoiding all your node functions needing to return two values, I'd suggest applying a decorator to certain nodes that you tag with a certain tag, e.g tags=["s3_replica"] taking inspiration from the below script (stolen from a colleague of mine):
class S3DataReplicationHook:
"""
Hook to replicate the output of any node tagged with `s3_replica` to S3.
E.g. if a node is defined as:
node(
func=myfunction,
inputs=['ds1', 'ds2'],
outputs=['ds3', 'ds4'],
tags=['tag1', 's3_replica']
)
Then the hook will expect to see `ds3.s3` and `ds4.s3` in the catalog.
"""
#hook_impl
def before_node_run(
self,
node: Node,
catalog: DataCatalog,
inputs: Dict[str, Any],
is_async: bool,
run_id: str,
) -> None:
if "s3_replica" in node.tags:
node.func = _duplicate_outputs(node.func)
node.outputs = _add_local_s3_outputs(node.outputs)
def _duplicate_outputs(func: Callable) -> Callable:
def wrapped(*args, **kwargs):
outputs = func(*args, **kwargs)
return (outputs,) + (outputs,)
return wrapped
def _add_local_s3_outputs(outputs: List[str]) -> List[str]:
return outputs + [f'{o}.s3' for o in outputs]
The above is a hook, so you'd place it in your hooks.py file (or wherever you want) in your project and then import it into your settings.py file and put:
from .hooks import ProjectHooks, S3DataReplicationHook
hooks = (ProjectHooks(), S3DataReplicatonHook())
in your settings.py.
You can be slightly cleverer with your output naming convention so that it only replicates certain outputs (for example, maybe you agree that all catalog entries that end with .local also have to have a corresponding .s3 entry and you mutate the outputs of your node in that hook accordingly, rather than do it for every output.
If you wanted to be even cleverer, you could inject the corresponding S3 entry into the catalog using a after_catalog_created hook rather than manually writing the S3-versioon of the dataset in your catalog, again, as per a naming convention you choose. Though I'd argue that writing the S3 entries is more readable in the long-run.

There are 2 ways I can think of. A simpler approach is to use --env conf for both cloud and local. https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments
conf
├── base
│ └──
├── cloud
│ └── catalog.yml
└── my_local_env
└── catalog.yml
And you can call kedro run --env=cloud or kedro run --env=my_local depending on which env you want to use.
Another more advanced way is to use TemplatedConfigLoader https://kedro.readthedocs.io/en/stable/kedro.config.TemplatedConfigLoader.html
conf
├── base
│ └── catalog.yml
├── cloud
│ └── globals.yml (contains `base_path:s3-prefix-path`)
└── my_local
└── globals.yml (contains `base_path:my_local_path`)
In catalog.yml, you can refer to base_path like this
my_dataset:
filepath: s3:${base_path}/my_dataset
And you can call kedro run --env=cloud or kedro run --env=my_local depending on which env you want to use.

Related

Process Substitution in Ansible for Path-based Parameters

Many Ansible modules are designed to accept file paths as a parameter, the but lack the possibility to supply the contents of the file directly. In cases where the input data actually comes from something other than a file, this forces one to create a temporary file somewhere on disk, write the intended parameter value into it and then supply the path of this temporary file to the Ansible module.
For illustration purposes a real life example: the java_cert Ansible module takes the parameter pkcs12_path for the path to a PKCS12 keystore containing a keypair to be imported into a given Java keystore. Now say for example this data is retrieved through a Vault lookup, so in order to be able to supply the module with the path it demands, we must write the Vault lookup result into a temporary file, use the file's path as the parameter and then handle the secure deletion of the temporary file, seeing as the data is likely confidential.
When a situation such as this arises within the context of Shell/bash scripting, namely a command line tool's flag only supporting interaction with a file, the magic of process substitution (e.g. --file=<(echo $FILE_CONTENTS)) allows for the tool's input and output data to be linked with other commands by transparently providing a named pipe that acts as if it were a (mostly) normal file on disk.
Within Ansible, is there any comparable mechanism to replace file-based parameters with more flexible constructs that allow for the usage of data from variables or other commands? If there is no built-in method to achieve this, are there maybe 3rd-party solutions that allow for it, or that simplify workflows like the one I described? For example something like a custom lookup plugin which is supplied with the file content data and then handles, transparently and in the background, the file management (i.e. creation, writing the data, and ultimately deletion) and provides the temporary path as its return value, without the user necessarily ever having to know it.
Exemplary usage of such a plugin could be:
...
pkcs_path: "{{ lookup('as_file', '-----BEGIN PRIVATE KEY-----...-----END PRIVATE KEY----- ') }}"
...
with the plugin then creating a file under e.g. /tmp/as_file.sg7N3bX containing the textual key from the second parameter and returning this file path as the lookup result. I am however unsure how exactly the continued management of the file (especially the timely deletion of sensitive data) could be realized in such a context.
Disclaimer:
I am (obviously!) the author of the below collection which was created as a reaction to the above question
The lookup plugin was not thoroughly tested and might fail with particular modules.
Since this was a pretty good idea and nothing existed, I decided to give it a try. This all ended up in a collection now called thoteam.var_as_file which is available in a github repo. I won't paste all files in this answer as they are all available in the mentioned repo with a full README documentation to install, test and use.
The global idea was the following:
Create a lookup plugin responsible for pushing new temporary files with a given content and returning a path to use them.
Clean-up the created files at the end of playbook run. For this step, I created a callback plugin which launches the cleanup action listening to v2_playbook_on_stats events.
I still have some concerns about concurrency (files yet to be cleaned are stored in a static json file on disk) and reliability (not sure that the stats stage happens in all situation, especially on crashes). I'm also not entirely sure using a callback for this is a good practice / best choice.
Meanwhile this was quite fun to code and it does the job. I will see if this work is used by other and might very well enhance all this in the next weeks (and if you have PRs to fix the already know issues, I'm happy to accept them).
Once installed and the callback plugin enabled (see https://github.com/ansible-ThoTeam/thoteam.var_as_file#installing-the-collection), the lookup can be used anywhere to get a file path containing the passed content. For example:
- name: Get a filename with the given content for later use
ansible.builtin.set_fact:
my_tmp_file: "{{ lookup('thoteam.var_as_file.var_as_file', some_variable) }}"
- name: Use in place in a module where a file is mandatory and you have the content in a var
community.general.java_cert:
pkcs12_path: "{{ lookup('thoteam.var_as_file.var_as_file', pkcs12_store_from_vault) }}"
cert_alias: default
keystore_path: /path/to/my/keystore.jks
keystore_pass: changeit
keystore_create: yes
state: present
These are the relevant parts of the two plugin files. I removed the ansible documentation vars (for conciseness) which you can find in the git repo directly if your wish.
plugins/lookup/var_as_file.py
from ansible.errors import AnsibleError
from ansible.plugins.lookup import LookupBase
from ansible.module_utils.common.text.converters import to_native
from ansible_collections.thoteam.var_as_file.plugins.module_utils.var_as_file import VAR_AS_FILE_TRACK_FILE
from hashlib import sha256
import tempfile
import json
import os
def _hash_content(content):
"""
Returns the hex digest of the sha256 sum of content
"""
return sha256(content.encode()).hexdigest()
class LookupModule(LookupBase):
created_files = dict()
def _load_created(self):
if os.path.exists(VAR_AS_FILE_TRACK_FILE):
with open(VAR_AS_FILE_TRACK_FILE, 'r') as jfp:
self.created_files = json.load(jfp)
def _store_created(self):
"""
serialize the created files as json in tracking file
"""
with open(VAR_AS_FILE_TRACK_FILE, 'w') as jfp:
json.dump(self.created_files, jfp)
def run(self, terms, variables=None, **kwargs):
'''
terms contains the content to be written to the temporary file
'''
try:
self._load_created()
ret = []
for content in terms:
content_sig = _hash_content(content)
file_exists = False
# Check if file was already create for this content and check it.
if content_sig in self.created_files:
if os.path.exists(self.created_files[content_sig]):
with open(self.created_files[content_sig], 'r') as efh:
if content_sig == _hash_content(efh.read()):
file_exists = True
ret.append(self.created_files[content_sig])
else:
os.remove(self.created_files[content_sig])
# Create / Replace the file
if not file_exists:
temp_handle, temp_path = tempfile.mkstemp(text=True)
with os.fdopen(temp_handle, 'a') as temp_file:
temp_file.write(content)
self.created_files[content_sig] = temp_path
ret.append(temp_path)
self._store_created()
return ret
except Exception as e:
raise AnsibleError(to_native(repr(e)))
plugins/callback/clean_var_as_file.py
from ansible.plugins.callback import CallbackBase
from ansible_collections.thoteam.var_as_file.plugins.module_utils.var_as_file import VAR_AS_FILE_TRACK_FILE
from ansible.module_utils.common.text.converters import to_native
from ansible.errors import AnsibleError
import os
import json
def _make_clean():
"""Clean all files listed in VAR_AS_FILE_TRACK_FILE"""
try:
with open(VAR_AS_FILE_TRACK_FILE, 'r') as jfp:
files = json.load(jfp)
for f in files.values():
os.remove(f)
os.remove(VAR_AS_FILE_TRACK_FILE)
except Exception as e:
raise AnsibleError(to_native(repr(e)))
class CallbackModule(CallbackBase):
''' This Ansible callback plugin cleans-up files created by the thoteam.var_as_file.var_as_file lookup '''
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'utility'
CALLBACK_NAME = 'thoteam.var_as_file.clean_var_as_file'
CALLBACK_NEEDS_WHITELIST = False
# This one doesn't work for a collection plugin
# Needs to be enabled anyway in ansible.cfg callbacks_enabled option
CALLBACK_NEEDS_ENABLED = False
def v2_playbook_on_stats(self, stats):
_make_clean()
I'll be happy to get any feedback if your give it a try.

Yocto PREMIRROR/SOURCE_MIRROR_URL with url arguments (SAS_TOKEN) possible?

I sucessfully created a premirror for our yocto builds on an Azure Storage Blob,
that works if I set the access level to "Blob (Anonymous read).."
Now I wanted to keep the blob completely private, and access only via SAS Tokens.
SAS_TOKEN = "?sv=2019-12-12&ss=bf&srt=co&sp=rdl&se=2020-08-19T17:38:27Z&st=2020-08-19T09:38:27Z&spr=https&sig=abcdef_TEST"
INHERIT += "own-mirrors"
SOURCE_MIRROR_URL = "https://somewhere.blob.core.windows.net/our-mirror/downloads/BASENAME${SAS_TOKEN}"
BB_FETCH_PREMIRRORONLY = "1"
In general this works, but yocto (or to be exact the bitbake fetch module) will try then try to fetch from https://somewhere.blob.core.windows.net/our-mirror/downloads/bash-5.0.tar.gz%3Fsv%3D2019-12-12%26ss%3Dbf%26srt%3Dco%26sp%3Drdl%26se%3D2020-08-19T17%3A38%3A27Z%26st%3D2020-08-19T09%3A38%3A27Z%26spr%3Dhttps%26sig%3Dabcdef_TEST/bash-5.0.tar.gz
Which also encodes the special characters for the parameters and of course the fetch fill fail.
Did anybody has solved this or similar issues already?
Or is it possible to patch files inside the poky layer (namely in ./layers/poky/bitbake/lib/bb/fetch2) without changing them, so I can roll my on encodeurl function there?

intersphinx link creation issues for local files across multiple projects

I have a few different Sphinx projects that I would like to refer to each other locally (no web server). I have separate code + build directories setup for the projects and was trying out intersphinx to solve this requirement.
My questions are:
Is there a better way of referring to two or more local projects within Sphinx?
Is there a way to strip out the second build in the path?
In my configuration file I have:
intersphinx_mapping = {
'doc1': ('../build/doc1',None)
}
so I get no issues in doing a make HTML, however when I look at the reference I've created with :ref:'doc1 info <doc1:label1>' I have in the HTML document:
file:///<root path>/build/**build**/doc1/html/doc.html#label1
So the issue is I get two "build" directory listings - it should of been:
file:///<root path>/build/doc1/html/doc.html#label1.
If I manually do this, it correctly pulls in the document.
I've also tried replacing None with '../build/doc1'. If I drop the build from the mapping I do get an error in finding the objects.inv file for doc1.
I do not want to use absolute path since the end user getting this documentation may see it in another location and I want this to be cross platform...
My directory tree is essentially as follows:
Makefile
build/doc1/html
build/doc2/html
doc1
doc2
As a background, I'm trying this under Cygwin with Sphinx 1.7.5... I haven't tried Linux yet to see if I get the same behavior...
You can set a different path for your target and for your inventory.
Maybe you can try something like:
intersphinx_mapping = {
'doc1': ('../doc1', '../build/doc1/objects.inv')
}
If you want to keep the None, it is also possible to have both:
intersphinx_mapping = {
'doc1': ('../doc1', (None, '../build/doc1/objects.inv'))
}

Placing file inside folder of S3 bucket

have a spring boot application, where I am tring to place a file inside folder of S3 target bucket. target-bucket/targetsystem-folder/file.csv
The targetsystem-folder name will differ for each file which will be retrived from yml configuration file.
The targetsystem-folder have to created via code if the folder doesnot exit and file should be placed under the folder
As I know, there is no folder concept in S3 bucket and all are stored as objects.
Have read in some documents like to place the file under folder, have to give the key-expression like targetsystem-folder/file.csv and bucket = target-bucket.
But it doesnot work out.Would like to achieve this using spring-integration-aws without using aws-sdk directly
<int-aws:s3-outbound-channel-adapter id="filesS3Mover"
channel="filesS3MoverChannel"
transfer-manager="transferManager"
bucket="${aws.s3.target.bucket}"
key-expression="headers.targetsystem-folder/headers.file_name"
command="UPLOAD">
</int-aws:s3-outbound-channel-adapter>
Can anyone guide on this issue
Your problem that the SpEL in the key-expression is wrong. Just try to start from the regular Java code and imagine how you would like to build such a value. Then you'll figure out that you are missing concatenation operation in your expression:
key-expression="headers.targetsystem-folder + '/' + headers.file_name"
Also, please, in the future provide more info about error. In most cases the stack trace is fully helpful.
In the project that I was working before, I just used the java aws sdk provided. Then in my implementation, I did something like this
private void uploadFileTos3bucket(String fileName, File file) {
s3client.putObject(new PutObjectRequest("target-bucket", "/targetsystem-folder/"+fileName, file)
.withCannedAcl(CannedAccessControlList.PublicRead));
}
I didn't create anymore configuration. It automatically creates /targetsystem-folder inside the bucket(then put the file inside of it), if it's not existing, else, put the file inside.
You can take this answer as reference, for further explanation of the subject.
There are no "sub-directories" in S3. There are buckets and there are
keys within buckets.
You can emulate traditional directories by using prefix searches. For
example, you can store the following keys in a bucket:
foo/bar1
foo/bar2
foo/bar3
blah/baz1
blah/baz2

chef cookbook lwrp, easiest way to use new_resource.updated_by_last_action(true)

I'm writing a LWRP for chef 10.
And when that resource is run in other recipes it should be marked as "updated_by_last_action" if something has changed. But if nothing has changed. updated_by_last_action should be false.
So as example I have chef documentation http://docs.opscode.com/lwrp_custom_provider.html#updated-by-last-action. That example the resource template is wrapped inside an variable to test if it's been changed, and then set the updated_by_last_action status.
So my code should look something like this
f = file new_resource.filename do
xxx
end
new_resource.updated_by_last_action(f.updated_by_last_action?)
t = template new_resource.templatename do
xxx
end
new_resource.updated_by_last_action(t.updated_by_last_action?)
m mount new_resource.mountpoint do
xxx
end
new_resource.updated_by_last_action(m.updated_by_last_action?)
But if a provider gets bigger and uses a lot of resources like template, file, directory, mount, etc..
Should all those resource be wrapped inside variables like the example to find out if a resource have been updated, so to then further send a status that this provider have been updated.
I'm wondering if there is a simpler and cleaner way to run new_resource.updated_by_last_action(true) other then to wrap all resources inside variables. Cause if I just put a new_resource.updated_by_last_action(true) inside action before end the LWRP is marked as being updated every chef run, which is not optimal.
You can add use_inline_resources at the top of your LWRP, which delegates the updated_by_last_action to the inline resources.

Resources