Data Engineering - Extract Phase - etl

I read the book ‘Data Pipelines Pocket Reference’ from James Densmore. Like many other pipelines, this sample pipeline saves the data in the extract phase to csv on the local drive. Is this also how it would go in production? Saving the file first on the local machine and then upload it to a data lake or whatsoever?

If by "local drive" you mean a user's personal PC then no, that's not what you would do in Production. You would land the data on a server or in a cloud storage location.

Related

How to create azure data factory pipeline on local machine for dev debug?

I create service which uses few ADF pipelines, which will be responsible for processing of large amount of big files.
The main purpose is not to create azure services like database, azure storage account and pipelines on local developer account with 50Euro credit. The biggest drawback of such solution is size of processed file. Such big files could burn credit and blue CK such account.
Structure of a project looks like:
web-ui - API server - azure data factory with pipelines (unit used for processing and calculation files). In order to develop and debug such project everything should be configurable on local machine. Data factory with pipelines will use database information in order to process and create calculations on files
I looked for different approaches to deploy such projects with ADF pipelines, but there is no general solution for such structure. Is it possible to simulate and create local instance of pipeline on developer's machine?

how can i create automatic file backup to save to cloud storage?

I have 4 workstations (windows) that I need to backup and save to gloud storage, I would like it to be automatically. it is possible?
You can imagine to set up, on each workstation a planned task that performs a gcloud rsync regularly in dedicated folder on Google Cloud Storage (and that get the correct folder from the local workstation)

Backup strategy ubuntu laravel

I am searching for a backup strategy for my web application files.
I am hosting my (laravel) application at an ubuntu (18.04) server in the cloud and currently have around 80GB of storage that needs to be backed up (this grows fast). The biggest files are around ~30mb, the rest of it are small jpg/txt/pdf files.
I want to make at least 2 times a day a full backup of the storage directory and store it as a zip file on a local server. I have 2 reasons for this: independence from cloud providers, and for archiving.
My first backup strategy was to zip all the contents of the storage folder en rsync the zip, this goes well until a couple of gigabytes then the server is completely stuck on cpu usage.
My second approach is with rsync, but this i can't track when a file is deleted / added.
I am looking for a good backup strategy that preferable generate zips before or after backup and stores them so we can browse and examine back in time.
Strange enough i could not find anything that suits me, i hope anyone can help me out.
I agree with #RobertFridzema that the whole server becomes unresponsive when using ZIP functionality from spatie package.
Had the same situation with a customer project. My suggestion is to keep the source code files within version control. Just backup the dynamic/changing files with rsync (incremental works best and fast) and create a separate database backup strategy. For example with MySQL/Mariadb: mysqldump, encrypt the resulting file and move it to an external storage as well.
If ZIP creation still is a problem, I would maybe use a storage which is already set up with raid functionality or if that is not possible, I would definitly not use the ZIP functionality on the live server. rsync incremental to another server and do the backup strategy there.
Spatie has a package for Laravel backups that can be scheduled in the laravel job scheduler. It will create zips with the entire project including storage dirs
https://github.com/spatie/laravel-backup

How to save/backup amazon instance local

I would like to lower the cost paying to Amazon.
There are stopped instances that I want to backup and save on my local server, on-prem.
After creating an image from the instance, is there any way I can copy AMI to my local server and remove it from Amazon. So in a day, I will need back, it can transfer back from my local server to Amazon to use it again?
The instance first created on Amazon.I rather a way to save instance on-premise as a file and not as a virtual server.
The main issue is: How can I transfer and save the image of an instance, that created on Amazon, as a file to the local server and how I can return it back to be in Amazon, in case I need to build the instance again.
Is there any way to do it?
Thanks a lot!
You can use some backup software (duplicati, cloudberry, or anything else):
Install backup software to your EC2
Make an image backup to S3 cloud storage
Install backup software to your physical machine
Restore image from S3 cloud storage to physical machine or your local storage to keep this backup locally.
And the last, but not least thing:
Good luck!)))
You would need to use the VM Import/Export Tool for that. Read the docs to make sure you know how to upload again.
As to the cost, I am not sure how Amazon configures the cost, that is something you have to check out from your account. Once you create the image it is on your account. Even after you download it, not sure when AWS charges you or not.
You can create an image file from your current drive but it will be quite expensive:
create another instance
attach your volume there as the second drive
use something like dd if=/dev/xvd0 of=drive.img ... to copy volume to a file
rsync / ftp / etc the file to your local drive.
You will be billed for the second instance and for the transfer. When you want to restore the machine - you'll be billed too.
Have you checked free tier? You have a year of free access to AWS for small instances and volumes.
You need a tool to get what you want. Take eg Cloudberry and create an image and store it at Amazon and then restore and things are done. This is the best option for you. No other ways.

Temporary storage for Azure WebSites

I want to cache some cropped images and serve them without calculating them again in a Azure WebSite. When I used the Azure VM I was just storing them at the D drive (temporary drive) but I don't know where to store them now.
I could use the Path.GetTempPath but I am not sure if this is the best approach.
Can you suggest me where should I store my Temporary files when I am serving from a Azure WebSite?
Azure Websites also comes with a Temp folder. The path is defined in the environment variable %TEMP%
You can store your images in App_Data folder in the root of your application or you can use Azure CDN for caching.
You could store the processed content on Azure Blob Storage and serve the content from there.
If what you really want is a cache you can also look into using the Azure Redis Cache.
you can use Path.GetTempPath() and Path.GetTempFileName() functions for the temp file name, but you are limited though in terms of space, so if you're doing a 10K save for every request and expect 100,000 requests at a time per server, maybe blob storage is better.
Following sample demonstrate how to save temp file in azure, both Path and Bolb.
Doc is here:https://code.msdn.microsoft.com/How-to-store-temp-files-in-d33bbb10
Code click here:https://github.com/Azure-Samples/storage-blob-dotnet-store-temp-files/archive/master.zip

Resources