python ssh netmiko pathos client server, processing files paths from server vs. client GIS - multiprocessing

So I'm on a win workstation running a python script for GIS processing for very large .tif files. There is a linux server that I want to use the processing power from. I've ssh'ed into the server (netmiko) and set up pathos multiprocessing to run on the node. Works great on small projects. When I scaled it up, it was crashing due to memory allocation on workstation.
I realized the the workstation was trying to load everything into memory.
I have mapped the working tif file directory in the in the ubuntu server.
how do I call and store the file paths relative to the server in python, bypassing the workstation file directories, and call the objects relative to the worker node?
Currently looking into celery with RabbitMQ

Well, I talked to my networking guys and they built me a cluster to directly code from, woot woot. I think gRPC could have handled this too.


How to send files from Local Machine to HortonBox instance running on Virtual Box?

I'm using Hortonbox 3.0.1 on a virtual box and ssh into it using putty. I have some files in my local machine (Windows 10), which I want to store in the hadoop file system.
SSH-ing into hortonbox instance, gives me a terminal of the instance, which means all files from the windows instance are not visible to the terminal.
Is there any way I can put files into the HDFS instance?
I am aware of WinSCP but that does not really serve my purpose. WinSCP would mean me sending the file onto the system, using my ssh to store the file on hadoop, and then deleting the file from the system after storing on data nodes. I might be wrong but this seems like additional and redundant work and I would always need a buffer for storage where hadoop is running, for extremely large files, this solution will almost certainly fail considering I would first need to store the entire file on the secondary disk, then send it to the data nodes through the name node. Is there any way to achieve this or the problem I'm facing is due to using a hortonbox instance? How does organizations handle sending data from several nodes to the namenode and then to datanodes?
First, you don't send data to the namenode for it to be placed on datanodes. When you issue hdfs put commands, the only information requested from the namenode is locations for the files to be placed.
That being said, if you want to skip SSH entirely, you need to forward the Namenode and datanode ports from the VM to your host, then install and configure the hadoop fs/hdfs commands on your windows host such that you can issue them directly from CMD.
The alternative is to use Fuse/SFTP/NFS/Samba mounts (referred to as a "shared folder" in the Virtualbox GUI) from Windows into the VM, where you could then run put without copying anything into the VM

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

Download a file from HDFS cluster

I am developing an API for using hdfs as a distributed file storage. I have made a REST api for allowing a server to mkdir, ls, create and delete a file in the HDFS cluster using Webhdfs. But since Webhdfs does not support downloading a file, are there any solutions for achieving this. I mean I have a server who runs my REST api and communicates with the cluster. I know the OPEN operation just supports reading a text file content, but suppose I have a file which is 300 MB in size, how can I download it from the hdfs cluster. Do you guys have any possible solutions.? I was thinking of directly pinging the datanodes for a file, but this solution is flawed as if the file is 300 MB in size, it will put a huge load on my proxy server, so is there a streaming API to achieve this.
As an alternative you could make use of streamFile provided by DataNode API.
wget http://$datanode:50075/streamFile/demofile.txt
It'll not read the file as a whole, so the burden will be low, IMHO. I have tried it, but on a pseudo setup and it works fine. You can just give it a try on your fully distributed setup and see if it helps.
One way which comes to my mind, is to use a proxy worker, which reads the file using hadoop file system API, and creates a local normal file.And the provide download link to this file. Downside being
Scalablity of Proxy server
Files may be theoretically too large to fit into disk of a single proxy server.

images cloud-ready for openstack

I have a question about the images to mount on openStack.
I can use any image of any operative system? I guess not... but why?
I found images already suitable for openStack, but what's the different between an image cloud-ready and a normal image?
For instance, I can create a virtual machine with windows desktop? If not, why?
thank you
Cloud-ready images have been customised by the distro maker to run well under a hypervisor such as OpenStack, EC2, kvm, and LXC (not strictly a hypervisor) instead of on physical hardware. This entails removing packages that are only need in physical environments like wireless drivers etc, and adding packages that are useful in a cloud environment. For example during the boot process, cloud-ready images download metadata from the environment such as hostname and networking information. This data is used to "personalise" a new instance when it boots up for the first time.
If you really want to get in to the nuts and bolts of things, the Ubuntu UEC Images page has lots of details about the composition of the Ubuntu cloud images and other information like how to build one yourself.
I'm sure you can create a virtual machine running Windows desktop, but I've never had occasion to do so. If you look at the Amazon page about Windows it's all about running server apps like SQL Server and ASP.NET apps.
As Everett Toews pointed out in a comment above, one of the main things for making an image cloud-ready is that it can retrieve data from the metadata server when it boots up. This is used for things like retrieving the private key and collecting user data.
In addition to CloudInit, there's also Condenser. Or, you can roll your own. OpenStack uses the same protocol as the Amazon EC2 metadata service, so the EC2 metadata docs explain how to access this data.

How do I run my application code (PHP) across my various Amazon EC2 instances?

I've been trying to get to grips with Amazons AWS services for a client. As is evidenced by the very n00bish question(s) I'm about to ask I'm having a little trouble wrapping my head round some very basic things:
a) I've played around with a few instances and managed to get LAMP working just fine, the problem I'm having is that the code I place in /var/www doesn't seem to be shared across those machines. What do I have to do to achieve this? I was thinking of a shared EBS volume and changing Apaches document root?
b) Furthermore what is the best way to upload code and assets to an EBS/S3 volume? Should I setup an instance to handle FTP to the aforementioned shared volume?
c) Finally I have a basic plan for the setup that I wanted to run by someone that actually knows what they are talking about:
DNS pointing to Load Balancer (AWS Elastic Beanstalk)
Load Balancer managing multiple AWS EC2 instances.
EC2 instances sharing code from a single EBS store.
An RDS instance to handle database queries.
Cloud Front to serve assets directly to the user.
Edit: My Solution for anyone that comes across this on google.
Please note that my setup is not finished yet and the bash scripts I'm providing in this explanation are probably not very good as even though I'm very comfortable with the command line I have no experience of scripting in bash. However, it should at least show you how my setup works in theory.
All AMIs are Ubuntu Maverick i386 from Alestic.
I have two AMI Snapshots:
git - Very limited access runs git-shell so can't be accessed via SSH but hosts a git repository which can be pushed to or pulled from.
ubuntu - Default SSH account, used to administer server and deploy code.
Simple git repository hosting via ssh.
Apache and PHP, databases are hosted on Amazon RDS
Apache and PHP, databases are hosted on Amazon RDS
Right now (this will change) this is how deploy code to my servers:
Merge changes to master branch on local machine.
Stop all slave instances.
Use Git to push the master branch to the master server.
Login to ubuntu user via SSH on master server and run script which does the following:
Exports (git-archive) code from local repository to folder.
Compresses folder and uploads backup of code to S3 with timestamp attached to the file name.
Replaces code in /var/www/ with folder and gives appropriate permissions.
Removes exported folder from home directory but leaves compressed file intact with containing the latest code.
5 Start all slave instances. On startup they run a script:
Apache does not start until it's triggered.
Use scp (Secure copy) to copy latest compressed code from master to /tmp/www
Extract code and replace /var/www/ and give appropriate permissions.
Start Apache.
I would provide code examples but they are very incomplete and I need more time. I also want to get all my assets (css/js/img) being automatically being pushed to s3 so they can be distibutes to clients via CloudFront.
EBS is like a harddrive you can attach to one instance, basically a 1:1 mapping. S3 is the only shared storage stuff in AWS, otherwise you will need to setup an NFS server or similar.
What you can do is put all your php files on s3 and then sync them down to a new instance when you start it.
I would recommend bundling a custom AMI with everything you need installed (apache, php, etc) and setup a cron job to sync php files from s3 to your document root. Your workflow would be, upload files to s3, let server cron sync files.
The rest of your setup seems pretty standard.
