Calling a compiled binary on Amazon MapReduce - hadoop

I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:
# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)
Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).
Thanks!

You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.
See:
https://forums.aws.amazon.com/thread.jspa?threadID=35158
For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.

Related

Is there a way to change the working directory of fiddle?

I'm trying to load a C shared library within Ruby using Fiddle.
Here is a minimal example:
require 'fiddle'
require 'fiddle/import'
module Era
extend Fiddle::Importer
dlload './ServerApi.so'
extern 'int era_init_lib()'
extern 'void era_deinit_lib()'
extern 'int era_process_request(const char* request, char** response)'
extern 'void era_free(char* response)'
end
Era.era_init_lib
begin
# ...
ensure
Era.era_deinit_lib
end
The shared library loads without issues. However when I call Era.era_init_lib it tries to load additional libraries (Network.so and Protobuf.so). I have these file located in the current working directory (in the same directory as ServerApi.so).
However when I try to execute the code above I receive the following error:
! Failed to load library: /home/username/.rvm/rubies/ruby-2.6.5/bin/Network.so, error: /home/username/.rvm/rubies/ruby-2.6.5/bin/Network.so: cannot open shared object file: No such file or directory
If I place the file at the location the error describes everything works fine.
My guess is that the C working directory of fiddle is different from the Ruby working directory. I would like to keep the project files within the project and not in the Ruby installation directory.
How can I use Network.so from my project folder?
All the *.so files are provided by a third-party. I do not have the source and as a result cannot change these files. The function signatures are provided by the documentation.
Searching for Network.so in the strace gives me these results:
readlink("/proc/self/exe", "/home/username/.rvm/rubies/ruby-2."..., 4096) = 44
openat(AT_FDCWD, "/home/username/.rvm/rubies/ruby-2.6.5/bin/Network.so", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
futex(0x7fcc16666d90, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7fcc16b44520, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "! Failed to load library: ", 26! Failed to load library: ) = 26
write(2, "/home/username/.rvm/rubies/ruby-2."..., 50/home/username/.rvm/rubies/ruby-2.6.5/bin/Network.so) = 50
write(2, ", error: ", 9, error: ) = 9
write(2, "/home/username/.rvm/rubies/ruby-2."..., 109/home/username/.rvm/rubies/ruby-2.6.5/bin/Network.so: cannot open shared object file: No such file or directory) = 109
write(2, "\n", 1) = 1
I've also written a C script which does the same thing which works perfectly fine when the files are dropped into the same directory. So it might be the fault of the library, which I assume checks the location of the current running program, then tries to load the library from that folder. This would explain the behavior when ran as a Ruby script (since it runs as part of the Ruby program), whereas a C binary runs standalone.
For those that want to re-create the (Linux) issue. You can download the necessary files from here. Which gives you the server-linux-x86_64.sh file.
Supported distros are: Suse, Ubuntu, Debian, Red Hat and CentOS but others may also work fine.
You can either run the installer, which should place the files in /opt/eset/RemoteAdministrator/Server. Or, assuming most of you don't want to install the full application you can run the following command:
sed '1,/^# Start of TAR\.GZ file #$/d' server-linux-x86_64.sh | sed '1d' > server-linux-x86_64.tar.gz
Which removes all the installer instructions from the .sh file and only leaves the binary .tar.gz data, writing it to server-linux-x86_64.tar.gz.
Copy the files ServerApi.so, Protobuf.so and Network.so into a directory of your liking. Create a Ruby script (with the question code) in the same directory and run the script.
Because ServerApi.so checks /proc/self/exe for the location of all subsequent files to load, and it is very difficult to modify this target by normal means, it is easier to just modify ServerApi.so itself so that it uses something else besides proc for the source.
If we run strings ServerApi.so, we can verify that the location to check is stored inside a string in ServerApi.so:
strings ServerApi.so | grep 'proc/self/exe'
B/proc/self/exe
So now all we need to do is modify this string to something else that works for us.
The easiest way to modify the string is to replace it with something that is exactly the same length as the original. This way we do not have to worry about changing the end-of-string zero padding or accidentally changing the total size of ServerApi.so.
Here we can see a suitable candidate could be /tmp/scriptexe:
/proc/self/exe
/tmp/scriptexe <- same length
So let's do that:
sed -e 's/proc\/self\/exe/tmp\/scriptexe/' ServerApi.so > ServerApi_Mod.so
Now we can verify the change:
strings ServerApi_Mod.so | grep scriptexe
B/tmp/scriptexe
Next we need to create /tmp/scriptexe to actually point to our Ruby script:
ln -s /the/full/path/to/our/ruby/script.rb /tmp/scriptexe
Then we modify our script:
dlload './ServerApi_Mod.so
Now we can run it as normal:
ruby script.rb
And everything should work.
If we read the strace output we see that the library obtains the current executable location from /proc/self/exe, and then searches subsequent libraries from there.
/proc/self/exe is not easily modifiable, but by using a hard link to a Ruby executable in the current directory we can trick it to point to a new folder.
Problem is making a hard link requires root.
In any case, here is a self-contained solution (note that it will ask for root password the first time you run it, in order to create the hard link).
Put this at the top of your script:
# Obtain path to current executable
exe = File.readlink("/proc/self/exe")
# Check if we are running the hard-liked version
if !exe.match /localruby/
if !File.exist?('localruby')
# Create a hard link to the current Ruby exe using sudo
system("sudo ln #{exe} localruby")
end
puts "Restarting..."
# In order to prevent infinite busy loop in case of some mishap
sleep 1
# Rerun self using the hard-linked Ruby executable.
# This will make /proc/self/exe point to the hard-link, which then
# allows the ESET library to search for .so files in current folder.
exec('./localruby', File.expand_path(__FILE__))
end
require 'fiddle'
require 'fiddle/import'
# ...rest of your script goes here...
A simple solution without any extra Ruby code is to just create the hard link manually, and then always run the script with ./localruby myscript.rb, instead of using the normal ruby myscript.rb.

Terraform lambda source_code_hash update with same code

I have an AWS Lambda deployed successfully with Terraform:
resource "aws_lambda_function" "lambda" {
filename = "dist/subscriber-lambda.zip"
function_name = "test_get-code"
role = <my_role>
handler = "main.handler"
timeout = 14
reserved_concurrent_executions = 50
memory_size = 128
runtime = "python3.6"
tags = <my map of tags>
source_code_hash = "${base64sha256(file("../modules/lambda/lambda-code/main.py"))}"
kms_key_arn = <my_kms_arn>
vpc_config {
subnet_ids = <my_list_of_private_subnets>
security_group_ids = <my_list_of_security_groups>
}
environment {
variables = {
environment = "dev"
}
}
}
Now, when I run terraform plan command it says my lambda resource needs to be updated because the source_code_hash has changed, but I didn't update lambda Python codebase (which is versioned in a folder of the same repo):
~ module.app.module.lambda.aws_lambda_function.lambda
last_modified: "2018-10-05T07:10:35.323+0000" => <computed>
source_code_hash: "jd6U44lfe4124vR0VtyGiz45HFzDHCH7+yTBjvr400s=" => "JJIv/AQoPvpGIg01Ze/YRsteErqR0S6JsqKDNShz1w78"
I suppose it is because it compresses my Python sources each time and the source changes. How can I avoid that if there are no changes in the Python code? Is my hypothesis coherent if I didn't change the Python codebase (I mean, why then the hash changes)?
This is because you are hashing just main.py but uploading dist/subscriber-lambda.zip. Terraform compares the hash to the hash it calculates when the file is uploaded to lambda. Since the hashing is done on two different files, you end up with different hashes. Try running the hash on the exact same file that is being uploaded.
This works for me and also doesn't trigger an update on the Lambda function when the code hasn't changed
data "archive_file" "lambda_zip" {
type = "zip"
source_dir = "../dist/go"
output_path = "../dist/lambda_package.zip"
}
resource "aws_lambda_function" "aggregator_func" {
description = "MyFunction"
function_name = "my-func-${local.env}"
filename = data.archive_file.lambda_zip.output_path
runtime = "go1.x"
handler = "main"
source_code_hash = data.archive_file.lambda_zip.output_base64sha256
role = aws_iam_role.function_role.arn
timeout = 120
publish = true
tags = {
environment = local.env
}
}
I'm going to add my answer to contrast to the one #ODYN-Kon provided.
The source code hash field in resource "aws_lambda_function" is not compared to some hash of the zip you upload. Instead, the hash is merely checked against the Terraform saved state from the last time it ran. So, the next time you run Terraform, it computes the hash of the actual python file to see if it has changed. If it has, it assumes that the zip has been changed and the Lambda function resource needs to be run again. The source_code_hash can have any value you want to give it or it can be omitted entirely. You could set it to a constant of some arbitrary string, and then it would never change unless you edit your Terraform configuration.
Now, the problem there is that Terraform assumes you updated the zip file. Assuming you only have one directory or one file in the zip archive, you can use the Terraform data source archive_file to create the zip file. I have a case where I cannot use that because I need a directory and a file (JS world: source + node_modules/). But here is how you can use that:
data "archive_file" "lambdaCode" {
type = "zip"
source_file = "lambda_process_firewall_updates.js"
output_path = "${var.lambda_zip}"
}
Alternativly, you can archive an entire directory, if you replace the "source_file" statement with source_dir = "node_modules"
Once you do this, you can reference the hash code of the zip archive file for insertion into resource "aws_lambda_function" "lambda" { block as "${data.archive_file.lambdaCode.output_base64sha256}" for the field source_hash. Then, anytime the zip changes, the lambda function gets updated. And, the data source archive file knows that anytime the source_file changes it must regenerate the zip.
Now, I haven't drilled down to a root cause in your case, but hopefully given some help to get to a better place. You can check the saved state of Terraform via: tf state list - which lists the items of saved state. You can find the one that matches your lambda function block and then execute tf state show <state-name>. For example, for one I am working on:
tf state show aws_lambda_function.test-lambda-networking gives about 30 lines of output, including:
source_code_hash = 2fKX9v/duluQF0H6O9+iRnID2gokhfpXIXpxyeVBUM0=
You can compare the hash via command line commands. Example on MacOS: sha256sum my-lambda.zip, where sha256sum was installed by brew install coreutils.
As mentioned, the use of archive_file doesn't work when you have multiple elements of the zip which are not isolated to a single directory. I think that probably happens a lot, so I wish the Hashicorp guys would extend archive_file to support multiple. I even went looking at the Go code, but that is a rainy day project. One variation I use is to take the source_code_hash to be "${base64sha256(file("my-lambda.zip"))}". But that still requires me to run tf twice.
As others have said, your zip should be used in your filename and your hash.
I want to mention that you can also get similar recreation issues if you use the wrong hash function in your lambda definitions. For example filesha256(.zip) will also recreate your lambdas every time. You have to use filebase64sha256("file.zip") (terraform 0.11.12+) or base64sha256(file("file.zip")) as mentioned under source_code_hash here

Run multiple iterations of one Airflow Operator

I am building a system that is supposed to list files on a remote SFTP server and then download the files locally. I want this to run in parallel such that I can initiate one job for each file to be downloaded, or an upper of say 10 simultaneous downloads.
I am new to Airflow and still not fully understanding everything. I assume there should be a solution to do this but I just can't figure it out.
This is the code, currently I download all files in one Operator, but as far as I know, it is not using multiple workers.
def transfer_files():
for i in range(1, 11):
sftp.get(REMOTE_PATH + 'test_{}.csv'.format(i), LOCAL_PATH + 'test_{}.csv'.format(i))
Assume you are using PythonOperator, You can start multiple PythonOperators, it would looks something like this:
def get_my_file(i):
sftp.get(REMOTE_PATH + 'test_{}.csv'.format(i), LOCAL_PATH + 'test_{}.csv'.format(i))
def transfer_files():
for i in range(1, 11):
task = PythonOperator(
task_id='test_{}.csv'.format(i),
python_callable=get_my_file,
op_args=[i],
dag=dag)

Run arbitrary ruby code in a chef cookbook

I have a simple chef cookbook and all it does is it sets the MOTD on a CentOS machine. It takes the content of the /tmp/mymotd.txt and turns it into the MOTD.
I also have a simple ruby script (a full-fledged ruby script) that simply reads the text from the web-server and puts in into the /tmp/mymotd.txt.
My questions are:
how do I run this ruby script from within the cookbook?
how do I pass some parameters to the script (e.g. the address of the web-server)
Thanks a lot beforehand.
Ad 1.
You can use libraries directory in scripts to place there your ruby script and declare it in a module. Example:
# includes
module MODULE_NAME
# here some code using your script
# Example function
def example_function (text)
# some code
end
end
You can use then
include MODULE_NAME
in your recipe to import those functions and just use it like
example_function(something)
What's good - you can use there also Chef functions and resources.
IMPORTANT INFO: Just remember that Chef has 2 compilation phases. First will be all of Ruby code, second all of Chef resources. This means, that you have to remember priority of code. I won't write here more info about it, since you haven't asked for this, but if you want, you can find it here.
Ad 2.
You can do this in several ways, but it seems to me, that the best option for you would be to use environments. You can find more info in here. Basically, you can set up environment for script before it will run - this way you can define some variables you would use later.
Hope this helps.

ipython notebook : how to parallelize external script

I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.
Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.
Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :
for my_file in $(cat list_file); do
python pgm.py my_file &
done
But bash interpreter of Ipython notebook doesn't handle the background mode.
It seems that solution was to use parallel library from ipython.
I tried :
from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines
But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.
Thanks.
One year later, I eventually managed to get what I wanted.
1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.
def my_func(my_file):
!python pgm.py {my_file}
Don't forget the {} when using !
Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.
2) Start your ipython notebook Cluster with the number of CPU you want.
Wait 2s and execute the following cell:
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
3) Get a list of file you want to process:
files = list_of_files
4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).
r = view.map_async(my_func, files)
While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).
And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.
Hope this will help others !
This tutorial might be of some help:
http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb
Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.
Edit: Still works with Jupyter, see here for difference and potential issues you may encounter
Note that if you use external libraries in your function, you need to import them on the different engines with:
%px import numpy as np
or
%%px
import numpy as np
import pandas as pd
Same with variable and other functions, you need to push them to the engine name space:
rc[:].push(dict(
foo=foo,
bar=bar))
If you're trying to executing some external scripts in parallel, you don't need to use IPython's parallel functionality. Replicating bash's parallel execution can be achieved with the subprocess module as follows:
import subprocess
procs = []
for i in range(10):
procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE))
results = []
for proc in procs:
stdout, _ = proc.communicate()
results.append(stdout)
Be wary that if your subprocess generates a lot of output, the process will block. If you print the output (results) you get:
print results
['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']

Resources