I am building a system that is supposed to list files on a remote SFTP server and then download the files locally. I want this to run in parallel such that I can initiate one job for each file to be downloaded, or an upper of say 10 simultaneous downloads.
I am new to Airflow and still not fully understanding everything. I assume there should be a solution to do this but I just can't figure it out.
This is the code, currently I download all files in one Operator, but as far as I know, it is not using multiple workers.
def transfer_files():
for i in range(1, 11):
sftp.get(REMOTE_PATH + 'test_{}.csv'.format(i), LOCAL_PATH + 'test_{}.csv'.format(i))
Assume you are using PythonOperator, You can start multiple PythonOperators, it would looks something like this:
def get_my_file(i):
sftp.get(REMOTE_PATH + 'test_{}.csv'.format(i), LOCAL_PATH + 'test_{}.csv'.format(i))
def transfer_files():
for i in range(1, 11):
task = PythonOperator(
task_id='test_{}.csv'.format(i),
python_callable=get_my_file,
op_args=[i],
dag=dag)
Related
I have two load tests below with each one being in their separate test cases. This is using SOAP UI free:
Currently I have to manually select a load test, run it manually, wait until it finishes and then manually export the results before manually moving onto the next load test and performing the same actions.
Is there a way (and if so how) to be able to automatically run all the load tests (one by one) and extract each of it's own set of results in a file (test step, min, max avg, etc). This is to save the tester having to do manual intervention and can just let the test run whilst they do other stuff.
You can use the load tests command line, the doc is here.
Something like
loadtestrunner -ehttp://localhost:8080/services/MyService c:\projects\my-soapui-project.xml -r -f folder_name
Using these two options:
r : Turns on exporting of a LoadTest statistics summary report
f : Specifies the root folder to which test results should be exported
Then file like LoadTest_1-statistics.txt will be in your specified folder with csv statistics results.
inspired with answer of #aristotll )
loadtestrunner.bat runs the following class : com.eviware.soapui.tools.SoapUITestCaseRunner
from groovy you can call the same like this:
com.eviware.soapui.tools.SoapUITestCaseRunner.main([
"-ehttp://localhost:8080/services/MyService",
"c:\projects\my-soapui-project.xml",
"-r",
"-f",
"folder_name"
])
but the method main calls System.exit()...
and soapui will exit in this case.
so let's go deeper:
def res = new com.eviware.soapui.tools.SoapUITestCaseRunner().runFromCommandLine([
"-ehttp://localhost:8080/services/MyService",
"c:\projects\my-soapui-project.xml",
"-r",
"-f",
"folder_name"
])
assert res == 0 : "SoapUITestCaseRunner failed with code $res"
PS: did not tested - just an idea
I wrote a simple python script to automatically detect a file added in the directory, and then I would do sth to this new-added file. One issue I tried to solve is how to determine if the file copy process has completed.
1) how to detect a file added in the directory - solved
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html
2) how to detect a file copy process completed?
One solution is to compare size as suggested in the following post. But I tried this method, I found it did not work. When the file was still copying (Win7 pop-up window still showed 13 mins left), the piece of code has indicated that file copy has completed.
Can anyone help check why it did not work? Is there any better way to detect file copy process completed?
Python - How to know if a file is completed when copying from an outside process
file_size = 0
while True:
file_info = os.stat(file_path)
if file_info.st_size == 0 or file_info.st_size > file_size:
file_size = file_info.st_size
sleep(1)
else:
break
I camp up the following solution by myself.
Rather than checking file size, I tried to open this file. If IOError is seen, sleep 2 seconds and then retry.
Let me know if you have a better solution.
result = None
while result is None:
try:
result = open(logPath)
except IOError:
time.sleep(2)
result.close()
I'm trying to use parallel computing from ipython parallel library. But I have little knowledge about it and I find the doc difficult to read from someone who knows nothing about parallel computing.
Funnily, all tutorials I found just re-use the example in the doc, with the same explanation, which from my point of view, is useless.
Basically what I'd like to do is running few scripts in background so they are executed in the same time. In bash it would be something like :
for my_file in $(cat list_file); do
python pgm.py my_file &
done
But bash interpreter of Ipython notebook doesn't handle the background mode.
It seems that solution was to use parallel library from ipython.
I tried :
from IPython.parallel import Client
rc = Client()
rc.block = True
dview = rc[:2] # I take only 2 engines
But then I'm stuck. I don't know how to run twice (or more) the same script or pgm at the same time.
Thanks.
One year later, I eventually managed to get what I wanted.
1) Create a function with what you want to do on the different cpu. Here it is just calling a script from the bash with the ! magic ipython command. I guess it would work with the call() function.
def my_func(my_file):
!python pgm.py {my_file}
Don't forget the {} when using !
Note also that the path to my_file should be absolute, since the clusters are where you started the notebook (when doing jupyter notebook or ipython notebook) which is not necessarily where you are.
2) Start your ipython notebook Cluster with the number of CPU you want.
Wait 2s and execute the following cell:
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
3) Get a list of file you want to process:
files = list_of_files
4) Map asynchronously your function with all your files to the view of your engines you just created. (not sure of the wording).
r = view.map_async(my_func, files)
While it's running you can do something else on the notebook (It runs in "background"!). You can also call r.wait_interactive() that enumerates interactively the number of files processed and the number of time spent so far and the number of files left. This will prevent you to run other cells (but you can interrupt it).
And if you have more files than engines, no worries, they will be processed as soon as an engine finishes with 1 file.
Hope this will help others !
This tutorial might be of some help:
http://nbviewer.ipython.org/github/minrk/IPython-parallel-tutorial/blob/master/Index.ipynb
Note also that I still have IPython 2.3.1, I don't know if it changed since Jupyter.
Edit: Still works with Jupyter, see here for difference and potential issues you may encounter
Note that if you use external libraries in your function, you need to import them on the different engines with:
%px import numpy as np
or
%%px
import numpy as np
import pandas as pd
Same with variable and other functions, you need to push them to the engine name space:
rc[:].push(dict(
foo=foo,
bar=bar))
If you're trying to executing some external scripts in parallel, you don't need to use IPython's parallel functionality. Replicating bash's parallel execution can be achieved with the subprocess module as follows:
import subprocess
procs = []
for i in range(10):
procs.append(subprocess.Popen(['ls', '/Users/shad/tmp/'], stdout=subprocess.PIPE))
results = []
for proc in procs:
stdout, _ = proc.communicate()
results.append(stdout)
Be wary that if your subprocess generates a lot of output, the process will block. If you print the output (results) you get:
print results
['file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n', 'file1\nfile2\n']
A Python GUI that I develop executes an exe file in the same directory. I need to allow the user to open multiple instances of the GUI. This results in the same exe being called simultaneously and raises the following error: the process can not access the file because it is being used by another process. I use a dedicated thread in the python GUI to run the exe.
How can I allow the multiple GUIs to run the same exe simultaneously?
I would appreciate code examples.
Following is the thread. The run includes the execution of the exe. This exe was made using fortran.
class LineariseThread(threading.Thread):
def __init__(self, parent):
threading.Thread.__init__(self)
self._parent = parent
def run(self):
self.p = subprocess.Popen([exe_linearise], shell=True, stdout=subprocess.PIPE)
print threading.current_thread()
print "Subprocess started"
while True:
line = self.p.stdout.readline()
if not line:
break
print line.strip()
self._parent.status.SetStatusText(line.strip())
# Publisher().sendMessage(('change_statusbar'), line.strip())
sys.stdout.flush()
if not self.p.poll():
print " process done"
evt_show = LineariseEvent(tgssr_show, -1)
wx.PostEvent(self._parent, evt_show)
def killtree(self, pid):
print pid
parent = psutil.Process(pid)
print "in killtree sub: "
for child in parent.get_children(recursive=True):
child.kill()
parent.kill()
def abort(self):
if self.isAlive():
print "Linearisation thread is alive"
# kill the respective subprocesses
if not self.p.poll():
# stop them all
self.killtree(int(self.p.pid))
self._Thread__stop()
print str(self.getName()) + " could not be terminated"
self._parent.LineariseThread_killed=True
I think I figured out a way to avoid the error. It was actually not the execution of the exe raised the error. The error raised when the exe accesses the other files which are locked by another instance of the same exe. Therefore, I decided not to allow multiple instance of exe to run. Instead, I thought of allowing multiple cases to be opened within a single instance. That way I can manage the process threads to avoid the above mentioned issue.
I should mention that the comments given to me helped me to study the error messages in detail to figure out what was really going on.
I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:
# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)
Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).
Thanks!
You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.
See:
https://forums.aws.amazon.com/thread.jspa?threadID=35158
For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.