I'm new to Dask and Parallel processing. I have several hdf5 files and I hope to run each through a function that produces a numerical output. Within the function, the hdf5 is turned into a dask array. I was wondering what would be the fastest method to parallelize the code so that each hdf5 file can run through the function at the same time. Should I be converting the hdf5 files into dask arrays outside of the function?
The question is a bit abstract but you can load the data using the read_hdf method of dask.dataframe.
Then do the required computations on it with your function (using apply or map_partitions or applymap).
you can later convert to arrays.
note that you can read several hdf files at once using this syntax:
dd.read_hdf('myfile.*.hdf5', '/x')
more info:
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_hdf
Related
I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .
I've been using Julia in parallel on my computer successfully but want to increase the number of processors/workers I use so I plan to use my departmental cluster (UCL Econ). When just using julia on my computer, I have two seperate files. FileA contains all the functions I use, including the main function funcy(x,y,z). FileB calls this function over several processors as follows:
addprocs(4)
require("FileA.jl")
solution = pmap(imw -> funcy(imw,y,z), 1:10)
When I try to run this on the cluster, the require statement does not seem to work (though I don't get an explicit error output which is frustrating). Any advice?
I have a simple text file which contains list of folders on some FTP servers. Each line is a separate folder. Each folder contains couple of thousand images. I want to connect to each folder, store all files inside that foder in a SequenceFile and then remove that folder from FTP server. I have written a simple pig UDF for this. Here it is:
dirs = LOAD '/var/location.txt' USING PigStorage();
results = FOREACH dirs GENERATE download_whole_folder_into_single_sequence_file($0);
/* I don't need results bag. It is just a dummy bag */
The problem is I'm not sure if each line of input is processed in separate mapper. The input file is not a huge file just couple of hundred lines. If it were pure Map/Reduce then I would use NLineInputFormat and process each line in a separate Mapper. How can I achieve the same thing in pig?
Pig lets you write your own load functions, which let you specify which InputFormat you'll be using. So you could write your own.
That said, the job you described sounds like it would only involve a single map-reduce step. Since using Pig wouldn't reduce complexity in this case, and you'd have to write custom code just to use Pig, I'd suggest just doing it in vanilla map-reduce. If the total file size is Gigabytes or less, I'd just do it all directly on a single host. It's simpler not to use map reduce if you don't have to.
I typically use map-reduce to first load data into HDFS, and then Pig for all data processing. Pig doesn't really add any benefits over vanilla hadoop for loading data IMO, it's just a wrapper around InputFormat/RecordReader with additional methods you need to implement. Plus it's technically possible with Pig that your loader will be called multiple times. That's a gotcha you don't need to worry about using Hadoop map-reduce directly.
Is it possible to somehow pass a set of files through each map function. The requirement will be to process each file in parallel for different-2 operations. I am completely new to map reduce and i am using JAVA as my programming language.
If you want to get the same file as input to all mappers, for read only access, yes. You can add your files from your main (Driver) class to what is called Distributed Cache. More details can be found here.
I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop".
The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes.
So far I have not been successful in finding any java code which reads hdf files and extract data from them.
If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks
Ayush
Here are some resources:
SciHadoop (uses netCDF but might be already extended to HDF5).
You can either use JHDF5 or the lower level official Java HDF5 interface to read out data from any HDF5 file in the map-reduce task.
For your first option, you could use a conversion tool like HDF dump to dump HDF file to text format. Otherwise, you can write a program using Java library for reading HDF file and write it to text file.
For your second option, SciHadoop is a good example of how to read Scientific datasets from Hadoop. It uses NetCDF-Java library to read NetCDF file. Hadoop does not support POSIX API for file IO. So, it uses an extra software layer to translate POSIX call of NetCDF-java library to HDFS(Hadoop) API calls. If SciHadoop does not already support HDF files, you might go along a little harder path and develop a similar solution yourself.
If you do not find any java code and can do in other languages then you can use hadoop streaming.
SciMATE http://www.cse.ohio-state.edu/~wayi/papers/SciMATE.pdf is a good option. It is developed based on a variant of MapReduce, which has been shown to perform a lot of scientific applications much more efficiently than Hadoop.