I have hundreds of thousands of small csv files in hdfs. Before merging them into a single dataframe, I need to add an id to each file individually (or else in the merge it won't be possible to distinguish between data from different files).
Currently I am relying on yarn to distribute the processes that I create that add the id to each file and convert to parquet format. I find that no matter how I tune the cluster (in size/executor/memory) that the bandwidth is limited at 2000-3000 files/h.
for i in range(0,numBatches):
fileSlice = fileList[i*batchSize:((i+1)*batchSize)]
p = ThreadPool(numNodes)
logger.info('\n\n\n --------------- \n\n\n')
logger.info('Starting Batch : ' + str(i))
logger.info('\n\n\n --------------- \n\n\n')
p.map(lambda x: addIdCsv(x), fileSlice)
def addIdCsv(x):
logId=x[logId]
filePath=x[filePath]
fLogRaw = spark.read.option("header", "true").option('inferSchema', 'true').csv(filePath)
fLogRaw = fLogRaw.withColumn('id', F.lit(logId))
fLog.write.mode('overwrite').parquet(filePath + '_added')
You can see that my cluster is underperforming on CPU. But on the YARN manager it is given 100% access to resources.
What is the best was to solve this part of a data pipeline? What is the bottleneck?
Update 1
The jobs are evenly distributed as you can see in the event timeline visualization below.
As per #cricket_007 suggestion, Nifi provides a good easy solution to this problem which is more scalable and integrates better with other frameworks than plain python. The idea is to read the files into Nifi before writing to hdfs (in my case they are in S3). There is still an inherent bottleneck of reading/writing to S3 but has a throughput around 45k files/h.
The flow looks like this.
Most of the work is done in the ReplaceText processor that finds the end of line character '|' and adds the uuid and a newline.
I want to use Hadoop as a simple system for managing a grid job. (I was previously doing this with SGE and pbs/Torque but we are moving to Hadoop.) I have 1000 ZIP files, each containing 1000 files, for a total of 1M files. I want to upload them all to Amazon S3. Ideally I want to do this without putting the files in HDFS. All of the files are WWW accessible.
What I want to do is:
Have an iterator that goes from 0..999
For each map job, get the iterator and:
fetch the ZIP file (it's about 500MB, so it will be written to temp storage)
read the ZIP directory.
extract each file and upload it to Amazon S3.
I know how to do the ZIP file magic in Java and Python. My question is this: How do I create an iterator so that the mapper will get the numbers 0..999?
The output of the reducer will be the amount of time that each took to upload. I then want a second map/reduce step that will produce a histogram of the times. So I guess the correct thing is for the times and failure codes to be written into HDFS (although it seems like it would make a lot more sense just to write them to an SQL database).
I'm interested in doing this in both traditional MapReduce (preferably in Python, but I will do it in Java or Scala if I have to), and in Spark (and for that I need to do it in Scala, right?). Although I can see that there's no real advantage to doing it in Spark.
In Spark you can simply parallelize over range:
Python
n = ... # Desired parallelism
rdd = sc.parallelize(range(1000), n)
def do_something_for_side_effects(i): ...
rdd.foreach(do_something_for_side_effects)
or
def do_something(i): ...
rdd.map(do_something).saveAsTextFile(...) # Or another save* method
Scala
val n: Int = ??? // Desired parallelism
val rdd = sc.parallelize(1 until 1000, n)
def doSomethingForSideEffects(i: Int): Unit = ???
rdd.foreach(doSomethingForSideEffects)
or
def doSomething(i: Int) = ???
rdd.foreach(doSomething).saveAsTextFile(...) // Or another save* method
I have a java program that will process 800 images.
I decided to use Condor as a platform for distributed computing, aiming that I can divide those images onto available nodes -> get processed -> combined the results back to me.
Say I have 4 nodes. I want to divide the processing to be 200 images on each node and combine the end result back to me.
I have tried executing it normally by submitting it as java program and stating the requirements = Machine == .. (stating all nodes). But it doesn't seem to work.
How can I divide the processing and execute it in parallel?
HTCondor can definitely help you but you might need to do a little bit of work yourself :-)
There are two possible approaches that come to mind: job arrays and DAG applications.
Job arrays: as you can see from example 5 on the HTCondor Quick Start Guide, you can use the queue command to submit more than 1 job. For instance, queue 800 at the bottom of your job file would submit 800 jobs to your HTCondor pool.
What people do in this case is organize the data to process using a filename convention and exploit that convention in the job file. For instance you could rename your images as img_0.jpg, img_1.jpg, ... img_799.jpg (possibly using symlinks rather than renaming the actual files) and then use a job file along these lines:
Executable = /path/to/my/script
Arguments = /path/to/data/dir/img_$(Process)
Queue 800
When the 800 jobs run, $(Process) gets automatically assigned the value of the corresponding process ID (i.e. a integer going from 0 to 799). Which means that your code will pick up the correct image to process.
DAG: Another approach is to organize your processing in a simple DAG. In this case you could have a pre-processing script (SCRIPT PRE entry in your DAG file) organizing your input data (possibly creating symlinks named appropriately). The real job would be just like the example above.
I'm using these lines in my pig script:
set default_parallel 20;
requests = LOAD ‘/user/me/todayslogs.gz’ USING customParser;
intermediate_results = < some-processing ... >
some_data = FOREACH intermediate_results GENERATE day, request_id, result;
STORE some_data INTO '/user/me/output_data' USING PigStorage(',');
'/user/me/todayslogs.gz' contains thousands of gzipped files, each of size 200 MB.
When the script completes, '/user/me/output_data' has thousands of tiny (<1 KB ) files on HDFS.
I must read the files in '/user/me/output_data' in another pig script for further processing. I see that it hurts performance then. The performance is worse if the files output by some_data are gzip-ed.
Here's the output from the MapReduceLauncher.
2013-11-04 12:38:11,961 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases campaign_join,detailed_data,detailed_requests,fields_to_retain,grouped_by_reqid,impressions_and_clicks,minimal_data,ids_cleaned,request_id,requests,requests_only,requests_typed,xids_from_request
2013-11-04 12:38:11,961 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: requests[30,11],campaign_join[35,16],null[-1,-1],null[-1,-1],detailed_requests[37,20],detailed_data[39,16],null[-1,-1],minimal_data[49,15],null[-1,-1],ids_cleaned[62,18],grouped_by_reqid[65,21] C: R: null[-1,-1],xids_from_request[66,21],impressions_and_clicks[69,26],fields_to_retain[70,20],requests_only[67,17],request_id[68,18],requests_typed[73,17]
How do I force PigStorage to write the output into fewer output files?
The reason this is happening is because your job is map-only. There is no need for a reduce phase in the processing you do, so each mapper outputs records to its own file, and you end up with one file for each mapper. If you have thousands of input files, you have thousands of output files.
The reason this goes away when you use an ORDER BY is because that triggers a reduce phase, at which point the default parallelism of 20 comes into play.
If you want to avoid this behavior, you have to force a reduce phase somehow. Since you're already doing a JOIN, you could just choose to not do this USING 'replicated'. Alternatively, if you were in a situation where you weren't doing a join, you could force it using a do-nothing GROUP BY, like so:
reduced = FOREACH (GROUP some_data BY RANDOM()) GENERATE FLATTEN(some_data);
You might want to combine multiple input files and feed it into single mapper. Following link should help you.
http://pig.apache.org/docs/r0.10.0/perf.html#combine-files
You might want to do this for first as well as second script.
An alternative solution is to run a script after your job that concatenates the small files into larger files.
When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';