I am working on a cluster where I submit jobs through the qsub engine.
I am granted a maximum of 72h of computational time at once. The output of my simulation is a folder which typically contains about 1000 files (about 10 Gb). I copy my output back after 71h30m of simulation. This means that everything that is produced after 71h30m (+ time to copy?) is lost. Is there a way to make the process more efficient, that is not having to manually estimate the time needed to copy output back?
Also before copying back my output I compress files with bzip2, what resources are used to do that? Should I ask a 1 node more than what I need to run the simulation only to compress files?
Related
There is a software called "Everything" it indexes all the files in your machine, and find anything very fast; once the files are indexed.
I would expect the index phase to take few minutes but no. it takes few seconds to index a full computer. with multiple TB.
How is it possible? A simple loop over over the files would take much more.
What am I missing?
Enumerating files one-by-one through the official API would takes ages, indeed. But Everything reads the Master File Table (and later updates look at the USN Change Journal), according to the author himself, thereby bypassing the slow file enumeration API.
a full computer. with multiple TB
The total size of the files is not relevant, because Everything does not index file contents. MFT entries are 1KB each, so for 100K files you can expect to read on the order of 0.1GB to build an index from scratch (actually more because of non-file entries, but similar order of magnitude, of course less when updating an existing index). That's not really a lot of data after all, it should be possible to read it in under a second.
Then processing 100K entries to build an index may seem like a task that could be slow, but for sense of scale you can compare to the (tens of) billions of instructions that a contemporary computer can execute per second. "4GHz" does not exactly mean "4 billion instructions per second", but it's even better, even an old CPU like the original Pentium could execute several instructions per cycle. Just based on that scale alone, it's not unthinkable to build an index of 100K entries in a few seconds. Minutes seems excessive: that would correspond to millions of instructions per item, that's bad even for an O(n log n) algorithm (the base 2 log of 100K is about 17), surely we can do better than that.
threading/multiprocessing can drastically improve speeds. They are probably taking advantage of multiple cores. You said a simple loop over the files so i am assuming you don't know of threading/multiprocessing.
We have a weekly process that archives a big number of frequently changing files into a single tar file and synchronizes it to another host using rsync as following (resulting in a very low speedup metric, usually close to 1.00):
rsync -avr <src> <dst>
Over the years, this archive has steadily increased in size and is now over 200G large. With the increasing file size, rsync has come to a point where it takes about 20 hours to finish the synchronization. However, deleting the file at the destination before the rsync process starts, causes the transfer to complete in only about 1 hour.
I understand that rsync's delta-transfer algorithm introduces some overhead, but it seems that it is not linear but exponentially growing with very large file sizes. If the actual transfer of bytes over the network takes 1h, what exactly is rsync doing in the remaining 19h?
I'm trying to find a way to run a batch script on Windows that backs up my project directory to our local network file share server.
Example of what I would usually run:
robocopy /mir "C:\PROJECT_FOLDER_PATH" "\\NETWORK_FOLDER_PATH"
But, every now and then, my IT admin approaches me about a massive copy operation that is slowing down the network.
As my projects folder grows over time, this becomes more of an annoyance. I try to run the script only while signing off later in the day to minimize the number of people affected in the office but, I was trying to come up with a better solution.
I've written a script that uses 7zip to create a 7zip archive and splits it into volumes of 250MB. So now I have a folder that just contains several smaller files and no folders to worry about. But, if I batch copy all of these to the server, I'm concerned I'm still running into the same problem.
So my initial idea was to run copy one file at a time every 5-10sec. rather than all at once. But I would only want the script to run once. I know I could write a loop and rely on robocopy's /mir tag to skip files that have already been backed up, but I don't want to have to monitor the script once I start it.
I want to run the script when I'm ready to do a backup and then have it copy the files up to the network at intervals to avoid over taxing our small network.
Robocopy has a special option to throttle data traffic while copying.
/ipg:n - Specifies the inter-packet gap to free bandwidth on slow lines.
The number n is the number of milliseconds for Robocopy to wait after each block of 64 KB.
The higher the number, the slower Robocopy gets, but also: the less likely you will run into a conflict with your IT admin.
Example:
robocopy /mir /ipg:50 "C:\PROJECT_FOLDER_PATH" "\\NETWORK_FOLDER_PATH"
On a file of 1 GB (about 16,000 blocks of 64 KB each), this will increase the time it takes to copy the file with 800 seconds (16,000 x 50 ms).
Suppose it normally takes 80 seconds to copy this file; this might well be the case on a 100 Mbit connection.
Then the total time becomes 80 + 800 = 880 seconds (almost 15 minutes).
The bandwidth used is 8000 Mbit / 880 sec = 9.1 Mbit/s.
This leaves more than 90 Mbit/s of bandwidth for other processes to use.
Other options you may find useful:
/rh:hhmm-hhmm - Specifies run times when new copies may be started.
/pf - Checks run times on a per-file (not per-pass) basis.
Source:
https://technet.microsoft.com/en-us/library/cc733145(v=ws.11).aspx
http://www.zeda.nl/index.php/en/copy-files-on-slow-links
http://windowsitpro.com/windows-server/robocopy-over-network
I have 1 billion rows of data (about 400GB uncompressed; about 40GB compressed) that I would like to process in map-reduce style, and I have two executables (binaries, not scripts) that can handle the "map" and "reduce" steps. The "map" step can process about 10,000 rows per second, per core, and its output is approximately 1MB in size, regardless of the size of its input. The "reduce" step can process about 50MB / second (excluding IO latency).
Assume that I can pre-process the data once, to do whatever I'd like such as compress it, break it into pieces, etc. For simplicity, assume input is plain text and each row terminates with a newline and each newline is a row terminator.
Once that one-time pre-processing is complete, the goal is to be able to execute a request within 30 seconds. So, if my only bottleneck is the map job (which I don't know will really be true-- it could very well be the IO), and assuming I can do all the reduce jobs in under 5 seconds, then I would need about 425 8-core computers, all processing different parts of the input data, to complete the run in time.
Assuming you have the data, and the two map/reduce executables, and you have unlimited access to AWS or GCE, what is a solution to this problem that I can implement with the fewest lines of code and/or script (and not ignoring potential IO or other non-CPU bottlenecks)?
(As an aside, it would be interesting to also knowing what would execute with the fewest nodes, if different from the solution with fewest SLOC)
I have a basic mapreduce question.
My input consists of many small files and I have designed a custom CombinedFileInputFormat (which is working properly).
The size of all files together is only like 100 Mb for 20 000 files, but processing an individual file takes a couple of minutes (it's a heavy indexing problem), therefore I want as many map tasks as possible. Will hadoop take care of this or do I have to enforce it and how? In the latter case my first guess would be to manipulate the maximum split size but I am not sure if I am on the right track. Any help greatly appreciated! (suggestions on how to set the split size best in the latter case are also helpful)
Some extra information to be more clear:
There is however another reason I wanted to process multiple files per task and that is that I want to be able to use combiners. The output of a single task only produces unique keys, but between several files there might be a substantial overlap. By processing multiple files with the same map task I can implement a combiner or make use of in-mapper combining. This would definitely limit the amount of IO. The fact is that although a single file has a size of a couple of kilobytes the output of this file is roughly 30 * 10^6 key-value pairs which easily leads to a couple of Gigabytes.
I don't think there is another way to allow combining (or in-mapper combining) if you have only one file per maptask?
Regards, Dieter
To get the best utilization for your long running map tasks, you'll probably want each file to run in it's own task rather than using your implementation of CombineInputFormat.
Using combine input format is usually advisable when you have small files that are quickly processed as it takes longer to instantiate the map task (jvm, config etc) than it does to process the file itself. You can alleviate this you by configuring 'JVM reuse', but still for a CPU bound tasks (as opposed to an IO bound tasks) you'll just want to run map tasks for each input file.
You will however need your Job Tracker to have a good chunk of memory allocated to it so it can manage and track the 20k map tasks created.
Edit: In response to your updated question, if you want to use combined input format then you'll need to set the configuration properties for min / max size per node / rack. Hadoop won't be able to do anything more intelligible than try and keep files that are data local or rack local together in the same map task.