I've a issue with am machine I'm running some paralell calculations on.
Until now I assumed that core ids need to be sequentially. But on this machine I have:
lscpu
I get the following Output for:
/bin/cat /proc/cpuinfo | grep 'core id'
Since the code I'm using assumes that the number of cores is equal to max(core ids) + 1, this causes many problems. I can not change this assumption in the code easily. Therefore my questions are the following:
Are the core ids 5-7 just missing?
Or are these cores actually there but not "activated"?
Can I change this in a sense that I can rename the Ids?
Do I have something wrong in a sense that core Ids are never sequentially ordered?
What can cause such a unusual ordering?
Related
For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.
I have a relatively simple bash script that reads from a set of static input files, stores the input in bash variables and then does a bunch of processing over said input by calling out to external scripts (e.g. written in Python, Go, other bash scripts etc.) and using the intermediate results.
Lately I have been experiencing an intermittent problem where a single character seems to be getting altered somewhere during the processing which then causes subsequent errors. Specifically, a lot of the processing I'm doing involves slicing up a list of comma-separated records, and one of the values on each line is a unix timestamp, e.g. 1354245000.
What seems to be happening is that occasionally one of these values will get altered slightly, so I end up with a timestamp like 13542458=2 or 13542458>2 or 13542458;2 coming out of one of the intermediate scripts. This then subsequently gets fed into another script, which throws an exception when it tries to parse the value to an integer.
In the title of this question, I've suggested that this might be a potential CPU/RAM error. I know the general folly in thinking errors are caused by low level things like hardware/compilers etcetera, but the nature of this particular error makes me think it may be possible, for the following reasons:
The input files are the same on each invocation of the script, and the script only fails on some invocations.
I cannot think of any sources of randomness in the source code prior to where the script is breaking. It's basically just slicing and dicing csv input.
I cannot think of any sources of concurrency in the source code -- even the Go scripts aren't actually written to run anything concurrently.
This problem has only arisen in the last week or so. Prior to this time, this error would never occur.
While I haven't documented every erroneous character, they seem to often be quite close in the ASCII table to numeric values (=, >, ; etc). That said, I guess the Hamming distance between two characters quite far apart can be small also with changes to a high order bit.
The script often breaks at a different stage on different runs. i.e. I have a number of separate Python scripts, and sometimes it'll make it past one script and then the error will be induced in another. Other times it'll be induced on an earlier script.
What I'd like to know is, is there any methodical way to either confirm or rule out a hardware error for this problem? Or if it is a hardware problem, is it possibly undetectable by the operating system?
A bit of further info on the machine:
Linux 64-bit, Ubuntu 12.04
Intel i7 processor
16GB DDR3 RAM
I'm hoping someone can either point me to a reliable way to verify whether the hardware is to blame or otherwise a sound reason as to what else might be the cause.
Try booting into Memtest to check your memory.
While it is highly unlikely that it will be hardware, if you have exhausted you standard software debug as suggested by #OliCharlesworth, here is an outline of hardware error investigation:
(1) check your log area for any `MCE` logs (machine check exceptions).
If you find any in either your log area (syslog) or sometimes in
the present working dir or /dir -- you have a hardware failure.
(2) check your log area for disk errors. e.g:
smartd[3963]: Device: /dev/sda [SAT], 34 Currently unreadable (pending) sectors
(3) check your drive integrity, e.g.: (as root) # `smartctl -a /dev/sda` if any abnormality, run:
smartctl -t short /dev/sda (change drive as required)
(4) download/install/boot to [memtest86](http://www.memtest86.com/download.htm)
(run the complete test)
If your cpu/motherboard has thrown no mce's, you have no disk error, your drive tests OK with smartctl and you have no memory errors with memtest86, then recheck the software debugging. While additional hardware errors can still be present (bad capacitors, etc..) the likelihood at this point is software. Good luck.
I use htop to view information about the processes currently running in my osx machine, also to sort them by CPU, memory usage, etc.
Is there any way to fetch the output of htop programatically in Ruby?. Also I would like to be able to use the API to sort the processes using various parameters like CPU, memory usage, etc.
I can do IO.popen('ps -a') and parse the output, but want to know if there is a better way than directly parsing the output of a system command run programmatically.
Check out sys-proctable:
require 'sys/proctable'
Sys::ProcTable.ps
To sort by starttime:
Sys::ProcTable.ps.sort_by(&:starttime)
On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.
I have a program that creates a file of about 50MB size. During the process the program frequently rewrites sections of the file and forces the changes to disk (in the order of 100 times). It uses a FileChannel and direct ByteBuffers via fc.read(...), fc.write(...) and fc.force(...).
New text:
I have a better view on the problem now.
The problem appears to be that I use three different JVMs to modify a file (one creates it, two others (launched from the first) write to it). Every JVM closes the file properly before the next JVM is started.
The problem is that the cost of fc.write() to that file occasionally goes through the roof for the third JVM (in the order of 100 times the normal cost). That is, all write operations are equally slow, it is not just one that hang very long.
Interestingly, one way to help this is to insert delays (2 seconds) between the launching of JVMs. Without delay, writing is always slow, with delay, the writing is slow aboutr every second time or so.
I also found this Stackoverflow: How to unmap a file from memory mapped using FileChannel in java? which describes a problem for mapped files, which I'm not using.
What I suspect might be going on:
Java does not completely release the file handle when I call close(). When the next JVM is started, Java (or Windows) recognizes concurrent access to that file and installes some expensive concurrency handler for that file, which makes writing expensive.
Would that make sense?
The problem occurs on Windows 7 (Java 6 and 7, tested on two machines), but not under Linux (SuSE 11.3 64).
Old text:
The problem:
Starting the program from as a JUnit test harness from eclipse or from console works fine, it takes around 3 seconds.
Starting the program through an ant task (or through JUnit by kicking of a separate JVM using a ProcessBuilder) slows the program down to 70-80 seconds for the same task (factor 20-30).
Using -Xprof reveals that the usage of 'force0' and 'pwrite' goes through the roof from 34.1% (76+20 tics) to 97.3% (3587+2913+751 tics):
Fast run:
27.0% 0 + 76 sun.nio.ch.FileChannelImpl.force0
7.1% 0 + 20 sun.nio.ch.FileDispatcher.pwrite0
[..]
Slow run:
Interpreted + native Method
48.1% 0 + 3587 sun.nio.ch.FileDispatcher.pwrite0
39.1% 0 + 2913 sun.nio.ch.FileChannelImpl.force0
[..]
Stub + native Method
10.1% 0 + 751 sun.nio.ch.FileDispatcher.pwrite0
[..]
GC and compilation are negligible.
More facts:
No other methods show a significant change in the -Xprof output.
It's either fast or very slow, never something in-between.
Memory is not a problem, all test machines have at least 8GB, the process uses <200MB
rebooting the machine does not help
switching of virus-scanners and similar stuff has no affect
When the process is slow, there is virtually no CPU usage
It is never slow when running it from a normal JVM
It is pretty consistently slow when running it in a JVM that was started from the first JVM (via ProcessBuilder or as ant-task)
All JVMs are exactly the same. I output System.getProperty("java.home") and the JVM options via RuntimeMXBean RuntimemxBean = ManagementFactory.getRuntimeMXBean(); List arguments = RuntimemxBean.getInputArguments();
I tested it on two machines with Windows7 64bit, Java 7u2, Java 6u26 and JRockit, the hardware of the machines differs, though, but the results are very similar.
I tested it also from outside Eclipse (command-line ant) but no difference there.
The whole program is written by myself, all it does is reading and writing to/from this file, no other libraries are used, especially no native libraries. -
And some scary facts that I just refuse to believe to make any sense:
Removing all class files and rebuilding the project sometimes (rarely) helps. The program (nested version) runs fast one or two times before becoming extremely slow again.
Installing a new JVM always helps (every single time!) such that the (nested) program runs fast at least once! Installing a JDK counts as two because both the JDK-jre and the JRE-jre work fine at least once. Overinstalling a JVM does not help. Neither does rebooting. I haven't tried deleting/rebooting/reinstalling yet ...
These are the only two ways I ever managed to get fast program runtimes for the nested program.
Questions:
What may cause this performance drop for nested JVMs?
What exactly do these methods do (pwrite0/force0)? -
Are you using local disks for all testing (as opposed to any network share) ?
Can you setup Windows with a ram drive to store the data ? When a JVM terminates, by default its file handles will have been closed but what you might be seeing is the flushing of the data to the disk. When you overwrite lots of data the previous version of data is discarded and may not cause disk IO. The act of closing the file might make windows kernel implicitly flush data to disk. So using a ram drive would allow you to confirm that their since disk IO time is removed from your stats.
Find a tool for windows that allows you to force the kernel to flush all buffers to disk, use this in between JVM runs, see how long that takes at the time.
But I would guess you are hitten some iteraction with the demands of the process and the demands of the kernel in attempting to manage disk block buffer cache. In linux there is a tool like "/sbin/blockdev --flushbufs" that can do this.
FWIW
"pwrite" is a Linux/Unix API for allowing concurrent writing to a file descriptor (which would be the best kernel syscall API to use for the JVM, I think Win32 API already has provision for the same kinds of usage to share a file handle between threads in a process, but since Sun have Unix heritige things get named after the Unix way). Google "pwrite(2)" for more info on this API.
"force" I would guess that is a file system sync, meaning the process is requesting the kernel to flush unwritten data (that is currently in disk block buffer cache) into the file on the disk (such as would be needed before you turned your computer off). This action will happen automatically over time, but transactional systems require to know when the data previously written (with pwrite) has actually hit the physical disk and is stored. Because some other disk IO is dependant on knowing that, such as with transactional checkpointing.
One thing that could help is making sure you explicitly set the FileChannel to null. Then call System.runFinalization() and maybe System.gc() at the end of the program. You may need more than 1 call.
System.runFinalizersOnExit(true) may also help, but it's deprecated so you will have to deal with the compiler warnings.