Stack Overflow in Open Cobol - gnucobol

I am using Open Cobol.
I have a program that I have been running for several weeks.
Yesterday, I got the following error:
MERRILL_MAX_AMOUNTS.COB:46: libcob: Stack overflow, possible PERFORM depth exceeded
I tried going back to other versions of the same program that worked, but I am still getting the same error. I have several other programs that run fine with no problem.

If the program was running for several weeks and then ends with this error the program seems to be broken.
You get that error if a section/paragraph was PERFORMed and then (likely after a bunch of other statements possibly including GO TO or PERFORMing other sections/paragraphs there) is `PERFORM' itself again (recursively).
In most cases this is an error.
If the same program "worked before" and now doesn't then its program flow is changed, likely because of different data being processed.
You could enable tracing of paragraphs and sections for this single program by adding -ftrace to this single program and adjusting runtime.cfg / export/set COB_SET_TRACE and COB_TRACE_FILE according to the runtime documentation.
Note: The PERFORM stack checking is only enabled upon request by -fstack-check, which is auto-enabled with --debug (all runtime checks) or -g (debugging) - if you don't want this you can disable it by explicit specifying -fno-stack-check.
You can also adjust the number of iterations libcob considers "possibly safe" with -fstack-size=number, the current default of 255 is quite high, the maximum that can be set in a current version is 512 (artificial limit only).
In any case I highly suggest to replace the outdated OpenCOBOL (likely 1.1 from Feb 2009) by a current GnuCOBOL version (latest 3.1-rc1 19 days ago).

Related

VB6 compiled is slow when copying files

I know, VB6 is historic...ok, but...
I w
rote years ago a backup program not being satisfied from coomercial producuts I tested.
Now I wanted to renew it with some enhancements and a new graphic; the result is quite good for me. Since the file copying process is generally rather slow, I thought to compile it to squeeze some seconds...and instead...this is much slower.
Here are some info:
Win10-64 (version 22H2 just upgraded)
Tested on the same PC with identical parameters
VB6 runs with admin privileges, in Win7 SP3 compatibility mode.
Even if it is not relevant here, the job was to copy a folder containing other 426 folders and 4598 files of different sizes (from 1kB to 435MB, for a total of 1.05GB), from an inside SSD disk to an external SSD disk.
The interpreted version took 7.2 sec while the compiled version ended in 18.6 sec !
I tried different compilation setting in native code, dismissing all the advanced controls over ranges, integers and floats, without any notable difference.
I could accept a small difference for some unknown reason, but it is unreal to get a 2.5:1 ratio.
Any idea?
EDIT
Based on comments:
I repeated the comparison several times; the differences (in both the compiled and the interpreted mode) is around +/- 1sec.
Files are copied using filesystemobject.copyfile
my admin privileges are the same for both
Again, I'm not complaining nor worried by the absolute time the copy takes, I can survive with that since it is an operation made every week and during easy hours.
What is surprising is WHY it happens.
Even the idea to compile the program was due to my curiosity since there is very little to optimize in the code; it is just a for-next loop with very little calculations and assignements.
The program takes the dir and files info from a text-based DB created by recursively scanning of the source folder, then loaded into a custom array...pretty simple.
This is done before the actual copy phase, which is what I'm investigating.

GNAT Ada runtime Exception = message EXCEPTION_STACK_OVERFLOW

I'am trying to run my application after compiling it with AdaCores GPS (Gnat Programming Studio).
I get the run time error
Exception name: STORAGE_ERROR
Message: EXCEPTION_STACK_OVERFLOW
I get these run time errors despite setting the stack size in the binder options using
-d65535 (task stack size) and
-D65535 (secondary stack size)
(I also have tried 65535k on both as well as 655m).
The application runs well when compiling it with the Aonix Object Ada compiler. In the Aonix compiler I set the
- stack size to 65535,
- the secondary stack size to 65535
- and the Task stack size to 46345.
My main aim is to port the application to the GNAT Ada compiler.
I notice -d sets the task stack size and -D the secondary stack size but I can't see where to set the main stack size, and I am assuming that this is the issue with the application, but please correct me if I am looking in the wrong direction.
Any pointers would be greatly appreciated appreciated.
Bearslumber
If the problem is indeed the main task, a workaround is to move the main procedure to the body of a helper task.
First, compile for debug (-g) (there may be other relevant options; posting wrong information is the fastest way to find them ;-) and you should get more information : the source line and file that raised the exception. Or a stack trace which you can analyze via addr2line.
That should help understand why it is raising...
Are you allocating hundreds of MB on the stack? I've got away with about 200MB in the past...
Is the raise within one of the container classes or part of the RTS?
Is the message actually misleading and a new() heap allocation failed? Other things than the stack can raise Storage_Error, and I'm not clear how or if the default handler distinguishes the cause...
We can't proceed further down this path without further information : edit it into the question and I'll take another look.
Setting stack size for the environment task is not directly possible within Gnat. It's part of gcc's interaction with the OS, and supposed to use the system's ulimit settings for the resulting executable (on Linux; other OS may vary)...
Unfortunately, around the gcc/gnat 4.5 timeframe I found evidence these were being ignored, though that may have been corrected, and I haven't revisited that problem.
I have seen Alex's answer posted elsewhere as a viable workround if the debug trace and ulimit settings don't provide the answer, or if you need to press on instead of spending time debugging. To keep the codebase clean, I would suggest a wrapper, simply creating the necessary task and calling your current main procedure. For Aonix you simply don't need to include the wrapper file in your build.

How to debug potential CPU/RAM errors in Bash script on Linux

I have a relatively simple bash script that reads from a set of static input files, stores the input in bash variables and then does a bunch of processing over said input by calling out to external scripts (e.g. written in Python, Go, other bash scripts etc.) and using the intermediate results.
Lately I have been experiencing an intermittent problem where a single character seems to be getting altered somewhere during the processing which then causes subsequent errors. Specifically, a lot of the processing I'm doing involves slicing up a list of comma-separated records, and one of the values on each line is a unix timestamp, e.g. 1354245000.
What seems to be happening is that occasionally one of these values will get altered slightly, so I end up with a timestamp like 13542458=2 or 13542458>2 or 13542458;2 coming out of one of the intermediate scripts. This then subsequently gets fed into another script, which throws an exception when it tries to parse the value to an integer.
In the title of this question, I've suggested that this might be a potential CPU/RAM error. I know the general folly in thinking errors are caused by low level things like hardware/compilers etcetera, but the nature of this particular error makes me think it may be possible, for the following reasons:
The input files are the same on each invocation of the script, and the script only fails on some invocations.
I cannot think of any sources of randomness in the source code prior to where the script is breaking. It's basically just slicing and dicing csv input.
I cannot think of any sources of concurrency in the source code -- even the Go scripts aren't actually written to run anything concurrently.
This problem has only arisen in the last week or so. Prior to this time, this error would never occur.
While I haven't documented every erroneous character, they seem to often be quite close in the ASCII table to numeric values (=, >, ; etc). That said, I guess the Hamming distance between two characters quite far apart can be small also with changes to a high order bit.
The script often breaks at a different stage on different runs. i.e. I have a number of separate Python scripts, and sometimes it'll make it past one script and then the error will be induced in another. Other times it'll be induced on an earlier script.
What I'd like to know is, is there any methodical way to either confirm or rule out a hardware error for this problem? Or if it is a hardware problem, is it possibly undetectable by the operating system?
A bit of further info on the machine:
Linux 64-bit, Ubuntu 12.04
Intel i7 processor
16GB DDR3 RAM
I'm hoping someone can either point me to a reliable way to verify whether the hardware is to blame or otherwise a sound reason as to what else might be the cause.
Try booting into Memtest to check your memory.
While it is highly unlikely that it will be hardware, if you have exhausted you standard software debug as suggested by #OliCharlesworth, here is an outline of hardware error investigation:
(1) check your log area for any `MCE` logs (machine check exceptions).
If you find any in either your log area (syslog) or sometimes in
the present working dir or /dir -- you have a hardware failure.
(2) check your log area for disk errors. e.g:
smartd[3963]: Device: /dev/sda [SAT], 34 Currently unreadable (pending) sectors
(3) check your drive integrity, e.g.: (as root) # `smartctl -a /dev/sda` if any abnormality, run:
smartctl -t short /dev/sda (change drive as required)
(4) download/install/boot to [memtest86](http://www.memtest86.com/download.htm)
(run the complete test)
If your cpu/motherboard has thrown no mce's, you have no disk error, your drive tests OK with smartctl and you have no memory errors with memtest86, then recheck the software debugging. While additional hardware errors can still be present (bad capacitors, etc..) the likelihood at this point is software. Good luck.

Why does CUDA code run so much faster in NVIDIA Visual Profiler?

A piece of code that takes well over 1 minute on the command line was done in a matter of seconds in NVIDIA Visual Profiler (running the same .exe). So the natural question is why? Is there something wrong with command line, or does Visual Profiler do something different and not really execute everything as on the command line?
I'm using CUBLAS, Thrust and cuRAND.
Incidentally, there's been a noticeable slowdown in compiled code on my machine very recently, even old code that previously ran quickly, hence I'm getting suspicious.
Update:
I have checked that the calculated output on command line and Visual Profiler is identical - i.e. all required code has been run in both cases.
GPU-shark indicated that my performance state was unchanged at P0 when I switched from command line to Visual Profiler.
However, GPU usage was reported at 0.0% when run with Visual Profiler, but went as high as 98% when run off command line.
Moreover, far less memory is used with Visual Profiler. When run off command line, task manager indicates usage of 650-700MB of memory (spikes at the first cudaFree(0) call). In Visual Profiler that figure goes down to ~100MB.
This is an old question, but I've just finished chasing the same issue (though the cause may not be the same).
Namely: my app achieved between 900 and 1100 frames (synchronous launches) per second when running under NVVP, but around 100-120 when running outside of the profiler.
The cause appears to be a status message I was printing to the console via cout. I had intended for this to only happen about once every 100-200 frames. Instead, it was printing the status message for every frame, and the console IO became the bottleneck.
By only printing the status message every 100 frames (though the optimal number here would depend on your application), the frame rate jumped back up to match what I was seeing in NVVP. Of course, this could also be handled in a separate CPU thread if that sort of overhead is unacceptable in your circumstances.
NVVP has to redirect stdout to its own internal buffer in order to capture the application's output (which it shows in its console tab). It appears that NVVP's mechanism for buffering or processing that output has significantly less overhead than allowing the operating system to handle it directly. It looks like NVVP is buffering everything, and displaying it in a separate thread, or just saving a bunch of output until some threshold is reached, when it adds that buffer to its own console tab.
So, my advice would be to disable any console IO, and see if or how that affects things.
(It didn't help that VS2012 refused to profile my CUDA app. It would have been nice to see that 80% of the execution time was spent on console IO.)
Hope this helps!
This should not happen. I've never seen anything like it; probably something in your setup.
It could be that some JIT-compile step is skipped by the profiler. This could explain the difference in memory usage. Try creating a fat binary?

What can lead to failures in appending data to a file?

I maintain a program that is responsible for collecting data from a data acquisition system and appending that data to a very large (size > 4GB) binary file. Before appending data, the program must validate the header of this file in order to ensure that the meta-data in the file matches that which has been collected. In order to do this, I open the file as follows:
data_file = fopen(file_name, "rb+");
I then seek to the beginning of the file in order to validate the header. When this is done, I seek to the end of the file as follows:
_fseeki64(data_file, _filelengthi64(data_file), SEEK_SET);
At this point, I write the data that has been collected using fwrite(). I am careful to check the return values from all I/O functions.
One of the computers (windows 7 64 bit) on which we have been testing this program intermittently shows a condition where the data appears to have been written to the file yet neither the file's last changed time nor its size changes. If any of the calls to fopen(), fseek(), or fwrite() fail, my program will throw an exception which will result in aborting the data collection process and logging the error. On this machine, none of these failures seem to be occurring. Something that makes the matter even more mysterious is that, if a restore point is set on the host file system, the problem goes away only to re-appear intermittently appear at some future time.
We have tried to reproduce this problem on other machines (a vista 32 bit operating system) but have had no success in replicating the issue (this doesn't necessarily mean anything since the problem is so intermittent in the first place.
Has anyone else encountered anything similar to this? Is there a potential remedy?
Further Information
I have now found that the failure occurs when fflush() is called on the file and that the win32 error that is being returned by GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). Searching google for this error leads to a bunch of reports related to "extents" for SQL server files. I suspect that there is some sort of journaling resource that the file system is reporting and this because we are growing a large file by opening it, appending a chunk of data, and closing it. I am now looking for understanding regarding this particular error with the hope for coming up with a valid remedy.
The file append is failing because of a file system fragmentation limit. The question was answered in What factors can lead to Win32 error 665 (file system limitation)?

Resources