Popup window in Turbo Pascal 7 - pascal

In Turbo Pascal 7 for DOS you can use the Crt unit to define a window. If you define a second window on top of the first one, like a popup, I don’t see a way to get rid of the second one except for redrawing the first one on top again.
Is there a window close technique I’m overlooking?
I’m considering keeping an array of screens in memory to make it work, but the TP IDE does popups like I want to do, so maybe it’s easy and I’m just looking in the wrong place?

I don't think there's a window-closing technique you're missing, if you mean one provided by the CRT unit.
The library Borland used for the TP7 IDE was called TurboVision (see https://en.wikipedia.org/wiki/Turbo_Vision) and it was eventually released to the public domain, but well before that, a number of 3rd-party screen handling/windowing libraries had become available and these were much more powerful than what could be achieved with the CRT unit. Probably the best known was Turbopower Software's Object Professional (aka OPro).
Afaik, these libraries (and, fairly obviously TurboVision) were all based on an in-memory representation of a framed window which could be rapidly copied in and out of the PC's video memory and, as in Windows with a capital W, they were treated as a stack with a z-order. So the process or closing/erasing the top level window was one of getting the window(s) that it had been covering to re-draw itself/themselves. Otoh, CRT had basically evolved from v. primitive origins similar to, if not based on, the old DEC VT100 display protocol and wasn't really up to the job of supporting independent, stackable window objects.
Although you may still be able to track down the PD release of TurboVision, it never really caught on as a library for developers. In an ideal world, a better place to start would be with OPro. It was apparently on SoureForge for a while, but seems to have been taken down sometime since about 2007, and these days even if you could get hold of a copy, there is a bit of a question mark over licensing. However ...
There was also a very popular freeware library available for TP by the name of the "Technojock's toolkit" and which had a large functionality overlap (including screen handling) with OPro and it is still available on github - see https://github.com/lallousx86/TurboPascal/tree/master/TotLib/TOTSRC11. Unlike OPro, I never used TechnoJocks myself, but devotees swore by it. Take a look.

Related

Extending functionality of existing program I don't have source for

I'm working on a third-party program that aggregates data from a bunch of different, existing Windows programs. Each program has a mechanism for exporting the data via the GUI. The most brain-dead approach would have me generate extracts by using AutoIt or some other GUI manipulation program to generate the extractions via the GUI. The problem with this is that people might be interacting with the computer when, suddenly, some automated program takes over. That's no good. What I really want to do is somehow have a program run once a day and silently (i.e. without popping up any GUIs) export the data from each program.
My research is telling me that I need to hook each application (assume these applications are always running) and inject a custom DLL to trigger each export. Am I remotely close to being on the right track? I'm a fairly experienced software dev, but I don't know a whole lot about reverse engineering or hooking. Any advice or direction would be greatly appreciated.
Edit: I'm trying to manage the availability of a certain type of professional. Their schedules are stored in proprietary systems. With their permission, I want to install an app on their system that extracts their schedule from whichever system they are using and uploads the information to a central server so that I can present that information to potential clients.
I am aware of four ways of extracting the information you want, both with their advantages and disadvantages. Before you do anything, you need to be aware that any solution you create is not guaranteed and in fact very unlikely to continue working should the target application ever update. The reason is that in each case, you are relying on an implementation detail instead of a pre-defined interface through which to export your data.
Hooking the GUI
The first way is to hook the GUI as you have suggested. What you are doing in this case is simply reading off from what an actual user would see. This is in general easier, since you are hooking the WinAPI which is clearly defined. One danger is that what the program displays is inconsistent or incomplete in comparison to the internal data it is supposed to be representing.
Typically, there are two common ways to perform WinAPI hooking:
DLL Injection. You create a DLL which you load into the other program's virtual address space. This means that you have read/write access (writable access can be gained with VirtualProtect) to the target's entire memory. From here you can trampoline the functions which are called to set UI information. For example, to check if a window has changed its text, you might trampoline the SetWindowText function. Note every control has different interfaces used to set what they are displaying. In this case, you are hooking the functions called by the code to set the display.
SetWindowsHookEx. Under the covers, this works similarly to DLL injection and in this case is really just another method for you to extend/subvert the control flow of messages received by controls. What you want to do in this case is hook the window procedures of each child control. For example, when an item is added to a ComboBox, it would receive a CB_ADDSTRING message. In this case, you are hooking the messages that are received when the display changes.
One caveat with this approach is that it will only work if the target is using or extending WinAPI controls.
Reading from the GUI
Instead of hooking the GUI, you can alternatively use WinAPI to read directly from the target windows. However, in some cases this may not be allowed. There is not much to do in this case but to try and see if it works. This may in fact be the easiest approach. Typically, you will send messages such as WM_GETTEXT to query the target window for what it is currently displaying. To do this, you will need to obtain the exact window hierarchy containing the control you are interested in. For example, say you want to read an edit control, you will need to see what parent window/s are above it in the window hierarchy in order to obtain its window handle.
Reading from memory (Advanced)
This approach is by far the most complicated but if you are able to fully reverse engineer the target program, it is the most likely to get you consistent data. This approach works by you reading the memory from the target process. This technique is very commonly used in game hacking to add 'functionality' and to observe the internal state of the game.
Consider that as well as storing information in the GUI, programs often hold their own internal model of all the data. This is especially true when the controls used are virtual and simply query subsets of the data to be displayed. This is an example of a situation where the first two approaches would not be of much use. This data is often held in some sort of abstract data type such as a list or perhaps even an array. The trick is to find this list in memory and read the values off directly. This can be done externally with ReadProcessMemory or internally through DLL injection again. The difficulty lies mainly in two prerequisites:
Firstly, you must be able to reliably locate these data structures. The problem with this is that code is not guaranteed to be in the same place, especially with features such as ASLR. Colloquially, this is sometimes referred to as code-shifting. ASLR can be defeated by using the offset from a module base and dynamically getting the module base address with functions such as GetModuleHandle. As well as ASLR, a reason that this occurs is due to dynamic memory allocation (e.g. through malloc). In such cases, you will need to find a heap address storing the pointer (which would for example be the return of malloc), dereference that and find your list. That pointer would be prone to ASLR and instead of a pointer, it might be a double-pointer, triple-pointer, etc.
The second problem you face is that it would be rare for each list item to be a primitive type. For example, instead of a list of character arrays (strings), it is likely that you will be faced with a list of objects. You would need to further reverse engineer each object type and understand internal layouts (at least be able to determine offsets of primitive values you are interested in in terms of its offset from the object base). More advanced methods revolve around actually reverse engineering the vtable of objects and calling their 'API'.
You might notice that I am not able to give information here which is specific. The reason is that by its nature, using this method requires an intimate understanding of the target's internals and as such, the specifics are defined only by how the target has been programmed. Unless you have knowledge and experience of reverse engineering, it is unlikely you would want to go down this route.
Hooking the target's internal API (Advanced)
As with the above solution, instead of digging for data structures, you dig for the internal API. I briefly covered this with when discussing vtables earlier. Instead of doing this, you would be attempting to find internal APIs that are called when the GUI is modified. Typically, when a view/UI is modified, instead of directly calling the WinAPI to update it, a program will have its own wrapper function which it calls which in turn calls the WinAPI. You simply need to find this function and hook it. Again this is possible, but requires reverse engineering skills. You may find that you discover functions which you want to call yourself. In this case, as well as being able to locate the location of the function, you have to reverse engineer the parameters it takes, its calling convention and you will need to ensure calling the function has no side effects.
I would consider this approach to be advanced. It can certainly be done and is another common technique used in game hacking to observe internal states and to manipulate a target's behaviour, but is difficult!
The first two methods are well suited for reading data from WinAPI programs and are by far easier. The two latter methods allow greater flexibility. With enough work, you are able to read anything and everything encapsulated by the target but requires a lot of skill.
Another point of concern which may or may not relate to your case is how easy it will be to update your solution to work should the target every be updated. With the first two methods, it is more likely no changes or small changes have to be made. With the second two methods, even a small change in source code can cause a relocation of the offsets you are relying upon. One method of dealing with this is to use byte signatures to dynamically generate the offsets. I wrote another answer some time ago which addresses how this is done.
What I have written is only a brief summary of the various techniques that can be used for what you want to achieve. I may have missed approaches, but these are the most common ones I know of and have experience with. Since these are large topics in themselves, I would advise you ask a new question if you want to obtain more detail about any particular one. Note that in all of the approaches I have discussed, none of them suffer from any interaction which is visible to the outside world so you would have no problem with anything popping up. It would be, as you describe, 'silent'.
This is relevant information about detouring/trampolining which I have lifted from a previous answer I wrote:
If you are looking for ways that programs detour execution of other
processes, it is usually through one of two means:
Dynamic (Runtime) Detouring - This is the more common method and is what is used by libraries such as Microsoft Detours. Here is a
relevant paper where the first few bytes of a function are overwritten
to unconditionally branch to the instrumentation.
(Static) Binary Rewriting - This is a much less common method for rootkits, but is used by research projects. It allows detouring to be
performed by statically analysing and overwriting a binary. An old
(not publicly available) package for Windows that performs this is
Etch. This paper gives a high-level view of how it works
conceptually.
Although Detours demonstrates one method of dynamic detouring, there
are countless methods used in the industry, especially in the reverse
engineering and hacking arenas. These include the IAT and breakpoint
methods I mentioned above. To 'point you in the right direction' for
these, you should look at 'research' performed in the fields of
research projects and reverse engineering.

How did Turbo Pascal overlays work?

I'm implementing an assemblinker for the 16-bit DCPU from the game 0x10c.
One technique that somebody suggested to me was using "overlays, like in Turbo Pascal back in the day" in order to swap code around at run time.
I get the basic idea (link overlayed symbols to same memory, swap before ref), but what was their implementation?
Was it a function that the compiler inserted before references? Was it a trap? Was the data for the overlay stored at the location of the overlay, or in a big table somewhere? Did it work well, or did it break often? Was there an interface for assembly to link with overlayed Pascal (and vice versa), or was it incompatible?
Google is giving me basically no information (other than it being a no-on on modern Pascal compilers). And, I'm just, like, five years too young to have ever needed them when they were current.
A jump table per unit whose elements point to a trap (int 3F) when not loaded. But that is for older Turbo Pascal/Borland Pascal versions (5/6), newer ones also support (286) protected mode, and they might employ yet another scheme.
This scheme means that when an overlay is loaded, no trap overhead happens anymore.
I found this link in my references: The Slithy Tove. There are other nice details there, like how call chains are handled that span multiple overlays.

Most suitable language for cheque/check printing on Windows Platform

I need to create a simple module/executable that can print checks (fill out the details). The details need to be retried from an existing Oracle 9i DB on the Windows(xp or later)
Obviously, I shall need to define the pixel format as to where the details (Name, amount, etc) are to be filled.
The major constraint is that the client needs / strongly prefers a executable , not code that is either interpreted or uses a VM. This is so that installation is extremely simple. This requirement really cannot be changed.
Now, the question is, how do I do it.
(.NET, java and python are out of the question, unless there is a way around the VMs)
I have never worked with MFC or other native windows APIs. I am also unfamiliar with GDI.
Do I have any other option? Any language that can abstract the complexities and can be packed into a x86 binary?
Also, if not then any code help with GDI would be appreciated.
The most obvious possibilities would probably be C, C++, and Delphi. There are a few others such as Ada (e.g., Gnat), but offhand I don't see a lot of reason to favor them (especially for a job this small).
At least the way I'd write this, the language would be almost irrelevant. I'd have it run almost entirely by an external configuration file that gave the name of each field, and the location where it should be printed. I'd probably use something like MM_LOMETRIC mapping mode, so Windows will handle most of the translation to real-world coordinates (and use tenths of a millimeter in the configuration file, so you can use the coordinates without any translation).
Probably the more difficult part of this would/will be the database connectivity. There are various libraries around to help out with that, so this won't be terribly difficult, but it's still not (quite) as trivial as the drawing part.

How to debug a program without a debugger? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Interview question-
Often its pretty easier to debug a program once you have trouble with your code.You can put watches,breakpoints and etc.Life is much easier because of debugger.
But how to debug a program without a debugger?
One possible approach which I know is simply putting print statements in your code wherever you want to check for the problems.
Are there any other approaches other than this?
As its a general question, its not restricted to any specific language.So please share your thoughts on how you would have done it?
EDIT- While submitting your answer, please mention a useful resource (if you have any) about any concept. e.g. Logging
This will be lot helpful for those who don't know about it at all.(This includes me, in some cases :)
UPDATE: Michal Sznajderhas put a real "best" answer and also made it a community wiki.Really deserves lots of up votes.
Actually you have quite a lot of possibilities. Either with recompilation of source code or without.
With recompilation.
Additional logging. Either into program's logs or using system logging (eg. OutputDebugString or Events Log on Windows). Also use following steps:
Always include timestamp at least up to seconds resolution.
Consider adding thread-id in case of multithreaded apps.
Add some nice output of your structures
Do not print out enums with just %d. Use some ToString() or create some EnumToString() function (whatever suits your language)
... and beware: logging changes timings so in case of heavily multithreading you problems might disappear.
More details on this here.
Introduce more asserts
Unit tests
"Audio-visual" monitoring: if something happens do one of
use buzzer
play system sound
flash some LED by enabling hardware GPIO line (only in embedded scenarios)
Without recompilation
If your application uses network of any kind: Packet Sniffer or I will just choose for you: Wireshark
If you use database: monitor queries send to database and database itself.
Use virtual machines to test exactly the same OS/hardware setup as your system is running on.
Use some kind of system calls monitor. This includes
On Unix box strace or dtrace
On Windows tools from former Sysinternals tools like http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx, ProcessExplorer and alike
In case of Windows GUI stuff: check out Spy++ or for WPF Snoop (although second I didn't use)
Consider using some profiling tools for your platform. It will give you overview on thing happening in your app.
[Real hardcore] Hardware monitoring: use oscilloscope (aka O-Scope) to monitor signals on hardware lines
Source code debugging: you sit down with your source code and just pretend with piece of paper and pencil that you are computer. Its so called code analysis or "on-my-eyes" debugging
Source control debugging. Compare diffs of your code from time when "it" works and now. Bug might be somewhere there.
And some general tips in the end:
Do not forget about Text to Columns and Pivot Table in Excel. Together with some text tools (awk, grep or perl) give you incredible analysis pack. If you have more than 32K records consider using Access as data source.
Basics of Data Warehousing might help. With simple cube you may analyse tons of temporal data in just few minutes.
Dumping your application is worth mentioning. Either as a result of crash or just on regular basis
Always generate you debug symbols (even for release builds).
Almost last but not least: most mayor platforms has some sort of command line debugger always built in (even Windows!). With some tricks like conditional debugging and break-print-continue you can get pretty good result with obscure bugs
And really last but not least: use your brain and question everything.
In general debugging is like science: you do not create it you discover it. Quite often its like looking for a murderer in a criminal case. So buy yourself a hat and never give up.
First of all, what does debugging actually do? Advanced debuggers give you machine hooks to suspend execution, examine variables and potentially modify state of a running program. Most programs don't need all that to debug them. There are many approaches:
Tracing: implement some kind of logging mechanism, or use an existing one such as dtrace(). It usually worth it to implement some kind of printf-like function that can output generally formatted output into a system log. Then just throw state from key points in your program to this log. Believe it or not, in complex programs, this can be more useful than raw debugging with a real debugger. Logs help you know how you got into trouble, while a debugger that traps on a crash assumes you can reverse engineer how you got there from whatever state you are already in. For applications that you use other complex libraries that you don't own that crash in the middle of them, logs are often far more useful. But it requires a certain amount of discipline in writing your log messages.
Program/Library self-awareness: To solve very specific crash events, I often have implemented wrappers on system libraries such as malloc/free/realloc which extensions that can do things like walk memory, detect double frees, attempts to free non-allocated pointers, check for obvious buffer over-runs etc. Often you can do this sort of thing for your important internal data types as well -- typically you can make self-integrity checks for things like linked lists (they can't loop, and they can't point into la-la land.) Even for things like OS synchronization objects, often you only need to know which thread, or what file and line number (capturable by __FILE__, __LINE__) the last user of the synch object was to help you work out a race condition.
If you are insane like me, you could, in fact, implement your own mini-debugger inside of your own program. This is really only an option in a self-reflective programming language, or in languages like C with certain OS-hooks. When compiling C/C++ in Windows/DOS you can implement a "crash-hook" callback which is executed when any program fault is triggered. When you compile your program you can build a .map file to figure out what the relative addresses of all your public functions (so you can work out the loader initial offset by subtracting the address of main() from the address given in your .map file). So when a crash happens (even pressing ^C during a run, for example, so you can find your infinite loops) you can take the stack pointer and scan it for offsets within return addresses. You can usually look at your registers, and implement a simple console to let you examine all this. And voila, you have half of a real debugger implemented. Keep this going and you can reproduce the VxWorks' console debugging mechanism.
Another approach, is logical deduction. This is related to #1. Basically any crash or anomalous behavior in a program occurs when it stops behaving as expected. You need to have some feed back method of knowing when the program is behaving normally then abnormally. Your goal then is to find the exact conditions upon which your program goes from behaving correctly to incorrectly. With printf()/logs, or other feedback (such as enabling a device in an embedded system -- the PC has a speaker, but some motherboards also have a digital display for BIOS stage reporting; embedded systems will often have a COM port that you can use) you can deduce at least binary states of good and bad behavior with respect to the run state of your program through the instrumentation of your program.
A related method is logical deduction with respect to code versions. Often a program was working perfectly at one state, but some later version is not longer working. If you use good source control, and you enforce a "top of tree must always be working" philosophy amongst your programming team, then you can use a binary search to find the exact version of the code at which the failure occurs. You can use diffs then to deduce what code change exposes the error. If the diff is too large, then you have the task of trying to redo that code change in smaller steps where you can apply binary searching more effectively.
Just a couple suggestions:
1) Asserts. This should help you work out general expectations at different states of the program. As well familiarize yourself with the code
2) Unit tests. I have used these at times to dig into new code and test out APIs
One word: Logging.
Your program should write descriptive debug lines which include a timestamp to a log file based on a configurable debug level. Reading the resultant log files gives you information on what happened during the execution of the program. There are logging packages in every common programming language that make this a snap:
Java: log4j
.Net: NLog or log4net
Python: Python Logging
PHP: Pear Logging Framework
Ruby: Ruby Logger
C: log4c
I guess you just have to write fine-grain unit tests.
I also like to write a pretty-printer for my data structures.
I think the rest of the interview might go something like this...
Candidate: So you don't buy debuggers for your developers?
Interviewer: No, they have debuggers.
Candidate: So you are looking for programmers who, out of masochism or chest thumping hamartia, make things complicated on themselves even if they would be less productive?
Interviewer: No, I'm just trying to see if you know what you would do in a situation that will never happen.
Candidate: I suppose I'd add logging or print statements. Can I ask you a similar question?
Interviewer: Sure.
Candidate: How would you recruit a team of developers if you didn't have any appreciable interviewing skill to distinguish good prospects based on relevant information?
Peer review. You have been looking at the code for 8 hours and your brain is just showing you what you want to see in the code. A fresh pair of eyes can make all the difference.
Version control. Especially for large teams. If somebody changed something you rely on but did not tell you it is easy to find a specific change set that caused your trouble by rolling the changes back one by one.
On *nix systems, strace and/or dtrace can tell you an awful lot about the execution of your program and the libraries it uses.
Binary search in time is also a method: If you have your source code stored in a version-control repository, and you know that version 100 worked, but version 200 doesn't, try to see if version 150 works. If it does, the error must be between version 150 and 200, so find version 175 and see if it works... etc.
use println/log in code
use DB explorer to look at data in DB/files
write tests and put asserts in suspicious places
More generally, you can monitor side effects and output of the program, and trigger certain events in the program externally.
A Print statement isn't always appropriate. You might use other forms of output such as writing to the Event Log or a log file, writing to a TCP socket (I have a nice utility that can listen for that type of trace from my program), etc.
For programs that don't have a UI, you can trigger behavior you want to debug by using an external flag such as the existence of a file. You might have the program wait for the file to be created, then run through a behavior you're interested in while logging relevant events.
Another file's existence might trigger the program's internal state to be written to your logging mechanism.
like everyone else said:
Logging
Asserts
Extra Output
&
your favorite task manager or process
explorer
links here and here
Another thing I have not seen mentioned here that I have had to use quite a bit on embedded systems is serial terminals.
You can cannot a serial terminal to just about any type of device on the planet (I have even done it to embedded CPUs for hydraulics, generators, etc). Then you can write out to the serial port and see everything on the terminal.
You can get real fancy and even setup a thread that listens to the serial terminal and responds to commands. I have done this as well and implemented simple commands to dump a list, see internal variables, etc all from a simple 9600 baud RS-232 serial port!
Spy++ (and more recently Snoop for WPF) are tremendous for getting an insight into Windows UI bugs.
A nice read would be Delta Debugging from Andreas Zeller. It's like binary search for debugging

What's the toughest bug you ever found and fixed? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What made it hard to find? How did you track it down?
Not close enough to close but see also
https://stackoverflow.com/questions/175854/what-is-the-funniest-bug-youve-ever-experienced
A jpeg parser, running on a surveillance camera, which crashed every time the company's CEO came into the room.
100% reproducible error.
I kid you not!
This is why:
For you who doesn't know much about JPEG compression - the image is kind of broken down into a matrix of small blocks which then are encoded using magic etc.
The parser choked when the CEO came into the room, because he always had a shirt with a square pattern on it, which triggered some special case of contrast and block boundary algorithms.
Truly classic.
This didn't happen to me, but a friend told me about it.
He had to debug a app which would crash very rarely. It would only fail on Wednesdays -- in September -- after the 9th. Yes, 362 days of the year, it was fine, and three days out of the year it would crash immediately.
It would format a date as "Wednesday, September 22 2008", but the buffer was one character too short -- so it would only cause a problem when you had a 2 digit DOM on a day with the longest name in the month with the longest name.
This requires knowing a bit of Z-8000 assembler, which I'll explain as we go.
I was working on an embedded system (in Z-8000 assembler). A different division of the company was building a different system on the same platform, and had written a library of functions, which I was also using on my project. The bug was that every time I called one function, the program crashed. I checked all my inputs; they were fine. It had to be a bug in the library -- except that the library had been used (and was working fine) in thousands of POS sites across the country.
Now, Z-8000 CPUs have 16 16-bit registers, R0, R1, R2 ...R15, which can also be addressed as 8 32-bit registers, named RR0, RR2, RR4..RR14 etc. The library was written from scratch, refactoring a bunch of older libraries. It was very clean and followed strict programming standards. At the start of each function, every register that would be used in the function was pushed onto the stack to preserve its value. Everything was neat & tidy -- they were perfect.
Nevertheless, I studied the assembler listing for the library, and I noticed something odd about that function --- At the start of the function, it had PUSH RR0 / PUSH RR2 and at the end to had POP RR2 / POP R0. Now, if you didn't follow that, it pushed 4 values on the stack at the start, but only removed 3 of them at the end. That's a recipe for disaster. There an unknown value on the top of the stack where return address needed to be. The function couldn't possibly work.
Except, may I remind you, that it WAS working. It was being called thousands of times a day on thousands of machines. It couldn't possibly NOT work.
After some time debugging (which wasn't easy in assembler on an embedded system with the tools of the mid-1980s), it would always crash on the return, because the bad value was sending it to a random address. Evidently I had to debug the working app, to figure out why it didn't fail.
Well, remember that the library was very good about preserving the values in the registers, so once you put a value into the register, it stayed there. R1 had 0000 in it. It would always have 0000 in it when that function was called. The bug therefore left 0000 on the stack. So when the function returned it would jump to address 0000, which just so happened to be a RET, which would pop the next value (the correct return address) off the stack, and jump to that. The data perfectly masked the bug.
Of course, in my app, I had a different value in R1, so it just crashed....
This was on Linux but could have happened on virtually any OS. Now most of you are probably familiar with the BSD socket API. We happily use it year after year, and it works.
We were working on a massively parallel application that would have many sockets open. To test its operation we had a testing team that would open hundreds and sometimes over a thousand connections for data transfer. With the highest channel numbers our application would begin to show weird behavior. Sometimes it just crashed. The other time we got errors that simply could not be true (e.g. accept() returning the same file descriptor on subsequent calls which of course resulted in chaos.)
We could see in the log files that something went wrong, but it was insanely hard to pinpoint. Tests with Rational Purify said nothing was wrong. But something WAS wrong. We worked on this for days and got increasingly frustrated. It was a showblocker because the already negotiated test would cause havoc in the app.
As the error only occured in high load situations, I double-checked everything we did with sockets. We had never tested high load cases in Purify because it was not feasible in such a memory-intensive situation.
Finally (and luckily) I remembered that the massive number of sockets might be a problem with select() which waits for state changes on sockets (may read / may write / error). Sure enough our application began to wreak havoc exactly the moment it reached the socket with descriptor 1025. The problem is that select() works with bit field parameters. The bit fields are filled by macros FD_SET() and friends which DON'T CHECK THEIR PARAMETERS FOR VALIDITY.
So everytime we got over 1024 descriptors (each OS has its own limit, Linux vanilla kernels have 1024, the actual value is defined as FD_SETSIZE), the FD_SET macro would happily overwrite its bit field and write garbage into the next structure in memory.
I replaced all select() calls with poll() which is a well-designed alternative to the arcane select() call, and high load situations have never been a problem everafter. We were lucky because all socket handling was in one framework class where 15 minutes of work could solve the problem. It would have been a lot worse if select() calls had been sprinkled all over of the code.
Lessons learned:
even if an API function is 25 years old and everybody uses it, it can have dark corners you don't know yet
unchecked memory writes in API macros are EVIL
a debugging tool like Purify can't help with all situations, especially when a lot of memory is used
Always have a framework for your application if possible. Using it not only increases portability but also helps in case of API bugs
many applications use select() without thinking about the socket limit. So I'm pretty sure you can cause bugs in a LOT of popular software by simply using many many sockets. Thankfully, most applications will never have more than 1024 sockets.
Instead of having a secure API, OS developers like to put the blame on the developer. The Linux select() man page says
"The behavior of these macros is
undefined if a descriptor value is
less than zero or greater than or
equal to FD_SETSIZE, which is normally
at least equal to the maximum number
of descriptors supported by the
system."
That's misleading. Linux can open more than 1024 sockets. And the behavior is absolutely well defined: Using unexpected values will ruin the application running. Instead of making the macros resilient to illegal values, the developers simply overwrite other structures. FD_SET is implemented as inline assembly(!) in the linux headers and will evaluate to a single assembler write instruction. Not the slightest bounds checking happening anywhere.
To test your own application, you can artificially inflate the number of descriptors used by programmatically opening FD_SETSIZE files or sockets directly after main() and then running your application.
Thorsten79
Mine was a hardware problem...
Back in the day, I used a DEC VaxStation with a big 21" CRT monitor. We moved to a lab in our new building, and installed two VaxStations in opposite corners of the room. Upon power-up,my monitor flickered like a disco (yeah, it was the 80's), but the other monitor didn't.
Okay, swap the monitors. The other monitor (now connected to my VaxStation) flickered, and my former monitor (moved across the room) didn't.
I remembered that CRT-based monitors were susceptable to magnetic fields. In fact, they were -very- susceptable to 60 Hz alternating magnetic fields. I immediately suspected that something in my work area was generating a 60 Hz alterating magnetic field.
At first, I suspected something in my work area. Unfortunately, the monitor still flickered, even when all other equipment was turned off and unplugged. At that point, I began to suspect something in the building.
To test this theory, we converted the VaxStation and its 85 lb monitor into a portable system. We placed the entire system on a rollaround cart, and connected it to a 100 foot orange construction extension cord. The plan was to use this setup as a portable field strength meter,in order to locate the offending piece of equipment.
Rolling the monitor around confused us totally. The monitor flickered in exactly one half of the room, but not the other side. The room was in the shape of a square, with doors in opposite corners, and the monitor flickered on one side of a diagnal line connecting the doors, but not on the other side. The room was surrounded on all four sides by hallways. We pushed the monitor out into the hallways, and the flickering stopped. In fact, we discovered that the flicker only occurred in one triangular-shaped half of the room, and nowhere else.
After a period of total confusion, I remembered that the room had a two-way ceiling lighting system, with light switches at each door. At that moment, I realized what was wrong.
I moved the monitor into the half of the room with the problem, and turned the ceiling lights off. The flicker stopped. When I turned the lights on, the flicker resumed. Turning the lights on or off from either light switch, turned the flicker on or off within half of the room.
The problem was caused by somebody cutting corners when they wired the ceiling lights. When wiring up a two-way switch on a lighting circuit, you run a pair of wires between the SPDT switch contacts, and a single wire from the common on one switch, through the lights, and over to the common on the other switch.
Normally, these wires are bundeled together. They leave as a group from one switchbox, run to the overhead ceiling fixture, and on to the other box. The key idea, is that all of the current-carrying wires are bundeled together.
When the building was wired, the single wire between the switches and the light was routed through the ceiling, but the wires travelling between the switches were routed through the walls.
If all of the wires ran close and parallel to each other, then the magnetic field generated by the current in one wire was cancelled out by the magnetic field generated by the equal and opposite current in a nearby wire. Unfortunately, the way that the lights were actually wired meant that one half of the room was basically inside a large, single-turn transformer primary. When the lights were on, the current flowed in a loop, and the poor monitor was basically sitting inside of a large electromagnet.
Moral of the story: The hot and neutral lines in your AC power wiring are next to each other for a good reason.
Now, all I had to do was to explain to management why they had to rewire part of their new building...
A bug where you come across some code, and after studying it you conclude, "There's no way this could have ever worked!" and suddenly it stops working though it always did work before.
One of the products I helped build at my work was running on a customer site for several months, collecting and happily recording each event it received to a SQL Server database. It ran very well for about 6 months, collecting about 35 million records or so.
Then one day our customer asked us why the database hadn't updated for almost two weeks. Upon further investigation we found that the database connection that was doing the inserts had failed to return from the ODBC call. Thankfully the thread that does the recording was separated from the rest of the threads, allowing everything but the recording thread to continue functioning correctly for almost two weeks!
We tried for several weeks on end to reproduce the problem on any machine other than this one. We never could reproduce the problem. Unfortunately, several of our other products then began to fail in about the same manner, none of which have their database threads separated from the rest of their functionality, causing the entire application to hang, which then had to be restarted by hand each time they crashed.
Weeks of investigation turned into several months and we still had the same symptoms: full ODBC deadlocks in any application that we used a database. By this time our products are riddled with debugging information and ways to determine what went wrong and where, even to the point that some of the products will detect the deadlock, collect information, email us the results, and then restart itself.
While working on the server one day, still collecting debugging information from the applications as they crashed, trying to figure out what was going on, the server BSoD on me. When the server came back online, I opened the minidump in WinDbg to figure out what the offending driver was. I got the file name and traced it back to the actual file. After examining the version information in the file, I figured out it was part of the McAfee anti-virus suite installed on the computer.
We disabled the anti-virus and haven't had a single problem since!!
I just want to point out a quite common and nasty bug that can happens in this google-area time:
code pasting and the infamous minus
That is when you copy paste some code with an minus in it, instead of a regular ASCII character hyphen-minus ('-').
Plus, minus(U+2212), Hyphen-Minus(U+002D)
Now, even though the minus is supposedly rendered as longer than the hyphen-minus, on certain editors (or on a DOS shell windows), depending on the charset used, it is actually rendered as a regular '-' hyphen-minus sign.
And... you can spend hours trying to figure why this code does not compile, removing each line one by one, until you find the actual cause!
May be not the toughest bug out there, but frustrating enough ;)
(Thank you ShreevatsaR for spotting the inversion in my original post - see comments)
The first was that our released product exhibited a bug, but when I tried to debug the problem, it didn't occur. I thought this was a "release vs. debug" thing at first -- but even when I compiled the code in release mode, I couldn't reproduce the problem. I went to see if any other developer could reproduce the problem. Nope. After much investigation (producing a mixed assembly code / C code listing) of the program output and stepping through the assembly code of the released product (yuck!), I found the offending line. But the line looked just fine to me! I then had to lookup what the assembly instructions did -- and sure enough the wrong assembly instruction was in the released executable. Then I checked the executable that my build environment produced -- it had the correct assembly instruction. It turned out that the build machine somehow got corrupt and produced bad assembly code for only one instruction for this application. Everything else (including previous versions of our product) produced identical code to other developers machines. After I showed my research to the software manager, we quickly re-built our build machine.
Somewhere deep in the bowels of a networked application was the line (simplified):
if (socket = accept() == 0)
return false;
//code using the socket()
What happened when the call succeeded? socket was set to 1. What does send() do when given a 1? (such as in:
send(socket, "mystring", 7);
It prints to stdout... this I found after 4 hours of wondering why, with all my printf()s taken out, my app was printing to the terminal window instead of sending the data over the network.
With FORTRAN on a Data General minicomputer in the 80's we had a case where the compiler caused a constant 1 (one) to be treated as 0 (zero). It happened because some old code was passing a constant of value 1 to a function which declared the variable as a FORTRAN parameter, which meant it was (supposed to be) immutable. Due to a defect in the code we did an assignment to the parameter variable and the compiler gleefully changed the data in the memory location it used for a constant 1 to 0.
Many unrelated functions later we had code that did a compare against the literal value 1 and the test would fail. I remember staring at that code for the longest time in the debugger. I would print out the value of the variable, it would be 1 yet the test 'if (foo .EQ. 1)' would fail. It took me a long time before I thought to ask the debugger to print out what it thought the value of 1 was. It then took a lot of hair pulling to trace back through the code to find when the constant 1 became 0.
I had a bug in a console game that occurred only after you fought and won a lengthy boss-battle, and then only around 1 time in 5. When it triggered, it would leave the hardware 100% wedged and unable to talk to outside world at all.
It was the shyest bug I've ever encountered; modifying, automating, instrumenting or debugging the boss-battle would hide the bug (and of course I'd have to do 10-20 runs to determine that the bug had hidden).
In the end I found the problem (a cache/DMA/interrupt race thing) by reading the code over and over for 2-3 days.
Not very tough, but I laughed a lot when it was uncovered.
When I was maintaining a 24/7 order processing system for an online shop, a customer complained that his order was "truncated". He claimed that while the order he placed actually contained N positions, the system accepted much less positions without any warning whatsoever.
After we traced order flow through the system, the following facts were revealed. There was a stored procedure responsible for storing order items in database. It accepted a list of order items as string, which encoded list of (product-id, quantity, price) triples like this:
"<12345, 3, 19.99><56452, 1,
8.99><26586, 2, 12.99>"
Now, the author of stored procedure was too smart to resort to anything like ordinary parsing and looping. So he directly transformed the string into SQL multi-insert statement by replacing "<" with "insert into ... values (" and ">" with ");". Which was all fine and dandy, if only he didn't store resulting string in a varchar(8000) variable!
What happened is that his "insert ...; insert ...;" was truncated at 8000th character and for that particular order the cut was "lucky" enough to happen right between inserts, so that truncated SQL remained syntactically correct.
Later I found out the author of sp was my boss.
This is back when I thought that C++ and digital watches were pretty neat...
I got a reputation for being able to solve difficult memory leaks. Another team had a leak they couldn't track down. They asked me to investigate.
In this case, they were COM objects. In the core of the system was a component that gave out many twisty little COM objects that all looked more or less the same. Each one was handed out to many different clients, each of which was responsible for doing AddRef() and Release() the same number of times.
There wasn't a way to automatically calculate who had called each AddRef, and whether they had Released.
I spent a few days in the debugger, writing down hex addresses on little pieces of paper. My office was covered with them. Finally I found the culprit. The team that asked me for help was very grateful.
The next day I switched to a GC'd language.*
(*Not actually true, but would be a good ending to the story.)
Bryan Cantrill of Sun Microsystems gave an excellent Google Tech Talk on a bug he tracked down using a tool he helped develop called dtrace.
The The Tech Talk is funny, geeky, informative, and very impressive (and long, about 78 minutes).
I won't give any spoilers here on what the bug was but he starts revealing the culprit at around 53:00.
While testing some new functionality that I had recently added to a trading application, I happened to notice that the code to display the results of a certain type of trade would never work properly. After looking at the source control system, it was obvious that this bug had existed for at least a year, and I was amazed that none of the traders had ever spotted it.
After puzzling for a while and checking with a colleague, I fixed the bug and went on testing my new functionality. About 3 minutes later, my phone rang. On the other end of the line was an irate trader who complained that one of his trades wasn’t showing correctly.
Upon further investigation, I realized that the trader had been hit with the exact same bug I had noticed in the code 3 minutes earlier. This bug had been lying around for a year, just waiting for a developer to come along and spot it so that it could strike for real.
This is a good example of a type of bug known as a Schroedinbug. While most of us have heard about these peculiar entities, it is an eerie feeling when you actually encounter one in the wild.
The two toughest bugs that come to mind were both in the same type of software, only one was in the web-based version, and one in the windows version.
This product is a floorplan viewer/editor. The web-based version has a flash front-end that loads the data as SVG. Now, this was working fine, only sometimes the browser would hang. Only on a few drawings, and only when you wiggled the mouse over the drawing for a bit. I narrowed the problem down to a single drawing layer, containing 1.5 MB of SVG data. If I took only a subsection of the data, any subsection, the hang didn't occur. Eventually it dawned on me that the problem probably was that there were several different sections in the file that in combination caused the bug. Sure enough, after randomly deleting sections of the layer and testing for the bug, I found the offending combination of drawing statements. I wrote a workaround in the SVG generator, and the bug was fixed without changing a line of actionscript.
In the same product on the windows side, written in Delphi, we had a comparable problem. Here the product takes autocad DXF files, imports them to an internal drawing format, and renders them in a custom drawing engine. This import routine isn't particularly efficient (it uses a lot of substring copying), but it gets the job done. Only in this case it wasn't. A 5 megabyte file generally imports in 20 seconds, but on one file it took 20 minutes, because the memory footprint ballooned to a gigabyte or more. At first it seemed like a typical memory leak, but memory leak tools reported it clean, and manual code inspection turned up nothing either. The problem turned out to be a bug in Delphi 5's memory allocator. In some conditions, which this particular file was duly recreating, it would be prone to severe memory fragmentation. The system would keep trying to allocate large strings, and find nowhere to put them except above the highest allocated memory block. Integrating a new memory allocation library fixed the bug, without changing a line of import code.
Thinking back, the toughest bugs seem to be the ones whose fix involves changing a different part of the system than the one where the problem occurs.
When the client's pet bunny rabbit gnawed partway through the ethernet cable. Yes. It was bad.
Had a bug on a platform with a very bad on device debugger.
We would get a crash on the device if we added a printf to the code. It then would crash at a different spot than the location of the printf. If we moved the printf, the crash would ether move or disappear. In fact, if we changed that code by reordering some simple statements, the crash would happen some where unrelated to the code we did change.
It turns out there was a bug in the relocator for our platform. the relocator was not zero initializing the ZI section but rather using the relocation table to initialze the values. So any time the relocation table changed in the binary the bug would move. So simply added a printf would change the relocation table an there for the bug.
This happened to me on the time I worked on a computer store.
One customer came one day into shop and tell us that his brand new computer worked fine on evenings and night, but it does not work at all on midday or late morning.
The trouble was that mouse pointer does not move at that times.
The first thing we did was changing his mouse by a new one, but the trouble were not fixed. Of course, both mouses worked on store with no fault.
After several tries, we found the trouble was with that particular brand and model of mouse.
Customer workstation was close to a very big window, and at midday the mouse was under direct sunlight.
Its plastic was so thin that under that circumstances, it became translucent and sunlight prevented optomechanical wheel for working :|
My team inherited a CGI-based, multi-threaded C++ web app. The main platform was Windows; a distant, secondary platform was Solaris with Posix threads. Stability on Solaris was a disaster, for some reason. We had various people who looked at the problem for over a year, off and on (mostly off), while our sales staff successfully pushed the Windows version.
The symptom was pathetic stability: a wide range of system crashes with little rhyme or reason. The app used both Corba and a home-grown protocol. One developer went so far as to remove the entire Corba subsystem as a desperate measure: no luck.
Finally, a senior, original developer wondered aloud about an idea. We looked into it and eventually found the problem: on Solaris, there was a compile-time (or run-time?) parameter to adjust the stack size for the executable. It was set incorrectly: far too small. So, the app was running out of stack and printing stack traces that were total red herrings.
It was a true nightmare.
Lessons learned:
Brainstorm, brainstorm, brainstorm
If something is going nuts on a different, neglected platform, it is probably an attribute of the environment platform
Beware of problems that are transferred from developers who leave the team. If possible, contact the previous people on personal basis to garner info and background. Beg, plead, make a deal. The loss of experience must be minimized at all costs.
Adam Liss's message above talking about the project we both worked on, reminded me of a fun bug I had to deal with. Actually, it wasn't a bug, but we'll get to that in a minute.
Executive summary of the app in case you haven't seen Adam message yet: sales-force automation software...on laptops...end of the day they dialed up ...to synchronize with the Mother database.
One user complained that every time he tried to dial in, the application would crash. The customer support folks went through all their usually over-the-phone diagnostic tricks, and they found nothing. So, they had to relent to the ultimate: have the user FedEx the laptop to our offices. (This was a very big deal, as each laptop's local database was customized to the user, so a new laptop had to be prepared, shipped to the user for him to use while we worked on his original, then we had to swap back and have him finally sync the data on first original laptop).
So, when the laptop arrived, it was given to me to figure out the problem. Now, syncing involved hooking up the phone line to the internal modem, going to the "Communication" page of our app, and selecting a phone number from a Drop-down list (with last number used pre-selected). The numbers in the DDL were part of the customization, and were basically, the number of the office, the number of the office prefixed with "+1", the number of the office prefixed with "9,,," in case they were calling from an hotel etc.
So, I click the "COMM" icon, and pressed return. It dialed in, it connected to a modem -- and then immediately crashed. I tired a couple more times. 100% repeatability.
So, a hooked a data scope between the laptop & the phone line, and looked at the data going across the line. It looked rather odd... The oddest part was that I could read it!
The user had apparently wanted to use his laptop to dial into a local BBS system, and so, change the configuration of the app to use the BBS's phone number instead of the company's. Our app was expecting our proprietary binary protocol -- not long streams of ASCII text. Buffers overflowed -- KaBoom!
The fact that a problem dialing in started immediately after he changed the phone number, might give the average user a clue that it was the cause of the problem, but this guy never mentioned it.
I fixed the phone number, and sent it back to the support team, with a note electing the guy the "Bonehead user of the week". (*)
(*) OkOkOk... There's probably a very good chance what actually happened in that the guy's kid, seeing his father dial in every night, figured that's how you dial into BBS's also, and changed the phone number sometime when he was home alone with the laptop. When it crashed, he didn't want to admit he touched the laptop, let alone broke it; so he just put it away, and didn't tell anyone.
It was during my diploma thesis. I was writing a program to simulate the effect of high intensity laser on a helium atom using FORTRAN.
One test run worked like this:
Calculate the intial quantum state using program 1, about 2 hours.
run the main simulation on the data from the first step, for the most simple cases about 20 to 50 hours.
then analyse the output with a third program in order to get meaningful values like energy, tork, momentum
These should be constant in total, but they weren't. They did all kinds of weird things.
After debugging for two weeks I went berserk on the logging and logged every variable in every step of the simulation including the constants.
That way I found out that I wrote over an end of an array, which changed a constant!
A friend said he once changed the literal 2 with such a mistake.
A deadlock in my first multi-threaded program!
It was very tough to find it because it happened in a thread pool. Occasionally a thread in the pool would deadlock but the others would still work. Since the size of the pool was much greater than needed it took a week or two to notice the first symptom: application completely hung.
I have spent hours to days debugging a number of things that ended up being fixable with literally just a couple characters.
Some various examples:
ffmpeg has this nasty habit of producing a warning about "brainfart cropping" (referring to a case where in-stream cropping values are >= 16) when the crop values in the stream were actually perfectly valid. I fixed it by adding three characters: "h->".
x264 had a bug where in extremely rare cases (one in a million frames) with certain options it would produce a random block of completely the wrong color. I fixed the bug by adding the letter "O" in two places in the code. Turned out I had mispelled the name of a #define in an earlier commit.
My first "real" job was for a company that wrote client-server sales-force automation software. Our customers ran the client app on their (15-pound) laptops, and at the end of the day they dialed up to our unix servers to synchronize with the Mother database. After a series of complaints, we found that an astronomical number of calls were dropping at the very beginning, during authentication.
After weeks of debugging, we discovered that the authentication always failed if the incoming call was answered by a getty process on the server whose Process ID contained an even number followed immediately by a 9. Turns out the authentication was a homebrew scheme that depended on an 8-character string representation of the PID; a bug caused an offending PID to crash the getty, which respawned with a new PID. The second or third call usually found an acceptable PID, and automatic redial made it unnecessary for the customers to intervene, so it wasn't considered a significant problem until the phone bills arrived at the end of the month.
The "fix" (ahem) was to convert the PID to a string representing its value in octal rather than decimal, making it impossible to contain a 9 and unnecessary to address the underlying problem.
Basically, anything involving threads.
I held a position at a company once in which I had the dubious distinction of being one of the only people comfortable enough with threading to debug nasty issues. The horror. You should have to get some kind of certification before you're allowed to write threaded code.
I heard about a classic bug back in high school; a terminal that you could only log into if you sat in the chair in front of it. (It would reject your password if you were standing.)
It reproduced pretty reliably for most people; you could sit in the chair, log in, log out... but if you stand up, you're denied, every time.
Eventually it turned out some jerk had swapped a couple of adjacent keys on the keyboard, E/R and C/V IIRC, and when you sat down, you touch-typed and got in, but when you stood, you had to hunt 'n peck, so you looked at the incorrent labels and failed.
While I don't recall a specific instance, the toughest category are those bugs which only manifest after the system has been running for hours or days, and when it goes down, leaves little or no trace of what caused the crash. What makes them particularly bad is that no matter how well you think you've reasoned out the cause, and applied the appropriate fix to remedy it, you'll have to wait for another few hours or days to get any confidence at all that you've really nailed it.
Our network interface, a DMA-capable ATM card, would very occasionally deliver corrupted data in received packets. The AAL5 CRC had checked out as correct when the packet came in off the wire, yet the data DMAd to memory would be incorrect. The TCP checksum would generally catch it, but back in the heady days of ATM people were enthused about running native applications directly on AAL5, dispensing with TCP/IP altogether. We eventually noticed that the corruption only occurred on some models of the vendor's workstation (who shall remain nameless), not others.
By calculating the CRC in the driver software we were able to detect the corrupted packets, at the cost of a huge performance hit. While trying to debug we noticed that if we just stored the packet for a while and went back to look at it later, the data corruption would magically heal itself. The packet contents would be fine, and if the driver calculated the CRC a second time it would check out ok.
We'd found a bug in the data cache of a shipping CPU. The cache in this processor was not coherent with DMA, requiring the software to explicitly flush it at the proper times. The bug was that sometimes the cache didn't actually flush its contents when told to do so.

Resources