How does a bytecode interpreter know what line a runtime error occurred on? - compilation

As of now, I am working on a language that compiles to bytecode, and then is ran by a VM. My question is, when a runtime error occurs, how does the VM know what line of the source code caused the error, as all whitespace is removed during the compilation process. One thing I would think of is to store a separate array of integers correlating to the bytecode with the line numbers within it, but that sounds extremely memory-inefficient, especially when there are a lot of instructions.

Some forms of bytecode contain information about line numbers, method names, etc. which are included to provide better debugging information. In the JVM, for example, method bytecode contains a table that maps ranges of bytecode addresses to source line numbers. That’s a more efficient way of storing it than tagging each bytecode operation with a line number, since there are typically multiple operations per line. It does use extra space, though I wouldn’t classify it as extremely inefficient.
Absent this info, there really isn’t a way for the interpreter to report anything about the original program, since as you’ve noted all that information is otherwise discarded.
This is similar to how compiled executables handle debug info. With debug symbols included, the program has tables mapping code addresses to function names and line numbers. With symbols stripped out, you just have raw instructions and data and there’s no way to reference the original code.

Related

Why does non-executed compile-time code increase Raku's bytecode size? Does it slow runtime performance?

Consider the following two programs:
unit module Comp;
say 'Hello, world!'
and
unit module Comp;
CHECK { if $*DISTRO.is-win { say 'compiling on Windows' }}
say 'Hello, world!'
Naively, I would have expected both programs to compile to exactly the same bytecode: the CHECK block specifies code to run at the end of compilation; checking a variable and then doing nothing has no effect on the run-time behavior of the program, and thus (I would have thought) shouldn't need to be included in the compiled bytecode.
However, compiling these two programs does not result in the same bytecode. Specifically, compiling the version without the CHECK block creates 24K of bytecode versus 60K for the version with it. Why is the bytecode different for these two versions? Does this difference in bytecode have (or potentially have) a runtime cost? (It seems like it must, but I want to be sure).
And one more related question: how do DOC CHECK blocks fit in with the above? My understanding is that even the compiler skips DOC CHECK blocks when it's not run with the --doc flag. Consistent with that, the bytecode for a hello-world program does not increase in size when given a DOC CHECK block like the one above. However, it does increase in size if the block includes a use statement. From that, I conclude that use is somehow special-cased and gets executed even in DOC CHECK blocks. Is that correct? If so, are there other simillarly special-cased forms I should know about?
A CHECK or BEGIN block (or other BEGIN-time constructs) may contain code that escapes. For example:
BEGIN SomeClass.^add_method('foo', anon method foo() { 42 })
Adds a method to a class, which exists beyond the bounds of the BEGIN block. That method's bytecode is therefore required in the compiled output. Currently, Rakudo conservatively includes the bytecode of everything in a BEGIN or CHECK block. It may be possible to avoid that for some simple cases in the future.
So far as the runtime cost goes, the implementation goes to some lengths to minimize the cost of bytecode that is never run (not so much for this case, but because the standard library is huge but many programs use only a fraction of it). For example:
Bytecode is mmap'd, so some unused parts of it may not actually be paged into memory
Bytecode is only validated on the first call to that frame
Frame meta-data (what lexicals does it have) is only deserialized on the first call to the frame
Unless something references it, the code object will not be deserialized
So far as use goes, its action is performed as soon as it is parsed. Being inside a DOC CHECK block does not suppress that - and in general can not, because the use might bring in things that need to be known in order to finish parsing the contents of that block.

Eclipse CLP: maximum number of constraints/variables

In the Eclipse CLP, how many constraints or variables can I define?
I am currently remodeling my scheduling problem - I need to replace a single alldifferent constraint with many atmost constraints. But since I've introduced this change, my ecl script is not working. By "not working" I mean the Eclipse CLP - eclipse.exe or the TkEclipse GUI just shuts down. Without any error message,comment or saying goodbye. Just nothing.
If I try to comment-out some constraints, the script at least gets compiled.
Has someone already bothered with this issue?
There is no specific limit on the number or variables or constraints.
But you were working with large, generated source files where clauses had thousands of subgoals. Because ECLiPSe uses a recursive descent parser, such files can cause an OS stack overflow, in particular on Windows. You could either increase the Windows stack limit, or you could break your generated code into smaller clauses, and call these in conjunction.
Generally, however, generating textual source code isn't such a great idea: it must be created, written, read, parsed, compiled, and is then executed just once. Consider instead generating a pure data file that contains only things like arrays/lists of numbers, but no variables. You can then have a generic ECLiPSe program that reads these data and uses them to create variables and constraints, usually in several loops.
For a very simple example, compare https://eclipseclp.org/examples/transport1.pl.txt (where all the data is explicit in the flat model) with
https://eclipseclp.org/examples/transport_arr.pl.txt where the model is generic and all data comes from the data/3 fact at the end (this would correspond to the generated data file).

Edit library in hex editor while preserving its integrity

I'm attempting to edit a library in hex editor, insert mode. The main point is to rename a few entries in it. If I make it in "Otherwrite" mode, everything works fine, but every time I try to add a few symbols to the end of string in "Insert" mode, the library fails to load. Anything I'm missing here?
Yes, you're missing plenty. A library follows the PE/COFF format, which is quite heavy on pointers throughout the file. (Eg, towards the beginning of the file is a table which points to the locations of each section in the file).
In the case that you are editing resources, there's the potential to do it without breaking things if you make sure you correct any pointers and sizes for anything pointing to after your edits, but I doubt it'll be easy. In the case that you are editing the .text section (ie, the code), then I doubt you'll get it done, since the operands of function calls and jumps are relative locations to their position in code - you would need to update the entire code to account for edits.
One technique to overcome this is a "code cave", where you replace a piece of the existing code with an explicit JMP instruction to some empty location (You can do this at runtime, where you have the ability to create new memory) - where you define some new code which can be of arbitrary length - then you explicitly JMP back to where you called from (+5 bytes say for the JMP opcode + operand).
Are the names you're changing them to the same length as the old names? If not, then the offsets of everything is shifted. And do any of the functions call one another? That could be another problem point. It'd be easier to obtain the source code (from the project's website if it's not in-house, or from the vendor if it's closed) and change them in that, and then recompile it. I'm curious as to why you are changing the names anyway.
DLLs are a complex binary format (ie compiled code). The compiling process turns named function calls into hard-wired references to specific positions in the file ("offsets"). Therefore if you insert characters into the middle of the file, the offsets after that point will no longer match what is actually at the position they reference, meaning that the function calls in your library will run the wrong code (if they manage to run anything at all).
Basically, the bottom line is what you're doing is always going to break stuff. If you're unlucky, it might even break it really badly and cause serious damage.
Sure - a detailed knowledge of the format, and what has to change. If you're wondering why some of your edits cause loading to fail, you are missing that knowledge.
Libraries are intended to be written by the linker for the use of the linker. They follow a well-defined format that is intended to be easy for the linker to write and read. They don't need tolerance for human input like a compiler does.
Very simply, libraries aren't intended to be modified by hex editors. It may be possible to change entries by overwriting them with names of the same length, or that may screw up an index somewhere. If you change the length of anything, you're likely breaking pointers and metadata.
You don't give any reason for wanting to do this. If it's for fun, well, it's harder than you expected. If you have another reason, you're better off getting the source, or getting somebody who has the source to rename and rebuild.

Are there any good reference implementations available for command line implementations for embedded systems?

I am aware that this is nothing new and has been done several times. But I am looking for some reference implementation (or even just reference design) as a "best practices guide". We have a real-time embedded environment and the idea is to be able to use a "debug shell" in order to invoke some commands. Example: "SomeDevice print reg xyz" will request the SomeDevice sub-system to print the value of the register named xyz.
I have a small set of routines that is essentially made up of 3 functions and a lookup table:
a function that gathers a command line - it's simple; there's no command line history or anything, just the ability to backspace or press escape to discard the whole thing. But if I thought fancier editing capabilities were needed, it wouldn't be too hard to add them here.
a function that parses a line of text argc/argv style (see Parse string into argv/argc for some ideas on this)
a function that takes the first arg on the parsed command line and looks it up in a table of commands & function pointers to determine which function to call for the command, so the command handlers just need to match the prototype:
int command_handler( int argc, char* argv[]);
Then that function is called with the appropriate argc/argv parameters.
Actually, the lookup table also has pointers to basic help text for each command, and if the command is followed by '-?' or '/?' that bit of help text is displayed. Also, if 'help' is used for a command, the command table is dumped (possible only a subset if a parameter is passed to the 'help' command).
Sorry, I can't post the actual source - but it's pretty simple and straight forward to implement, and functional enough for pretty much all the command line handling needs I've had for embedded systems development.
You might bristle at this response, but many years ago we did something like this for a large-scale embedded telecom system using lex/yacc (nowadays I guess it would be flex/bison, this was literally 20 years ago).
Define your grammar, define ranges for parameters, etc... and then let lex/yacc generate the code.
There is a bit of a learning curve, as opposed to rolling a 1-off custom implementation, but then you can extend the grammar, add new commands & parameters, change ranges, etc... extremely quickly.
You could check out libcli. It emulates Cisco's CLI and apparently also includes a telnet server. That might be more than you are looking for, but it might still be useful as a reference.
If your needs are quite basic, a debug menu which accepts simple keystrokes, rather than a command shell, is one way of doing this.
For registers and RAM, you could have a sub-menu which just does a memory dump on demand.
Likewise, to enable or disable individual features, you can control them via keystrokes from the main menu or sub-menus.
One way of implementing this is via a simple state machine. Each screen has a corresponding state which waits for a keystroke, and then changes state and/or updates the screen as required.
vxWorks includes a command shell, that embeds the symbol table and implements a C expression evaluator so that you can call functions, evaluate expressions, and access global symbols at runtime. The expression evaluator supports integer and string constants.
When I worked on a project that migrated from vxWorks to embOS, I implemented the same functionality. Embedding the symbol table required a bit of gymnastics since it does not exist until after linking. I used a post-build step to parse the output of the GNU nm tool for create a symbol table as a separate load module. In an earlier version I did not embed the symbol table at all, but rather created a host-shell program that ran on the development host where the symbol table resided, and communicated with a debug stub on the target that could perform function calls to arbitrary addresses and read/write arbitrary memory. This approach is better suited to memory constrained devices, but you have to be careful that the symbol table you are using and the code on the target are for the same build. Again that was an idea I borrowed from vxWorks, which supports both teh target and host based shell with the same functionality. For the host shell vxWorks checksums the code to ensure the symbol table matches; in my case it was a manual (and error prone) process, which is why I implemented the embedded symbol table.
Although initially I only implemented memory read/write and function call capability I later added an expression evaluator based on the algorithm (but not the code) described here. Then after that I added simple scripting capabilities in the form of if-else, while, and procedure call constructs (using a very simple non-C syntax). So if you wanted new functionality or test, you could either write a new function, or create a script (if performance was not an issue), so the functions were rather like 'built-ins' to the scripting language.
To perform the arbitrary function calls, I used a function pointer typedef that took an arbitrarily large (24) number of arguments, then using the symbol table, you find the function address, cast it to the function pointer type, and pass it the real arguments, plus enough dummy arguments to make up the expected number and thus create a suitable (if wasteful) maintain stack frame.
On other systems I have implemented a Forth threaded interpreter, which is a very simple language to implement, but has a less than user friendly syntax perhaps. You could equally embed an existing solution such as Lua or Ch.
For a small lightweight thing you could use forth. Its easy to get going ( forth kernels are SMALL)
look at figForth, LINa and GnuForth.
Disclaimer: I don't Forth, but openboot and the PCI bus do, and I;ve used them and they work really well.
Alternative UI's
Deploy a web sever on your embedded device instead. Even serial will work with SLIP and the UI can be reasonably complex ( or even serve up a JAR and get really really complex.
If you really need a CLI, then you can point at a link and get a telnet.
One alternative is to use a very simple binary protocol to transfer the data you need, and then make a user interface on the PC, using e.g. Python or whatever is your favourite development tool.
The advantage is that it minimises the code in the embedded device, and shifts as much of it as possible to the PC side. That's good because:
It uses up less embedded code space—much of the code is on the PC instead.
In many cases it's easier to develop a given functionality on the PC, with the PC's greater tools and resources.
It gives you more interface options. You can use just a command line interface if you want. Or, you could go for a GUI, with graphs, data logging, whatever fancy stuff you might want.
It gives you flexibility. Embedded code is harder to upgrade than PC code. You can change and improve your PC-based tool whenever you want, without having to make any changes to the embedded device.
If you want to look at variables—If your PC tool is able to read the ELF file generated by the linker, then it can find out a variable's location from the symbol table. Even better, read the DWARF debug data and know the variable's type as well. Then all you need is a "read-memory" protocol message on the embedded device to get the data, and the PC does the decoding and displaying.

Accurately accessing VB6 limitations

As antiquated and painful as it is - I work at a company that continues to actively use VB6 for a large project. In fact, 18 months ago we came up against the 32k identifier limit.
Not willing to give up on the large code base and rewrite everything in .NET we broke our application into a main executable and several supporting DLL files. This week we ran into the 32k limit again.
The problem we have is that no tool we can find will tell us how many unique identifiers our source is using. We have no accurate way to gauge how our efforts are reducing the number of identifiers or how close we are to the limit before we reach it.
Does anyone know of a tool that will scan the source for a project and return some accurate metrics and statistics?
OK. The Project Metrics Viewer which is part of the Project Analyzer tool from Aivosto will do exactly what you want. I've included a screenshot and also the link to the metrics list which includes numbers of variables etc.
Metrics List
(source: aivosto.com)
The company I work for also has a large VB6 project that encountered the identifier limit. I developed a way to accurately count the number of identifiers remaining, and this has been incorporated into our build process for this project.
After trying several tools without success, I finally realized that the VB6 IDE itself knows exactly how many identifiers it has remaining. In fact, the VB6 IDE throws an "out of memory" error when you add one variable past its limit.
Taking advantage of this fact, I wrote a VB6 Add-In project that first compiles the currently loaded project in the IDE, then adds uniquely named variables to the project until it throws an error. When an error is raised, it records the number of identifiers added before the error as the number of identifiers remaining.
This number is stored in file in a location known to our automated build process, which then reads this number and reports it to the development team. When it gets below a value we feel comfortable with, we schedule some refactoring time and move more code out of this project into DLL projects. We have been using this in production for several years now, and has proven to be a reliable process.
To directly answer the question, using an Add-In is the only way I know to accurately measure the number of remaining identifiers. While I cannot share the Add-In code our project is using, I can say there is not much code involved, and it did not take long to develop.
Microsoft has a decent guide for how to create an Add-In, which can get you started:
https://support.microsoft.com/en-us/kb/189468
Here are some important details specific to counting identifiers:
The VB6 IDE will not consistently throw an error when out of identifiers until the current loaded project has been compiled. Our Add-In programmatically does this before adding identifiers to guarantee an accurate count. If the project cannot be compiled, then an accurate count cannot be obtained.
There are 32,500 identifiers available to a new, empty VB6 project.
Only unique identifier names count. Two local variables with the same name in two different routines only count as one identifier.
CodeSmart by AxTools is very good.
(source: axtools.com)
Cheat - create an unused class with #### unique variables in it. Use Excel or something to generate the alphabetical unique variable names. Remove the class from the project when you hit the limit, or comment out blocks of 100 unique variables..
I'd rather lean on the compiler (which defines how many variables are too many) than on some 3rd party tool anyway.
(oh crud, sorry to necro - didn't notice the dates)
You could get this from a tool that extracted identifiers from VB6 code. Then all you'd have to do is sort the list, eliminate duplicates, and measure the list size. We have a source code search engine that breaks up source code into language tokens ("lexes"), with some of those tokens being exactly those identifiers. That would contain exactly the data you want.
But maybe there's another way to solve your problem: find out which variable names which occur rarely and replace them by a set of standard names (e.g., "temp"). So what you really want is a count of the number of each variable name so you can sort for "small numbers of references". The same lexer data can provide this information.
Then all you need is a tool to rename low-occurrence identifiers to something from the standard set. We offer obfuscators that replace one name by another that could probably do this.
[Oct 2014 update]. Just had a long conversation with somebody with this problem. It turns out there's a pretty conceptual answer on which to base a tool, and that is called register coloring, which allocates a fixed number of registers to an arbitrary number of operands. This works by computing an "interference graph" over operands; and two operands that don't "interfere" can be assigned the same register. One could use that to allocate 2^16 available variable names names to an arbitrary number of identifiers, if the interference graph isn't bad enough. My guess is that it is not. YMMV, and somebody still has to build such a tool, needing likely a VB6 parser and machinery to compute such a graph. [Check out my bio].
It seems that Compuware's DevPartner had that kind of code analysis. I don't know if the current version still supports Visual Basic 6.0. (But at least there's a 14-day trial available)

Resources