I'm trying to find a way to figure out in IDA which exports are data exports and which are real functions export.
For example, let's have a look at Microsoft's msftedit.dll's export entries:
While CreateTextServices is a real exported function:
IID_IRichEditOle is a data export and IDA fails to realize that, interpeting data as code:
Do someone know a reliable way to distinguish the two? Help will be much appreciated.
Thanks in advance.
There is no perfectly reliable way to do this for every export.
Each export only specifies an offset within the executable file -- logically, it could be treated as code or as data by any other code that references it.
As you mentioned, you could come up with heuristics to detect the type of the export in almost all of the cases, but it would be easy to come up with counterexamples that do not work for any given heuristic. Take, for instance, the rule you proposed:
The exported entry will be considered a valid exported function if there is a ret instruction in the function, and there are more than <min> valid instructions, and IDA recognizes the function's calling convention.
False negatives: You might have a function that uses tail call optimization and ends with jmp instructions rather than ret instructions. Any short function would also fail. And there are several ways that IDA can be confused into not treating the code as a function.
False positives: There could be a string in memory followed closely by a C3 or C2 like db 'BACKGAMMON0',0,0C3h -- this could logically disassemble as a valid 11-instruction function with a ret and no arguments.
The lines are blurred even further when you consider that an export could be logically treated as both code and data: Imagine that a byte sequence at an export is copied into dynamically allocated memory -- potentially even in another process -- where it is later executed as code.
Perhaps a reasonable suggestion would be to just trust IDA and treat the export as code if IDA thinks it's code. A large part of IDA's functionality is automatically guessing the logical types of data, and it's normally pretty good at it. As you've shown, sometimes it's wrong. But you can't get 100% accuracy anyway. The best you can do is balance between false negatives and false positives.
Proof of this problem's undecidability:
Whether or not an export will be executed as code is undecidable. Whether or not an export will be read as data is also undecidable. Since we cannot guarantee that either is true, distinguishing between seemingly ambiguous cases is impossible.
Proof: Assume that we have an oracle A(P,I,E) which returns 1 if program P (including all of its dependencies) executes (or reads from) export E (from any DLL loaded in the course of P's execution) with "input" (external state) I. Otherwise, it returns 0.
Let us construct a minimal program Z(P,I,E) which executes (or reads from) export E (the DLL for which is loaded into the address space) if and only if A(P,I,E) returns 0.
Now consider the result of Z(Z,I,E):
If Z(Z,I,E) executes (or reads from) export E, then A(Z,I,E) would return 1. But Z(Z,I,E) is defined to not access export E unless A(Z,I,E) returns 0. This is a contradiction.
If Z(Z,I,E) does not execute (or read from) export E, then A(Z,I,E) would return 0. But Z(Z,I,E) is defined such that it will access export E when A(Z,I,E) returns 0. This is a contradiction.
Therefore, our initial assumption that oracle A(P,I,E) exists is proven false.
But you can do better through instrumentation...
Depending on the exact problem you're trying to solve, you may be able to determine which exports are valid functions at runtime.
For example, you could write an application which debugs the program you which to analyze and places guard pages on each of the pages that contain exports you wish to hook. This means, whenever a page is access (executed/read/written to), an exception is raised, and the debugger program gains control.
The debugger could check the program context to see what type of access was made and whether it has anything to do with the export. If the access is an attempt to execute an export, it could perform some hooking functionality before returning control to the program. Otherwise, it could just return control to the program.
In either case, the PAGE_GUARD modifier is lifted after each exception, so you'd need to put it back each time.
Unsurprisingly, this would make execution of your program very slow, as any R/W/X access to any of the pages containing an export causes an expensive context switch -- this would likely include the execution of most instructions that are a part of your exported functions, along with several others that have nothing to do with them.
You could take a similar approach with other instrumentation tools, such as Pin.
Note that you may not gain information about the usage of every export through instrumentation. This is because you may need to determine what input/external state is required to cause the program to access each export in order to learn if it is used as code or as data (if at all).
Also note that both execute and read (or even write) accesses could potentially occur on the same exports.
Related
In the Eclipse CLP, how many constraints or variables can I define?
I am currently remodeling my scheduling problem - I need to replace a single alldifferent constraint with many atmost constraints. But since I've introduced this change, my ecl script is not working. By "not working" I mean the Eclipse CLP - eclipse.exe or the TkEclipse GUI just shuts down. Without any error message,comment or saying goodbye. Just nothing.
If I try to comment-out some constraints, the script at least gets compiled.
Has someone already bothered with this issue?
There is no specific limit on the number or variables or constraints.
But you were working with large, generated source files where clauses had thousands of subgoals. Because ECLiPSe uses a recursive descent parser, such files can cause an OS stack overflow, in particular on Windows. You could either increase the Windows stack limit, or you could break your generated code into smaller clauses, and call these in conjunction.
Generally, however, generating textual source code isn't such a great idea: it must be created, written, read, parsed, compiled, and is then executed just once. Consider instead generating a pure data file that contains only things like arrays/lists of numbers, but no variables. You can then have a generic ECLiPSe program that reads these data and uses them to create variables and constraints, usually in several loops.
For a very simple example, compare https://eclipseclp.org/examples/transport1.pl.txt (where all the data is explicit in the flat model) with
https://eclipseclp.org/examples/transport_arr.pl.txt where the model is generic and all data comes from the data/3 fact at the end (this would correspond to the generated data file).
NetLogo being interactive makes debugging easy, but I yet to find any tools available for setting breakpoints and stepping through code.
Please guide me if such exist. Or I can achieve the same with the current setup available.
I am not aware of such a tool if one exists. For debugging I use meaningful print statements. First I make a switch as a global parameter to set the debug mode on and off, then I add a statement to each method that prints which method updates which variable and in which order they were called (if debug mode is on).
I also use profiler extension which shows how many times each method was called and which one is the most or least time consuming one.
Not existing currently. Still, you can use one of the alternatives from above or you might take a look at user-message (https://ccl.northwestern.edu/netlogo/docs/dictionary.html#user-message), which will pop up a dialog. This will also block the execution at that step, although not providing you with a jump-to-next-line mechanism, for me this solution proved to be the best.
Another possibility is to do the debugging in any modern browser if/when NetLogo Web produces source maps. This way one can set breakpoints in the NetLogo code and use Chrome or FireFox or IE11's developer tools on the NetLogo code.
I have used user-message and inspect together as the debugging tool for NetLogo. Here is a video demo on how to use them to identify the cause of an error.
the reason for using inspect is to examine all properties of an object when we are not sure where exactly went wrong
using user-message to print out some instruction, output some outcome and halt the program
I found Netlogo somewhat difficult to debug until I discovered print statements.
I basically zone in on the module that is causing problems and I add print statements within critical blocks to inspect the state of the variables. I have find this to be an effective way to debug.
I do wish the documentation was more comprehensive, with more code examples. Perhaps some good Samaritan will take it up as a project.
NB: Note that this approach is not just a convoluted way to achieve the exact same benefit that using random-seed gives. Using random-seed is an ex-ante way to reproduce a run. However, for rare errors, it is impractical to manually change random-seed (maybe a few hundred times) until you hit by chance a run in which the error appears. This approach, instead, makes you able to reproduce the error after it occurred, potentially saving tons of time by letting you reproduce that rare run ex-post.
Feel free to download this blueprint from the Modelling Commons if anyone finds it useful and wants to save the time to set it up.
Not a NetLogo feature, but an expedient I've devised recently.
Some errors might occur only rarely (for example, every few hundred runs). If that is the case, even just reproducing the error can become time consuming.
In order to avoid this problem, the following arrangement can be used (note that the code blocks only contain very few lines of code, the vast majority is only comments).
globals [
current-seed
]
to setup
clear-all
reset-ticks
manage-randomness
end
to manage-randomness
; This procedure checks if the user wants to use a new seed given by the 'new-seed' primitive, or if
; they want to use a custom-defined seed. In the latter case, the custom-defined seed is retrieved
; from the 'custom-seed' global variable, which is set in the Interface in the input box. Such variable
; can be defined either manually by the user, or through the 'save-current-seed-as-custom' button (see
; the 'save-current-seed-as-custom' procedure below).
; In either case, this selection is mediated by the 'current-seed' global variable. This is because
; having a global variable storing the value that will be passed to the 'random-seed' primitive is the
; only way to [1] have such value displayed in the Interface, and [2] re-use such value in case the
; user wants to perform 'save-current-seed-as-custom'.
ifelse (use-custom-seed?)
[set current-seed custom-seed]
[set current-seed new-seed]
random-seed current-seed
end
to save-current-seed-as-custom
; This procedure lets the user store the seed that was used in the current run (and stored in the
; 'random-seed' global variable; see comment in 'manage-randomness') as 'custom-seed', which will
; allow such value to be re-used after turning on the 'use-custom-seed?' switch in the Interface.
set custom-seed current-seed
end
This will make it possible to reproduce the same run where a rare error occurred, just by saving that run's seed as the custom seed and switching on the switch.
To make things even more useful, the same logic can be applied to ticks: to jump exactly to the same point where the rare error occurred (maybe thousands of ticks after the start of the run), it is possible to combine the previous arrangement about seeds and the following arrangement for ticks:
to go
; The first if-statement allows the user to bring the run to a custom-defined ticks value.
; The custom-defined ticks value is retrieved from the 'custom-ticks' global variable,
; which is set in the Interface in the input box. Such variable can be defined either
; manually by the user, or through the 'save-current-ticks-as-custom' button (see the
; 'save-current-ticks-as-custom' procedure above).
if (use-custom-ticks?) AND (ticks = custom-ticks) [stop]
; Insert here the normal 'go' procedure.
tick
end
to save-current-ticks-as-custom
; This procedure lets the user store the current 'ticks' value as 'custom-ticks'. This will allow
; the user, after switching on the 'use-custom-ticks?' switch, to bring the simulation to the
; exact same ticks as when the 'save-current-ticks-as-custom' button was used. If used in combination
; with the 'save-current-seed-as-custom' button and 'use-custom-seed?' switch, this allows the user
; to surely and quickly jump to the exact same situation as when a previous simulation was interrupted.
set custom-ticks ticks
end
This will make it possible not only to quickly jump to where an otherwise rare error would occur, but also, if needed, to manually change the custom-ticks value to a few ticks earlier, in order to be able to observe how things build up before the error occurring. Something that, with rare errors, can otherwise become quite time-consuming.
NetLogo is all about keeping the code in one spot. When I run a simulation in 2D or 3D, I usually have an idea what my whole system is going to produce at timepoint X. So when I'm testing, I usually color code my agents, the "turtles", around a variable I'm tracking (like number of protein signals etc.)
It can be as simple as making them RED when the variable your wondering about is over a threshold or BLUE when under. Or you can throw in another color, maybe GREEN, so that you track when the turtles of interest fall within the "optimal" range.
I just stumbled on what appears to be a generally-known compsci keyword, "emit". But I can't find any clear definition of it in general computer science terms, nor a specific definition of an "emit()" function or keyword in any specific programming language.
I found it here, reading up on MapReduce:
https://en.wikipedia.org/wiki/MapReduce
The context of my additional searches show it has something to do with signaling and/or events. But it seems like it is just assumed that the reader will know what "emit" is and does. For example, this article on MapReduce patterns:
https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
There's no mention of what "emit" is actually doing, there are only calls to it. It must be different from other forms of returning data, though, such as "return" or simply "printf" or the equivalent, else the calls to "emit" would be calls to "return".
Further searching, I found a bunch of times that some pseudocode form of "emit" appears in the context of MapReduce. And in Node.js. And in Qt. But that's about it.
Context: I'm a (mostly) self-taught web programmer and system administrator. I'm sure this question is covered in compsci 101 (or 201?) but I didn't take that course.
In the context of web and network programming:
When we call a function the function may return a value.
When we call a function and the function is supposed to send those results to another function we will not user return anymore. Instead we use emit. We expect the function to emit the results to another function by our call.
A function can return results and emit events.
I've only ever seen emit() used when building a simple compiler in academia.
Upon analyzing the grammar of a program, you tokenize the contents of it and emit (push out) assembly instructions. (The compiler program that was written actually even contained an internal function called emit to mirror that theoretical/logical aspect of it.)
Once the grammar analysis is complete, the assembler will take the assembly instructions and generate the binary code (aka machine code).
So, I don't think there is a general CS definition for emit; however, I do know it is used in the pseudocode (and sometimes, actual code) for writing compiler programs. And that is undergraduate level computer science education in the US.
I can think of three contexts in which it's used:
Map/Reduce functions, where some input value causes 0 or more output values to go into the Reduce function
Tokenizers, where a stream of text is processed, and at various intervals, tokens are emitted
Messaging systems
I think the common thread is the "zero or more". A return provides exactly one value back from a function, whereas an "emit" is a function call that could take place zero times or several times.
In the context of the MapReduce programming model, it is said that an operation of a map nature takes an input value and emits a result, which is nothing more than a transformation of the input.
I have a chunk of lua code that I'd like to be able to (selectively) ignore. I don't have the option of not reading it in and sometimes I'd like it to be processed, sometimes not, so I can't just comment it out (that is, there's a whole bunch of blocks of code and I either have the option of reading none of them or reading all of them). I came up with two ways to implement this (there may well be more - I'm very much a beginner): either enclose the code in a function and then call or not call the function (and once I'm sure I'm passed the point where I would call the function, I can set it to nil to free up the memory) or enclose the code in an if ... end block. The former has slight advantages in that there are several of these blocks and using the former method makes it easier for one block to load another even if the main program didn't request it, but the latter seems the more efficient. However, not knowing much, I don't know if the efficiency saving is worth it.
So how much more efficient is:
if false then
-- a few hundred lines
end
than
throwaway = function ()
-- a few hundred lines
end
throwaway = nil -- to ensure that both methods leave me in the same state after garbage collection
?
If it depends a lot on the lua implementation, how big would the "few hundred lines" need to be to reliably spot the difference, and what sort of stuff should it include to best test (the main use of the blocks is to define a load of possibly useful functions)?
Lua's not smart enough to dump the code for the function, so you're not going to save any memory.
In terms of speed, you're talking about a different of nanoseconds which happens once per program execution. It's harming your efficiency to worry about this, which has virtually no relevance to actual performance. Write the code that you feel expresses your intent most clearly, without trying to be clever. If you run into performance issues, it's going to be a million miles away from this decision.
If you want to save memory, which is understandable on a mobile platform, you could put your conditional code in it's own module and never load it at all of not needed (if your framework supports it; e.g. MOAI does, Corona doesn't).
If there is really a lot of unused code, you can define it as a collection of Strings and loadstring() it when needed. Storing functions as strings will reduce the initial compile time, however of most functions the string representation probably takes up more memory than it's compiled form and what you save when compiling is probably not significant before a few thousand lines... Just saying.
If you put this code in a table, you could compile it transparently through a metatable for minimal performance impact on repeated calls.
Example code
local code_uncompiled = {
f = [=[
local x, y = ...;
return x+y;
]=]
}
code = setmetatable({}, {
__index = function(self, k)
self[k] = assert(loadstring(code_uncompiled[k]));
return self[k];
end
});
local ff = code.f; -- code of x gets compiled here
ff = code.f; -- no compilation here
for i=1, 1000 do
print( ff(2*i, -i) ); -- no compilation here either
print( code.f(2*i, -i) ); -- no compile either, but table access (slower)
end
The beauty of it is that this compiles as needed and you don't really have to waste another thought on it, it's just like storing a function in a table and allows for a lot of flexibility.
Another advantage of this solution is that when the amount of dynamically loaded code gets out of hand, you could transparently change it to load code from external files on demand through the __index function of the metatable. Also, you can mix compiled and uncompiled code by populating the "code" table with "real" functions.
Try the one that makes the code more legible to you first. If it runs fast enough on your target machine, use that.
If it doesn't run fast enough, try the other one.
lua can ignore multiple lines by:
function dostuff()
blabla
faaaaa
--[[
ignore this
and this
maybe this
this as well
]]--
end
I am aware that this is nothing new and has been done several times. But I am looking for some reference implementation (or even just reference design) as a "best practices guide". We have a real-time embedded environment and the idea is to be able to use a "debug shell" in order to invoke some commands. Example: "SomeDevice print reg xyz" will request the SomeDevice sub-system to print the value of the register named xyz.
I have a small set of routines that is essentially made up of 3 functions and a lookup table:
a function that gathers a command line - it's simple; there's no command line history or anything, just the ability to backspace or press escape to discard the whole thing. But if I thought fancier editing capabilities were needed, it wouldn't be too hard to add them here.
a function that parses a line of text argc/argv style (see Parse string into argv/argc for some ideas on this)
a function that takes the first arg on the parsed command line and looks it up in a table of commands & function pointers to determine which function to call for the command, so the command handlers just need to match the prototype:
int command_handler( int argc, char* argv[]);
Then that function is called with the appropriate argc/argv parameters.
Actually, the lookup table also has pointers to basic help text for each command, and if the command is followed by '-?' or '/?' that bit of help text is displayed. Also, if 'help' is used for a command, the command table is dumped (possible only a subset if a parameter is passed to the 'help' command).
Sorry, I can't post the actual source - but it's pretty simple and straight forward to implement, and functional enough for pretty much all the command line handling needs I've had for embedded systems development.
You might bristle at this response, but many years ago we did something like this for a large-scale embedded telecom system using lex/yacc (nowadays I guess it would be flex/bison, this was literally 20 years ago).
Define your grammar, define ranges for parameters, etc... and then let lex/yacc generate the code.
There is a bit of a learning curve, as opposed to rolling a 1-off custom implementation, but then you can extend the grammar, add new commands & parameters, change ranges, etc... extremely quickly.
You could check out libcli. It emulates Cisco's CLI and apparently also includes a telnet server. That might be more than you are looking for, but it might still be useful as a reference.
If your needs are quite basic, a debug menu which accepts simple keystrokes, rather than a command shell, is one way of doing this.
For registers and RAM, you could have a sub-menu which just does a memory dump on demand.
Likewise, to enable or disable individual features, you can control them via keystrokes from the main menu or sub-menus.
One way of implementing this is via a simple state machine. Each screen has a corresponding state which waits for a keystroke, and then changes state and/or updates the screen as required.
vxWorks includes a command shell, that embeds the symbol table and implements a C expression evaluator so that you can call functions, evaluate expressions, and access global symbols at runtime. The expression evaluator supports integer and string constants.
When I worked on a project that migrated from vxWorks to embOS, I implemented the same functionality. Embedding the symbol table required a bit of gymnastics since it does not exist until after linking. I used a post-build step to parse the output of the GNU nm tool for create a symbol table as a separate load module. In an earlier version I did not embed the symbol table at all, but rather created a host-shell program that ran on the development host where the symbol table resided, and communicated with a debug stub on the target that could perform function calls to arbitrary addresses and read/write arbitrary memory. This approach is better suited to memory constrained devices, but you have to be careful that the symbol table you are using and the code on the target are for the same build. Again that was an idea I borrowed from vxWorks, which supports both teh target and host based shell with the same functionality. For the host shell vxWorks checksums the code to ensure the symbol table matches; in my case it was a manual (and error prone) process, which is why I implemented the embedded symbol table.
Although initially I only implemented memory read/write and function call capability I later added an expression evaluator based on the algorithm (but not the code) described here. Then after that I added simple scripting capabilities in the form of if-else, while, and procedure call constructs (using a very simple non-C syntax). So if you wanted new functionality or test, you could either write a new function, or create a script (if performance was not an issue), so the functions were rather like 'built-ins' to the scripting language.
To perform the arbitrary function calls, I used a function pointer typedef that took an arbitrarily large (24) number of arguments, then using the symbol table, you find the function address, cast it to the function pointer type, and pass it the real arguments, plus enough dummy arguments to make up the expected number and thus create a suitable (if wasteful) maintain stack frame.
On other systems I have implemented a Forth threaded interpreter, which is a very simple language to implement, but has a less than user friendly syntax perhaps. You could equally embed an existing solution such as Lua or Ch.
For a small lightweight thing you could use forth. Its easy to get going ( forth kernels are SMALL)
look at figForth, LINa and GnuForth.
Disclaimer: I don't Forth, but openboot and the PCI bus do, and I;ve used them and they work really well.
Alternative UI's
Deploy a web sever on your embedded device instead. Even serial will work with SLIP and the UI can be reasonably complex ( or even serve up a JAR and get really really complex.
If you really need a CLI, then you can point at a link and get a telnet.
One alternative is to use a very simple binary protocol to transfer the data you need, and then make a user interface on the PC, using e.g. Python or whatever is your favourite development tool.
The advantage is that it minimises the code in the embedded device, and shifts as much of it as possible to the PC side. That's good because:
It uses up less embedded code space—much of the code is on the PC instead.
In many cases it's easier to develop a given functionality on the PC, with the PC's greater tools and resources.
It gives you more interface options. You can use just a command line interface if you want. Or, you could go for a GUI, with graphs, data logging, whatever fancy stuff you might want.
It gives you flexibility. Embedded code is harder to upgrade than PC code. You can change and improve your PC-based tool whenever you want, without having to make any changes to the embedded device.
If you want to look at variables—If your PC tool is able to read the ELF file generated by the linker, then it can find out a variable's location from the symbol table. Even better, read the DWARF debug data and know the variable's type as well. Then all you need is a "read-memory" protocol message on the embedded device to get the data, and the PC does the decoding and displaying.