I'm writting a Windows Java application that needs to call unsafe native code, and I need to prevent this code from having access to Java objects and JVM data structures, otherwise it might crash the JVM or hack into sensitive data. Before you ask, this native code is previously verified - it can only call a few APIs and cannot have certain instructions, so we know it won't VirtualProtect itself or other memory regions to gain more access and mess around.
Anyway, my first attempt was to wrap this code into a separate process (sandbox) and use IPC to talk with Java. There's a JNI DLL that does IPC stuff on Java side. Basically, every time we need to run unsafe native code, our Java app calls a JNI function that wakes up the sandbox using an auto-reset Windows Event, and then awaits completion. The sandbox runs unsafe native code and wakes up the JVM using another auto-reset Windows Event, and life continues. It would be perfect if it weren't so slow.
The problem is that unsafe native code can contain some functions that perform very quick calculations and can be called millions of times from Java, hence the call overhead should be minimum. But this overhead is huge because JVM wakes-up sandbox with a Windows Event, and vice-versa when the sandbox returns. This process is 8x the time of an in-process (non-IPC) solution, where unsafe native code is wrapped in JNI DLL (and hence the call happens in the same thread, in the same time slice).
My first guess is that when JVM wakes-up the sandbox, Windows only puts the sandbox thread on the ready set, so it runs only after some milliseconds. And the same happens when the sandbox returns. Not to count for two (possibly expensive) context switches.
Microsoft documentation here says the following:
If a higher-priority thread becomes available to run, the system ceases to execute the lower-priority thread (without allowing it to finish using its time slice), and assigns a full time slice to the higher-priority thread.
To test this theory, I assigned THREAD_PRIORITY_TIME_CRITICAL to the sandbox thread. There was some gains. Performance went from 8x to 5x the time of the in-process (non-IPC) solution. But I need more, otherwise this Java app might not get a change to go into production!
You can help me in two ways:
Tell me if there's a faster method to wake-up another process, such as forcing a context switch or performing an inter-process procedure call.
Tell me how can I protect JVM while running unsafe native code in-process. I heard that Google Native Client does this, but I only found this documentation. If you know more, please provide links to more detailed information about how this is implemented.
I solved the problem by performing JVM-sandbox interactions using spinlock over a shared memory variable accessed from JVM through file mapping. This question explains how to implement in a C++ environment. Porting to Java is easy with MappedByteBuffer.
Related
Late Friday night thoughts after reading through material on how Cloudflare's v8 based "no cold start" Workers function - in short, because of the V8 engine's Just-in-Time compiler of Javascript code - I'm wondering why this no cold start type of serverless functions seems to only exist for Javascript.
Is this just because architecturally when AWS Lambda / Azure Functions were launched, they were designed as a kind of even more simplified Kubernetes model, where each function exists in its own container? I would assume that was a simpler model of keeping different clients' code separate than whatever magic sauce v8 isolates provided under the hood.
So given Java is compiled into bytecode for the JVM, which uses JIT compilation (if it doesn't optimise and compile to machine code certain high usage functions), is it therefore also technically possible to have no cold start Java serverless functions? As long as there is some way to load in each client's bytecode as they are invoked, on the cloud provider's server.
What are the practical challenges for this to become a reality? I'm not a big expert on all this, but can imagine perhaps:
The compiled bytecode isn't designed to be loaded in this way - it expects to be the only code being executed in a JVM
JVM optimisations aren't written to support loading short-lived, multiple functions, and treats all code loaded in to be one massive program
JVM once started doesn't support loading additional bytecode.
In principle, you could probably develop a Java-centric serverless runtime in which individual functions are dynamically loaded on-demand, and you might be able to achieve pretty good cold-start time this way. However, there are two big reasons why this might not work as well as JavaScript:
While Java is designed for JIT compiling, it has not been optimized for startup time nearly as intensely as V8 has. Today, the JVM is most commonly used in large always-on servers, where startup speed is not that important. V8, on the other hand, has always focused on a browser environment where code is downloaded and executed while a user is waiting, so minimizing startup latency is critical. (It might actually be interesting to look at an alternative Java runtime like Android's Dalvik, which has had much more reason to prioritize startup speed. Maybe it could be the basis of a really fast Java serverless environment!)
Security. V8 and other JavaScript runtimes have been designed with hostile code in mind from the beginning, and have had a huge amount of security research done on them. Java tried to target this too, in the very early days, with "applets", but that usage of Java never caught on. These days, secure sandboxing is not a major concern of Java. Because of this, it is probably too risky to run multiple Java apps that don't trust each other within the same container. And so, you are back to starting a separate container for each application.
Here is a little background information. I'm working on replacing a dll that has been used in a dll injection technique via the AppInit_DLLs registry entry. Its purpose was to be present in every process and set hooks into the GDI32.dll to gather information about printing. This is kind of a funky way to get what we want. The .dll itself is over 10 years old (written in Visual Studio 97) and we'd like to replace it with something a little less invasive than an injected dll.
It appears SetWindowsHookEx() maybe what we are looking for. I've been having some trouble with it, but I've also had some discussions with co-workers about whether or not this tree is worth barking up. Here are some questions that we could not determine:
When we hook a routine out of a dll, for example StartDoc() from GDI32.dll, do we really get a notification every time any other process uses that rotuine out of that dll? This is kind of the functionality we were getting with our injected .dll and we need the same functionality going forward.
When the hook is triggered, does the hook handling procedure run in the process space of the process that initiated the actual call, or in the process space of the process that set up the hook? My opinion is that it has to run in the process space of the process that called the routine. For example, if a program calls StartDoc() from GDI32.dll, it will have the hook handling procedure code "injected" into its space and executed. Otherwise, there would have to be some inter-process communication that automatically gets set up between the calling process and the process that set up the hook, and I just don't see that as being the case. Also, its kind of necessary that this hook handling routine run in the process space of the calling process since one of the things it needs to know is the name of that calling process, and I'm not sure on how to get that information if it wasn't actually running in that process.
If the hook handling routine is written using the .NET managed environment, will it break when getting hooked into a process not utilizing the .NET managed environment? We'd really like to move away from C++ here and use C#, but what would happen if we our hook gets called from a process that isn't managed? As stated before, I think that our hook handling procedure will run in the process that originally called the routine that was hooked. But if this is true, than I would think that we'd run into trouble if this process was not using the .NET run time environment, but the incoming hooked handling code is.
Yes.
Generally, it's the former: it executes in the context of the process whose event it is hooking.
After a successful call to SetWindowsHookEx, the operating system automatically injects the hook DLL (the one that contains the callback function) into the address space of all target processes that meet the requirements for the specified hook type. (Of course, the hooking code is not necessarily injected immediately.)
The exception to this general rule are the low-level keyboard and mouse hooks (WH_LL_KEYBOARD and WH_LL_MOUSE). Since those hook types are not injected into the client processes, the callback is called in the same thread that originally called SetWindowsHookEx.
That last point is important to keep in mind to answer your third question. Because the low-level keyboard and mouse hooks are the only two global hooks that do not require DLL injection, they are also the only two types of hooks that can be written in managed .NET code.
For the other hook types, your concerns expressed in the question are precisely correct. You would need to write these hook DLLs in C or C++. Of course, the rest of your application's pieces could still be written in a managed language. The only thing that matters is the hook DLL.
You might consider looking into either Microsoft Detours or EasyHook.
Recently I have been reading up articles about DLL injection and I understand them fairly well.
However, what I don't understand is why APIs such as CreateRemoteThread, WriteProcessMemory(in being able to write to the memory of another process) and VirtualAllocEx(in being able to allocat memory in the context of another process) were implemented in the first place.
What was the original need for such APIs? Just curious.
WriteProcessMemory was made for ring3 debuggers that need to securely write process memory, most commonly for INT 3 breakpoints or user provided memory edits.
along the same line, CreateRemoteThread can also be used for debugging purposes, however, MSDN can enlighten us on CreateRemoteThread a bit more:
A common use of this function is to inject a thread into a process
that is being debugged to issue a break. However, this use is not
recommended, because the extra thread is confusing to the person
debugging the application and there are several side effects to using
this technique:
It converts single-threaded applications into
multithreaded applications.
It changes the timing and memory layout of
the process.
It results in a call to the entry point of each DLL in
the process.
IIRC, CreateRemoteThread is also used by debuggers to hook application native expection handlers, commonly set by SetExceptionHandler, which requires call from the target process as the handler is stored in the PEB.
VirtualAllocEx is just how windows virtual memory system operates, it needs a context to allocate in, be it in the current process, a child process or a remote process. VirtualAlloc in fact is nothing more than a pass through wrapper of the Ex variant, it just passes a special constant that indicates the handle of the caller process is to be used.
My delphi program (NOT for .NET) on windows 7 seems to be running for couple of days straight and then the program sort of freezes with all of its windows painted with blueish grey color as if its windows are disabled. You simply don't have control over the program anymore but has to kill its process and start it up again. You don't need to reboot the system itself.
Has anyone experience this or anything similar? If so, what did you do to resolve or try to resolve it?
Thanks,
Your question context is very vague. We do not have any information about your application, even its design and architecture.
Nethertheless, my (general-purpose) suggestions are the following:
If your application is not multi-threaded, do the process in background threads, then leave the main thread ready to process GDI messages;
If your application is multi-threaded, take care that all VCL access from background threads are made via a Synchronize call;
If your application is multi-threaded or use timers, take care that no method is re-entrant (in some circonstances, you may come into a race condition);
Hunt any memory leak;
Use a detailed logging of the program execution, logging all exceptions risen, to guess the context of the program hang (it may be used on the customer side also to hunt race conditions);
Download the great free tool named ProcessExplorer (now hosted by Microsoft), and check out the state of your frozen program: you will see detailed information about threads, CPU use, memory, network, libraries, handles - this is a must have for any serious debugging - track especially the GDI handles leaks (number of those should remain stable);
If you did not check it already, take a look at the global Windows system event log: there may be some information here;
Perhaps a third party component or library is responsible of the process hang: try to isolate the part of your code which may be responsible of this hang.
I've Delphi application running for months without any problem. Issue is definitively in application code, not in the Delphi architecture (its RTL and VCL are very stable).
The bluish grey color is probably the default window color, meaning the window is no longer painting itself. This is consistent with the other symptom that the program no longer responds to any input. This means it isn't processing any window messages.
The easiest way to debug is to run the program in a debugger, and when it's hung just stop it and see where it's at.
If you have a memory leak you may eventually run out of memory in your process space, and it's possible that the program doesn't properly respond to that condition. Check Task Manager to see the amount of memory it's using.
Yes I fixed several hangs and other problems in the past years.
I used ProcessExplorer before (to view the stack), but it needs Microsoft debug symbols. And with Delphi you can only create a .map file. With map2dbg I could convert the .map to a .dbg, but this does not always work (note: .dbg is deprecated, newer versions of Microsoft debugging tools do not use them anymore).
So I made my own tool :-)
It is part of "AsmProfiler Sampling" tool:
http://code.google.com/p/asmprofiler/downloads/detailname=AsmProfiler_Sampling%20v1.0.7.13.zip
Click on the "Stack view of Process" button in the first screen.
Then select your process from the list and double click on it:
http://code.google.com/p/asmprofiler/wiki/ProcessStackViewer
Now you can view the stack trace of each thread. If the GUI does not respond, the main thread hangs, so check the first thread. (note: sometimes you see an "emtpy" stack because a function misaligned the stack for calculation etc, use the raw strack tracing algoritm to get more the full stack again (with a lot of false positives, because every pointer on the stack which is possible a function is shown!)).
Please post the stack here if you can't solve it, so we can take a look at it.
Note: it uses the jclDebug.pas unit of the JEDI library, so it can read .map and .jdbg files (also .dbg and .pdb debug files of Windows dlls) and also internal JCLDEBUG sections (embedded .jdbg file in one .exe). So you must at least build an .exe with detailed (!) map file, see Project Options -> Compiler -> Linking.
While trying to resolve process hanging on CoUninitialize() I came upon a piece of code shared by many of our projects. When a program is going to quit it first calls CoFreeUnusedLibraries(), then immediately OleUninitialize().
While the effect of OleUninitialize() is quite clear I can't find why one would want to call CoFreeUnusedLibraries() before calling OleUnitialize(). What might be the use of this call at this specific point?
CoFreeUnusedLibraries() will trigger a call to the DllCanUnloadNow for each in-process COM DLL that exports this function. Not sure about threading issues or out-of-process COM components as it relates to this API.
Presumably, someone who wrote the code that calls DllCanUnloadNow before OleUnitialize was attempting to reduce working set and ensure cleanup.
I don't think there's much value in calling CoFreeUnusedLibraries right before application shutdown (the DLLs will get unloaded anyway).
My experience is that calling CoFreeUnusedLibraries results in crashes and hangs in 3rd party COM DLLs that never had their DllCallUnloadNow implementation tested before release. (Because not too many apps call this function).
You didn't provide a call stack or hint as to where the hang was occurring (did you break into a debugger to see what DLL is at the top of the stack?). My guess is that you can likely take this call out if you can't fix the offending DLL.
Docs indicate that
This function is provided for
compatibility with 16-bit Windows.
Hmmm...
Have you seen this problem report? This call seems redundant to me - maybe this leaves one or more DLLs in a state where OleUninitialize does not work properly - waiting for some state change due to the earlier call. However this does allude to the need to wait a while between calls...
CoFreeUnusedLibraries does not
immediately release DLLs that have no
active object. There is a ten minute
delay for multithreaded apartments
(MTAs) and neutral apartments (NAs).
For single-threaded apartments (STAs),
there is no delay. The ten minute
delay for CoFreeUnusedLibraries is to
avoid multithread race conditions
caused by unloading a component DLL.
There are also comments elsewhere re a 6 -minute closedown timeout when using DCOM - is that applicable to you?