Smalltalks runtime state - runtime

For Javascript, there exists this excellent intro that explains the runtime state:
For Smalltalk, I have never found a similar overview of how the runtime and image snapshots are structured.
It is said that a Smalltalk image consists of objects that can send each other messages. This creates many questions:
Is only one object ever active at a time?
Is there a "root scheduler" that starts up designated "process" objects?
Does each suspended image have a list of active objects?
What happens if two active objects send a message to a third one?
Can only one message be handled at a time? What is the level of "atomicity"?
How do two active objects communicate?
Does every object have an "inbox" of messages received, but not yet processed?
Does every object have an event loop?

Is only one object ever active at a time?
Yes, while the systems can schedule different "processes", which are instances of the class Process running at different priorities, these take control one at the time. Since the scheduling is non-preemptive, processes must explicitly yield or wait on a semaphore (instance of the class Semaphore).
Is there a "root scheduler" that starts up designated "process" objects?
Yes, the global Processor (an instance of ProcessorScheduler) keeps and manages the prioritized list of processes that are ready to run (the others being the ones that are waiting on some semaphore).
Does each suspended image have a list of active objects?
The suspended image is nothing but the image. So, yes, it has everything in it, in particular the Processor, which knows who the activeProcess is.
What happens if two active objects send a message to a third one?
Messages are sent one at a time (even though they may be interrupted by the Virtual Machine)
Can only one message be handled at a time? What is the level of "atomicity"?
The level of atomiciy (non-interruptibility) is essentially the bytecode: message sends, assignments, etc. Any operation perceived as atomic by the programmer.
How do two active objects communicate?
Objects communicate by means of message sends.

In some way, the answer to your question depends on the virtual machine that one is using. And while many Smalltalk implementations or derivatives stick pretty close to the original concept, that may vary a lot.
As from your question, it looks like you are interested in conceptual answers, I recommend you to read the original Smalltalk "blue book".
Smalltalk-80: The Language and Its Implementation. By Adele Goldberg and David Robson, 1983, Addison-Wesley
The book talks in-depth about the design of the system and implementation of the core classes, but also has a few sections, in the end, providing specifications from the VM, interpreter, object memory, etc…


How to identify a process in Windows? Kernel and User mode

In Windows, what is the formal way of identifying a process uniquely? I am not talking about PID, which is allocated dynamically, but a unique ID or a name which is permanent to that process. I know that every program/process has a security descriptor but it seems to hold SIDs for loggedin user and group (not the process). We cannot use the path and name of executable from where the process starts as that can change.
My aim is to identify a process in the kernel mode and allow it to perform certain operation. What is the easiest and best way of doing this?
Your question is too vague to answer properly. For example how could the path possibly change (without poking around in kernel memory) after creation of a process? And yes, I am aware that one could hook into the memory-mapping process during process creation to replace the image originally destined to be loaded with another. Point is that a process is merely one instance of running a given executable. And it's not clear what exact tampering attempts you want to counter here.
But from kernel mode you do have the ability to simply use the pointer to the EPROCESS structure. No need to use the PID, although that will be unique while the process is still alive.
So assuming your process uses an IRP to communicate to the driver (whether it be WriteFile, ReadFile, DeviceIoControl or something more exotic), in order to register itself, you can use IoGetCurrentProcess to get the PEPROCESS value which will be unique to the process.
While the structure itself is not officially documented, hints can be gleaned from the "Windows Internals" book (in its various incarnations), the dt (Display Type) command in WinDbg (and friends) as well as from third-party resources on the internet (e.g. here, specific to Vista).
The process objects are kept in several linked lists. So if you know the (officially undocumented!!!) layout for a particular OS version, you may traverse the lists to get from one to the next process object (i.e. EPROCESS structure).
Cautionary notes
Make sure to reference the object of the process, by using the respective object manager routines. Otherwise you cannot be certain it's safe to both reach into these structures (which is anyway unsafe, since you cannot rely on their layout across OS versions) or to pass it to functions that expect a PEPROCESS.
As a side-note: Harry Johnston is of course right to assert that a privileged user can insert arbitrary (well almost arbitrary) code into the TCB in order to thwart your protective measures. In the end it is going to be an arms race.
Also keep in mind that similar to PIDs, theoretically the value of the PEPROCESS may be recycled. But in both cases you can simply counter this by invalidating whatever internal state you keep in your driver that allows the process to do its magic, whenever the process goes down. Using something like PsSetCreateProcessNotifyRoutine would seem to be a good method here. In order to translate your process handle from the callback to a PEPROCESS value, use ObReferenceObjectByHandle.
An alternative of countering recycling of the PID/PEPROCESS is by keeping a reference to the process object and thus keeping it in a kind of undead state (similar to not closing a handle in user mode), although the main thread may have finished.

What's the best erlang approach to being able to identify a processes identity from its process id?

When I'm debugging, I'm usually looking at about 5000 processes, each of which could be one of about 100 gen_servers, fsms, etc. If I want to know WHAT an erlang process is, I can do:
process_info(pid(0,1,0), initial_call).
And get a result like:
...which is all but useless.
More recently, I hit upon the idea (brace yourselves) of registering each process with a name that told me WHO that process represented. For example, player_1150 is the player process that represents player 1150. Yes, I end up making a couple million atoms over the course of a week-long run. (And I would love to hear comments on the drawbacks of boosting the limit to 10,000,000 atoms when my system runs with about 8GB of real memory unused, if there are any.) Doing this meant that I could, at the console of a live system, query all processes for how long their message queue was, find the top offenders, then check to see if those processes were registered and print out the atom they were registered with.
I've hit a snag with this: I'm moving processes from one node to another. Now a player process can have 3 different names; player_1158, player_1158_deprecating, player_1158_replacement. And I have to make absolutely sure I register and unregister these names with precision timing to make sure that a process is always named and that the appropriate names always exist, AND that I don't try to register a name that some dying process already holds. There is some slop room, since this is only used for console debugging of a live system Nonetheless, the moment I started feeling like this mechanism was affecting how I develop the system (the one that moves processes around) I felt like it was time to do something else.
There are two ideas on the table for me right now. An ets tables that associates process ids with their description:
ets:insert(self(), {player, 1158}).
I don't really like that one because I have to manually keep the tables clean. When a player exits (or crashes) someone is responsible for making sure that his data are removed from the ets table.
The second alternative was to use the process dictionary, storing similar information. When my exploration of a live system led me to wonder who a process is, I could just look at his process dictionary using process_info.
I realize that none of these solutions is functionally clean, but given that the system itself is never, EVER the consumer of these data, I'm not too worried about it. I need certain debugging tools to work quickly and easily, so the behavior described is not open for debate. Are there any convincing arguments to go one way or another (other than the academic "don't use the _, it's evil" canned garbage?) I'd be happy to hear other suggestions and their justifications.
You should try out gproc, it's a very convenient application for keeping process metadata.
A process can be registered with several names and you can associate arbitrary properties to a process (where the key and value can be any erlang term). Also gproc monitors the registered processes and unregisters them automatically if they crash.
If you're debugging gen_servers and gen_fsms while they're still running, I would implement the handle_info functions for these behaviors. When you send each process a {get_info, ReplyPid} tuple, the process in question can send back a term describing its own state, what it is, etc. That way you don't have to keep track of this information outside of the process itself.
Isac mentions there is already a built in way to do this

Address Book thread safety and performance

My sense from the Address Book documentation and my understanding of the underlying CoreData implementation suggests that Address Book should be thread safe, and making queries from multiple threads should pose no problems. But I'm having trouble finding any explicit discussion of thread safety in the docs. This raises a few questions:
Is it safe to use +sharedAddressBook on multiple threads for read-only access? I believe the answer is yes.
For write-access on background threads, it appears that you should use +addressBook instead (and save your changes manually). Do I understand this correctly?
Has anyone investigated the performance impact of making multiple simultaneous queries to Address Book on multiple threads? This should be very similar to the performance of making multiple CoreData queries on multiple threads. My sense is that I would gain little by making parallel queries since I assume they will serialize when they hit SQLLite, but I'm not certain here.
I need to make dozens of queries (some complex) against AddressBook and am doing so on a background thread using NSOperation to avoid blocking the UI (which it currently does). My underlying question is whether it makes sense to set the max concurrent operations to a value larger than 1, and whether there is any danger in doing so if the application may also be writing to AddressBook at the same time on another thread.
Unless an API says it is threadsafe it is not. Even if the current implementation happens to be thread safe it might not be in the future. In other words, do not use AB from multiple threads.
As an aside, what about it being CoreData based makes you think it would be thread safe? CoreData uses a thread confinement model where it is only safe to access a context on a single thread, all the objects from the context must be accessed on the same thread.
That means that sharedAddressBook will not be thread safe if it keeps an NSManagedObjectContext around to use. It would only be safe if AB creates a new context every time it needs to do something and immediately disposes of it, or if it creates a context per thread and always uses the appropriate context (probably by storing a ref to it in the threadDictionary). In either event it would not be safe to store anything as NSManagedObjects since the contexts would be constantly destroyed, which means every ABRecord would have to store an NSManagedObjectID so it could reconstitute the object in the appropriate context whenever it needed it.
Clearly all of that is possible, it may be what is done, but it is hardly the obvious implementation.

Can I be sure that the code I write is always executed in the same thread?

I normally work on single threaded applications and have generally never really bothered with dealing with threads. My understanding of how things work - which certainly, may be wrong - is that as long as we're always dealing with single threaded code (i.e. no forks or anything like that) it will always be executed in the same thread.
Is this assumption correct? I have a fuzzy idea that UI libraries/frameworks may spawn off threads of their own to handle GUI stuff (which accounts for the fact that the Windows task manager tells me that my 'single threaded' application is actually running on 10 threads) but I'm guessing that this shouldn't affect me?
How does this apply to COM? For instance, if I were to create an instance of a COM component in my code; and that COM component writes some information to a thread-based location (using System.Threading.Thread.GetData for instance) will my application be able to get hold of that information?
So in summary:
In single threaded code, can I be sure that whatever I store in a thread-based location can be retrievable from anywhere else in the code?
If that single threaded code were to create an instance of a COM component which stores some information in a thread-based location, can that be similarly retrievable from anywhere else?
UI usually has the opposite constraint (sadly): it's single threaded and everything must happen on that thread.
The easiest way to check if you are always in the same thread (for, say, a function) is to have an integer variable set at -1, and have a check function like (say you are in C#):
void AssertSingleThread()
if (m_ThreadId < 0) m_ThreadId = Thread.CurrentThread.ManagedThreadId;
Debug.Assert(m_ThreadId == Thread.CurrentThread.ManagedThreadId);
That said:
I don't understand the question #1, really. Why store in a thread-based location if your purpose is to have a global scope ?
About the second question, most COM code runs on a single thread and, most often, on the thread where your UI message processing lives - this is because most COM code is designed to be compatible with VB6 which is single-thread.
The reason your program has about 10 threads is because both Windows (if you use some of its features like completion ports, or some kind of timers) and the CLR (for example for the GC or, again, some types of timers) may create threads in your process space (technically any program with enough priviledges, can too).
Think about having the model of having a single dataStore class running in your mainThread that all threads can read and write their instance variables to. This will avoid a lot of problems that might arise accessing threads all over the shop.
Simple idea, until you reach the fun part of threading. Concurrency and synchronization; simply, if you have two threads that want to read and write to the same variable inside dataStore at the same time, you have a problem.
Java handles this by allowing you to declare a variable or method synchronized, allowing only one thread access at a time.
I believe some .NET objects have Lock and Synchronized methods defined on them, but I know no more than this.

Is it safe to manipulate objects that I created outside my thread if I don't explicitly access them on the thread which created them?

I am working on a cocoa software and in order to keep the GUI responsive during a massive data import (Core Data) I need to run the import outside the main thread.
Is it safe to access those objects even if I created them in the main thread without using locks if I don't explicitly access those objects while the thread is running.
With Core Data, you should have a separate managed object context to use for your import thread, connected to the same coordinator and persistent store. You cannot simply throw objects created in a context used by the main thread into another thread and expect them to work. Furthermore, you cannot do your own locking for this; you must at minimum lock the managed object context the objects are in, as appropriate. But if those objects are bound to by your views a controls, there are no "hooks" that you can add that locking of the context to.
There's no free lunch.
Ben Trumbull explains some of the reasons why you need to use a separate context, and why "just reading" isn't as simple or as safe as you might think, in this great post from late 2004 on the webobjects-dev list. (The whole thread is great.) He's discussing the Enterprise Objects Framework and WebObjects, but his advice is fully applicable to Core Data as well. Just replace "EC" with "NSManagedObjectContext" and "EOF" with "Core Data" in the meat of his message.
The solution to the problem of sharing data between threads in Core Data, like the Enterprise Objects Framework before it, is "don't." If you've thought about it further and you really, honestly do have to share data between threads, then the solution is to keep independent object graphs in thread-isolated contexts, and use the information in the save notification from one context to tell the other context what to re-fetch. -[NSManagedObjectContext refreshObject:mergeChanges:] is specifically designed to support this use.
I believe that this is not safe to do with NSManagedObjects (or subclasses) that are managed by a CoreData NSManagedObjectContext. In general, CoreData may do many tricky things with the sate of managed objects, including firing faults related to those objects in separate threads. In particular, [NSManagedObject initWithEntity:insertIntoManagedObjectContext:] (the designated initializer for NSManagedObjects as of OS X 10.5), does not guarantee that the returned object is safe to pass to an other thread.
Using CoreData with multiple threads is well documented on Apple's dev site.
The whole point of using locks is to ensure that two threads don't try to access the same resource. If you can guarantee that through some other mechanism, go for it.
Even if it's safe, but it's not the best practice to use shared data between threads without synchronizing the access to those fields. It doesn't matter which thread created the object, but if more than one line of execution (thread/process) is accessing the object at the same time, since it can lead to data inconsistency.
If you're absolutely sure that only one thread will ever access this object, than it'd be safe to not synchronize the access. Even then, I'd rather put synchronization in my code now than wait till later when a change in the application puts a second thread sharing the same data without concern about synchronizing access.
Yes, it's safe. A pretty common pattern is to create an object, then add it to a queue or some other collection. A second "consumer" thread takes items from the queue and does something with them. Here, you'd need to synchronize the queue but not the objects that are added to the queue.
It's NOT a good idea to just synchronize everything and hope for the best. You will need to think very carefully about your design and exactly which threads can act upon your objects.
Two things to consider are:
You must be able to guarantee that the object is fully created and initialised before it is made available to other threads.
There must be some mechanism by which the main (GUI) thread detects that the data has been loaded and all is well. To be thread safe this will inevitably involve locking of some kind.
Yes you can do it, it will be safe
until the second programmer comes around and does not understand the same assumptions you have made. That second (or 3rd, 4th, 5th, ...) programmer is likely to start using the object in a non safe way (in the creator thread). The problems caused could be very subtle and difficult to track down. For that reason alone, and because its so tempting to use this object in multiple threads, I would make the object thread safe.
To clarify, (thanks to those who left comments):
By "thread safe" I mean programatically devising a scheme to avoid threading issues. I don't necessarily mean devise a locking scheme around your object. You could find a way in your language to make it illegal (or very hard) to use the object in the creator thread. For example, limiting the scope, in the creator thread, to the block of code that creates the object. Once created, pass the object over to the user thread, making sure that the creator thread no longer has a reference to it.
For example, in C++
void CreateObject()
Object* sharedObj = new Object();
PassObjectToUsingThread( sharedObj); // this function would be system dependent
Then in your creating thread, you no longer have access to the object after its creation, responsibility is passed to the using thread.
