Asynchronous interapplication communication on Mac OS X - macos

On Mac OS X, I have a process which produces JSON objects, and another intermittent process which should consume them. The producer and consumer processes are independent of each other. Objects will be produced no more often than every 5 seconds, and will typically be several hundred bytes, but may range up into megabytes sometimes. The objects should be communicated first-in-first-out. The consumer may or may not be running when the producer is producing, and may or may not read objects immediately.
My boneheaded solution is
Create a directory.
Producer writes each JSON object to a text file, names it with a serial number.
When Consumer launches, it reads and then deletes files in serial-number order, and while it is running, uses FSEvents to watch this directory for new files arriving.
Is there any easier or better way to do this?

The modern way to do this, as of Lion, is to use XPC. Unfortunately, there's no good documentation of it; there's a broad overview in the Daemons and Services guide and a primitive HeaderDoc-generated reference, but the best way to get introduced to it is to watch the session about it from last year's WWDC sessions.
With XPC, you won't have to worry about keeping serial numbers serial, having to contend for a spinning disk, or whether there's enough disk space. Indeed, you don't even have to generate and parse JSON data at all, since XPC's communication mechanism is built around JSON-esque/plist-esque container and value objects.

Assuming you want the consumer to see the old files, this is the way it's been done since the beginning of time - loathsome though it may be.
There's lots of highish tech things that look cleaner - but honestly, they just tend to add complexity and/or deployment infrastructure that add hassle. What you suggest works, and it works well, and it's easy to write and maintain. You might need some kind of sentinel files to track what you are doing for crash recovery, but that's probably about it.
Hell, most people would just poll with a sleep 5. At least you are all all up in the fsevent.
Now if it was accepable to lose the events generated when the listener wasn't around; and perf was paramount - it could get more interesting. :)

Related

Performance, writing multiple files in multiple threads to the same directory

First off I'd like to apologise if this has been answered somewhere else here on SO. Since I couldn't find any questions specific for what I need I decided to ask.
Problem: I need to receive multiple data packets from a device (scanner for example) and store these in disk. My approach was to queue the data packets and have a thread wait until there were enough items in the queue and start a thread that would consume a batch of data packets and store them in disk, the directory in which they are stored doesn't change so multiple threads can, possibly, write to the same directory at the same time.
A collegue of mine mentioned that this was a bad idea since having multiple threads write to the same disk would actually slow it down. Having googled this, I found that there was no clear answer to whether this was true or not, some stated that this isn't a problem for modern computers while others said this wouldn't be a problem if different threads were writing to separate disks like in a RAID setup.
Question: Is writing to the same disk with multiple threads (which are writing a batch of data packet files) a bad idea? Or is there actual benefits to doing so?
P.S: I'd like to add that the solution for this will be for the Windows OS.
Thanks!

What's the best erlang approach to being able to identify a processes identity from its process id?

When I'm debugging, I'm usually looking at about 5000 processes, each of which could be one of about 100 gen_servers, fsms, etc. If I want to know WHAT an erlang process is, I can do:
process_info(pid(0,1,0), initial_call).
And get a result like:
{initial_call,{proc_lib,init_p,5}}
...which is all but useless.
More recently, I hit upon the idea (brace yourselves) of registering each process with a name that told me WHO that process represented. For example, player_1150 is the player process that represents player 1150. Yes, I end up making a couple million atoms over the course of a week-long run. (And I would love to hear comments on the drawbacks of boosting the limit to 10,000,000 atoms when my system runs with about 8GB of real memory unused, if there are any.) Doing this meant that I could, at the console of a live system, query all processes for how long their message queue was, find the top offenders, then check to see if those processes were registered and print out the atom they were registered with.
I've hit a snag with this: I'm moving processes from one node to another. Now a player process can have 3 different names; player_1158, player_1158_deprecating, player_1158_replacement. And I have to make absolutely sure I register and unregister these names with precision timing to make sure that a process is always named and that the appropriate names always exist, AND that I don't try to register a name that some dying process already holds. There is some slop room, since this is only used for console debugging of a live system Nonetheless, the moment I started feeling like this mechanism was affecting how I develop the system (the one that moves processes around) I felt like it was time to do something else.
There are two ideas on the table for me right now. An ets tables that associates process ids with their description:
ets:insert(self(), {player, 1158}).
I don't really like that one because I have to manually keep the tables clean. When a player exits (or crashes) someone is responsible for making sure that his data are removed from the ets table.
The second alternative was to use the process dictionary, storing similar information. When my exploration of a live system led me to wonder who a process is, I could just look at his process dictionary using process_info.
I realize that none of these solutions is functionally clean, but given that the system itself is never, EVER the consumer of these data, I'm not too worried about it. I need certain debugging tools to work quickly and easily, so the behavior described is not open for debate. Are there any convincing arguments to go one way or another (other than the academic "don't use the _, it's evil" canned garbage?) I'd be happy to hear other suggestions and their justifications.
You should try out gproc, it's a very convenient application for keeping process metadata.
A process can be registered with several names and you can associate arbitrary properties to a process (where the key and value can be any erlang term). Also gproc monitors the registered processes and unregisters them automatically if they crash.
If you're debugging gen_servers and gen_fsms while they're still running, I would implement the handle_info functions for these behaviors. When you send each process a {get_info, ReplyPid} tuple, the process in question can send back a term describing its own state, what it is, etc. That way you don't have to keep track of this information outside of the process itself.
Isac mentions there is already a built in way to do this

How safe is it to store sessions with Redis?

I'm currently using MySql to store my sessions. It works great, but it is a bit slow.
I've been asked to use Redis, but I'm wondering if it is a good idea because I've heard that Redis delays write operations. I'm a bit afraid because sessions need to be real-time.
Has anyone experienced such problems?
Redis is perfect for storing sessions. All operations are performed in memory, and so reads and writes will be fast.
The second aspect is persistence of session state. Redis gives you a lot of flexibility in how you want to persist session state to your hard-disk. You can go through http://redis.io/topics/persistence to learn more, but at a high level, here are your options -
If you cannot afford losing any sessions, set appendfsync always in your configuration file. With this, Redis guarantees that any write operations are saved to the disk. The disadvantage is that write operations will be slower.
If you are okay with losing about 1s worth of data, use appendfsync everysec. This will give great performance with reasonable data guarantees
This question is really about real-time sessions, and seems to have arisen partly due to a misunderstanding of the phrase 'delayed write operations' While the details were eventually teased out in the comments, I just wanted to make it super-duper clear...
You will have no problems implementing real-time sessions.
Redis is an in-memory key-value store with optional persistence to disk. 'Delayed write operations' refers to writes to disk, not the database in general, which exists in memory. If you SET a key/value pair, you can GET it immediately (i.e in real-time). The policy you select with regards to persistence (how much you delay the writes) will determine the upper-bound for how much data could be lost in a crash.
Basically there are two main types available: async snapsnots and fsync(). They're called RDB and AOF respectively. More on persistence modes on the official page.
The signal handling of the daemonized process syncs to disk when it receives a SIGTERM for instance, so the data will still be there after a reboot. I think the daemon or the OS has to crash before you'll see an integrity corruption, even with the default settings (RDB snapshots).
The AOF setting uses an Append Only File that logs the commands the server receives, and recreates the DB from scratch on cold start, from the saved file. The default disk-sync policy is to flush once every second (IIRC) but can be set to lock and write on every command.
Using both the snapshots and the incremental log seems to offer both a long term don't-mind-if-I-miss-a-few-seconds-of-data approach with a more secure, but costly incremental log. Redis supports clustering out of the box, so replication can be done too it seems.
I'm using the default RDB setting myself and saving the snapshots to remote FTP. I haven't seen a failure that's caused a data loss yet. Acute hardware failure or power outages would most likely, but I'm hosted on a VPS. Slim chance of that happening :)

Best form of IPC for a decentralized roguelike?

I've got a project to create a roguelike that in some way abstracts the UI from the engine and the engine from map creation, line-of-site, etc. To narrow the focus, i first want to just get the UI (player's client) and engine working.
My current idea is to make the client basically a program that decides what one character (player, monsters) will do for its turn and waits until it can move again. So each monster has a client, and so does the player. The player's client prints the map, waits for input, sends it to the engine, and tells the player what happened. The monster's client does the same except without printing the map and using AI instead of keyboard input.
Before i go any futher, if this seems somehow an obfuscated way of doing things, my goal is to learn, not write a roguelike. It's the journy, not the destination.
And so i need to choose what form of ipc fits this model best.
My first attempt used pipes because they're simplest and i wrote a
UI for the player and a program to pipe in instructions such as
where to put the map and player. While this works, it only allows
one client--communicating through stdin and out.
I've thought about making the engine a daemon that looks in a spool
where clients, when started, create unique-per-client temp files to
give instructions to the engine and recieve feedback.
Lastly, i've done a little introductory programing with sockets.
They seem like they might be the way to go, and would allow the game
to perhaps someday be run over a net. I'd like to, if possible, use
a simpler solution, and since i'm unfamiliar with them, it's more
error prone.
I'm always open to suggestions.
I've been playing around with using these combinations for a similar problem (multiple clients talking via a single daemon on the local box, with much of the intelligence shoved off into the clients).
mmap for sharing large data blobs, with unix domain sockets, messages queues, or named pipes for notification
same, but using individual files per blob instead of munging them all together in an mmap
same, but without the files or mmap (in other words, more like conventional messaging)
In general I like the idea of breaking things up into separate executables this way -- it certainly makes testing easier, for instance. I think the choice of method comes down to usage patterns -- how large are messages, how persistent does the data in them need to be, can you afford the cost of multiple trips through the network stack for a socket-based message, that sort of thing. The fact that you're sticking to Linux makes things easy in terms of what's available -- you don't need to worry about portability of message queues, for instance.
This one's also applicable: https://stackoverflow.com/a/1428542/1264797

In an ISAPI filter, what is a good approach for a common logfile for multiple processes?

I have an ISAPI filter that runs on IIS6 or 7. When there are multiple worker processes ("Web garden"), the filter will be loaded and run in each w3wp.exe.
How can I efficiently allow the filter to log its activities in a single consolidated logfile?
log messages from the different (concurrent) processes must not interfere with each other. In other words, a single log message emitted from any of the w3wp.exe must be realized as a single contiguous line in the log file.
there should be minimal contention for the logfile. The websites may serve 100's of requests per second.
strict time ordering is preferred. In other words if w3wp.exe process #1 emits a message at t1, then process #2 emits a message at t2, then process #1 emits a message at t3, the messages should appear in proper time order in the log file.
The current approach I have is that each process owns a separate logfile. This has obvious drawbacks.
Some ideas:
nominate one of the w3wp.exe to be "logfile owner" and send all log messages through that special process. This has problems in case of worker process recycling.
use an OS mutex to protect access to the logfile. Is this high-perf enough? In this case each w3wp.exe would have a FILE on the same filesystem file. Must I fflush the logfile after each write? Will this work?
any suggestions?
At first I was going to say that I like your current approach best, because each process shares nothing, and then I realized, that, well, they are probably all sharing the same hard drive underneath. So, there's still a bottleneck where contention occurs. Or maybe the OS and hard drive controllers are really smart about handling that?
I think what you want to do is have the writing of the log not slow down the threads that are doing the real work.
So, run another process on the same machine (lower priority?) which actually writes the log messages to disk. Communicate to the other process using not UDP as suggested, but rather memory that the processes share. Also known, confusingly, as a memory mapped file. More about memory mapped files. At my company, we have found memory mapped files to be much faster than loopback TCP/IP for communication on the same box, so I'm assuming it would be faster than UDP too.
What you actually have in your shared memory could be, for starters, an std::queue where the pushs and pops are protected using a mutex. Your ISAPI threads would grab the mutex to put things into the queue. The logging process would grab the mutex to pull things off of the queue, release the mutex, and then write the entries to disk. The mutex is only protected the updating of shared memory, not the updating of the file, so it seems in theory that the mutex would be held for a briefer time, creating less of a bottleneck.
The logging process could even re-arrange the order of what it's writing to get the timestamps in order.
Here's another variation: Contine to have a separate log for each process, but have a logger thread within each process so that the main time-critical thread doesn't have to wait for the logging to occur in order to proceed with its work.
The problem with everything I've written here is that the whole system - hardware, os, the way multicore CPU L1/L2 cache works, your software - is too complex to be easily predictable by a just thinking it thru. Code up some simple proof-of-concept apps, instrument them with some timings, and try them out on the real hardware.
Would logging to a database make sense here?
I've used a UDP-based logging system in the past and I was happy with this kind of solution.
The logs are sent via UDP to a log-collector process which his in charge of saving it to a file on a regular basis.
I don't know if it can work in your high-perf context but I was satisfied with that solution in a less under-stress application.
I hope it helps.
Rather than an OS Mutex to control access to the file, you could just use Win32 file locking mechanisms with LockFile() and UnlockFile().
My suggestion is send messages asynchronously (UDP) to a process that will take charge of recording the log.
The process will :
- one thread receiver puts the messages in a queue;
- one thread is responsible for removing the messages from the queue putting in a time ordered list;
- one thread monitor messages in the list and only messages with time length greater than the minimum should be saved in file (to prevent a delayed message to be written out of order).
You could continue logging to separate files and find/write a tool to merge them later (perhaps automated, or you could just run it at the point you want to use the files.)
Event Tracing for Windows, included in Windows Vista and later, provides a nice capability for this.
Excerpt:
Event Tracing for Windows (ETW) is an efficient kernel-level tracing facility that lets you log kernel or application-defined events to a log file. You can consume the events in real time or from a log file and use them to debug an application or to determine where performance issues are occurring in the application.

Resources