EXE size bloats while using Websocketpp - boost

I've built a very basic EXE which uses Websocketpp client, which just connects to a Websocket server, and sends and receives a mesage.
I've used VS 2013.
I'm noticing that the size of the EXE is mammoth. It's like 2.3 MB for Release and 6 MB for Debug.
Any ideas as to how I can reduce the size of EXE??

WebSocket++ author here. The sizes you quote seem about the right ballpark. Keep in mind that a "very basic sample" like the echo_server (which produces a ~1MB executable on linux) does a lot more than you might think based on the ~50 lines in the program source.
Out of the box any WebSocket++/Asio based program is a high performance event based client/server system and includes code for DNS resolution, IPv4 and IPv6, timers, SHA1/MD5 hashing, base64 encoding, UTF8 validation, logging, thread safety, and parsers for URIs, HTTP, and multiple WebSocket protocol versions. Just because you only use these capabilities to echo back messages doesn't make this a trivial program.
Some observations/notes on the topic:
Due to the way templates work, the code for WebSocket++, ASIO, and the STL is compiled into your program rather than sitting in an externally linked library. This may make a WebSocket++ or Asio program look artificially larger than a program that links to an external library.
The situation described in #1 can sometimes end up more efficient than an external library because this program will only include the parts of the library that your code actually uses, rather than all parts. I.e. If you don't instantiate a client endpoint no client code will be included. If your config disables TLS encryption, logging, or the thread safety features they will also not be included. Again due to the way templates work this can go both ways. For example: A program that includes both a client and a server will have some potentially unnecessary duplication.
The size of WebSocket++'s code is largely correlated to the number of different endpoint configs that you use and the options enabled in each of those configs. These represent a fixed size no matter what else your program does. If your program does little, they will make up a large proportion of the code. If your program does a lot that proportion will shrink.
WebSocket++ is fairly modular (though this is less well documented right now). If you are really concerned about code size (small embedded systems perhaps?) and don't actually need all the features that Asio and WebSocket++ bring out of the box, you can set up a custom config that either removes many features or replaces them with your own space optimized implementations.
Say you only ever need to service one non-TLS connection with no DNS lookup and no security timeouts in a guaranteed single threaded program with no logging. You can implement your own network transport policy based on your native OS socket library that doesn't include all the stuff that Asio does. You can also stub out the locking/concurrency and logger policies you don't need.

Related

MPI client server connection with Singleton MPI_INIT

I want to implement (in C++) a feature, using MPI, in an existing (non-MPI) application. I am thinking of using mpich-3.4.1 for this.
I am planning to create a .so file for that feature, which the original application can link to. I initially thought to have a function in the .so file that starts with an MPI_Init() and ends with MPI_Finalize() and, in between, calls all required MPI apis to do the parallel job. As part of the MPI job, the new feature makes the current application an MPI server by calling APIs like 'MPI_Open_port' and 'MPI_Comm_accept'. Other worker processes (possibly running on different machines) connect to this server, send/receive messages, and complete a heavy computation in parallel. The application then resumes its other non-mpi work.
It seems to me that Singleton MPI_INIT mechanism will be useful for this. I found the following page on Singleton Init:
https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/node254.htm
This page says, "A high-quality implementation will allow any process (including those not started with a ``parallel application'' mechanism) to become an MPI process by calling MPI_INIT. Such a process can then connect to other MPI processes...".
However, the comments in mpich-3.4.1/src/mpi/init/init.c says, "The MPI standard does not say what a program can do before an 'MPI_INIT' or after an 'MPI_FINALIZE'. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output."
Based on the above comments, it seems we should not have MPI_Init(NULL, NULL) and MPI_Finalize() as part of any implementation in a library. In that case, I am thinking to have the init and finalize APIs in the original application's main function, and have rest of the API calls made from the .so file. My original application is a working large software, and may not need to execute my mpi feature at all, in some situations.
My questions are:
(1) Does it make sense to have MPI_Init(NULL, NULL) and MPI_Finalize() called in the main function of this application, and rest of the MPI functionalities in a .so file?
(2) Once MPI_Init(NULL, NULL) is called in the main, would it interfere with the normal execution of the software in any way? Would there be any performance impact on the existing application?
(3) Is there an MPI implementation that handles this better?
(4) Is MPI a good approach to handle this requirement, or other mechanisms like ZeroMQ better? In the comments made by Wesley Bland in the following link, he says that "MPI may not be right for you if you're looking for a client/server model. Yes, it's possible, but it's not really optimized for that use case and you might have better luck using a different communication mechanism". Is that true in 2022?
client relationship within MPI server

ZeroMQ and actor model

I'm having problems scaling up an application that uses the actor model and zeromq. To put it simply: I'm trying to create thousands of threads that communicate via sockets. Similar to what one would do with a Erlang-type message passing. I'm not doing it for multicore/performance reasons, but because framing it in this way gives me very clean code.
From a philosophical point of view it sounds as if this is what zmq developers would like to achieve, e.g.
http://zeromq.org/whitepapers:multithreading-magic
However, it seems as if there are some practical limitations. At 1024 inproc sockets I start getting the "ZMQError: Too many open files" error. TCP gives me the typical "Assertion failed: fds.size () <= FD_SETSIZE" crash.
Why does inproc sockets have this limit?
To get it to work I've had to group together items to share a socket. Is there a better way?
Is zmq just the wrong tool for this kind of job? i.e. it's still more a network library than an actor message passing library?
ZMQ uses file descriptors as the "resource unit" for inproc connections. There is a limit for file descriptors set by the OS, you should be able to modify that (found several potential avenues for Windows with a quick Google search), though I don't know what the performance impact might be.
It looks like this is related to the ZMQ library using C code that is portable among systems for opening new files, rather than Windows native code that doesn't suffer from this same limitation.

Two-way communication between kernel-mode driver and user-mode application?

I need a two-way communication between a kernel-mode WFP driver and a user-mode application. The driver initiates the communication by passing a URL to the application which then does a categorization of that URL (Entertainment, News, Adult, etc.) and passes that category back to the driver. The driver needs to know the category in the filter function because it may block certain web pages based on that information. I had a thread in the application that was making an I/O request that the driver would complete with the URL and a GUID, and then the application would write the category into the registry under that GUID where the driver would pick it up. Unfortunately, as the driver verifier pointed out, this is unstable because the Zw registry functions have to run at PASSIVE_LEVEL. I was thinking about trying the same thing with mapped memory buffers, but I’m not sure what the interrupt requirements are for that. Also, I thought about lowering the interrupt level before the registry function calls, but I don't know what the side effects of that are.
You just need to have two different kinds of I/O request.
If you're using DeviceIoControl to retrieve the URLs (I think this would be the most suitable method) this is as simple as adding a second I/O control code.
If you're using ReadFile or equivalent, things would normally get a bit messier, but as it happens in this specific case you only have two kinds of operations, one of which is a read (driver->application) and the other of which is a write (application->driver). So you could just use WriteFile to send the reply, including of course the GUID so that the driver can match up your reply to the right query.
Another approach (more similar to your original one) would be to use a shared memory buffer. See this answer for more details. The problem with that idea is that you would either need to use a spinlock (at the cost of system performance and power consumption, and of course not being able to work on a single-core system) or to poll (which is both inefficient and not really suitable for time-sensitive operations).
There is nothing unstable about PASSIVE_LEVEL. Access to registry must be at PASSIVE_LEVEL so it's not possible directly if driver is running at higher IRQL. You can do it by offloading to work item, though. Lowering the IRQL is usually not recommended as it contradicts the OS intentions.
Your protocol indeed sounds somewhat cumbersome and doing a direct app-driver communication is probably preferable. You can find useful information about this here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff554436(v=vs.85).aspx
Since the callouts are at DISPATCH, your processing has to be done either in a worker thread or a DPC, which will allow you to use ZwXXX. You should into inverted callbacks for communication purposes, there's a good document on OSR.
I've just started poking around WFP but it looks like even in the samples that they provide, Microsoft reinject the packets. I haven't looked into it that closely but it seems that they drop the packet and re-inject whenever processed. That would be enough for your use mode engine to make the decision. You should also limit the packet capture to a specific port (80 in your case) so that you don't do extra processing that you don't need.

Multi-threaded Windows Service - Erlang

I am going to tell the problem that I have to solve and I need some suggestions if i am in the right path.
The problem is:
I need to create a Windows Service application that receive a request and do some action. (Socket communication) This action is to execute a script (maybe in lua or perl).This script models te bussiness rules of the client, querying in Databases, making request in websites and then send a response to the client.
There are 3 mandatory requirements:
The service will receive a lot of request at the same time. So I think to use the worker's thread model.
The service must have a high throughput. I will have many of requests at the same second.
Low Latency: I must response these requests very quickly.
Every request will generate a log entries. I cant write these log entries in the physical disk at same time the scripts execute because the big I/O time. Probably I will make a queue in memory and others threds will consume this queue and write on disk.
In the future, is possible that two woker's thread have to change messages.
I have to make a protocol to this service. I was thinking to use Thrift, but i don't know the overhead involved. Maybe i will make my own protocol.
To write the windows service, i was thinking in Erlang. Is it a good idea?
Does anyone have suggestions/hints to solve this problem? Which is the better language to write this service?
Yes, Erlang is a good choice if you're know it or ready to learn. With Erlang you don't need any worker thread, just implement your server in Erlang style and you'll receive multithreaded solution automatically.
Not sure how to convert Erlang program to Windows service, but probably it's doable.
Writing to the same log file from many threads are suboptimal because requires locking. It's better to have a log-entries queue (lock-free?) and a separate thread (Erlang process?) that writes them to the file. BTW, are you sure that executing external script in another language is much faster than writing a log-record to the file?
It's doubtfully you'll receive much better performance with your own serialization library than Thrift provides for free. Another option is Google Protocol Buffers, somebody claimed that it's faster.
Theoretically (!) it's possible that Erlang solution won't provide you required performance. In this case consider a compilable language, e.g. C++ and asynchronous networking, e.g. Boost.Asio. But be ready that it's much more complicated than Erlang way.

Using gevent and multiprocessing together to communicate with a subprocess

Question:
Can I use the multiprocessing module together with gevent on Windows in an efficient way?
Scenario:
I have a gevent based Python application doing asynchronous I/O on Windows. The application is mostly I/O bound, but there are spikes of higher CPU load as well. This application would need to control a console application via its stdin and stdout. I cannot modify this console application and the user will be able to use his own custom one, only the text (line) based communication protocol is fixed.
I have a working implementation using subprocess and threads, but I would rather move the whole subprocess based communication code together with those threads into a separate process to turn the main application back to single-threaded. I plan to use the multiprocessing module for this.
Prior reading:
I have been searching the Web a lot and read some source code, so I know that the multiprocessing module is using a Pipe implementation based on named pipes on Windows. A pair of multiprocessing.queue.Queue objects would be used to communicate with the second Python process. These queues are based on that Pipe implementation, e.g. the IPC would be done via named pipes.
The key question is, whether calling the incoming Queue's get method would block gevent's main loop or not. There's a timeout for that method, so I could make it into a loop with a small timeout, but that's not a good solution, since it would still block gevent for small time periods hurting its low I/O latency.
I'm also open to suggestions on how to circumvent the whole problem of using pipes on Windows, which is known to be hard and sometimes fragile. I'm not sure whether shared memory based IPC is possible on Windows or not. Maybe I could wrap the console application in a way which would allow communicating with the child process using network sockets, which is known to work well with gevent.
Please don't question my primary use case, if possible. Thanks.
The Queue's get method is really blocking. Using it with timeout could potentially solve your problem, but it definitely won't be a cleanest solution and, which is the most important, will introduce extra latency for no good reason. Even if it wasn't blocking, that won't be a good solution either. Just because non-blocking itself is not enough, the good asynchronous call/API should smoothly integrate into the I/O framework in use. Be that gevent for Python, libevent for C or Boost ASIO for C++.
The easiest solution would be to use simple I/O by spawning your console applications and attaching to its console in and out descriptors. There are at two major factors to consider:
It will be extremely easy for your clients to write client applications. They will not have to work with any kind of IPC, socket or other code, which could be very hard thing for many. With this approach, application will just read from stdin and write to stdout.
It will be extremely easy to test console applications using this approach as you can manually start them, enter text into console and see results.
Gevent is a perfect fit for async read/write here.
However, the downside is that you will have to start this application, there will be no support for concurrent communication with it, and there will be no support for communication over network. There is even a good example for starters.
To keep it simple but more flexible, you can use TCP/IP sockets. If both client and server are running on the same machine. Also, a good operating system will use IPC as an underlying implementation, so it will be fast. And, if you are worrying about performance of this case, you probably should not use Python at all and look at other technologies.
Even fancies solution – use ZeroC ICE. It is very modern technology allowing almost seamless inter-process communication. It is a CORBA killer, very easy to use. It is heavily used by many, proven to be fastest in its class and rock stable. The beauty of this solution is that you can seamlessly integrate programs in many different languages, like Python, Java, C++ etc. But this will require some of your time to get familiar with a concept. If you decide to go this way, just spend a day reading trough documentation.
Hope it helps. Good luck!
Your question is already quite old. Nevertheless, I would like to recommend http://gehrcke.de/gipc which -- I believe -- would tackle the outlined challenge in a very straight-forward fashion. Basically, it allows you to integrate multiprocessing-based child processes anywhere in your application (also on Windows). Interaction with Process objects (such as calling join()) is gevent-cooperative. Via its pipe management, it allows for cooperatively blocking inter-process communication. However, on Windows, IPC currently is much less efficient than on POSIX-compliant systems (since non-blocking I/O is imitated through a thread pool). Depending on the IPC messaging volume of your application, this might or might not be of significance.

Resources