Wanting to understand purpose and count of V8's WorkerThread(s) - v8

I have a very simple program as follows that just creates an isolate, then sleeps:
#include <libplatform/libplatform.h>
#include <v8-platform.h>
#include <v8.h>
#include <stdio.h>
#include <unistd.h>
using v8::Isolate;
int main() {
std::unique_ptr<v8::Platform> platform = v8::platform::NewDefaultPlatform();
v8::V8::InitializePlatform(platform.get());
v8::V8::Initialize();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator = v8::ArrayBuffer::Allocator::NewDefaultAllocator();
Isolate* isolate = v8::Isolate::New(create_params);
printf("Sleeping...\n");
usleep(1000 * 1000 * 100);
printf("Done\n");
return 0;
}
When i run this program, i can then check the number of threads the process has created with ps -T -p <process_id>, and i see that on my 8 core machine, v8 creates 7 extra threads all named "V8 WorkerThread", and on my 16 core machine i get 8 instances of this "V8 WorkerThread" created.
I am looking to understand what determines the number of extra threads v8 spawns and what the purpose of these threads are. Thanks in advance!

The number of worker threads, when not specified by the embedder (that's you!), is chosen based on the number of CPU cores. In the current implementation, the formula is: number_of_worker_threads = (number_of_CPU_cores - 1) up to a maximum of 8, though this may change without notice. You can also specify your own worker thread pool size as an argument to NewDefaultPlatform.
The worker threads are used for various tasks that can be run in the background, mostly for garbage collection and optimized compilation.

Related

Test if the program uses MPI (distributed) correctly?

How do I check that a program is using MPI when it runs? Specifically, how can I verify the program is running on multiple processors? Also, how can I figure out if my program is correctly running across multiple nodes?
I am assuming you're trying to figure out which processor/host is the MPI process running on.
You can use the MPI_Get_processor_name function to print the processor name.
Here is what your code will look like.
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv)
{
int rank, max_len;
char processorname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processorname,&max_len);
printf("Hello world! I am process number: %d on processor %s\n", rank, processorname);
MPI_Finalize();
return 0;
}
So now to compile the program use mpicc -o hello_world hello_world.c.
To run the program use mpirun -np 4 -f machinefile ./hello_world.
This will run the program in 4 different processors mentioned in your machinefile.
You didn't tell us, what you are actually looking for. Your question is unclear and ambiguous, it would be great if you could improve it. That being said, I guess you would like to know wether your processes are actually executed by distinct CPU cores.
First of all, Pooja Nilangekar explained a method to verify the distribution across a network. Now within a single node, it most likely depends on the systems you are running on. If it is a Linux, you could for example make use of the /proc filesystem, and check the status of the current process in /proc/self/. This pseudo filesystem offers a file stat, which contains a field processor showing the cpu_id, this process was last run on. Maybe, also check /proc/self/status for the cpus, the process is allowed to run on. It might be that MPI or your scheduler puts restrictions on this for each process. Together with the node information from the answer of Pooja Nilangekar, you can thereby obtain the running information for each process.
If you can not modify the sources, to have each process reporting where it is running, I think, the easiest way to see which cores are utilized would be top, maybe also have look at this blog on How do I find out Linux CPU utilization?, which also mentions mpstat and sar.

Slow thread creation on Windows

I have upgraded a number crunching application to a multi-threaded program, using the C++11 facilities. It works well on Mac OS X but does not benefit from multithreading on Windows (Visual Studio 2013). Using the following toy program
#include <iostream>
#include <thread>
void t1(int& k) {
k += 1;
};
void t2(int& k) {
k += 1;
};
int main(int argc, const char *argv[])
{
int a{ 0 };
int b{ 0 };
auto start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::thread thread1{ t1, std::ref(a) };
std::thread thread2{ t2, std::ref(b) };
thread1.join();
thread2.join();
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time_stack = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
std::cout << "Time: " << time_stack / 10000.0 << " micro seconds" <<
std::endl;
std::cout << a << " " << b << std::endl;
return 0;
}
I have discovered that it takes 34 microseconds to start a thread on Mac OS X and 340 microseconds to do the same on Windows. Am I doing something wrong on the Windows side ? Is it a compiler issue ?
Not a compiler problem (nor an operating system problem, strictly speaking).
It is a well-known fact that creating threads is an expensive operation. This is especially true under Windows (used to be true under Linux prior to clone as well).
Also, creating and joining a thread is necessarily slow and does not tell a lot about creating a thread as such. Joining presumes that the thread has exited, which can only happen after it has been scheduled to run. Thus, your measurements include delays introduced by scheduling. Insofar, the times you measure are actually pretty good (they could easily be 20 times longer!).
However, it does not matter a lot whether spawning threads is slow anyway.
Creating 20,000 threads like in your benchmark in a real program is a serious error. While it is not strictly illegal or disallowed to create thousands (even millions) of threads, the "correct" way of using threads is to create no more threads than there are approximately CPU cores. One does not create very short-lived threads all the time either.
You might have a few short-lived ones, and you might create a few extra threads (which e.g. block on I/O), but you will not want to create hundreds or thousands of these. Every additional thread (beyond the number of CPU cores) means more context switches, more scheduler work, more cache pressure, and 1MB of address space and 64kB of physical memory gone per thread (due to stack reserve and commit granularity).
Now, assume you create for example 10 threads at program start, it does not matter at all whether this takes 3 milliseconds alltogether. It takes several hundred milliseconds (at least) for the program to start up anyway, nobody will notice a difference.
Visual C++ uses Concurrency Runtime (MS specific) to implement std.thread features. When you directly call any Concurrency Runtime feature/function, it creates a default runtime object (not going into details). Or, when you call std.thread function, it does the same as of ConcRT function was invoked.
The creation of default runtime (or say, scheduler) takes sometime, and hence it appear to be taking sometime. Try creating a std::thread object, let it run; and then execute the benching marking code (whole of above code, for example).
EDIT:
Skim over it - http://www.codeproject.com/Articles/80825/Concurrency-Runtime-in-Visual-C
Do Step-Into debugging, to see when CR library is invoked, and what it is doing.

Thread-Ids in Windows greater than 0xFFFF

we have a big and old software project. This software runs in older days on an old OS, so it has an OS-Wrapper. Today it runs on windows.
In the OS-Wrapper we have structs to manage threads. One Member of this struct is the thread-Id, but it is defined with an uint16_t. The thread-Ids will be generated with the Win-API createThreadEx.
Since some month at one of our customers thread-Ids appears which are greater than
numeric_limits<uint16_t>::max()
We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
So my question is: How is it possible in windows to get thread-Ids which are greater than 0xffff? How must be the circumstances to reach this?
Windows thread IDs are 32 bit unsigned integers, of type DWORD. There's no requirement for them to be less than 0xffff. Whatever thought process led you to that belief was flawed.
If you want to stress test your system to create a scenario where you have thread IDs that go above 0xffff then you simply need to create a large number of threads. To make this tenable, without running out of virtual address space, create threads with very small stacks. You can create the threads suspended too because you don't need the threads to do anything.
Of course, it might still be a little tricky to force the system to allocate that many threads. I found that my simple test application would not readily generate thread IDs above 0xffff when run as a 32 bit process, but would do so as a 64 bit process. You could certainly create a 64 bit process that would consume the low-numbered thread IDs and then allow your 32 bit process to go to work and so deal with lower numbered thread IDs.
Here's the program that I experimented with:
#include <Windows.h>
#include <iostream>
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
return 0;
}
int main()
{
for (int i = 0; i < 10000; i++)
{
DWORD threadID;
if (CreateThread(NULL, 64, ThreadProc, NULL, CREATE_SUSPENDED, &threadID) == NULL)
return 1;
std::cout << std::hex << threadID << std::endl;
}
return 0;
}
Re
” We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
Your current software’s use of a 16-bit object to store a value that requires 32 bits, is a bug. So you have to fix it, and test the fix. There are at least two practical fixes:
Changing the declaration of the id, and all uses of it.
It can really help with finding all copying of the id, to introduce a dedicated type that is not implicitly convertible to integer, e.g. a C++11 based enumeration type.
Adding a layer of indirection.
Might be possible without changing the data, only changing the threading library implementation.
A deeper fix might be to replace the current threading with C++11 standard library threading.
Anyway you're up for a bit of work, and/or some cost.

Read/Write memory on OS X 10.8.2 with vm_read and vm_write

This is my code that works only on Xcode (version 4.5):
#include <stdio.h>
#include <mach/mach_init.h>
#include <mach/mach_vm.h>
#include <sys/types.h>
#include <mach/mach.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <Security/Authorization.h>
int main(int argc, const char * argv[]) {
char test[14] = "Hello World! "; //0x7fff5fbff82a
char value[14] = "Hello Hacker!";
char test1[14];
pointer_t buf;
uint32_t sz;
task_t task;
task_for_pid(current_task(), getpid(), &task);
if (vm_write(current_task(), 0x7fff5fbff82a, (pointer_t)value, 14) == KERN_SUCCESS) {
printf("%s\n", test);
//getchar();
}
if (vm_read(task, 0x7fff5fbff82a, sizeof(char) * 14, &buf, &sz) == KERN_SUCCESS) {
memcpy(test1, (const void *)buf, sz);
printf("%s", test1);
}
return 0;
}
I was trying also ptrace and other things, this is why I include other libraries too.
The first problem is that this works only on Xcode, I can find with the debugger the position (memory address) of a variable (in this case of test), so I change the string with the one on value and then I copy the new value on test on test1.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid(), the 2° problem is that I need to read and write on another process, this is only a test for see if the functions works on the same process, and it works (only on Xcode).
How I can do that on other processes? I need to read a position (how I can find the address of "something"?), this is the first goal.
For your problems, there are solutions:
The first problem: OS X has address space layout randomization. If you want to make your memory images fixed and predictable, you have to compile your code with NOPIE setting. This setting (PIE = Position Independent Executable), is responsible for allowing ASLR, which "slides" the memory by some random value, which changes on every instance.
I actually don't understand how vm_write works (not completely) and the same for task_for_pid():
The Mach APIs operate on the lower level abstractions of "task" and "Thread" which correspond roughly to that of the BSD "process" and "(u)thread" (there are some exceptions, e.g. kernel_task, which does not have a PID, but let's ignore that for now). task_for_pid obtains the task port (think of it as a "handle"), and if you get the port - you are free to do whatever you wish. Basically, the vm_* functions operate on any task port - you can use it on your own process (mach_task_self(), that is), or a port obtained from task_for_pid.
Task for PID actually doesn't necessarily require root (i.e. "sudo"). It requires getting past taskgated on OSX, which traditionally verified membership in procmod or procview groups. You can configure taskgated ( /System/Library/LaunchDaemons/com.apple.taskgated.plist) for debugging purposes. Ultimately, btw, getting the task port will require an entitlement (the same as it now does on iOS). That said, the easiest way, rather than mucking around with system authorizations, etc, is to simply become root.
Did you try to run your app with "sudo"?
You can't read/write other app's memory without sudo.

How to make pthread_cond_timedwait() robust against system clock manipulations?

Consider the following source code, which is fully POSIX compliant:
#include <stdio.h>
#include <limits.h>
#include <stdint.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
int main (int argc, char ** argv) {
pthread_cond_t c;
pthread_mutex_t m;
char printTime[UCHAR_MAX];
pthread_mutex_init(&m, NULL);
pthread_cond_init(&c, NULL);
for (;;) {
struct tm * tm;
struct timeval tv;
struct timespec ts;
gettimeofday(&tv, NULL);
printf("sleep (%ld)\n", (long)tv.tv_sec);
sleep(3);
tm = gmtime(&tv.tv_sec);
strftime(printTime, UCHAR_MAX, "%Y-%m-%d %H:%M:%S", tm);
printf("%s (%ld)\n", printTime, (long)tv.tv_sec);
ts.tv_sec = tv.tv_sec + 5;
ts.tv_nsec = tv.tv_usec * 1000;
pthread_mutex_lock(&m);
pthread_cond_timedwait(&c, &m, &ts);
pthread_mutex_unlock(&m);
}
return 0;
}
Prints the current system date every 5 seconds, however, it does a sleep of 3 seconds between getting the current system time (gettimeofday) and the condition wait (pthread_cond_timedwait).
Right after it is printing "sleep (...)", try setting the system clock two days into the past. What happens? Well, instead of waiting 2 more seconds on the condition as it usually does, pthread_cond_timedwait now waits for two days and 2 seconds.
How do I fix that?
How can I write POSIX compliant code, that does not break when the user manipulates the system clock?
Please keep in mind that the system clock might change even without user interaction (e.g. a NTP client might update the clock automatically once a day). Setting the clock into the future is no problem, it will only cause the sleep to wake up early, which is usually no problem and which you can easily "detect" and handle accordingly, but setting the clock into the past (e.g. because it was running in the future, NTP detected that and fixed it) can cause a big problem.
PS:
Neither pthread_condattr_setclock() nor CLOCK_MONOTONIC exists on my system. Those are mandatory for the POSIX 2008 specification (part of "Base") but most systems still only follow the POSIX 2004 specification as of today and in the POSIX 2004 specification these two were optional (Advanced Realtime Extension).
Interesting, I've not encountered that behaviour before but, then again, I'm not in the habit of mucking about with my system time that much :-)
Assuming you're doing that for a valid reason, one possible (though kludgy) solution is to have another thread whose sole purpose is to periodically kick the condition variable to wake up any threads so affected.
In other words, something like:
while (1) {
sleep (10);
pthread_cond_signal (&condVar);
}
Your code that's waiting for the condition variable to be kicked should be checking its predicate anyway (to take care of spurious wakeups) so this shouldn't have any real detrimental effect on the functionality.
It's a slight performance hit but once every ten seconds shouldn't be too much of a problem. It's only really meant to take care of the situations where (for whatever reason) your timed wait will be waiting a long time.
Another possibility is to re-engineer your application so that you don't need timed waits at all.
In situations where threads need to be woken for some reason, it's invariably by another thread which is perfectly capable of kicking a condition variable to wake one (or broadcasting to wake the lot of them).
This is very similar to the kicking thread I mentioned above but more as an integral part of your architecture than a bolt-on.
You can defend your code against this problem. One easy way is to have one thread whose sole purpose is to watch the system clock. You keep a global linked list of condition variables, and if the clock watcher thread sees a system clock jump, it broadcasts every condition variable on the list. Then, you simply wrap pthread_cond_init and pthread_cond_destroy with code that adds/removes the condition variable to/from the global linked list. Protect the linked list with a mutex.

Resources