Do I always need to specify the array shape when reading from shared_memory (Python 3.8)? - multiprocessing

I am trying to use shared_memory on a distributed cluster with Python multiprocessing. Each time a worker is created on a new node, I need it to check whether a shared memory block with a particular name already exists and, if not, create it from a subset of a table from an external database. This shared memory block should then be accessible to all future workers on that node.
However, the example in the Python documentation (https://docs.python.org/3/library/multiprocessing.shared_memory.html) only seems to work when the shape of the array is known in advance:
# Attach to the existing shared memory block
existing_shm = shared_memory.SharedMemory(name='psm_21467_46075')
# Note that a.shape is (6,) and a.dtype is np.int64 in this example
c = np.ndarray((6,), dtype=np.int64, buffer=existing_shm.buf)
Is there any way to check the shape of an array in a shared_memory block without it being passed to the worker as an argument (my workers cannot communicate with each other)?

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

How to access Linux kernel data structures?

I want to print the information of each process and what that process is doing at runtime. i.e. Which file is read/write by that process continuously.
For this I'm writing a kernel module.
Any one have idea to How to access this information in kernel module or how to access the process table data structures in my kernel module?
pseudo code for task will be like this:
1. get each process from /proc.
2. Access the data structure of that process i.e. process table and all
3. print what that process is doing i.e. which file it is accessing (i.e. reading or writing) at rutime.
Please take a look at this example.
It specifically shows how to create a kernel module which prints the open files of a process (and relies on the task_struct struct gained from the current macro I mentioned in my comment). This can be manipulated to far more complicated things which can be accessed through the process task_struct struct.
There is a macro called for_each_process declared in /include/linux/sched.h
http://lxr.free-electrons.com/source/include/linux/sched.h#L2621
By using this macro, it is possible to traverse all process's task_struct.
http://lxr.free-electrons.com/source/include/linux/sched.h#L1343

How to define a shared (global) variable in Hadoop?

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.
This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

Fastest way to send large blobs of data from one program to another in Windows?

I need to send large blobs of data (~10MB) from one program to another in Windows 7. I would like a method that allows for at least a gigabyte per second total throughput with very low system load. To simplify this, all blobs may be the same size, and one program may be a child process of the other.
Method 1: Memory map the same file in both programs: CreateFileMapping() / MapViewOfFile()
In this case, the memory mapped file(s) presumably contains room for several blobs in a ring buffer. There would need to be some external mechanism to synchronize access to the ring buffer.
Method 2: Create named data sections
Method 3: WriteProcessMemory (suggested by Hristo Iliev below, thanks!)
Method 4: Read/write files on a RAM disk.
Method 5: Read/write to an anonymous pipe.
Method ?: Anything else? Perhaps write over TCP, use MPI, ...
I know that memory-mapped files (method 1) are considered the standard solution to this problem :)
How fast are memory-mapped files? (rough order of magnitude)
Is there an even faster method?
How much worse is the performance of the other methods? Which ones of them can hit GB/sec throughput?
If using memory mapped files, what is the best way for the programs to synchronize access to the data being passed? (ie: how would the producer indicate to the consumer that a new blob is available, and how would the consumer indicate it is done with a particular blob?)
If using memory mapped files, is it better to have one file for all blobs together (ring buffer in a file), or one file for each blob (ring buffer of files)?
You could also use WriteProcessMemory and have the first process to directly post the data into the address space of the second process. You'd need to develop a protocol of some kind. For example, the second process could send the virtual address of its receive buffer to the first process via a named pipe or a shared memory block, then the first process copies the data using WriteProcessMemory and when it is finished, signals the second one via a semaphore or something. This ought to be the fastest way to send data between two processes as it involves a single copy operation. The first process would need to obtain the proper rights on the second one and that should not be a problem as long as both processes belong to the same user.

Architecture - How to efficiently crawl the web with 10,000 machine?

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?
Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?
Use a distributed Map Reduction system like Hadoop to divide the workspace.
If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.
One Idea would be to use work queues (directories or DB), assuming you will be working out storage such that it meets your criteria for redundancy.
\retrieve
\retrieve\server1
\retrieve\server...
\retrieve\server10000
\in-process
\complete
1.) All pages to be seeds will be hashed and be placed in the queue using the hash as a file root.
2.) Before putting in the queue you check the complete and in-process queues to make sure you don't re-queue
3.) Each server retrieves a random batch (1-N) files from the retrieve queue and attempts to move it to the private queue
4.) Files that fail the rename process are assumed to have been “claimed” by another process
5.) Files that can be moved are to be processed put a marker in in-process directory to prevent re-queuing.
6.) Download the file and place it into the \Complete queue
7.) Clean file out of the in-process and server directories
8.) Every 1,000 runs check the oldest 10 in-process files by trying to move them from their server queues back into the general retrieve queue. This will help if a server hangs and also should load balance slow servers.
For the Retrieve, in-process and complete servers most file systems hate millions of files in 1 directory, Divide storage into segments based on the characters of the hash \abc\def\123\ would be the directory for file abcdef123FFFFFF…. If you were scaling to billions of downloads.
If you are using a mongo DB instead of a regular file store much of these problems would be avoided and you could benefit from the sharding etc…

Resources