What does `setupkvm` do in xv6 ? What does it really mean to "set up kernel virtual memory"? - memory-management

In copyuvm function setupkvm is called to set kernel virtual memory. Why do we need to setup kernel virtual memory when we are copying user process ? Why didn't we need that when we were doing allocuvm ?
Code for copyuvm
// Given a parent process's page table, create a copy
// of it for a child.
pde_t*
copyuvm(pde_t *pgdir, uint sz)
{
pde_t *d;
pte_t *pte;
uint pa, i, flags;
char *mem;
if((d = setupkvm()) == 0)
return 0;
for(i = 0; i < sz; i += PGSIZE){
if((pte = walkpgdir(pgdir, (void *) i, 0)) == 0)
panic("copyuvm: pte should exist");
if(!(*pte & PTE_P))
panic("copyuvm: page not present");
pa = PTE_ADDR(*pte);
flags = PTE_FLAGS(*pte);
if((mem = kalloc()) == 0)
goto bad;
memmove(mem, (char*)P2V(pa), PGSIZE);
if(mappages(d, (void*)i, PGSIZE, V2P(mem), flags) < 0) {
kfree(mem);
goto bad;
}
}
return d;
bad:
freevm(d);
return 0;
}
and for allocuvm
int
allocuvm(pde_t *pgdir, uint oldsz, uint newsz)
{
char *mem;
uint a;
if(newsz >= KERNBASE)
return 0;
if(newsz < oldsz)
return oldsz;
a = PGROUNDUP(oldsz);
for(; a < newsz; a += PGSIZE){
mem = kalloc();
if(mem == 0){
cprintf("allocuvm out of memory\n");
deallocuvm(pgdir, newsz, oldsz);
return 0;
}
memset(mem, 0, PGSIZE);
if(mappages(pgdir, (char*)a, PGSIZE, V2P(mem), PTE_W|PTE_U) < 0){
cprintf("allocuvm out of memory (2)\n");
deallocuvm(pgdir, newsz, oldsz);
kfree(mem);
return 0;
}
}
return newsz;
}

What copyuvm does is that copy whole virtual memory (user + kernel) from a page directory. So during copyuvm we need setupkvm for kernel part.
On the other hand, allocuvm just extends existing virtual memory ( specifically heap portion). Since there already exists kernel portion of mappings in allocuvm, we are not bound to call setupkvm.

Related

Load balancing MPI multithreading for variable-complexity tasks or variable-speed nodes?

I've written an MPI code that currently multithreads by sending equal numbers of elements from each array to a different process to do work (thus, for 6 workers, the array is broken into 6 equal parts). What I would like to do is send small chunks only if a worker is ready to receive, and receive completed chunks without blocking future sends; this way if one chunk takes 10 seconds but the other chunks take 1 second, other data can be processed while waiting for the long chunk to complete.
Here's some skeleton code I've put together:
#include <mpi.h>
#include <iostream>
#include <vector>
#include <cmath>
struct crazytaxi
{
double a = 10.0;
double b = 25.2;
double c = 222.222;
};
int main(int argc, char** argv)
{
//Initial and temp kanno vectors
std::vector<crazytaxi> kanno;
std::vector<crazytaxi> kanno_tmp;
//init MPI
MPI_Init(NULL,NULL);
//allocate vector
int SZ = 4200;
kanno.resize(SZ);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD,&world_size);
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
if (world_rank == 0)
{
for (int i = 0; i < SZ; i++)
kanno[i].a = 1.0*i;
kanno[i].b = 10.0/(i+1);
}
for (int j = 0; j < 10; j++) {
//Make sure all processes have same kanno vector;
if (world_rank == 0) {
for (int i = 1; i < world_size; i++)
MPI_Send(&kanno[0],sizeof(crazytaxi)*kanno.size(),MPI_BYTE,i,3,MPI_COMM_WORLD);
} else {
MPI_Recv(&kanno[0],sizeof(crazytaxi)*kanno.size(),MPI_BYTE,0,3,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}
//copy to tmp vector
kanno_tmp = kanno;
MPI_Barrier();
//the sender
if (world_rank == 0) {
unsigned p1 = 0;
unsigned segment = 10;
unsigned p2 = segment;
while (p1 < SZ) {
for (int i = 0; i < world_size; i++) {
//if (process #i is ready to receive)
//Send data in chunks of 10 to i
//else
//continue
}
}
}
if (world_rank != 0) {
//Receive data to be processed
//do some math
for (unsigned i = p1; i < p2; i++)
kanno_tmp[i].a = std::sqrt(kanno[i].a)/((double)i+1.0);
//Send processed data to 0 and wait to receive new data.
}
//copy temp vector to kanno
kanno = kanno_tmp;
}
//print some of the results;
if (world_rank == 0)
{
for (int i = 0; i < SZ; i += 40)
printf("Line %d: %lg,%lg\n",i,kanno[i].a,kanno[i].b);
}
MPI_Finalize();
}
I can 90% turn this into what I want, except that my MPI_Send and MPI_Recv calls will block, or the 'master' process won't know that the 'slave' processes are ready to receive data.
Is there a way in MPI to do something like
unsigned Datapointer = [some_array_index];
while (Datapointer < array_size) {
if (world_rank == 0) {
for (int i = 1; i < world_size; i++)
{
if (<process i is ready to receive>) {
MPI_Send([...]);
Datapointer += 10;
}
if (<process i has sent data>)
MPI_Recv([...]);
if (Datapointer > array_size) {
MPI_Bcast([killswitch]);
break;
}
}
}
}
MPI_Barrier();
or is there a more efficient way to structure this for variable-complexity chunks or variable-speed nodes?
As #Gilles Gouaillardet, pointed out the keywords in such scenario is MPI_ANY_SOURCE. Using it, the processes can receive message from any source. To know which process send that message, you can use status.MPI_SOURCE on the status of the recv call.
MPI_Status status;
if(rank == 0) {
//send initial work to all processes
while(true) {
MPI_recv(buf, 32, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
// do the distribution logic
MPI_send(buf, 32, MPI_INT, status.MPI_SOURCE, tag, MPI_COMM_WORLD);
// break out of the loop once the work is over and send all the processes
message to stop waiting for work
}
}
else {
while(true){
// receive work from rank 0
MPI_recv(buf, 32, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
// Perform computation and send back the result
MPI_send(buf, 32, MPI_INT, 0, tag, MPI_COMM_WORLD);
//break this until asked by master 0 using some kind of special message
}
}

CUDA string search in large file, wrong result

I am working on simple naive string search in CUDA.
I am new in CUDA. It works fine fol smaller files ( aprox. ~1MB ). After I make these files bigger ( ctrl+a ctrl+c several times in notepad++ ), my program's results are higher ( about +1% ) than a
grep -o text file_name | wc -l
It is very simple function, so I don't know what could cause this. I need it to work with larger files ( ~500MB ).
Kernel code ( gpuCount is a __device__ int global variable ):
__global__ void stringSearchGpu(char *data, int dataLength, char *input, int inputLength){
int id = blockDim.x*blockIdx.x + threadIdx.x;
if (id < dataLength)
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++)
{
if (data[id + j] != input[j]) fMatch = 0;
}
if (fMatch)
{
atomicAdd(&gpuCount, 1);
}
}
}
This is calling the kernel in main function:
int blocks = 1, threads = fileSize;
if (fileSize > 1024)
{
blocks = (fileSize / 1024) + 1;
threads = 1024;
}
clock_t cpu_start = clock();
// kernel call
stringSearchGpu<<<blocks, threads>>>(cudaBuffer, strlen(buffer), cudaInput, strlen(input));
cudaDeviceSynchronize();
After this I just copy the result to Host and print it.
Can anyone please help me with this?
First of all, you should always check return values of CUDA functions to check for errors. Best way to do so would be the following:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Wrap your CUDA calls, such as:
gpuErrchk(cudaDeviceSynchronize());
Second, your kernel accesses out of bounds memory. Suppose, dataLength=100, inputLength=7 and id=98. In your kernel code:
if (id < dataLength) // 98 is less than 100, so condition true
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++) // j runs from [0 - 6]
{
// if j>1 then id+j>=100, which is out of bounds, illegal operation
if (data[id + j] != input[j]) fMatch = 0;
}
Change the condition to something like:
if (id < dataLength - inputLength)

Linux Kernel: manually modify page table entry flags

I am trying to manually mark a certain memory region of a userspace process as non-cacheable (for educational purposes, not intended to be used in production code) by setting a flag in the respective page table entries.
I have an Ubuntu 14.04 (ASLR disabled) with a 4.4 Linux kernel running on an x86_64 Intel Skylake processor.
In my kernel module I have the following function:
/*
* Set memory region [start,end], excluding 'addr', of process with PID 'pid' as uncacheable.
*/
ssize_t set_uncachable(uint32_t pid, uint64_t start, uint64_t end, uint64_t addr)
{
struct task_struct* ts = NULL;
struct vm_area_struct *curr, *first = NULL;
struct mm_struct* mm;
pgd_t * pgd;
pte_t * pte;
uint64_t numpages, curr_addr;
uint32_t level, j, i = 0;
printk(KERN_INFO "set_unacheable called\n");
ts = pid_task(find_vpid(pid), PIDTYPE_PID); //find task from PID
pgd = ts->mm->pgd; //page table root of the task
first = ts->mm->mmap;
curr = first;
if(first == NULL)
return -1;
do
{
printk(KERN_INFO "Region %3u [0x%016llx - 0x%016llx]", i, curr->vm_start, curr->vm_end);
numpages = (curr->vm_end - curr->vm_start) / PAGE_SIZE; //PAGE_SIZE is 4K for now
if(curr->vm_start > curr->vm_end)
numpages = 0;
for(j = 0; j < numpages; j++)
{
curr_addr = curr->vm_start + (PAGE_SIZE*j);
pte = lookup_address_in_pgd(pgd, curr_addr, &level);
if((pte != NULL) && (level == 1))
{
printk(KERN_INFO "PTE for 0x%016x - 0x%016x (level %u)\n", curr_addr, pte->pte, level);
if(curr_addr >= start && curr_addr < end && curr_addr != addr)
{
//setting page entry to PAT#3
pte->pte |= PWT_BIT | PCD_BIT;
pte->pte &= ~PAT_BIT;
printk(KERN_INFO "PTE for 0x%016x - 0x%016x (level %u) -- UPDATED\n", curr_addr, pte->pte, level);
}
}
}
curr = curr->vm_next;
if(curr == NULL)
return -1;
i++;
} while (curr != first);
return 0;
}
To test the above code I run an application that allocates a certain region in memory:
//#define BUF_ADDR_START 0x0000000008400000LL /* works */
#define BUF_ADDR_START 0x00007ffff0000000LL /* does not work */
[...]
buffer = mmap((void *)BUF_ADDR, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | MAP_POPULATE, 0, 0);
if ( buffer == MAP_FAILED )
{
printf("Failed to map buffer\n");
exit(-1);
}
memset(buffer, 0, BUF_SIZE);
printf("Buffer at %p\n", buffer);
I want to mark the buffer uncacheable using my kernel module. The code in my kernel module works for 0x8400000, but for 0x7ffff0000000 no page table entry is found (i.e. lookup_address_in_pgd returns NULL). The buffer is definitely allocated in the test program, though.
It seems like my kernel module works for low addresses (code, data, and heap sections), but not for memory mapped at higher addresses (stack, shared libraries, etc.).
Does anyone have an idea why it fails for larger addresses? Suggestions on how to implement set_uncachable more elegantly are welcome as well ;-)
Thanks!

MPI_Recv() invalid buffer pointer

I have a dynamically allocated array that is sent by rank 0 to other ranks using MPI_Send()
On the receiving side, a dynamic array is allocated memory using malloc()
MPI_Recv() happens on the other ranks. At this receive function, I get invalid Buffer Pointer error.
Code is conceptually similar to this:
struct graph{
int count;
int * array;
} a_graph;
int x = 10;
MPI_Status status;
//ONLY 2 RANKS ARE PRESENT. RANK 0 SENDS MSG TO RANK 1
if (rank == 0){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
MPI_Send(my_graph->array,my_graph->count,int,1,0,MPI_COMM_WORLD);
free(my_graph->array);
free(my_graph);
}
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
MPI_Recv(my_graph->array,my_graph->count,int,0,0,MPI_COMM_WORLD,&status) // MPI INVALID BUFFER POINTER ERROR HAPPENS AT THIS RECV
}
I dont understand why this happens since memory is allocated in both sender and receiver ranks
Below is a minimal, working, and verifiable (MWVE) example which Zulan suggested you to make. Please provide MWVE in your future questions. Anyway, you need to use MPI datatype MPI_INT instead of int for sending and receiving.
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
typedef struct graph{
int count;
int * array;
} a_graph;
int main()
{
MPI_Init(NULL, NULL);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int x = 10;
MPI_Status status;
//ONLY 2 RANKS ARE PRESENT. RANK 0 SENDS MSG TO RANK 1
if (rank == 0){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
MPI_Send(my_graph->array,my_graph->count,MPI_INT,1,0,MPI_COMM_WORLD);
free(my_graph->array);
free(my_graph);
}
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
MPI_Recv(my_graph->array,my_graph->count,MPI_INT,0,0,MPI_COMM_WORLD,&status);
for (int i=0; i<my_graph->count; ++i)
{
printf("%i\n", my_graph->array[i]);
}
}
MPI_Finalize();
return 0;
}

malloc, scope, initialization (or lack thereof)

Here's the code:
// allocation
void allocateSymbolStorage(char **pepperShakerList, char **pepperList)
{
// allocate storage for an array of pointers
pepperShakerList = (char **) malloc(MAX_PEPPER_SHAKERS * sizeof(char *));
for (int i = 0; i < MAX_PEPPER_SHAKERS; i++)
{
if ((pepperShakerList[i] = (char *) malloc(MAX_SHAKERNAME_LENGTH * sizeof(char))) == NULL)
fatalError("failed pepperShakerList alloc");
}
// allocate storage for an array of pointers
pepperList = (char **) malloc(MAX_PEPPERS * sizeof(char *));
for (int i = 0; i < MAX_PEPPERS; i++)
{
if ((pepperList[i] = (char *) malloc(MAX_PEPPER_LENGTH * sizeof(char))) == NULL)
fatalError("failed pepperList alloc");
}
}
void buildPepperShakers(void)
{
char **pepperShakerList, **pepperList;
allocateSymbolStorage(pepperShakerList, pepperList);
// ....
freeSymbolStorage(pepperShakerList, pepperList);
}
Here's the VS 2010 error:
: warning C4700: uninitialized local variable 'pepperList' used
Here's the confusion:
Why the error if the char ** is being allocated in the allocate function? Is it a matter of the thing falling out of scope?
Assuming it's pepperList and not symbolList that you are talking about, AND assuming that your code in the allocationSymbolStorage reflects what you want to do, then VC is complaining correctly.
As it stands your code would crash because in buildPepperShakers() you are NOT getting any values back from allocateSymbolStorage.
So your allocateSymbolStorage should be declared as:
void allocateSymbolStorage(char ***pepperShakerList, char ***pepperList)
THEN you pass the addresses of local pointer-holder variables in buildPepperShakers, namely pepperList and pepperShakerList to the allocation function, so that it can THEN do allocations as per TJD's answer. That is:
void buildPepperShakers(void) {
char **pepperShakerList, **pepperList;
allocateSymbolStorage(&pepperShakerList, &pepperList);
}
of course your allocateSymbolStorage body now becomes:
void allocateSymbolStorage(char ***pepperShakerList_p, char ***pepperList_p)
{
char **pepperShakerList, **pepperList;
// allocate storage for an array of pointers
pepperShakerList = (char **) malloc(MAX_PEPPER_SHAKERS * sizeof(char *));
for (int i = 0; i < MAX_PEPPER_SHAKERS; i++)
{
if ((pepperShakerList[i] = (char *) malloc(MAX_SHAKERNAME_LENGTH * sizeof(char))) == NULL)
fatalError("failed pepperShakerList alloc");
}
// allocate storage for an array of pointers
pepperList = (char **) malloc(MAX_PEPPERS * sizeof(char *));
for (int i = 0; i < MAX_PEPPERS; i++)
{
if ((pepperList[i] = (char *) malloc(MAX_PEPPER_LENGTH * sizeof(char))) == NULL)
fatalError("failed pepperList alloc");
}
*pepperShakerList_p = pepperShakerList;
*pepperList_p = pepperList;
}
and now VC should not complain. Although this is an ugly way of doing memory management of your objects :-)
This is what you are intending, you need to dereference the pointer you pass in:
*pepperShakerList = (char *) malloc(MAX_PEPPER_SHAKERS * sizeof(char *));

Resources