In Cache Coherency (specifically write-through and write-back), can cache be updated even though it hasn't read from the memory first? - caching

There are 3 serial process. In the serial 1, the event is blank, and in write-through and write back only the memory column has X. In serial 2, the event is "P reads X" and, both memory and cache column in write-through and write-back have X. Lastly, in serial 3, the event is "P updates X" and, in write-through, both memory and cache have X'. However, in write-back, only the cach column have X' while the memory column is still X.
From this image, it shows how write-through and write-back works. But, what if in serial 2, instead of "P reads X", it is "P updates X"? What will happen to the memory and cache?
From what I understand, this is what will happen if the serial 2 is "P updates X"
| | Write-Through | Write-Back |
| Serial | Event | Memory | Cache | Memory | Cache |
| -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | | X | | X | |
| 2 | P updates X | X' | X' | X | X' |
But I'm not really sure if it's correct though. I need clarificaiton about this.

Related

How to measure precisely the memory usage of the GPU (OpenACC+Managed Memory)

Which is the most precise method to measure the memory usage of the GPU of an application that is using OpenACC with Managed Memory?
I used two method to do so: one is
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla v100 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 51C P5 11W / N/A | 10322MiB / 16160MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2670 G ./myapp 398MiB |
+-----------------------------------------------------------------------------+
About what here is printed, which is the difference between the Memory usage above (10322MiB / 16160MiB) and that below (./myapp 398MiB) ?
The other method I used is:
void measure_acc_mem_usage() {
auto dev_ty = acc_get_device_type();
auto dev_mem = acc_get_property(0, dev_ty, acc_property_memory);
auto dev_free_mem = acc_get_property(0, dev_ty, acc_property_free_memory);
auto mem = dev_mem - dev_free_mem;
if (mem > max_mem_usage)
max_mem_usage = mem;
}
A function I call many times during the program execution.
Both these methods don't seem to report the exact behaviour of the device (basing this statement on when the saturation seems to occurs: when the application begins to run really slow increasing the problem size) and report very different values (while for example, the second method indicates 2GB of memory usage, nvidia-smi says 16GB)
Not sure you'll be able to get a precise value of memory usage when using CUDA Unified Memory (aka managed). The nvidia-smi utility will only show cudaMalloc allocated memory and the OpenACC property function will use cudaGetMemInfo which isn't accurate for UM.
Bob gives a good explanation as to why here: CUDA unified memory pages accessed in CPU but not evicted from GPU

BigQuery: Sample a varying number of rows per group

I have two tables. One has a list of items, and for each item, a number n.
item | n
--------
a | 1
b | 2
c | 3
The second one has a list of rows containing item, uid, and other rows.
item | uid | data
------------------
a | x | foo
a | x | baz
a | x | bar
a | z | arm
a | z | leg
b | x | eye
b | x | eye
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
I would like to sample, for each (item,uid) pair, n rows (arbitrary, it's better if this is uniformly random, but it doesn't have to be). In the example above, I want to keep maximum one row per user for item a, two rows per user for item b, and three rows per user to item c:
item | uid | data
------------------
a | x | baz
a | z | arm
b | x | eye
b | x | eye
b | z | tap
c | y | tip
c | z | top
ARRAY_AGG with LIMIT n doesn't work for two reasons: first, I suspect that given that n can be large (on the order of 100,000), this won't scale. The second, more fundamental problem is that n needs to be a constant.
Table sampling also doesn't seem to solve my problem, since it's per-table, and also only supports sampling a fixed percentage of rows, rather than a fixed number of rows.
Are there any other options?
Consider below solution
select * except(n)
from rows_list
join items_list
using(item)
where true
qualify row_number() over win <= n
window win as (partition by item, uid order by rand())
if applied to sample data in your question - output is

Is there any calibration tool between two languages performance?

I'm measuring the performance of A and B programs. A is written in Golang, B is written in Python. The important point here is that I'm interested in how the performance value increases, not the absolute performance value of the two programs over time.
For example,
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 3 | 500 |
+------+-----+-----+
| 2 | 5 | 800 |
+------+-----+-----+
| 3 | 9 | 1300|
+------+-----+-----+
| 4 | 13 | 1800|
+------+-----+-----+
Where the values in columns A and B(A: 3, 5, 9, 13 / B: 500, 800, 1300, 1800) are the execution times of the program. This execution time can be seen as performance, and the difference between the absolute performance values of A and B is very large. Therefore, the slope comparison of the performance graphs of the two programs would be meaningless.(Python is very slow compared to Golang.)
I want to compare the performance of Program A written in Golang with Program B written in Python, and I'm looking for a calibration tool or formula based on benchmarks that calculates the execution time when Program A is written in Python.
Is there any way to solve this problem ?
If you are interested in the relative change, you should normalize the data for each programming language. In other words, divide the values for golang with 3 and for python, divide with value 500.
+------+-----+-----+
| time | A | B |
+------+-----+-----+
| 1 | 1 | 1 |
+------+-----+-----+
| 2 | 1.66| 1.6 |
+------+-----+-----+
| 3 | 3 | 2.6 |
+------+-----+-----+
| 4 |4.33 | 3.6 |
+------+-----+-----+

Communication between two applications using Environment Variables

Question
How to communicate with another program (for instance, a windows service one) through environment variables (not system or user ones)?
What do we have
Well, I have the following scheme for a data logger:
------------------------- --------------------------------
| the things to measure | | the things that do something |
------------------------- --------------------------------
| ^
| sensors | switches
V |
-------------------------------------------------------------------
| dedicated hardware |
-------------------------------------------------------------------
| ^
| | serial communication
V |
--------------- -------------
| Windows | ------------------------------------> | user |
| service | <------------------------------------ | interface |
--------------- udp communication -------------
|^ keyboard
V| and screen
--------
| user |
--------
On current development:
windows service is always running when Windows is running
user can open and close user interface (of course :p)
windows service acquires data from sensors
user interface automatic requests data to windows service every 100ms and shows it to user via udp communication through some implemented protocol (we call it GetData() command and response to it)
user can send some other commands to change the data to acquire through implemented protocol (we call it SetSensors() command and response to it)
Both user interface and windows service are developed on Borland C+ Builder 6 and use NMUDP component, from FastNet tab, for UDP communication.
What we are thinking to do
Because of some buffer issues and to free udp channel only for sending SetSensors()command and response to it, we are considering that instead of using GetData():
Windows service would get data from sensors and put them on environment variables
the user interface would read them to show to user
Scheme after doing what we are thinking
------------------------- --------------------------------
| the things to measure | | the things that do something |
------------------------- --------------------------------
| ^
| sensors | switches
V |
-------------------------------------------------------------------
| dedicated hardware |
-------------------------------------------------------------------
| ^
| | serial communication
V |
--------------- -------------
| | ------------------------------------> | |
| | environment variables | |
| | (get data from sensors) | |
| Windows | | user |
| service | | interface |
| | | |
| | ------------------------------------> | |
| | <------------------------------------ | |
--------------- udp communication -------------
(send commands to service) |^ keyboard
V| and screen
--------
| user |
--------
Any way to do that?
We would not use system and user environment variables, because it writes on Windows Registry, i.e., it will save to hard drive and it gets more slow...
As #HansPassant said, I cannot do that directly. Although I saw some ways to do that via memory mapped file, it is so easy only to add one more udp communication channel through other port. So:
------------------------- --------------------------------
| the things to measure | | the things that do something |
------------------------- --------------------------------
| ^
| sensors | switches
V |
-------------------------------------------------------------------
| dedicated hardware |
-------------------------------------------------------------------
| ^
| | serial communication
V |
--------------- -------------
| | ------------------------------------> | |
| | udp communication (port 3) | |
| | (get data from sensors) | |
| Windows | | user |
| service | | interface |
| | (port 1) | |
| | ------------------------------------> | |
| | <------------------------------------ | |
--------------- udp communication (port 2) -------------
(send commands to service) |^ keyboard
V| and screen
--------
| user |
--------
If someone provide a better solution, I'll mark it as solution in future.

The "Waiting lists problem"

A number of students want to get into sections for a class, some are already signed up for one section but want to change section, so they all get on the wait lists. A student can get into a new section only if someone drops from that section. No students are willing to drop a section they are already in unless that can be sure to get into a section they are waiting for. The wait list for each section is first come first serve.
Get as many students into their desired sections as you can.
The stated problem can quickly devolve to a gridlock scenario. My question is; are there known solutions to this problem?
One trivial solution would be to take each section in turn and force the first student from the waiting list into the section and then check if someone end up dropping out when things are resolved (O(n) or more on the number of section). This would work for some cases but I think that there might be better options involving forcing more than one student into a section (O(n) or more on the student count) and/or operating on more than one section at a time (O(bad) :-)
Well, this just comes down to finding cycles in the directed graph of classes right? each link is a student that wants to go from one node to another, and any time you find a cycle, you delete it, because those students can resolve their needs with each other. You're finished when you're out of cycles.
Ok, lets try. We have 8 students (1..8) and 4 sections. Each student is in a section and each section has room for 2 students. Most students want to switch but not all.
In the table below, we see the students their current section, their required section and the position on the queue (if any).
+------+-----+-----+-----+
| stud | now | req | que |
+------+-----+-----+-----+
| 1 | A | D | 2 |
| 2 | A | D | 1 |
| 3 | B | B | - |
| 4 | B | A | 2 |
| 5 | C | A | 1 |
| 6 | C | C | - |
| 7 | D | C | 1 |
| 8 | D | B | 1 |
+------+-----+-----+-----+
We can present this information in a graph:
+-----+ +-----+ +-----+
| C |---[5]--->1| A |2<---[4]---| B |
+-----+ +-----+ +-----+
1 | | 1
^ | | ^
| [1] [2] |
| | | |
[7] | | [8]
| V V |
| 2 1 |
| +-----+ |
\--------------| D |--------------/
+-----+
We try to find a section with a vacancy, but we find none. So because all sections are full, we need a dirty trick. So lets take a random section with a non empty queue. In this case section A and assume, it has an extra position. This means student 5 can enter section A, leaving a vacancy at section C which is taken by student 7. This leaves a vacancy in section D which is taken by student 2. We now have a vacancy at section A. But we assumed that section A has an extra position, so we can remove this assumption and have gained a simpler graph.
If the path never returned to section A, undo the moves and mark A as an invalid startingpoint. Retry with another section.
If there are no valid sections left we are finished.
Right now we have the following situation:
+-----+ +-----+ +-----+
| C | | A |1<---[4]---| B |
+-----+ +-----+ +-----+
| 1
| ^
[1] |
| |
| [8]
V |
1 |
+-----+ |
| D |--------------/
+-----+
We repeat the trick with another random section, and this solves the graph.
If you start with several students currently not assigned, you add an extra dummy section as their startingpoint. Of course, this means that there must be vacancies in any sections or the problem is not solvable.
Note that due to the order in the queue, it can be possible that there is no solution.
This is actually a Graph problem. You can think of each of these waiting list dependencies as edges on a directed graph. If this graph has a cycle, then you have one of the situations you described. Once you have identified a cycle, you can chose any point to "break" the cycle by "over filling" one of the classes, and you will know that things will settle correctly because there was a cycle in the graph.

Resources