How to interpret stats/show in Rebol 3 - rebol3

Wanted to make some profiling on a R3 script and was checking at the stats command.
But what do these informations mean?
How can it be used to monitor memory usage?
>> stats/show
Series Memory Info:
node size = 16
series size = 20
5 segs = 409640 bytes - headers
4888 blks = 812448 bytes - blocks
1511 strs = 86096 bytes - byte strings
2 unis = 86016 bytes - unicode strings
4 odds = 39216 bytes - odd series
6405 used = 1023776 bytes - total used
0 free / 14075 bytes - free headers / node-space
Pool[ 0] 8B 202/ 3328: 256 ( 6%) 13 segs, 26728 total
Pool[ 1] 16B 178/ 512: 256 (34%) 2 segs, 8208 total
Pool[ 2] 32B 954/ 2560: 512 (37%) 5 segs, 81960 total
...
Pool[26] 64B 0/ 0: 128 ( 0%) 0 segs, 0 total
Pools used 654212 of 1906200 (34%)
System pool used 497664
== 1023776

It shows the internal memory management information, not sure how useful it would be to the script.
Anyway, here are some explanations about the memory pools.
Most pools are for series (there is a dedicated pool for GOB!s, and some others if you're looking at Atronix source code), to make it simple, I will focus on series pools here.
Internally, a series has a header and its data which is a chunk of contiguous memory. The header has the width and length info about the series. The data holds the actual content of the series. In R3, Series is used extensively to implement block!, port!, string!, object!, etc. So managing memory in R3 is almost managing (allocating and destroying) series. Because of the difference in the width and length of serieses, pools are introduced to reduce the fragmentation.
When a new series is needed, the header is allocated in a special pool, and another pool is chosen for its data. The pool whose width is closed to the size of the series is chosen. E.g. a block with 3 elements will probably be allocated in a pool with width of 128-byte (on 32-bit systems, a block is a series with 4 (3 + 1 terminater) elements). As a pool could increase as the the program runs, it's implemented as a list of segments. New segments will be allocated and appended to the list as needed (but it's never released back to system).
Another special pool is the system pool, which is chosen when the required memory is big. R3 doesn't actually manage this pool other than collecting some statistics.
When it tries to collect garbage, it will sweep the root context, and mark everything that can be reachable, then it will go through the series header pool, and find out all unneeded serieses and destroy them.

If you use stats without a refinement, you can see the actual memory usage. So comparing memory usage before and after your implementations you can see which one uses less memory.
>> stats
== 1129824
>> s: make string! 1024
== ""
>> stats
== 1132064

Related

Mapping 512KB main memory into 1KB cache homework question

I'm sorry if I made an error in posting this. Please let me know if I need to change anything.
I've received my computer architecture homework back and I missed this question. My professor's explanation didn't make sense to me, and I disagree with what he told me, so I am here asking what you guys think.
Here is the question:
A computer uses 16-bit memory addresses. Main memory is 512KB, and the cache is 1KB with 32B per block. Given each of the following mapping functions, calculate the number of bits in each field of the memory address.
Here is how I worked through the direct mapping part of the problem:
Cache memory: 1KB (2^10), 16-bit memory addresses (1 word = 2B) -> 1024B/2B = 512 words, 16 words per block (32B) -> 512/16 = 32 cache memory blocks.
Main memory: 512 KB (2^19), 16-bit memory addresses (1 word = 2B) -> 524288B/2B = 256K words, 16 words per block (32B) -> 256K/16 = 16384 or 16K main memory blocks.
I understand the word tag as such: 32B per block allows for 16 16-bit memory addresses per block. This (I believe) supports that: 1 word = 16 bits = 2 B -> 32B/2B = 16 words in each block. This equates to 2^4 = 4 bits for determining which word in the block, leaving 12 bits for tag and block bits in the memory address.
Now, in order to map 16K main memory blocks directly into 32 cache memory blocks, there will have to be 512 main memory blocks mapped to each cache memory block. So 512/16K blocks per 1/32 blocks.
Here is where I am confused. Doesn't this require 9 tag bits, as 2^9 = 512 (main memory blocks possibly mapped into one cache memory block)?
For the block bits, which point to a particular block in the cache, this requires 5 bits. 2^5 = 32, blocks in cache memory.
This would require 18 bits in the memory address.
Here is my professor's answer for this question:
2^5 = 32 -> 5 Word bits
(1KB)/(32B) = 32 blocks -> 5 Block bits
16 – 5 – 5 = 6 Tag bits
I did not realize I could simply subtract the required block and word bits to get the tag bits. But it still doesn't make sense to me. 2^6 = 64 blocks per cache block. 64*32 gives 2048. I can't wrap my head around this. Can someone please help?
Okay, the terminology that i learnt is slightly different but the principal should be the same for this explanation.
So cache will have multiple sets (sort of like a cell). And each set will have 1 cache line (containing 1 block of data) or multiple cache lines (each contain 1 block of data) (direct mapping or n-associativity mapping).
In mapping the main memory blocks to the cache, the main memory address (16 bit) is divided into 3 fields: tag, index bits and offset bits. A memory cell is 1 byte and a block is made up of a few cells
Offset bits are used to access the individual bytes of a memory block. Think of it as the offset on top of the block base address to get the byte you want (i assume your memory should be byte-addressable rather than word-addressable as it doesn't make sense to access 2B word as this would be inflexible) And here your prof/textbook call it as word bit. Hence if a block has 32 Bytes, there would be log2(block size) = 5 bits needed to access the individuals cells in the mapped block.
Index bits (in direct mapped cache is called block bits too as the number of set is the same as the number of blocks in the cache) is used to identify which set/cache line/ cache block that the main memory block is mapped to the cache. There are 1KB/32B = 32 cache blocks in the cache. As direct mapping is used, each set contain only 1 cache block and therefore there will be 32 set in this cache. Thus to access the correct set in cache, 5 bits is needed and therefore index bits = 5 bits
Tag is a name to determine if the data block in cache is the correct one we are looking one from the main memory. As the address of main memory is 16 bit and we already know index and offset fields, it is easy to deduce that tag will need 16 - 5 - 5 6 bits. How we determine the tag is not really a concern as the block size and cache size (and hence no. of sets in cache is given here).

Difference between cache way and cache set

I am trying to learn some stuff about caches. Lets say I have a 4 way 32KB cache and 1GB of RAM. Each cache line is 32 bytes. So, I understand that the RAM will be split up into 256 4096KB pages, each one mapped to a cache set, which contains 4 cache lines.
How many cache ways do I have? I am not even sure what a cache way is. Can someone explain that? I have done some searching, the best example was
http://download.intel.com/design/intarch/papers/cache6.pdf
But I am still confused.
Thanks.
The cache you are referring to is known as set associative cache. The whole cache is divided into sets and each set contains 4 cache lines(hence 4 way cache). So the relationship stands like this :
cache size = number of sets in cache * number of cache lines in each set * cache line size
Your cache size is 32KB, it is 4 way and cache line size is 32B. So the number of sets is
(32KB / (4 * 32B)) = 256
If we think of the main memory as consisting of cache lines, then each memory region of one cache line size is called a block. So each block of main memory will be mapped to a cache line (but not always to a particular cache line, as it is set associative cache).
In set associative cache, each memory block will be mapped to a fixed set in the cache. But it can be stored in any of the cache lines of the set. In your example, each memory block can be stored in any of the 4 cache lines of a set.
Memory block to cache line mapping
Number of blocks in main memory = (1GB / 32B) = 2^25
Number of blocks in each page = (4KB / 32B) = 128
Each byte address in the system can be divided into 3 parts:
Rightmost bits represent byte offset within a cache line or block
Middle bits represent to which cache set this byte(or cache line) will be mapped
Leftmost bits represent tag value
Bits needed to represent 1GB of memory = 30 (1GB = (2^30)B)
Bits needed to represent offset in cache line = 5 (32B = (2^5)B)
Bits needed to represent 256 cache sets = 8 (2^8 = 256)
So that leaves us with (30 - 5 - 8) = 17 bits for tag. As different memory blocks can be mapped to same cache line, this tag value helps in differentiating among them.
When an address is generated by the processor, 8 middle bits of the 30 bit address is used to select the cache set. There will be 4 cache lines in that set. So tags of the all four resident cache lines are checked against the tag of the generated address for a match.
Example
If a 30 bit address is 00000000000000000-00000100-00010('-' separated for clarity), then
offset within the cache is 2
set number is 4
tag is 0
In their "Computer Organization and Design, the Hardware-Software Interface", Patterson and Hennessy talk about caches. For example, in this version, page 408 shows the following image (I have added blue, red, and green lines):
Apparently, the authors use only the term "block" (and not the "line") when they describe set-associative caches. In a direct-mapped cache, the "index" part of the address addresses the line. In a set-associative, it indexes the set.
This visualization should get along well with #Soumen's explanation in the accepted answer.
However, the book mainly describes Reduced Instruction Set Architectures (RISC). I am personally aware of MIPS and RISC-V versions. So, if you have an x86 in front of you, take this picture with a grain of salt, more as a concept visualization than as actual implementation.
If we divide the memory into cache line sized chunks(i.e. 32B chunks of memory), each of this chunks is called a block. Now when you try to access some memory address, the whole memory block(size 32B) containing that address will be placed to a cache line.
No each set is not responsible for 4096KB or one particular memory page. Multiple memory blocks from different memory pages can be mapped to same cache set.

How much memory do I need to have for 100 million records

How much memory do i need to load 100 million records in to memory. Suppose each record needs 7 bytes. Here is my calculation
each record = <int> <short> <byte>
4 + 2 + 1 = 7 bytes
needed memory in GB = 7 * 100 * 1,000,000 / 1000,000,000 = 0.7 GB
Do you see any problem with this calculation?
With 100,000,000 records, you need to allow for overhead. Exactly what and how much overhead you'll have will depend on the language.
In C/C++, for example, fields in a structure or class are aligned onto specific boundaries. Details may vary depending on the compiler, but in general int's must begin at an address that is a multiple of 4, short's at a multiple of 2, char's can begin anywhere.
So assuming that your 4+2+1 means an int, a short, and a char, then if you arrange them in that order, the structure will take 7 bytes, but at the very minimum the next instance of the structure must begin at a 4-byte boundary, so you'll have 1 pad byte in the middle. I think, in fact, most C compilers require structs as a whole to begin at an 8-byte boundary, though in this case that doesn't matter.
Every time you allocate memory there's some overhead for allocation block. The compiler has to be able to keep track of how much memory was allocated and sometimes where the next block is. If you allocate 100,000,000 records as one big "new" or "malloc", then this overhead should be trivial. But if you allocate each one individually, then each record will have the overhead. Exactly how much that is depends on the compiler, but, let's see, one system I used I think it was 8 bytes per allocation. If that's the case, then here you'd need 16 bytes for each record: 8 bytes for block header, 7 for data, 1 for pad. So it could easily take double what you expect.
Other languages will have different overhead. The easiest thing to do is probably to find out empirically: Look up what the system call is to find out how much memory you're using, then check this value, allocate a million instances, check it again and see the difference.
If you really need just 7 bytes per structure, then you are almost right.
For memory measurements, we usually use the factor of 1024, so you would need
700 000 000 / 1024³ = 667,57 MiB = 0,652 GiB

Which is faster to process a 1TB file: a single machine or 5 networked machines?

Which is faster to process a 1TB file: a single machine or 5 networked
machines? ("To process" refers to finding the single UTF-16 character
with the most occurrences in that 1TB file). The rate of data
transfer is 1Gbit/sec, the entire 1TB file resides in 1 computer, and
each computer has a quad core CPU.
Below is my attempt at the question using an array of longs (with array size of 2^16) to keep track of the character count. This should fit into memory of a single machine, since 2^16 x 2^3 (size of long) = 2^19 = 0.5MB. Any help (links, comments, suggestions) would be much appreciated. I used the latency times cited by Jeff Dean, and I tried my best to use the best approximations that I knew of. The final answer is:
Single Machine: 5.8 hrs (due to slowness of reading from disk)
5 Networked Machines: 7.64 hrs (due to reading from disk and network)
1) Single Machine
a) Time to Read File from Disk --> 5.8 hrs
-If it takes 20ms to read 1MB seq from disk,
then to read 1TB from disk takes:
20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs
= 350 mins = 5.8 hrs
b) Time needed to fill array w/complete count data
--> 0 sec since it is computed while doing step 1a
-At 0.5 MB, the count array fits into L2 cache.
Since L2 cache takes only 7 ns to access,
the CPU can read & write to the count array
while waiting for the disk read.
Time: 0 sec since it is computed while doing step 1a
c) Iterate thru entire array to find max count --> 0.00625ms
-Since it takes 0.0125ms to read & write 1MB from
L2 cache and array size is 0.5MB, then the time
to iterate through the array is:
0.0125ms/MB x 0.5MB = 0.00625ms
d) Total Time
Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)
2) 5 Networked Machines
a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
1TB x 1024GB/TB x 8bits/B x 1s/Gbit
= 8,192s = 137m = 2.3hr
But since the original machine keeps a fifth of the data, it
only needs to send (4/5)ths of data, so the time required is:
2.3 hr x 4/5 = 1.84 hrs
*But to send the data, the data needs to be read, which
is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
So total time = 1.84hrs + 4.64 hrs = 6.48 hrs
b) Time to fill array w/count data from original machine --> 1.16 hrs
-The original machine (that had the 1TB file) still needs to
read the remainder of the data in order to fill the array with
count data. So this requires (1/5)(answer 1a)=1.16 hrs.
The CPU time to read & write to the array is negligible, as
shown in 1b.
c) Time to fill other machine's array w/counts --> not counted
-As the file is being transferred, the count array can be
computed. This time is not counted.
d) Time required to receive 4 arrays --> (2^-6)s
-Each count array is 0.5MB
0.5MB x 4 arrays x 8bits/B x 1s/Gbit
= 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits
= 2^25/2^31s = (2^-6)s
d) Time to merge arrays
--> 0 sec(since it can be merge while receiving)
e) Total time
Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs
This is not an answer but just a longer comment. You have miscalculated the size of the frequency array. 1 TiB file contains 550 Gsyms and because nothing is said about their expected freqency, you would need a count array of at least 64-bit integers (that is 8 bytes/element). The total size of this frequency array would be 2^16 * 8 = 2^19 bytes or just 512 KiB and not 4 GiB as you have miscalculated. It would only take ≈4.3 ms to send this data over 1 Gbps link (protocol headers take roughly 3% if you use TCP/IP over Ethernet with an MTU of 1500 bytes /less with jumbo frames but they are not widely supported/). Also this array size perfectly fits in the CPU cache.
You have grossly overestimated the time it would take to process the data and extract the frequency and you have also overlooked the fact that it can overlap disk reads. In fact it is so fast to update the frequency array, which resides in the CPU cache, that the computation time is negligible as most of it will overlap the slow disk reads. But you have underestimated the time it takes to read the data. Even with a multicore CPU you still have only one path to the hard drive and hence you would still need the full 5.8 hrs to read the data in the single machine case.
In fact, this is an exemple kind of data processing that neither benefits from parallel networked processing nor from having more than one CPU core. This is why supercomputers and other fast networked processing systems use distributed parallel file storages that can deliver many GB/s of aggregate read/write speeds.
You only need to send 0.8tb if your source machine is part of the 5.
It may not even make sense sending the data to other machines. Consider this:
In order to for the source machine to send the data it must first hit the disk in order to read the data into main memory before it send the data over the network. If the data is already in main memory and not being processed, you are wasting that opportunity.
So under the assumption that loading to CPU cache is much less expensive than disk to memory or data over network (which is true, unless you are dealing with alien hardware), then you are better off just doing it on the source machine, and the only place splitting up the task makes sense is if the "file" is somehow created/populated in a distributed way to start with.
So you should only count the disk read time of a 1Tb file, with a tiny bit of overhead for L1/L2 cache and CPU ops. The cache access pattern is optimal since it is sequential so you only cache miss once per piece of data.
The primary point here is that disk is the primary bottleneck which overshadows everything else.

CUDA - Blocks and Threads

I have implemented a string matching algorithm on the GPU. The searching time of a parallel version has been decreased considerably compared with the sequential version of the algorithm, but by using different number of blocks and threads I get different results.
How can I determine the number of the blocks and threds to get the best results?
I think this question is hard, if not impossible, to answer for the reason that it really depends on the algorithm and how it is operating. Since i cant see your implementation i can give you some leads:
Don't use global memory & check how you can max out the use of shared memory. Generally get a good feel of how threads access memory and how data is retrieved etc.
Understand how your warps operate. Sometimes threads in a warp may wait for other threads to finish in case you have 1 to 1 mapping between thread and data. So instead of this 1 to 1 mapping, you can map threads to multiple data so that they are kept busy.
Since blocks consist of threads that are group in 32 threads warp, it is the best if the number of threads in a block is a multiple of 32, so that you dont get warps consisting of 3 threads etc.
Avoid Diverging paths in warps.
I hope it helps a bit.
#Chris points are very important too but depend more on the algorithm itself.
Check the Cuda Manual about Thread alignment regarding memory lookups. Shared Memory Arrays should also be size of multiple of 16.
Use Coalesced global memory reads. But by algorithm design this is often the case and using shared memory helps.
Don't use atomic operations in global memory or at all if possible. They are very slow. Some algorithms using atomic operations can be rewritten using different techniques.
Without shown code no-one can tell you what is the best or why performance changes.
The number of threads per block of your kernel is the most important value.
Important values to calculate that value are:
Maximum number of resident threads per multiprocessor
Maximum number of resident blocks per multiprocessor
Maximum number of threads per block
Number of 32-bit registers per multiprocessor
Your algorithms should be scalable across all GPU's reaching 100% occupancy. For this I created myself a helper class which automatically detects the best thread numbers for the used GPU and passes it to the Kernel as a DEFINE.
/**
* Number of Threads in a Block
*
* Maximum number of resident blocks per multiprocessor : 8
*
* ///////////////////
* Compute capability:
* ///////////////////
*
* Cuda [1.0 - 1.1] =
* Maximum number of resident threads per multiprocessor 768
* Optimal Usage: 768 / 8 = 96
* Cuda [1.2 - 1.3] =
* Maximum number of resident threads per multiprocessor 1024
* Optimal Usage: 1024 / 8 = 128
* Cuda [2.x] =
* Maximum number of resident threads per multiprocessor 1536
* Optimal Usage: 1536 / 8 = 192
*/
public static int BLOCK_SIZE_DEF = 96;
Example Cuda 1.1 to reach 786 resident Threads per SM
8 Blocks * 96 Threads per Block = 786 threads
3 Blocks * 256 Threads per Block = 786 threads
1 Blocks * 512 Threads per Block = 512 threads <- 33% of GPU will be idle
This is also mentioned in the book:
Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Good programming advices:
Analyse your kernel code and write down the maximal number of threads it can handle or how many "units" it can process.
Also output your register usage and try to lower it to the respective targeted CUDA version. Because if you use too many registers in your kernel less blocks will be executed resulting in less occupancy and performance.
Example: Using Cuda 1.1 and using optimal number of 768 resident threads per SM you have 8192 registers to use. This leads to 8192 / 768 = 10 maximum registers per thread/kernel. If you use 11 the GPU will use 1 Block less resulting in decreased performance.
Example: A matrix independent row vector normalizing kernel of mine.
/*
* ////////////////////////
* // Compute capability //
* ////////////////////////
*
* Used 12 registers, 540+16 bytes smem, 36 bytes cmem[1]
* Used 10 registers, 540+16 bytes smem, 36 bytes cmem[1] <-- with -maxregcount 10 Limit for Cuda 1.1
* I: Maximum number of Rows = max(x-dim)^max(dimGrid)
* II: Maximum number of Columns = unlimited, since they are loaded in a tile loop
*
* Cuda [1.0 - 1.3]:
* I: 65535^2 = 4.294.836.225
*
* Cuda [2.0]:
* II: 65535^3 = 281.462.092.005.375
*/

Resources