sitemesh expends more time with larger page size - sitemesh

Using sitemesh Version 2.4.2
It expends 10ms when page-size 50k;
It expends 30ms when page-size 100k;
It expends 60ms when page-size 300k.
How can I reduce Time-consuming when page-size becomes larger?

Related

Reduce latency in P99 response time in low TPS application - Jboss application server

I'm looking to find way to reduce latencies / higher response time at P99. The application is running on Jboss application server. Current configuration of the system is 0.5 core and 2 GB memory.
Suspecting low TPS might be the reason for higher P99's because current usages of the application at peak traffic is 0.25 core, averaging "0.025 core". And old gen GC times are running at 1s. Heap setting -xmx1366m -xms512m, metaspace at 250mb
Right now we have parallel GC policy, will G1GC policy help?
What else should I consider?

Finding average memory access time, AMAT and global miss rate

I'm quite confused about this question. So, I have IL1, DL1 and UL2 and when I try to find AMAT do I use the formula AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * Miss Penalty ? or Do I also add Hit time(3) because there are 3 miss rates
For Example: 0.4 + 0.1 * (0.8 + 0.05 * (10 + 0.02 * 48))
I used AMAT = Hit Time(1) + Miss Rate * (Hit time(2) + Miss Rate * (Hit time(3) + Miss Rate * Miss Penalty))
Here is the Table, and also Frequency is 2.5 GHZ and It is also provided that 20% of all instructions are of load/store type.
By the way are there also a way to find global miss rate of UL2 in %? I'm also quite stuck on that one too.
There are two different cache hierarchies to consider.  I cannot tell from your question post if you're trying to compute AMAT for just data operations (load & store) or for instruction access + data operations (20% of the them).
The hierarchies:
Instruction Cache: IL1 backed by UL2 backed by Main Memory
Data Cache: DL1 backed by UL2 backed by Main Memory
There is a stated hit time & miss rate associated with each individual cache, and, this is necessary because the caches are of different construction and size (and also in different positions in the hierarchy).
All instructions participate in accessing of the Instruction Cache, so hit/miss there applies to every instruction regardless of the nature or type of the instruction.  So, you can compute the AMAT for instruction access alone generally using the IL1->UL2->Main Memory hierarchy — be sure to use the specific hit time and miss rate for each given level in the hierarchy: 1clk & 10% for IL1; 25clk & 2% for UL2; and 120clk & 0% for Main Memory.
20% of the instructions participate in accessing of the Data Cache.
Of those that do data accesses, you can compute that component of AMAT using the DL1->UL2->Main Memory hierarchy — here you have DL1 with 2clk & 5%; UL2 with 25clk & 2%; and Main Memory with 120clk & 0%.
These numbers can be combined to an overall value that accounts for 100% of the instructions incurring the instructions cache hierarchy AMAT, and 20% of them incurring the data cache hierarchy AMAT.
As needed you can convert AMAT in cycles/clocks to AMAT in (nano) seconds.

Increase only one core CPU speed

I would like to increase the speed of only 1 CPU core and reduce all other CPU core frequencies.
For this I have used the following steps:
disabled intel_pstate and enabled acpi
set CPU3 governor to performance and CPU0,1,2 to powersave
$ sudo cpufreq-set -c 3 -g performance
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
powersave
powersave
performance
changed lower and higher frequency ranges for each CPUs such that only CPU3 runs at a higher frequency:
$ cpufreq-info | grep current
current policy: frequency should be within 1.20 GHz and 1.50 GHz.
current CPU frequency is 2.39 GHz.
current policy: frequency should be within 1.20 GHz and 1.50 GHz.
current CPU frequency is 2.55 GHz.
current policy: frequency should be within 1.20 GHz and 1.50 GHz.
current CPU frequency is 2.58 GHz.
current policy: frequency should be within 1.20 GHz and 2.20 GHz.
current CPU frequency is 2.20 GHz.
Unfortunately, I observe all CPU cores frequency remain similar, and they either all get set to the range of 800 MHz (when I set CPU governor of 1 core to powersave) or all CPU ranges get set to 2.20 GHz (when I set CPU governor of core 3 to performance).
I want to know if it is even possible to increase 1 core CPU frequency to a higher value and reduce the CPU frequency of all other CPUs? I came across this thread which says my laptop may be doing symmetric multiprocessing and what I am looking for is assymetric multiprocessing. I would like to ask if this can be set in BIOS or if there is another method to selectively only increase CPU frequency of one core.

Why is the CPU slower for calculations then the GPU when only Memory should matter?

A modern CPU has a ethash hashrate from under 1MH/s (source: https://ethereum.stackexchange.com/questions/2325/is-cpu-mining-even-worth-the-ether ) while GPUs mine with over 20MH/s easily. With overclocked memory they reach rates up to 30MH/s.
GPUs have GDDR Memory with Clockrates of about 1000MHz while DDR4 runs with higher clock speeds. Bandwith of DDR4 seems also to be higher (sources: http://www.corsair.com/en-eu/blog/2014/september/ddr3_vs_ddr4_synthetic and https://en.wikipedia.org/wiki/GDDR5_SDRAM )
It is said for Dagger-Hashimoto/ethash bandwith of memory is the thing that matters (also experienced from overclocking GPUs) which I find reasonable since the CPU/GPU only has to do 2x sha3 (1x Keccak256 + 1x Keccak512) operations (source: https://github.com/ethereum/wiki/wiki/Ethash#main-loop ).
A modern Skylake processor can compute over 100M of Keccak512 operations per second (see here: https://www.cryptopp.com/benchmarks.html ) so then core count difference between GPUs and CPUs should not be the problem.
But why don't we get about ~50Mhash/s from 2xKeccak operations and memory loading on a CPU?
See http://www.nvidia.com/object/what-is-gpu-computing.html for an overview of the differences between CPU and GPU programming.
In short, a CPU has a very small number of cores, each of which can do different things, and each of which can handle very complex logic.
A GPU has thousands of cores, that operate pretty much in lockstep, but can only handle simple logic.
Therefore the overall processing throughput of a GPU can be massively higher. But it isn't easy to move logic from the CPU to the GPU.
If you want to dive in deeper and actually write code for both, one good starting place is https://devblogs.nvidia.com/gpu-computing-julia-programming-language/.
"A modern Skylake processor can compute over 100M of Keccak512 operations per second" is incorrect, it is 140 MiB/s. That is MiBs per second and a hash operation is more than 1 byte, you need to divide the 140 MiB/s by the number of bytes being hashed.
I found an article addressing my problem (the influence of Memory on the algorithm).
It's not only the computation problem (mentioned here: https://stackoverflow.com/a/48687460/2298744 ) it's also the Memorybandwidth which would bottelneck the CPU.
As described in the article every round fetches 8kb of data for calculation. This results in the following formular:
(Memory Bandwidth) / ( DAG memory fetched per hash) = Max Theoreticical Hashrate
(Memory Bandwidth) / ( 8kb / hash) = Max Theoreticical Hashrate
For a grafics card like the RX470 mentioned this results in:
(211 Gigabytes / sec) / (8 kilobytes / hash) = ~26Mhashes/sec
While for CPUs with DDR4 this will result in:
(12.8GB / sec) / (8 kilobytes / hash) = ~1.6Mhashes/sec
or (debending on clock speeds of RAM)
(25.6GB / sec) / (8 kilobytes / hash) = ~3.2Mhashes/sec
To sum up, a CPU or also GPU with DDR4 ram could not get more than 3.2MHash/s since it can't get the data fast enough needed for processing.
Source:
https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memory-hardness-explained/

Average instruction time

Lets say we have an average of one page fault every 20,000,000 instructions, a normal instruction takes 2 nanoseconds, and a page fault causes the instruction to take an additional 10 milliseconds. What is the average instruction time, taking page faults into account?
20,000,000 instructions, one of them will page-fault
Therefore, the 20,000,000 instructions will take
(2 nanoseconds * 20,000,000) + 10 milliseconds
get the result (which is the total time for 20,000,000 instructions), and divide it by the number of instructions to get the time-per-instruction.
What is the average instruction time, taking page faults into account?
The average instruction time is the total time, divided by the number of instructions.
So: what's the total time for 20,000,000 instructions?
2.5 nanoseconds? Pretty simple arithmetic, I guess.
If 1 in 20,000,000 instructions causes a page fault then you have a page fault rate of:
Page Fault Rate = (1/20000000)
You can then calculate your average time per instruction:
Average Time = (1 - Page Fault Rate) * 2 ns + (Page Fault Rate * 10 ms)
Comes to 2.5 ns / instruction

Resources