Go routine performance maximizing

Go routine performance maximizing - go

I writing a data mover in go. Taking data located in one data center and moving it to another data center. Figured go would be perfect for this given the go routines.
I notice if I have one program running 1800 threads the amount of data being transmitted is really low
here's the dstat print out averaged over 30 seconds
---load-avg--- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
1m 5m 15m |usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0.70 3.58 4.42| 10 1 89 0 0 0| 0 156k|7306k 6667k| 0 0 | 11k 6287
0.61 3.28 4.29| 12 2 85 0 0 1| 0 6963B|8822k 8523k| 0 0 | 14k 7531
0.65 3.03 4.18| 12 2 86 0 0 1| 0 1775B|8660k 8514k| 0 0 | 13k 7464
0.67 2.81 4.07| 12 2 86 0 0 1| 0 1638B|8908k 8735k| 0 0 | 13k 7435
0.67 2.60 3.96| 12 2 86 0 0 1| 0 819B|8752k 8385k| 0 0 | 13k 7445
0.47 2.37 3.84| 11 2 86 0 0 1| 0 2185B|8740k 8491k| 0 0 | 13k 7548
0.61 2.22 3.74| 10 2 88 0 0 0| 0 1229B|7122k 6765k| 0 0 | 11k 6228
0.52 2.04 3.63| 3 1 97 0 0 0| 0 546B|1999k 1365k| 0 0 |3117 2033
If I run 9 instances of the program with 200 threads each I see much better performance
---load-avg--- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
1m 5m 15m |usr sys idl wai hiq siq| read writ| recv send| in out | int csw
8.34 9.56 8.78| 53 8 36 0 0 3| 0 410B| 38M 32M| 0 0 | 41k 26k
8.01 9.37 8.74| 74 10 12 0 0 4| 0 137B| 51M 51M| 0 0 | 59k 39k
8.36 9.31 8.74| 75 9 12 0 0 4| 0 1092B| 51M 51M| 0 0 | 59k 39k
6.93 8.89 8.62| 74 10 12 0 0 4| 0 5188B| 50M 49M| 0 0 | 59k 38k
7.09 8.73 8.58| 75 9 12 0 0 4| 0 410B| 51M 50M| 0 0 | 60k 39k
7.40 8.62 8.54| 75 9 12 0 0 4| 0 137B| 52M 49M| 0 0 | 61k 40k
7.96 8.63 8.55| 75 9 12 0 0 4| 0 956B| 51M 51M| 0 0 | 59k 39k
7.46 8.44 8.49| 75 9 12 0 0 4| 0 273B| 51M 50M| 0 0 | 58k 38k
8.08 8.51 8.51| 75 9 12 0 0 4| 0 410B| 51M 51M| 0 0 | 59k 39k
load average is a little high but I'll worry about that later. The network traffic though is almost hitting the network potential.
I'm on Ubuntu 12.04,
8 Gigs Ram,
2.3 GHz processors (says EC2 :P)
Also, I've increased my file descriptors from 1024 to 10240
I thought go was designed for this kind of thing or am I expecting too much of go for this application?
Is there something trivial that I'm missing? Do I need to configure my system to maximizes go's potential?
EDIT
I guess my question wasn't clear enough. Sorry. I'm not asking for magic from go, I know the computers have limitations to what they can handle.
So I'll rephrase. Why is 1 instance with 1800 go routines != 9 instances with 200 threads each? Same amount of go routines significantly less performance for 1 instance compared to 9 instances.

Please note, that goroutines are also limited to your local maschine and that channels are not natively network enabled, i.e. your particular case is probably not biting go's chocolate site.
Also: What did you expect from throwing (suposedly) every transfer into a goroutine? IO-Operations tend to have their bottleneck where the bits hit the metal, i.e. the physical transfer of the data to the medium. Think of it like that: No matter how many Threads or (Goroutines in this case) try to write to Networkcard, you still only have one Networkcard. Most likely hitting it with to many concurrent write calls will only slow things down, since the involved overhead increases
If you think this is not the problem or want to audit your code for optimized performance, go has neat builtin features to do so: profiling go programs (official go blog)
But still the actual bottleneck might well be outside your go program AND/OR in the way it interacts with the os.
Adressing your actual problem without code is pointless guessing. Post some and everyone will try their best to help you.

You will probably have to post your source code to get any real input, but just to be sure, you have increased number of cpus to use?
import "runtime"
func main() {
runtime.GOMAXPROCS(runtime.NumCPU())
}

Related

Cassandra Write getting slow in heavy writes - Load Factor surges up on one machine in the cluster

We are using cassandra 3.0.3 on AWS with 6 r3.xlarge machines (64G RAM, 16 Core) each, there are 6 machines in 2 datacenter's but this particular keyspace is replicated in only one DC therefore on 3 Nodes. We are writing about 300M rows into cassandra as a weekly sync.
During loading data load factor shooting up to as much as 34 on a machine and 100% CPU utilization (In this case a lot of data will be rewritten), we expected it to be slow but the performance degradation is dramatic on one of the nodes.
At a snapshot, load factor output for the machines:
On Overloaded Machine:
27.47, 29.78, 30.06
On other two:
2.65, 3.95, 4.59
3.76, 2.52, 2.50
nodetool status output:
Overloaded Node:
UN 10.21.56.21 65.94 GB 256 38.7% 57f35206-f264-44ec-b588-f72883139f69 rack1
Other two Nodes:
UN 10.21.56.20 56.34 GB 256 31.9% 2b29f85c-c783-4e20-8cea-95d4e2688550 rack1
UN 10.21.56.23 51.29 GB 256 29.4% fbf26f1d-1766-4f12-957c-7278fd19c20c rack1
I can see that the sstable count is also high and sstable flushed are ~15MB in size. Heap size is 8GB and G1GC is used.
The output of nodetool cfhistograms shows stark difference between write and read latency as can be shown below for one of the larger tables:
| Percentile | SSTables | Write Latency | Read Latency | Partition Size | Cell Count |
|------------- |------------ |----------------- |---------------- |------------------ |-------------- |
| | (micros) | (micros) | (bytes) | | |
| 50% | 8 | 20.5 | 1629.72 | 179 | 5 |
| 75% | 10 | 24.6 | 2346.8 | 258 | 10 |
| 95% | 12 | 42.51 | 4866.32 | 1109 | 72 |
| 98% | 14 | 51.01 | 10090.81 | 3973 | 258 |
| 99% | 14 | 61.21 | 14530.76 | 9887 | 642 |
| Min | 0 | 4.77 | 11.87 | 104 | 5 |
| Max | 17 | 322381.14 | 17797419.59 | 557074610 | 36157190 |
nodetool proxyhistogram output can be found below:
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 263.21 654.95 20924.30
75% 654.95 785.94 30130.99
95% 1629.72 36157.19 52066.35
98% 4866.32 155469.30 62479.63
99% 7007.51 322381.14 74975.55
Min 6.87 11.87 24.60
Max 12359319.16 30753941.06 63771372.18
One wierd thing that I can observe here is that Mutation count vary by considerable margin per machine :
MutationStage Pool Completed Total:
Overloaded Node: 307531460526
Other Node1: 77979732754
Other Node2: 146376997379
Here overloaded node total = ~4x Other Node1 and ~2x Other Node2. In a well distributed keyspace with MM3 partitioner is this scenario expected?
nodetool cfstats output is attached below for reference:
Keyspace: cat-48
Read Count: 122253245
Read Latency: 1.9288832487759324 ms.
Write Count: 122243273
Write Latency: 0.02254735837284069 ms.
Pending Flushes: 0
Table: bucket_distribution
SSTable count: 11
Space used (live): 10149121447
Space used (total): 10149121447
Space used by snapshots (total): 0
Off heap memory used (total): 14971512
SSTable Compression Ratio: 0.637019014259346
Number of keys (estimate): 2762585
Memtable cell count: 255915
Memtable data size: 19622027
Memtable off heap memory used: 0
Memtable switch count: 487
Local read count: 122253245
Local read latency: 2.116 ms
Local write count: 122243273
Local write latency: 0.025 ms
Pending flushes: 0
Bloom filter false positives: 17
Bloom filter false ratio: 0.00000
Bloom filter space used: 9588144
Bloom filter off heap memory used: 9588056
Index summary off heap memory used: 3545264
Compression metadata off heap memory used: 1838192
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 557074610
Compacted partition mean bytes: 2145
Average live cells per slice (last five minutes): 8.83894307680672
Maximum live cells per slice (last five minutes): 5722
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
----------------
Also I can observe in nodetool tpstats that on peak load one node (which is getting overloaded) has pending Native-Transport-Requests:
Overloaded Node:
Native-Transport-Requests 32 11 651595401 0 349
MutationStage 32 41 316508231055 0 0
The other two:
Native-Transport-Requests 0 0 625706001 0 495
MutationStage 0 0 151442471377 0 0
Native-Transport-Requests 0 0 630331805 0 219
MutationStage 0 0 78369542703 0 0
I have also checked nodetool compactionstats and the output is 0 most of the time, at times when compaction is happen, it is observed that load doesnt increase alarmingly.

Traced it down to issue with Data Model & a Kernel bug which was not patched in the kernel we used.
Some partitions in the data that we were writing were large that caused imbalance in the write requests, since RF is 1 so One server appeared to be under heavy load.
The kernel issue is described in detail here (in brief it affects java apps which are using park wait): datastax blog
This is Fixed by Linux Commit

Collecting CPU time log using Apache Flume

I am new for hadoop and learning apache Flume. I installed CDH 4.7 on Virtualbox. The below command will output the top cputime. How can I transfer this log data output of the below command to my HDFS using Apache flume?. How to create the flume configuration file?
user#computer-Lenovo-IdeaPad-S510p:$ dstat -ta --top-cputime
----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- --highest-total--
time |usr sys idl wai hiq siq| read writ| recv send| in out | int csw | cputime process
27-02 13:14:32| 6 5 87 1 0 0| 216k 235k| 0 0 | 0 11B| 11k 2934 |X 29
27-02 13:14:33| 1 7 93 0 0 0| 64k 176k| 0 0 | 0 0 | 38k 3194 |X 8650
27-02 13:14:34| 2 11 87 0 0 0| 512B 188k| 0 0 | 0 0 | 24k 2612 | --enable-cra 11
27-02 13:14:35| 2 13 85 0 0 0| 45k 56k| 0 0 | 0 0 | 22k 2432 |X 11
27-02 13:14:36| 2 13 85 0 0 0|2093k 0 | 0 0 | 0 0 | 25k 3962 |VirtualBox 12
27-02 13:14:37| 1 4 95 1 0 0| 0 20k| 0 0 | 0 0 | 27k 3126 |VirtualBox 8942
27-02 13:14:38| 2 7 92 0 0 0| 0 8192B| 0 0 | 0 0 | 21k 3019 |VirtualBox 9082
27-02 13:14:39| 3 9 88 0 0 0| 512B 168k| 0 0 | 0 0 | 30k 2508 | --enable-cra 16
27-02 13:14:40| 2 13 86 0 0 0| 0 0 | 0 0 | 0 0 | 21k 2433 |VirtualBox 8041
27-02 13:14:41| 1 10 88 0 0 0| 0 0 | 0 0 | 0 0 | 19k 3191 |VirtualBox 10
27-02 13:14:42| 2 7 91 0 0 0| 32k 0 | 0 0 | 0 0 | 23k 2799 |X 8713
27-02 13:14:43| 2 7 90 1 0 0| 0 192k| 0 0 | 0 0 | 39k 2696 |X 10
27-02 13:14:44| 2 11 87 0 0 0| 0 140k| 0 0 | 0 0 | 35k 2434 |VirtualBox 8961
27-02 13:14:45| 2 11 87 0 0 0| 0 0 | 0 0 | 0 0 | 19k 2157 |VirtualBox 8126
27-02 13:14:46| 2 15 83 0 0 0| 182k 0 | 0 0 | 0 0 | 20k 3262 |VirtualBox 13^C

You can use flume exec source, to collection log and use hdfs sink to store log.
config can like this:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = dstat -ta --top-cputime
a1.sources.r1.channels = c1
http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
http://flume.apache.org/FlumeUserGuide.html#exec-source

How are floppy disk sectors numbered

I was wondering how are floppy disk sectors ordered, I am currently writing a program to access the root directory of a floppy disk (fat12 formated High Density), I can load it with debug at sector 13h but in assembly it is at head 1 track 0 sector 2 why is sector 13h, not at head 0 track 1 sector 1?

That's because the sectors on the other side of the disk comes before the sectors on the second track on the first side.
Sectors 0 through 17 (11h) are found at head 0 track 0. Sectors 18 (12h) through 35 (23h) are found at head 1 track 0.
Logical sectors are numbered from zero up, but the sectors in a track are numbered from 1 to 18 (12h).
sector# head track sector usage
------- ---- ----- ------ --------
0 0h 0 0 1 1h boot
1 1h 0 0 2 2h FAT 1
2 2h 0 0 3 3h |
3 3h 0 0 4 4h v
4 4h 0 0 5 5h
5 5h 0 0 6 6h
6 6h 0 0 7 7h
7 7h 0 0 8 8h
8 8h 0 0 9 9h
9 9h 0 0 10 ah
10 ah 0 0 11 bh FAT 2
11 bh 0 0 12 ch |
12 ch 0 0 13 dh v
13 dh 0 0 14 eh
14 eh 0 0 15 fh
15 fh 0 0 16 10h
16 10h 0 0 17 11h
17 11h 0 0 18 12h
18 12h 1 0 1 1h
19 13h 1 0 2 2h root
20 14h 1 0 3 3h |
21 15h 1 0 4 4h v
...

Maths in a while loop causing random negative numbers

So I have done this in both python and bash, and the code I am about to post probably has a world of things wrong with it but it is generally very basic and I cannot see a reason that it would cause this 'bug' which I will explain soon.. I have done the same in Python, but much more professionally and cleanly and it also causes this error (at some point, the maths generates a negative number, which makes no sense.)
#!/bin/bash
while [ 1 ];
do
zero=0
ARRAY=()
ARRAY2=()
first=`command to generate a list of numbers`
sleep 1
second=`command to generate a list of numbers`
# so now we have two data sets, 1 second between the capture of each.
for i in $first;
do
ARRAY+=($i)
done
for i in $second;
do
ARRAY2+=($i)
done
for (( c=$zero; c<=${#ARRAY2[#]}; c++ ))
do
expr ${ARRAY2[$c]} - ${ARRAY[$c]}
done
ARRAY=()
ARRAY2=()
zero=0
c=0
first=``
second=``
math=''
done
So the script grabs a set of data, waits 1 second, grabs it again, does math on the two sets to get the difference, that difference is printed. It's very simple, and I have done it elegantly in Python too - no matter how I would do it every now and then, could be anywhere from 3 loops in to 30 loops in, we will get negative numbers.. like so:
START 0 0 0 0 0 19 10 563 0
-34 19 14 2 0
-1302 1198
-532 639
-1078 1119 1 0 0
-843 33 880 0 5
-8
-13508 8773 4541 988 181
-12
-205 217
-9 7 1
-360 303 60 1 0 0
-12
-96 98 3
-870 904
-130
-2105 2264 6
-3084 1576 1650
-939 971
-2249 1150 1281
-693 9 513 142 76 expr: syntax error
Please help, I simply can't find anything about this.
Sample OUTPUT as requested:
ARRAY1 OUTPUT
1 15 1 25 25 1 2 1 3541 853 94567 42 5 1 351 51 1 11 1 13 7 14 12 3999 983 5 1938 3 8287 40 1 1 1 5253 706 1 1 1 1 5717 3 50 1 85 100376 17334 4655 1 1345 2 1 16 1777 1 3 38 23 8 32 47 781 947 1 1 206 9 1 3 2 81 2602 7 158 1 1 43 91 1 120 6589 6 2534 1092 1 6014 7 2 2 37 1 1 1 80 2 1 1270 15448 66 1 10238 1 10794 16061 4 1 1 1 9754 5617 1123 926 3 24 10 16
ARRAY2 OUTPUT
1 15 1 25 25 1 2 1 3555 859 95043 42 5 1 355 55 1 11 1 13 7 14 12 4015 987 5 1938 3 8335 40 1 1 1 5280 706 1 1 1 1 5733 3 50 1 85 100877 17396 4691 1 1353 2 1 16 1782 1 3 38 23 8 32 47 787 947 1 1 206 9 1 3 2 81 2602 7 159 1 1 43 91 1 120 6869 6 2534 1092 1 6044 7 2 2 37 1 1 1 80 2 1 1270 15563 66 1 10293 1 10804 16134 4 1 1 1 9755 5633 1135 928 3 24 10 16
START

The answer lies in Russell Uhl's comment above. Your loop runs one time to many(this is your code):
for (( c=$zero; c<=${#ARRAY2[#]}; c++ ))
do
expr ${ARRAY2[$c]} - ${ARRAY[$c]}
done
To fix, you need to change the test condition from c <= ${#ARRAY2[#]} to c < ${#ARRAY2[#]}:
for (( c=$zero; c < ${#ARRAY2[#]}; c++ ))
do
echo $((${ARRAY2[$c]} - ${ARRAY[$c]}))
done
I've also changed the expr to use arithmetic evaluation builtin $((...)).
The test script (sum.sh):
#!/bin/bash
zero=0
ARRAY=()
ARRAY2=()
first="1 15 1 25 25 1 2 1 3541 853 94567 42 5 1 351 51 1 11 1 13 7 14 12 3999 983 5 1938 3 8287 40 1 1 1 5253 706 1 1 1 1 5717 3 50 1 85 100376 17334 4655 1 1345 2 1 16 1777 1 3 38 23 8 32 47 7
second="1 15 1 25 25 1 2 1 3555 859 95043 42 5 1 355 55 1 11 1 13 7 14 12 4015 987 5 1938 3 8335 40 1 1 1 5280 706 1 1 1 1 5733 3 50 1 85 100877 17396 4691 1 1353 2 1 16 1782 1 3 38 23 8 32 47
for i in $first; do
ARRAY+=($i)
done
# Alternately as chepner suggested:
ARRAY2=($second)
for (( c=$zero; c < ${#ARRAY2[#]}; c++ )); do
echo -n $((${ARRAY2[$c]} - ${ARRAY[$c]})) " "
done
Running it:
samveen#precise:/tmp$ echo $BASH_VERSION
4.2.25(1)-release
samveen#precise:/tmp$ bash sum.sh
0 0 0 0 0 0 0 0 14 6 476 0 0 0 4 4 0 0 0 0 0 0 0 16 4 0 0 0 48 0 0 0 0 27 0 0 0 0 0 16 0 0 0 0 501 62 36 0 8 0 0 0 5 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 280 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 115 0 0 55 0 10 73 0 0 0 0 1 16 12 2 0 0 0 0
EDIT:
* Added improvements from suggestions in comments.

I think the problem has to be when the two arrays don't have the same size. It's easy to reproduce that syntax error -- one of the operands for the minus operator is an empty string:
$ a=5; b=3; expr $a - $b
2
$ a=""; b=3; expr $a - $b
expr: syntax error
$ a=5; b=""; expr $a - $b
expr: syntax error
$ a=""; b=""; expr $a - $b
-
Try
ARRAY=( $(command to generate a list of numbers) )
sleep 1
ARRAY2=( $(command to generate a list of numbers) )
if (( ${#ARRAY[#]} != ${#ARRAY2[#]} )); then
echo "error: different size arrays!"
echo "ARRAY: ${#ARRAY[#]} (${ARRAY[*]})"
echo "ARRAY2: ${#ARRAY2[#]} (${ARRAY2[*]})"
fi
"The error occurs whenever the first array is smaller than the second" -- of course. You're looping from 0 to the array size of ARRAY2. When ARRAY has fewer elements, you'll eventually try to access an index that does not exist in the array. When you try to reference an unset variable, bash gives you the empty string.
$ a=(1 2 3)
$ b=(4 5 6 7)
$ i=2; expr ${a[i]} - ${b[i]}
-3
$ i=3; expr ${a[i]} - ${b[i]}
expr: syntax error

Calculating CPU usage from /proc/stat

When reading /proc/stat, I get these return values:
cpu 20582190 643 1606363 658948861 509691 24 112555 0 0 0
cpu0 3408982 106 264219 81480207 19354 0 35 0 0 0
cpu1 3395441 116 265930 81509149 11129 0 30 0 0 0
cpu2 3411003 197 214515 81133228 418090 0 1911 0 0 0
cpu3 3478358 168 257604 81417703 30421 0 29 0 0 0
cpu4 1840706 20 155376 83328751 1564 0 7 0 0 0
cpu5 1416488 15 171101 83410586 1645 13 108729 0 0 0
cpu6 1773002 7 133686 83346305 25666 10 1803 0 0 0
cpu7 1858207 10 143928 83322929 1819 0 8 0 0 0
Some sources state to read only the first four values to calculate CPU usage, while some sources say to read all the values.
Do I read only the first four values to calculate CPU utilization; the values user, nice, system, and idle? Or do I need all the values? Or not all, but more than four? Would I need iowait, irq, or softirq?
cpu 20582190 643 1606363
Versus the entire line.
cpu 20582190 643 1606363 658948861 509691 24 112555 0 0 0
Edits: Some sources also state that iowait is added into idle.
When calculating a specific process' CPU usage, does the method differ?

The man page states that it varies with architecture, and also gives a couple of examples describing how they are different:
In Linux 2.6 this line includes three additional columns: ...
Since Linux 2.6.11, there is an eighth column, ...
Since Linux 2.6.24, there is a ninth column, ...
When "some people said to only use..." they were probably not taking these into account.
Regarding whether the calculation differs across CPUs: You will find lines related to "cpu", "cpu0", "cpu1", ... in /proc/stat. The "cpu" fields are all aggregates (not averages) of corresponding fields for the individual CPUs. You can check that for yourself with a simple awk one-liner.
cpu 84282 747 20805 1615949 44349 0 308 0 0 0
cpu0 26754 343 9611 375347 27092 0 301 0 0 0
cpu1 12707 56 2581 422198 5036 0 1 0 0 0
cpu2 33356 173 6160 394561 7508 0 4 0 0 0
cpu3 11464 174 2452 423841 4712 0 1 0 0 0

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Go routine performance maximizing - go

You will probably have to post your source code to get any real input, but just to be sure, you have increased number of cpus to use? import "runtime" func main() { runtime.GOMAXPROCS(runtime.NumCPU()) }

Related

Cassandra Write getting slow in heavy writes - Load Factor surges up on one machine in the cluster

Collecting CPU time log using Apache Flume

How are floppy disk sectors numbered

Maths in a while loop causing random negative numbers

Calculating CPU usage from /proc/stat

Categories

Resources