Node API taking 100% CPU, node-tick-processor output looks cryptic - performance

I have a node API which normally handles traffic quite well. However, in our peak times, it gets into this state where it starts using 100% CPU and needs to be restarted. After restarting, it returns to a regular state for the next few days.
Using a load testing site, I have been able to reproduce this issue. The request I am load testing is extremely simple, so I fear the problem is in a third-party library I'm using.
I'm new to debugging node and I'm not sure what to make of the following output from node-tick-processor. Can anyone decipher this?
Update: I'm running node v0.10.4
[Unknown]:
ticks total nonlib name
5 0.0%
[Shared libraries]:
ticks total nonlib name
11943 49.1% 0.0% /lib64/libc-2.12.so
10754 44.2% 0.0% /usr/local/bin/node
314 1.3% 0.0% /lib64/libpthread-2.12.so
50 0.2% 0.0% 7fff318b4000-7fff318b5000
5 0.0% 0.0% /lib64/libm-2.12.so
3 0.0% 0.0% /usr/lib64/libstdc++.so.6.0.17
[JavaScript]:
ticks total nonlib name
40 0.2% 3.2% LazyCompile: ~read tls.js:397
36 0.1% 2.8% LazyCompile: *EventEmitter.addListener events.js:126
29 0.1% 2.3% LazyCompile: *Readable.read _stream_readable.js:226
<clipped>
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 2.0% are not shown.
ticks parent name
11943 49.1% /lib64/libc-2.12.so
10754 44.2% /usr/local/bin/node
8270 76.9% LazyCompile: *use tls.js:222
5162 62.4% LazyCompile: ~read tls.js:397
5074 98.3% LazyCompile: *Readable.read _stream_readable.js:226
3396 66.9% LazyCompile: ~write tls.js:315
3396 100.0% LazyCompile: *Writable.write _stream_writable.js:155
1063 20.9% LazyCompile: *write tls.js:315
1063 100.0% LazyCompile: *Writable.write _stream_writable.js:155
370 7.3% LazyCompile: *Writable.write _stream_writable.js:155
370 100.0% LazyCompile: ~write _stream_readable.js:546
186 3.7% LazyCompile: ~callback tls.js:753
180 96.8% LazyCompile: *onclienthello tls.js:748
6 3.2% LazyCompile: ~onclienthello tls.js:748
2417 29.2% LazyCompile: *read tls.js:397
2417 100.0% LazyCompile: *Readable.read _stream_readable.js:226
2320 96.0% LazyCompile: *Writable.write _stream_writable.js:155
2315 99.8% LazyCompile: ~write _stream_readable.js:546
57 2.4% LazyCompile: ~callback tls.js:753
57 100.0% LazyCompile: *onclienthello tls.js:748
691 8.4% LazyCompile: *Readable.read _stream_readable.js:226
675 97.7% LazyCompile: *write tls.js:315
675 100.0% LazyCompile: *Writable.write _stream_writable.js:155
674 99.9% LazyCompile: ~write _stream_readable.js:546

For those still coming here, this was caused somewhere in the SSL-handling code in a system node module. I re-configured my stack to do SSL termination at nginx, and use basic HTTP handling at the node level, and this problem went away entirely.

Related

Why are all the values same in ARIMA model predictions?

The data set had 1511 observations. I used the first 1400 values to fit ARIMA model of order (1,1,9), keeping the rest for predictions. But when I look at the predictions, apart from the first 16 values all the remaining values are the same. Here's what I tried:
model2=ARIMA(tstrain,order=(1,1,9))
fitted_model2=model2.fit()
And for prediction:
start=len(tstrain)
end=len(tstrain)+len(tstest)-1
predictions=fitted_model2.predict(start,end,typ='levels')
Here tstrain and tstest are the train and test sets.
predictions.head(30)
1400 214.097742
1401 214.689674
1402 214.820804
1403 215.621131
1404 215.244980
1405 215.349230
1406 215.392444
1407 215.022312
1408 215.020736
1409 215.021384
1410 215.021118
1411 215.021227
1412 215.021182
1413 215.021201
1414 215.021193
1415 215.021196
1416 215.021195
1417 215.021195
1418 215.021195
1419 215.021195
1420 215.021195
1421 215.021195
1422 215.021195
1423 215.021195
1424 215.021195
1425 215.021195
1426 215.021195
1427 215.021195
1428 215.021195
1429 215.021195
Please help me out here. What am I missing?

DV360 API. Create target for lien item

I'm trying to create an ageRange target for a line item using the DV360 API. However, I require targetOptionId. Where can I get it?
(https://developers.google.com/display-video/api/reference/rest/v1/advertisers.lineItems.targetingTypes.assignedTargetingOptions/create?apix_params=%7B%22advertiserId%22% 3A613810781% 2C% 22lineItemId% 22% 3A15014994889% 2C% 22targetingType% 22% 3A% 22TARGETING_TYPE_AGE_RANGE% 22% 2C% 22resource% 22% 3A% 7B% 22ageRangeDetails24% 22% 3Aagne% 22GE_RANAGE% 22% 3Aagne% 22GE_RANAGE% 22 7D% 7D% 7D)

Ho to properly interpret HeapInuse / HeapIdle / HeapReleased memory stats in golang

I want to monitor memory usage of my golang program and clean up some internal cache if system lacks of free memory.
The problem is that HeapAlloc / HeapInuse / HeapReleased stats aren't always add up properly (to my understanding).
I'm looking at free system memory (+ buffers/cache) - the value that is shown as available by free utility:
$ free
total used free shared buff/cache available
Mem: 16123232 409248 15113628 200 600356 15398424
Swap: 73242180 34560 73207620
And also I look at HeapIdle - HeapReleased, that, according to comments in https://godoc.org/runtime#MemStats,
HeapIdle minus HeapReleased estimates the amount of memory
that could be returned to the OS, but is being retained by
the runtime so it can grow the heap without requesting more
memory from the OS.
Now the problem: sometimes Available + HeapInuse + HeapIdle - HeapReleased exceeds total amount of system memory. Usually it happens when HeapIdle is quite high and HeapReleased is neither close to HeapIdle nor to zero:
# Start of test
Available: 15379M, HeapAlloc: 49M, HeapInuse: 51M, HeapIdle: 58M, HeapReleased: 0M
# Work is in progress
# Looks good: 11795 + 3593 = 15388
Available: 11795M, HeapAlloc: 3591M, HeapInuse: 3593M, HeapIdle: 0M, HeapReleased: 0M
# Work has been done
# Looks good: 11745 + 45 + 3602 = 15392
Available: 11745M, HeapAlloc: 42M, HeapInuse: 45M, HeapIdle: 3602M, HeapReleased: 0M
# Golang released some memory to OS
# Looks good: 15224 + 14 + 3632 - 3552 = 15318
Available: 15224M, HeapAlloc: 10M, HeapInuse: 14M, HeapIdle: 3632M, HeapReleased: 3552M
# Some other work started
# Looks SUSPICIOUS: 13995 + 1285 + 2360 - 1769 = 15871
Available: 13995M, HeapAlloc: 1282M, HeapInuse: 1285M, HeapIdle: 2360M, HeapReleased: 1769M
# 5 seconds later
# Looks BAD: 13487 + 994 + 2652 - 398 = 16735 - more than system memory
Available: 13487M, HeapAlloc: 991M, HeapInuse: 994M, HeapIdle: 2652M, HeapReleased: 398M
# This bad situation holds for quite a while, even when work has been done
# Looks BAD: 13488 + 14 + 3631 - 489 = 16644
Available: 13488M, HeapAlloc: 10M, HeapInuse: 14M, HeapIdle: 3631M, HeapReleased: 489M
# It is strange that at this moment HeapIdle - HeapReleased = 3142
# > than 2134M of used memory reported by "free" utility.
$ free
total used free shared buff/cache available
Mem: 16123232 2185696 13337632 200 599904 13621988
Swap: 73242180 34560 73207620
# Still bad when another set of work started
# Looks BAD: 13066 + 2242 + 1403 = 16711
Available: 13066M, HeapAlloc: 2240M, HeapInuse: 2242M, HeapIdle: 1403M, HeapReleased: 0M
# But after 10 seconds it becomes good
# Looks good: 11815 + 2325 + 1320 = 15460
Available: 11815M, HeapAlloc: 2322M, HeapInuse: 2325M, HeapIdle: 1320M, HeapReleased: 0M
I do not understand from where this additional "breathing" 1.3GB (16700 - 15400) of memory comes from. Used swap space remained the same during the whole test.

Map access bottleneck in Golang

I am using Golang to implement naive bayesian classification for a dataset with over 30000 possible tags. I have built the model and I am in the classification phase. I am working on classifying 1000 records and this is taking up to 5 minutes. I have profiled the code with pprof functionality; the top10 are shown below:
Total: 28896 samples
16408 56.8% 56.8% 24129 83.5% runtime.mapaccess1_faststr
4977 17.2% 74.0% 4977 17.2% runtime.aeshashbody
2552 8.8% 82.8% 2552 8.8% runtime.memeqbody
1468 5.1% 87.9% 28112 97.3% main.(*Classifier).calcProbs
861 3.0% 90.9% 861 3.0% math.Log
435 1.5% 92.4% 435 1.5% runtime.markspan
267 0.9% 93.3% 302 1.0% MHeap_AllocLocked
187 0.6% 94.0% 187 0.6% runtime.aeshashstr
183 0.6% 94.6% 1137 3.9% runtime.mallocgc
127 0.4% 95.0% 988 3.4% math.log10
Surprisingly the map access seems to be the bottleneck. Has anyone experienced this. What other key, value datastructure can be used to avoid this bottleneck? All the map access is done in the following piece of code given below:
func (nb *Classifier) calcProbs(data string) *BoundedPriorityQueue{
probs := &BoundedPriorityQueue{}
heap.Init(probs)
terms := strings.Split(data, " ")
for class, prob := range nb.classProb{
condProb := prob
clsProbs := nb.model[class]
for _, term := range terms{
termProb := clsProbs[term]
if termProb != 0{
condProb += math.Log10(termProb)
}else{
condProb += -6 //math.Log10(0.000001)
}
}
entry := &Item{
value: class,
priority: condProb,
}
heap.Push(probs,entry)
}
return probs
}
The maps are nb.classProb which is map[string]float64 while the nb.model is a nested map of type
map[string]map[string]float64
In addition to what #tomwilde said, another approach that may speed up your algorithm is string interning. Namely, you can avoid using a map entirely if you know the domain of keys ahead of time. I wrote a small package that will do string interning for you.
Yes, the map access will be the bottleneck in this code: it's the most significant operation inside the two nested loops.
It's not possible to tell for sure from the code that you've included, but I expect you've got a limited number of classes. What you might do, is number them, and store the term-wise class probabilities like this:
map[string][NumClasses]float64
(ie: for each term, store an array of class-wise probabilities [or perhaps their logs already precomputed], and NumClasses is the number of different classes you have).
Then, iterate over terms first, and classes inside. The expensive map lookup will be done in the outer loop, and the inner loop will be iteration over an array.
This'll reduce the number of map lookups by a factor of NumClasses. This may need more memory if your data is extremely sparse.
The next optimisation is to use multiple goroutines to do the calculations, assuming you've more than one CPU core available.

Pandas performance issue of dataframe column "rename" and "drop"

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Resources