Statsd - Graphite Data points resets to zero - metrics

I'm using graphite, when i send the metric stats_count it shows up in the metrics data points json, but after some time it resets to zero.
ex:
initially [0, 1426770240] [449.968673, 1426770300]
later [0, 1426770240] [0, 1426770300] [0, 1426770360]
retentions = 10s:7d,5m:30d,15m:5y

You're likely using a wrong aggregation function in Graphite, so it's aggregating your counts by averaging them instead of summing them up.
From https://kevinmccarthy.org/blog/2013/07/18/10-things-i-learned-deploying-graphite/#aggregating-data-what:
The other thing that can trip you up is aggregating counts. What
happens when we average 60 seconds of counts? Well, we lose around
1/60th of the events that happened. So for count statistics, we
actually want to sum the number of times that something happened.
Check this section on the statsd docs about how to configure Graphite for statsd. The key here is:
# conf/storage-aggregation.conf:
[count]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
As you can see the aggregation method used is sum. By default Graphite averages metrics when downsamping.

Related

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

Spark Dataframe suddenly become very slow when I reuse the old cached data iteratively too much time

The problem happened when I try to keep my cached result in a List and try to calculate new DataFrame by all the data from the last list in each iteration. However, Even I use an empty DataFrame and get an empty result each time, the function will suddenly get very slow after about 8~12 round.
Here is my code
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
the benchmark result
1~6 round : < 200ms
7 round : 367ms
8 round : 918ms
9 round : 2476ms
10 round : 7833ms
11 round : 24231ms
I don't think GC or Block eviction is the problem in my case since I already use an empty DataFrame, but I don't know what is the cause? Do I misunderstand the meaning of cache or something?
Thanks!
After reading ImDarrenG's solution, I changed my code to be the following:
spark.sparkContext.setCheckpointDir("/tmp")
testLoop(Nil)
def testLoop(lastDfList:List[DataFrame]){
// do some dummy transformation like union and cache the result
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf)}.cache
resultDf.checkpoint()
// always get 0, of course
println(resultDf.count)
// benchmark action
benchmark(resultDf.count)
testLoop(resultDf::lastDfList)
}
But it still become very slow after a few iterations.
Here you create a list of DataFrames by adding resultDf to the beginning of lastDfList and pass that to the next iteration of testLoop:
testLoop(resultDf::lastDfList)
So lastDfList gets longer each pass.
This line creates a new DataFrame by unioning each member of lastDfList:
val resultDf = lastDfList.foldLeft(Seq[Data]().toDF){(df, lastDf) => df.union(lastDf))}.cache
Each member of lastDfList is a union of it's predecessors, therefore, Spark is maintaining a lineage that becomes exponentially larger with each pass of testLoop.
I expect that the increase in time is caused by the housekeeping of the DAG. Caching the dataframes removes the need to repeat transformations, but the lineage must still be maintained by spark.
Cached data or no, it looks like you are building a really complex DAG by unioning each DataFrame with all of it's predecessors with each pass of testLoop.
You could use checkpoint to trim the lineage, and introduce some check to prevent infinite recursion.
According to API and code, checkpoint will return a new Dataset instead of changing original Dataset.

Boost Confidence of Overlapping Observations In Apache Spark

I'm fairly new to scala/spark, so forgive me if my question is elementary but I've searched everywhere and can't find the answer.
Problem
I'm trying to boost the confidence scores a bunch of network router observations (observations of probable router types at different network junctions).
I have a type NetblockObservation combines device types seen on a network with an associated netblock and a confidence. The confidence is the confidence that we accurately identified which device the device we saw.
case class NetblockObservation(
device_type: String
ip_start: Long,
ip_end: Long,
confidence: Double
)
If the confidence is above some threshold thresh, then I want that observation to be in the returned dataset. If it's below thresh, it should not be.
In addition if I have two observations with the same device_type and that one contains the other, the containee should have its confidence increased by by the confidence of the container.
Example
Let's say I have 3 Netblock Observations
// 0.0.0.0/28
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 15, confidence_score: .4)
// 0.0.0.0/29
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 7, confidence_score: .4)
// 0.0.0.0/30
NetblockObservation(device_type: "x", ip_start: 0, ip_end: 3, confidence_score: .4)
With a confidence threshold of 1, I would expect to have a single output of NetblockObservation(device_type: "x", ip_start: 0, ip_end: 4, confidence_score: 1.2)
Explanation: I am allowed to add the confidence scores of NetblockObservation's together if it's contained and has the same device_type
I was allowed to add the confidence score of the 0.0.0.0/29 to the confidence of the 0.0.0.0/30 because it's contained within it.
I was not allowed to add the confidence score of 0.0.0.0/30 to the 0.0.0.0/29 because 0.0.0.0/29 is not contained within 0.0.0.0/30.
My (pitiful) Attempt
Failure reason: Too slow / never completed
I attempted to implement this while simultaneously learning scala/spark so I'm not sure if it's the idea or the implementation which is wrong. I think it would eventually work but after an hour, it hadn't completed on a dataset of size 300,000 (small compared to production scale) so I gave up on it.
The idea is to find the largest netblock and separate the data into netblocks which are contained and netblocks which are not contained. The netblocks which are not contained are recursively passed back into the same function. If the largest netblock has a confidence_score of 1, the entire contained dataset is disregarded and the largest is added to return dataset. If the confidence_score is less then 1, then its confidence_score is added to everything in the contained dataset and that group is recursively passed back to the same function. Eventually, you should only be left with the data which has a confidence_score greater then 1. This algorithm also has the issue of not taking device_type into account.
def handleDataset(largestInNetData: Option[NetblockObservation], netData: RDD[NetblockObservation]): RDD[NetblockObservation] = {
if (netData.isEmpty) spark.sparkContext.emptyRDD else largestInNetData match {
case Some(largest) =>
val grouped = netData.groupBy(item =>
if (item.ip_start >= largest.ip_start && item.ip_end <= largest.ip_end) largestInNetData
else None)
def lookup(k: Option[NetblockObservation]) = grouped.filter(_._1 == k).flatMap(_._2)
val nos = handleDataset(None, lookup(None))
// Threshold is assumed to be 1
val next = if (largest.confidence_score >= 1) spark.sparkContext.parallelize(Seq(largest)) else
handleDataset(None, lookup(largestInNetData)
.filter(x => x != largest)
.map(x => x.copy(confidence_score = x.confidence_score + largest.confidence_score)))
nos ++ next
case None =>
val largest = netData.reduce((a: NetblockObservation, b: NetblockObservation) => if ((a.ip_end - a.ip_start) > (b.ip_end - b.ip_start)) a else b)
handleDataset(Option(largest), netData)
}
}
It is a fairly involved bit of code, so here is a general algorithm that I hope will help:
Forget about Spark for a moment and write a Scala function, probably in the companion object for NetblockObservation, that takes a collection of them and returns a subset of that collection that is contained. You should unit test the heck out of this function, and again this is pure Scala.
Moving now to Spark. Do a groupBy on your RDD[NetblockObservation] with device_type as the key producing essentially a map of String to Iterable[NetblockObservation].
Filter out all the entries in the map that have a value of size 1 and have a confidence below thresh.
For the entries that remain, apply your function from the first step to the collections of NetblockObservations with a mapValues.
Do a reduceByKey or similar to simply add up the confidence_scores of the contained values.
Enjoy a refreshing beverage.

Poor h2o GBM Classification Performance in a balanced binomial response

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)
You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

How can one analyze the greatest percentage gain (burst) of numbers in sequence in an array?

There are algorithms for detecting the maximum subarray within an array (both contiguous and non-continguous). Most of them are based around having both negative and positive numbers, though. How is it done with positive numbers only?
I have an array of values of a stock over a consequtive range of time (let's say, the array contains values for all consecutive months).
[15.42, 16.42, 17.36, 16.22, 14.72, 13.95, 14.73, 13.76, 12.88, 13.51, 12.67, 11.11, 10.04, 10.38, 10.14, 7.72, 7.46, 9.41, 11.39, 9.7, 12.67, 18.42, 18.44, 18.03, 17.48, 19.6, 19.57, 18.48, 17.36, 18.03, 18.1, 19.07, 21.02, 20.77, 19.92, 18.71, 20.29, 22.36, 22.38, 22.39, 22.94, 23.5, 21.66, 22.06, 21.07, 19.86, 19.49, 18.79, 18.16, 17.24, 17.74, 18.41, 17.56, 17.24, 16.04, 16.05, 15.4, 15.77, 15.68, 16.29, 15.23, 14.51, 14.05, 13.28, 13.49, 13.12, 14.33, 13.67, 13.13, 12.45, 12.48, 11.58, 11.52, 11.2, 10.46, 12.24, 11.62, 11.43, 10.96, 10.63, 10.19, 10.03, 9.7, 9.64, 9.16, 8.96, 8.49, 8.16, 8.0, 7.86, 8.08, 8.02, 7.67, 8.07, 8.37, 8.35, 8.82, 8.58, 8.47, 8.42, 7.92, 7.77, 7.79, 7.6, 7.18, 7.44, 7.74, 7.47, 7.63, 7.21, 7.06, 6.9, 6.84, 6.96, 6.93, 6.49, 6.38, 6.69, 6.49, 6.76]
I need an algorithm to determine for each element the single time period where it had the biggest percentage gain. This could be a time period of 1 month, some span of several months, or the entire array (e.g., 120 months), depending on the stock. I then want to output the burst, in terms of percentage gain, as well as the return (change in price over the original price; so the peak price vs the starting price in the period).
I've combined the max subarray type algorithms, but realized that this problem is a bit different; the array has no negative numbers, so those algorithms just report the entire array as the period and the sum of all elements as the gain.
The algorithms I mentioned are located here and here, with the latter being based on the Master Theorem. Hope this helps.
I'm coding in Ruby but pseudocode would be welcome, too.
I think you went the wrong way ...
I'm not familiar with ruby but let us build the algorithm in pseudocode using your own words :
I've got an array that contains the values of a stock over a range of
time (let's say, for this example, each element is the value of the
stock in a month; the array contains values for all consecutive
months).
We'll name this array StockValues, its length is given by length(StockValues), assume it is 1 based (first item is retrieved with StockValues[1])
I need an algorithm to analyze the array, and determine for each
element the single time period where it had the biggest percentage
gain in price.
You want to know for a given index i at which index j with j>i we have a maximum gain in percent i.e. when gain=100*StockValues[j]/StockValues[i]-100 is maximum.
I then want to output the burst, in terms of percentage gain, as well
as the return(change in price over the original price; so the peak price
vs the starting price in the period).
You want to retrieve the two values burst=gain=100*StockValues[j]/StockValues[i]-100 and return=StockValues[j]-StickValues[i]
The first step will be to loop thru the array and for each element do a second loop to find when the gain is maximum, when we find a maximum we save the values you want in another array named Result (let us assume this array is initialized with invalid values, like burst=-1 which means no gain over any period can be found)
for i=1 to length(StockValues)-1 do
max_gain=0
for j=i+1 to length(StockValues) do
gain=100*StockValues[j]/StockValues[i]-100
if gain>max_gain then
gain=max_gain
Result[i].burst=gain
Result[i].return=StockValues[j]-StockValues[i]
Result[i].start=i
Result[i].end=j
Result[i].period_length=j-i+1
Result[i].start_price=StockValues[i]
Result[i].end_price=StockValues[j]
end if
end for
end for
Note that this algorithm gives the smallest period, if you replace gain>max_gain with gain>=max_gain you'll get the longest period in the case there are more than one period with the same gain value. Only positive or null gains are listed, if there is no gain at all, Result will contain the invalid value. Only period>1 are listed, if period of 1 are accepted then the worst gain possible would be 0%, and you would have to modify the loops i goes to length(StockValues) and j starts at i
This doesn't really sound like several days of work :p unless I'm missing something.
# returns array of percentage gain per period
def percentage_gain(array)
initial = array[0]
after = 0
percentage_gain = []
1.upto(array.size-1).each do |i|
after = array[i]
percentage_gain << (after - initial)/initial*100
initial = after
end
percentage_gain
end
# returns array of amount gain $ per period
def amount_gain(array)
initial = array[0]
after = 0
amount_gain = []
1.upto(array.size-1).each do |i|
after = array[i]
percentage_gain << (after - initial)
initial = after
end
amount_gain
end
# returns the maximum amount gain found in the array
def max_amount_gain(array)
amount_gain(array).max
end
# returns the maximum percentage gain found in the array
def max_percentage_gain(array)
percentage_gain(array).max
end
# returns the maximum potential gain you could've made by shortselling constantly.
# i am basically adding up the amount gained when you would've hit profit.
# on days the stock loses value, i don't add them.
def max_potential_amount_gain(array)
initial = array[0]
after = 0
max_potential_gain = 0
1.upto(array.size-1).each do |i|
after = array[i]
if after - initial > 0
max_potential_gain += after - initial
end
initial = after
end
amount_gain
end
array = [15.42, 16.42, 17.36, 16.22, 14.72, 13.95, 14.73, 13.76, 12.88, 13.51, 12.67, 11.11, 10.04, 10.38, 10.14, 7.72, 7.46, 9.41, 11.39, 9.7, 12.67, 18.42, 18.44, 18.03, 17.48, 19.6, 19.57, 18.48, 17.36, 18.03, 18.1, 19.07, 21.02, 20.77, 19.92, 18.71, 20.29, 22.36, 22.38, 22.39, 22.94, 23.5, 21.66, 22.06, 21.07, 19.86, 19.49, 18.79, 18.16, 17.24, 17.74, 18.41, 17.56, 17.24, 16.04, 16.05, 15.4, 15.77, 15.68, 16.29, 15.23, 14.51, 14.05, 13.28, 13.49, 13.12, 14.33, 13.67, 13.13, 12.45, 12.48, 11.58, 11.52, 11.2, 10.46, 12.24, 11.62, 11.43, 10.96, 10.63, 10.19, 10.03, 9.7, 9.64, 9.16, 8.96, 8.49, 8.16, 8.0, 7.86, 8.08, 8.02, 7.67, 8.07, 8.37, 8.35, 8.82, 8.58, 8.47, 8.42, 7.92, 7.77, 7.79, 7.6, 7.18, 7.44, 7.74, 7.47, 7.63, 7.21, 7.06, 6.9, 6.84, 6.96, 6.93, 6.49, 6.38, 6.69, 6.49, 6.76]

Resources