Graphite metrics from Aggregator showing weird retention(?) behaviour - metrics

We have been using Graphite for a while now and recently changed the source of some metrics from statsd to yammer/codahale-metrics. Since our metrics generally are sent from a number of different servers, we set up Graphite's own aggregator to handle that for us.
Now the problem is that the stats for individual servers show up and behave just fine, but the aggregated stats will always only be correct for the last one hour or so. Meaning that older aggregated values are somehow modified after some time. Here's an image of what it looks like:
The green line is just a sumSeries on the metrics that should have been aggregated, the blue line is what the aggregator generated. Note how both lines are harmonizing only in the past hour.
Of course we have looked into storage/aggregation/retention rules, but they are all really basic and should cover all metrics equally (and basically not even be in effect after just 1 hour):
storage-schemas.conf
[stats]
priority = 110
pattern = .*
# store 60s for 30d, then 15 minutes 350400 (10 years)
retentions = 60:43000,900:262974
storage-aggregation.conf
[kv]
pattern = \.kv\.
xFilesFactor = 0.2
aggregationMethod = average
[counts]
pattern = \.counts\.
xFilesFactor = 0
aggregationMethod = sum
[timers]
pattern = \.timers\.
xFilesFactor = 0.2
aggregationMethod = average
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min
[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average
The configuration of the actual aggregator is probably the blind spot here, since we couldn't find any really detailed documentation and just left everything as it was, mostly.
carbon.conf
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
DESTINATIONS = 127.0.0.1:2004
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 10000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5

It looks like you ran into an issue that exists in the latest release (0.9.12) of graphite and that was reported to the project's bugtracker at https://github.com/graphite-project/carbon/issues/109 .
The bug report also mentions a potential fix for the issue.

Related

Omnetpp.ini - How to create loop for the host parametres

I have 1000 hosts. I need to simulate the situation when host[0] connects with other 999 hosts by PingApp in accordance with a timetable.
For example
**.host[0]*.numPingApps = 999 #number of hosts
**.host[0]*.pingApp[*].typename = "PingApp"
**.host[0]*.pingApp[*].packetSize = 42 B
**.host[0]*.pingApp[*].sendInterval = 1 s
**.host[0]*.pingApp[*].srcAddr = "host[0]"
**.host[0]*.pingApp[0].destAddr = "host[1]"
**.host[0]*.pingApp[0].startTime = 0 s
**.host[0]*.pingApp[0].stopTime = 5s
**.host[0]*.pingApp[1].destAddr = "host[2]"
**.host[0]*.pingApp[1].startTime = 0.1 s
**.host[0]*.pingApp[1].stopTime = 5.1 s
**.host[0]*.pingApp[2].destAddr = "host[3]"
**.host[0]*.pingApp[2].startTime = 0.2 s
**.host[0]*.pingApp[2].stopTime = 5.2 s
**.host[0]*.pingApp[3].destAddr = "host[4]"
**.host[0]*.pingApp[3].startTime = 0.3 s
**.host[0]*.pingApp[3].stopTime = 5.3 s
and so on...
How can I create the loop for automatic changes of parameters: startTime, stopTime, destAddr, number of pingApp?
I need to increase startTime and stopTime by 0.1s at every step of one point increase of pingApp number and destAddr.
Help me please!
Thank you!
Actually, every host should have only one Ping Application. To achieve your goal you can use the following settings:
**.host[*].numApps = 1
**.host[*].app[0].typename = "PingApp"
**.host[999].app[0].destAddr = "host[0]"
**.host[*].app[0].destAddr = "host[" + string(parentIndex()+1) + "]"
**.host[*].app[0].startTime = replaceUnit (0.1*(parentIndex()), "s")
**.host[*].app[0].stopTime = replaceUnit (5 + 0.1*(parentIndex()), "s")
The paretnIndex() returns the index of the host in vector of hosts, reference OMNeT++ Manual. For the last node (i.e. host[999]) destAddr is set by hand because parentIndex()+1 will return 1000, and there is no host[1000].
The second NED function - replaceUnit() - is used to add the unit to the result of calculation.
Here is an other quasi solution:
From the PingApp's documentation:
string destAddr = default(""); // destination address(es), separated by spaces, "*" means all IPv4/IPv6 interfaces in entire simulation
Specifying '*' allows pinging ALL configured network interfaces in the
whole simulation. This is useful to check if a host can reach ALL other
hosts in the network (i.e. routing tables were set up properly).
To specify the number of ping requests sent to a single destination address,
use the 'count' parameter. After the specified number of ping requests was
sent to a destination address, the application goes to sleep for 'sleepDuration'.
Once the sleep timer has expired, the application switches to the next destination
and starts pinging again. The application stops pinging once all destination
addresses were tested or the simulation time reaches 'stopTime'.
So if you have only these hosts in the network and you don't mind that in the beginning the host pings itself too, destAddr="*" and count=1
I combined answers of #Rudi and #JerzyD. and got the workable solution:
**.host[0]*.numPingApps = 999
**.host[0]*.pingApp[*].typename = "PingApp"
**.host[0]*.pingApp[*].sendInterval = 1 s
**.host[0]*.pingApp[*].packetSize = 42 B
**.host[0]*.pingApp[0..998].destAddr = "host[" + string(index()+1) + "]"
**.host[0]*.pingApp[0..998].startTime = replaceUnit (0.1 * (index()), "s")
**.host[0]*.pingApp[0..998].stopTime = replaceUnit (5 + 0.1 * (index()), "s")

Confused about the use of validation set here

For the main.py of the px2graph project, the part of training and validation is shown as below:
splits = [s for s in ['train', 'valid'] if opt.iters[s] > 0]
start_round = opt.last_round - opt.num_rounds
# Main training loop
for round_idx in range(start_round, opt.last_round):
for split in splits:
print("Round %d: %s" % (round_idx, split))
loader.start_epoch(sess, split, train_flag, opt.iters[split] * opt.batchsize)
flag_val = split == 'train'
for step in tqdm(range(opt.iters[split]), ascii=True):
global_step = step + round_idx * opt.iters[split]
to_run = [sample_idx, summaries[split], loss, accuracy]
if split == 'train': to_run += [optim]
# Do image summaries at the end of each round
do_image_summary = step == opt.iters[split] - 1
if do_image_summary: to_run[1] = image_summaries[split]
# Start with lower learning rate to prevent early divergence
t = 1/(1+np.exp(-(global_step-5000)/1000))
lr_start = opt.learning_rate / 15
lr_end = opt.learning_rate
tmp_lr = (1-t) * lr_start + t * lr_end
# Run computation graph
result = sess.run(to_run, feed_dict={train_flag:flag_val, lr:tmp_lr})
out_loss = result[2]
out_accuracy = result[3]
if sum(out_loss) > 1e5:
print("Loss diverging...exiting before code freezes due to NaN values.")
print("If this continues you may need to try a lower learning rate, a")
print("different optimizer, or a larger batch size.")
return
time_str = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, global_step, out_loss, out_accuracy))
# Log data
if split == 'valid' or (split == 'train' and step % 20 == 0) or do_image_summary:
writer.add_summary(result[1], global_step)
writer.flush()
# Save training snapshot
saver.save(sess, 'exp/' + opt.exp_id + '/snapshot')
with open('exp/' + opt.exp_id + '/last_round', 'w') as f:
f.write('%d\n' % round_idx)
It seems that the author only get the result of each batch of the validation set. I am wondering, if I want to observe whether the model is improving or reaching the best performance, should I use the result on the whole validation set?
If the validation set is small enough, we could calculate the loss, accuracy on the whole validation set during training to observe the performance. However, if the validation set is too large, it is better to calculate batch-wise validation results and for multiple steps.

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

I am working on a large-ish dataframe collection with some machine data in several tables. The goal is to add a column to every table which expresses the row's "class", considering its vicinity to a certain time stamp.
seconds = 1800
for i in range(len(tables)): # looping over 20 equally structured tables containing machine data
table = tables[i]
table['Class'] = 'no event'
for event in events[i].values: # looping over 20 equally structured tables containing events
event_time = event[1] # get integer time stamp
start_time = event_time - seconds
table.loc[(table.Time<=event_time) & (table.Time>=start_time), 'Class'] = 'event soon'
The event_times and the entries in table.Time are integers. The point is to assign the class "event soon" to all rows in a specific time frame before an event (the number of seconds).
The code takes quite long to run, and I am not sure what is to blame and what can be fixed. The amount of seconds does not have much impact on the runtime, so the part where the table is actually changed is probabaly working fine and it may have to do with the nested loops instead. However, I don't see how to get rid of them. Hopefully, there is a faster, more pandas way to go about adding this class column.
I am working with Python 3.6 and Pandas 0.19.2
You can use numpy broadcasting to do this vectotised instead of looping
Dummy data generation
num_tables = 5
seconds=1800
def gen_table(count):
for i in range(count):
times = [(100 + j)**2 for j in range(i, 50 + i)]
df = pd.DataFrame(data={'Time': times})
yield df
def gen_events(count, num_tables):
for i in range(num_tables):
times = [1E4 + 100 * (i + j )**2 for j in range(count)]
yield pd.DataFrame(data={'events': times})
tables = list(gen_table(num_tables)) # a list of 5 DataFrames of length 50
events = list(gen_events(5, num_tables)) # a list of 5 DataFrames of length 5
Comparison
For debugging, I added a dict of verification DataFrames. They are not needed, I just used them for debugging
verification = {}
for i, (table, event_df) in enumerate(zip(tables, events)):
event_list = event_df['events']
time_diff = event_list.values - table['Time'].values[:,np.newaxis] # This is where the magic happens
events_close = np.any( (0 < time_diff) & (time_diff < seconds), axis=1)
table['Class'] = np.where(events_close, 'event soon', 'no event')
# The stuff after this line can be deleted since it's only used for the verification
df = pd.DataFrame(data=time_diff, index=table['Time'], columns=event_list)
df['event'] = np.any((0 < time_diff) & (time_diff < seconds), axis=1)
verification[i] = df
newaxis
A good explanation on broadcasting is in Jakevdp's book
table['Time'].values[:,np.newaxis]
gives a (50,1) 2-d array
array([[10000],
[10201],
[10404],
....
[21609],
[21904],
[22201]], dtype=int64)
Verification
For the first step the verification df looks like this:
events 10000.0 10100.0 10400.0 10900.0 11600.0 event
Time
10000 0.0 100.0 400.0 900.0 1600.0 True
10201 -201.0 -101.0 199.0 699.0 1399.0 True
10404 -404.0 -304.0 -4.0 496.0 1196.0 True
10609 -609.0 -509.0 -209.0 291.0 991.0 True
10816 -816.0 -716.0 -416.0 84.0 784.0 True
11025 -1025.0 -925.0 -625.0 -125.0 575.0 True
11236 -1236.0 -1136.0 -836.0 -336.0 364.0 True
11449 -1449.0 -1349.0 -1049.0 -549.0 151.0 True
11664 -1664.0 -1564.0 -1264.0 -764.0 -64.0 False
11881 -1881.0 -1781.0 -1481.0 -981.0 -281.0 False
12100 -2100.0 -2000.0 -1700.0 -1200.0 -500.0 False
12321 -2321.0 -2221.0 -1921.0 -1421.0 -721.0 False
12544 -2544.0 -2444.0 -2144.0 -1644.0 -944.0 False
....
20449 -10449.0 -10349.0 -10049.0 -9549.0 -8849.0 False
20736 -10736.0 -10636.0 -10336.0 -9836.0 -9136.0 False
21025 -11025.0 -10925.0 -10625.0 -10125.0 -9425.0 False
21316 -11316.0 -11216.0 -10916.0 -10416.0 -9716.0 False
21609 -11609.0 -11509.0 -11209.0 -10709.0 -10009.0 False
21904 -11904.0 -11804.0 -11504.0 -11004.0 -10304.0 False
22201 -12201.0 -12101.0 -11801.0 -11301.0 -10601.0 False
Small optimizations of original answer.
You can shave a few lines and some assignments of the original algorithm
for table, event_df in zip(tables, events):
table['Class'] = 'no event'
for event_time in event_df['events']: # looping over 20 equally structured tables containing events
start_time = event_time - seconds
table.loc[table['Time'].between(start_time, event_time), 'Class'] = 'event soon'
You might shave some more if instead of the text 'no event' and 'event soon' you would just use booleans

Calculate time remaining with different length of variables

I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff
OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals

Calculate running/cumulative cost of EC2 spot instance

I often run spot instances on EC2 (for Hadoop task jobs, temporary nodes, etc.) Some of these are long-running spot instances.
Its fairly easy to calculate the cost for on-demand or reserved EC2 instances - but how do I calculate the cost incurred for a specific node (or nodes) that are running as spot instances?
I am aware that the cost for a spot instance changes every hour depending on market rate - so is there any way to calculate the cumulative total cost for a running spot instance? Through an API or otherwise?
OK I found a way to do this in the Boto library. This code is not perfect - Boto doesn't seem to return the exact time range, but it does get the historic spot prices more or less within a range. The following code seems to work quite well. If anyone can improve on it, that would be great.
import boto, datetime, time
# Enter your AWS credentials
aws_key = "YOUR_AWS_KEY"
aws_secret = "YOUR_AWS_SECRET"
# Details of instance & time range you want to find spot prices for
instanceType = 'm1.xlarge'
startTime = '2012-07-01T21:14:45.000Z'
endTime = '2012-07-30T23:14:45.000Z'
aZ = 'us-east-1c'
# Some other variables
maxCost = 0.0
minTime = float("inf")
maxTime = 0.0
totalPrice = 0.0
oldTimee = 0.0
# Connect to EC2
conn = boto.connect_ec2(aws_key, aws_secret)
# Get prices for instance, AZ and time range
prices = conn.get_spot_price_history(instance_type=instanceType,
start_time=startTime, end_time=endTime, availability_zone=aZ)
# Output the prices
print "Historic prices"
for price in prices:
timee = time.mktime(datetime.datetime.strptime(price.timestamp,
"%Y-%m-%dT%H:%M:%S.000Z" ).timetuple())
print "\t" + price.timestamp + " => " + str(price.price)
# Get max and min time from results
if timee < minTime:
minTime = timee
if timee > maxTime:
maxTime = timee
# Get the max cost
if price.price > maxCost:
maxCost = price.price
# Calculate total price
if not (oldTimee == 0):
totalPrice += (price.price * abs(timee - oldTimee)) / 3600
oldTimee = timee
# Difference b/w first and last returned times
timeDiff = maxTime - minTime
# Output aggregate, average and max results
print "For: one %s in %s" % (instanceType, aZ)
print "From: %s to %s" % (startTime, endTime)
print "\tTotal cost = $" + str(totalPrice)
print "\tMax hourly cost = $" + str(maxCost)
print "\tAvg hourly cost = $" + str(totalPrice * 3600/ timeDiff)
I've re-written Suman's solution to work with boto3. Make sure to use utctime with the tz set!:
def get_spot_instance_pricing(ec2, instance_type, start_time, end_time, zone):
result = ec2.describe_spot_price_history(InstanceTypes=[instance_type], StartTime=start_time, EndTime=end_time, AvailabilityZone=zone)
assert 'NextToken' not in result or result['NextToken'] == ''
total_cost = 0.0
total_seconds = (end_time - start_time).total_seconds()
total_hours = total_seconds / (60*60)
computed_seconds = 0
last_time = end_time
for price in result["SpotPriceHistory"]:
price["SpotPrice"] = float(price["SpotPrice"])
available_seconds = (last_time - price["Timestamp"]).total_seconds()
remaining_seconds = total_seconds - computed_seconds
used_seconds = min(available_seconds, remaining_seconds)
total_cost += (price["SpotPrice"] / (60 * 60)) * used_seconds
computed_seconds += used_seconds
last_time = price["Timestamp"]
# Difference b/w first and last returned times
avg_hourly_cost = total_cost / total_hours
return avg_hourly_cost, total_cost, total_hours
You can subscribe to the spot instance data feed to get charges for your running instances dumped to an S3 bucket. Install the ec2 toolset and then run:
ec2-create-spot-datafeed-subscription -b bucket-to-dump-in
Note: you can have only one data feed subscription for your entire account.
In about an hour you should start seeing gzipped tabbed delimited files show up in the bucket that look something like this:
#Version: 1.0
#Fields: Timestamp UsageType Operation InstanceID MyBidID MyMaxPrice MarketPrice Charge Version
2013-05-20 14:21:07 UTC SpotUsage:m1.xlarge RunInstances:S0012 i-1870f27d sir-b398b235 0.219 USD 0.052 USD 0.052 USD 1
I have recently developed a small python library that calculates the cost of a single EMR cluster, or for a list of clusters (given a period of days).
It takes into account Spot instances and Task nodes as well (that may go up and down while the cluster is still running).
In order to calculate the cost I use the bid price, which (in many cases) might not be the exact price that you end up paying for the instance.
Depending on your bidding policy however, this price can be accurate enough.
You can find the code here: https://github.com/memosstilvi/emr-cost-calculator

Resources