Calculate running/cumulative cost of EC2 spot instance - amazon-ec2

I often run spot instances on EC2 (for Hadoop task jobs, temporary nodes, etc.) Some of these are long-running spot instances.
Its fairly easy to calculate the cost for on-demand or reserved EC2 instances - but how do I calculate the cost incurred for a specific node (or nodes) that are running as spot instances?
I am aware that the cost for a spot instance changes every hour depending on market rate - so is there any way to calculate the cumulative total cost for a running spot instance? Through an API or otherwise?

OK I found a way to do this in the Boto library. This code is not perfect - Boto doesn't seem to return the exact time range, but it does get the historic spot prices more or less within a range. The following code seems to work quite well. If anyone can improve on it, that would be great.
import boto, datetime, time
# Enter your AWS credentials
aws_key = "YOUR_AWS_KEY"
aws_secret = "YOUR_AWS_SECRET"
# Details of instance & time range you want to find spot prices for
instanceType = 'm1.xlarge'
startTime = '2012-07-01T21:14:45.000Z'
endTime = '2012-07-30T23:14:45.000Z'
aZ = 'us-east-1c'
# Some other variables
maxCost = 0.0
minTime = float("inf")
maxTime = 0.0
totalPrice = 0.0
oldTimee = 0.0
# Connect to EC2
conn = boto.connect_ec2(aws_key, aws_secret)
# Get prices for instance, AZ and time range
prices = conn.get_spot_price_history(instance_type=instanceType,
start_time=startTime, end_time=endTime, availability_zone=aZ)
# Output the prices
print "Historic prices"
for price in prices:
timee = time.mktime(datetime.datetime.strptime(price.timestamp,
"%Y-%m-%dT%H:%M:%S.000Z" ).timetuple())
print "\t" + price.timestamp + " => " + str(price.price)
# Get max and min time from results
if timee < minTime:
minTime = timee
if timee > maxTime:
maxTime = timee
# Get the max cost
if price.price > maxCost:
maxCost = price.price
# Calculate total price
if not (oldTimee == 0):
totalPrice += (price.price * abs(timee - oldTimee)) / 3600
oldTimee = timee
# Difference b/w first and last returned times
timeDiff = maxTime - minTime
# Output aggregate, average and max results
print "For: one %s in %s" % (instanceType, aZ)
print "From: %s to %s" % (startTime, endTime)
print "\tTotal cost = $" + str(totalPrice)
print "\tMax hourly cost = $" + str(maxCost)
print "\tAvg hourly cost = $" + str(totalPrice * 3600/ timeDiff)

I've re-written Suman's solution to work with boto3. Make sure to use utctime with the tz set!:
def get_spot_instance_pricing(ec2, instance_type, start_time, end_time, zone):
result = ec2.describe_spot_price_history(InstanceTypes=[instance_type], StartTime=start_time, EndTime=end_time, AvailabilityZone=zone)
assert 'NextToken' not in result or result['NextToken'] == ''
total_cost = 0.0
total_seconds = (end_time - start_time).total_seconds()
total_hours = total_seconds / (60*60)
computed_seconds = 0
last_time = end_time
for price in result["SpotPriceHistory"]:
price["SpotPrice"] = float(price["SpotPrice"])
available_seconds = (last_time - price["Timestamp"]).total_seconds()
remaining_seconds = total_seconds - computed_seconds
used_seconds = min(available_seconds, remaining_seconds)
total_cost += (price["SpotPrice"] / (60 * 60)) * used_seconds
computed_seconds += used_seconds
last_time = price["Timestamp"]
# Difference b/w first and last returned times
avg_hourly_cost = total_cost / total_hours
return avg_hourly_cost, total_cost, total_hours

You can subscribe to the spot instance data feed to get charges for your running instances dumped to an S3 bucket. Install the ec2 toolset and then run:
ec2-create-spot-datafeed-subscription -b bucket-to-dump-in
Note: you can have only one data feed subscription for your entire account.
In about an hour you should start seeing gzipped tabbed delimited files show up in the bucket that look something like this:
#Version: 1.0
#Fields: Timestamp UsageType Operation InstanceID MyBidID MyMaxPrice MarketPrice Charge Version
2013-05-20 14:21:07 UTC SpotUsage:m1.xlarge RunInstances:S0012 i-1870f27d sir-b398b235 0.219 USD 0.052 USD 0.052 USD 1

I have recently developed a small python library that calculates the cost of a single EMR cluster, or for a list of clusters (given a period of days).
It takes into account Spot instances and Task nodes as well (that may go up and down while the cluster is still running).
In order to calculate the cost I use the bid price, which (in many cases) might not be the exact price that you end up paying for the instance.
Depending on your bidding policy however, this price can be accurate enough.
You can find the code here: https://github.com/memosstilvi/emr-cost-calculator

Related

Quantlib Zero Coupon Inflation Swap Ignores CPI Values

I am trying to price a zero-coupon USD CPI inflation swap in Quantlib and Python. My discount curve and NPV of the fixed leg looks good, but I'm a few percentage points out compared to BBG SWPM on the NPV of the inflation leg.
One thing I've noticed is that changing the values of the CPI has no effect on the price of the swap. So I think I'm setting the CPI wrong and as such the base index of the swap is wrong. Can anyone see what I'm doing wrong here?
For reference, I've backed this out from looking at the C++ unit tests. If there's a complete Python example I would be interested to see it, but couldn't find one on my own.
import QuantLib as quantlib
import pandas as pd
start_date = quantlib.Date.from_date(pd.Timestamp(2022, 9, 6))
calc_date = quantlib.Date.from_date(pd.Timestamp(2022, 9, 6))
end_date = quantlib.Date.from_date(pd.Timestamp(2024, 9, 6))
swap_type = quantlib.ZeroCouponInflationSwap.Receiver
calendar = quantlib.TARGET()
day_count_convention = quantlib.ActualActual()
contract_observation_lag = quantlib.Period(3, quantlib.Months)
business_day_convention = quantlib.ModifiedFollowing
nominal = 10e6
fixed_rate = 0.05
cpi_json = '{"columns":[1],"index":[1640908800000,1643587200000,1646006400000,1648684800000,1651276800000,1653955200000,1656547200000,1659225600000],"data":[[277.948],[278.802],[283.716],[287.504],[289.109],[292.296],[296.311],[296.276]]}'
cpi_prints = pd.read_json(cpi_json, orient='split')
# Pretty-printed CPI:
# 1
# 2021-12-31 277.948
# 2022-01-31 278.802
# 2022-02-28 283.716
# 2022-03-31 287.504
# 2022-04-30 289.109
# 2022-05-31 292.296
# 2022-06-30 296.311
# 2022-07-31 296.276
zero_coupon_observations = pd.DataFrame(index=[0],
data={'1Y': 2.73620,
'2Y': 2.975,
'3Y': 2.967,
'4Y': 2.917,
'5Y': 2.8484})
inflation_yield_term_structure = quantlib.RelinkableZeroInflationTermStructureHandle()
inflation_index = quantlib.USCPI(True, inflation_yield_term_structure)
for date, value in cpi_prints.itertuples():
# Setting the CPI as fixings, but no matter what I put here the NPV comes out the same
# Looks like the base index for the swap is not being set by me/set through the CPI prints
# I put here.
inflation_index.addFixing(quantlib.Date.from_date(date), value)
inflation_rate_helpers = []
nominal_term_structure = quantlib.YieldTermStructureHandle(quantlib.FlatForward(calc_date,
0.00, # Changing this seems to have no effect
quantlib.ActualActual()))
for tenor in zero_coupon_observations.columns:
maturity = calendar.advance(calc_date, quantlib.Period(tenor))
quote = quantlib.QuoteHandle(quantlib.SimpleQuote(zero_coupon_observations.at[0, tenor] / 100.0))
helper = quantlib.ZeroCouponInflationSwapHelper(quote,
contract_observation_lag,
maturity,
calendar,
business_day_convention,
day_count_convention,
inflation_index,
nominal_term_structure)
inflation_rate_helpers.append(helper)
# Not sure how to choose this number, just taking the 1Y tenor on the calc date?
# I'm pricing a 2Y swap, and will want to price it off it's start date as well
base_zero_rate = zero_coupon_observations.at[0, '1Y']/100
inflation_curve = quantlib.PiecewiseZeroInflation(calc_date,
calendar,
day_count_convention,
contract_observation_lag,
quantlib.Monthly,
inflation_index.interpolated(),
base_zero_rate,
inflation_rate_helpers,
1.0e-12,
quantlib.Linear())
inflation_yield_term_structure.linkTo(inflation_curve)
swap = quantlib.ZeroCouponInflationSwap(swap_type,
nominal,
start_date,
end_date,
calendar,
business_day_convention,
day_count_convention,
fixed_rate,
inflation_index,
contract_observation_lag)
# Leaving off the construction of the discount curve for brevity.
# NPV of the fixed legs checks out
discount_curve = ...
swap_engine = quantlib.DiscountingSwapEngine(discount_curve)
swap.setPricingEngine(swap_engine)
print(swap.NPV())

How to add a maximum travel time duration for the sum of all routes in VRP Google OR-TOOLS

I am new to programming and used Google OR-tools to create my VRP model. In my current model, I have included a general time window and capacity constraint per vehicle, creating a capacitated vehicle routing problem with time windows. I followed the OR-tools guides which contains a maximum travel duration for each vehicle.
However, I want to include a maximum travel duration for the sum of all routes, whereas the maximum travel duration for each vehicle does not matter (so I set it to 100.000). Accorddingly, I want to create something in the model/solution printer that tells me which amount of addresses could not be visited due to the constraint on the maximum travel duration for the sum of all routes. From the examples I have seen I think it would be kind of easy, but my knowledge on programming is fairly limited, so my attempts had no succes. Can anyone help me?
import pandas as pd
import openpyxl
import numpy as np
import math
from random import sample
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
from scipy.spatial.distance import squareform, pdist
from haversine import haversine
#STEP - create data
# import/read excel file
data = pd.read_excel(r'C:\Users\Jean-Paul\Documents\Thesis\OR TOOLS\Data.xlsx', engine = 'openpyxl')
df = pd.DataFrame(data, columns= ['number','lat','lng']) # create dataframe with 10805 addresses + address of the depot
#print (df)
# randomly sample X addresses from the dataframe and their corresponding number/latitude/longtitude
df_sample = df.sample(n=100)
#print (df_data)
# read first row of the excel file (= coordinates of the depot)
df_depot = pd.DataFrame(data, columns= ['number','lat','lng']).iloc[0:1]
#print (df_depot)
# combine dataframe of depot and sample into one dataframe
df_data = pd.concat([df_depot, df_sample], ignore_index=True, sort=False)
#print (df_data)
#STEP - create distance matrix data
# determine distance between latitude and longtitude
df_data.set_index('number', inplace=True)
matrix_distance = pd.DataFrame(squareform(pdist(df_data, metric=haversine)), index=df_data.index, columns=df_data.index)
matrix_list = np.array(matrix_distance)
#print (matrix_distance) # create table of distances between addresses including headers
#print (matrix_list) # converting table to list of lists and exclude headers
#STEP - create time matrix data
travel_time = matrix_list / 15 * 60 # divide distance by travel speed 20 km/h and multiply by 60 minutes
#print (travel_time) # converting distance matrix to travel time matrix
#STEP - create time window data
# create list for each sample - couriers have to visit this address within 0-X minutes of time using a list of lists
window_range = []
for i in range(len(df_data)):
list = [0, 240]
window_range.append(list) # create list of list with a time window range for each address
#print (window_range)
#STEP - create demand data
# create list for each sample - all addresses demand 1 parcel except the depot
demand_range = []
for i in range(len(df_data.iloc[0:1])):
list = 0
demand_range.append(list)
for j in range(len(df_data.iloc[1:])):
list2 = 1
demand_range.append(list2)
#print (demand_range)
#STEP - create fleet size data # amount of vehicles in the fleet
fleet_size = 6
#print (fleet_size)
#STEP - create capacity data for each vehicle
fleet_capacity = []
for i in range(fleet_size): # capacity per vehicle
list = 20
fleet_capacity.append(list)
#print (fleet_capacity)
#STEP - create data model that stores all data for the problem
def create_data_model():
data = {}
data['time_matrix'] = travel_time
data['time_windows'] = window_range
data['num_vehicles'] = fleet_size
data['depot'] = 0 # index of the depot
data['demands'] = demand_range
data['vehicle_capacities'] = fleet_capacity
return data
#STEP - creating the solution printer
def print_solution(data, manager, routing, solution):
"""Prints solution on console."""
print(f'Objective: {solution.ObjectiveValue()}')
time_dimension = routing.GetDimensionOrDie('Time')
total_time = 0
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
plan_output = 'Route for vehicle {}:\n'.format(vehicle_id)
while not routing.IsEnd(index):
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2}) -> '.format(
manager.IndexToNode(index), solution.Min(time_var),
solution.Max(time_var))
index = solution.Value(routing.NextVar(index))
time_var = time_dimension.CumulVar(index)
plan_output += '{0} Time({1},{2})\n'.format(manager.IndexToNode(index),
solution.Min(time_var),
solution.Max(time_var))
plan_output += 'Time of the route: {}min\n'.format(
solution.Min(time_var))
print(plan_output)
total_time += solution.Min(time_var)
print('Total time of all routes: {}min'.format(total_time))
#STEP - create the VRP solver
def main():
# instantiate the data problem
data = create_data_model()
# create the routing index manager
manager = pywrapcp.RoutingIndexManager(len(data['time_matrix']),
data['num_vehicles'], data['depot'])
# create routing model
routing = pywrapcp.RoutingModel(manager)
#STEP - create demand callback and dimension for capacity
# create and register a transit callback
def demand_callback(from_index):
"""Returns the demand of the node."""
# convert from routing variable Index to demands NodeIndex
from_node = manager.IndexToNode(from_index)
return data['demands'][from_node]
demand_callback_index = routing.RegisterUnaryTransitCallback(
demand_callback)
routing.AddDimensionWithVehicleCapacity(
demand_callback_index,
0, # null capacity slack
data['vehicle_capacities'], # vehicle maximum capacities
True, # start cumul to zero
'Capacity')
#STEP - create time callback
# create and register a transit callback
def time_callback(from_index, to_index):
"""Returns the travel time between the two nodes."""
# convert from routing variable Index to time matrix NodeIndex
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['time_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(time_callback)
# define cost of each Arc (costs in terms of travel time)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
# STEP - create a dimension for the travel time (TIMEWINDOW) - dimension keeps track of quantities that accumulate over a vehicles route
# add time windows constraint
time = 'Time'
routing.AddDimension(
transit_callback_index,
2, # allow waiting time (does not have an influence in this model)
100000, # maximum total route lenght in minutes per vehicle (does not have an influence because of capacity constraint)
False, # do not force start cumul to zero
time)
time_dimension = routing.GetDimensionOrDie(time)
# add time window constraints for each location except depot
for location_idx, time_window in enumerate(data['time_windows']):
if location_idx == data['depot']:
continue
index = manager.NodeToIndex(location_idx)
time_dimension.CumulVar(index).SetRange(time_window[0], time_window[1])
# add time window constraint for each vehicle start node
depot_idx = data['depot']
for vehicle_id in range(data['num_vehicles']):
index = routing.Start(vehicle_id)
time_dimension.CumulVar(index).SetRange(
data['time_windows'][depot_idx][0],
data['time_windows'][depot_idx][1])
#STEP - instantiate route start and end times to produce feasible times
for i in range(data['num_vehicles']):
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.Start(i)))
routing.AddVariableMinimizedByFinalizer(
time_dimension.CumulVar(routing.End(i)))
#STEP - setting default search parameters and a heuristic method for finding the first solution
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
#STEP - solve the problem with the serach parameters and print solution
solution = routing.SolveWithParameters(search_parameters)
if solution:
print_solution(data, manager, routing, solution)
if __name__ == '__main__':
main()
See #Mizux's answer, going under-the-hood in the solver to make a summation cost over all vehicle route lengths:
https://stackoverflow.com/a/68756570/13773745

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

I am working on a large-ish dataframe collection with some machine data in several tables. The goal is to add a column to every table which expresses the row's "class", considering its vicinity to a certain time stamp.
seconds = 1800
for i in range(len(tables)): # looping over 20 equally structured tables containing machine data
table = tables[i]
table['Class'] = 'no event'
for event in events[i].values: # looping over 20 equally structured tables containing events
event_time = event[1] # get integer time stamp
start_time = event_time - seconds
table.loc[(table.Time<=event_time) & (table.Time>=start_time), 'Class'] = 'event soon'
The event_times and the entries in table.Time are integers. The point is to assign the class "event soon" to all rows in a specific time frame before an event (the number of seconds).
The code takes quite long to run, and I am not sure what is to blame and what can be fixed. The amount of seconds does not have much impact on the runtime, so the part where the table is actually changed is probabaly working fine and it may have to do with the nested loops instead. However, I don't see how to get rid of them. Hopefully, there is a faster, more pandas way to go about adding this class column.
I am working with Python 3.6 and Pandas 0.19.2
You can use numpy broadcasting to do this vectotised instead of looping
Dummy data generation
num_tables = 5
seconds=1800
def gen_table(count):
for i in range(count):
times = [(100 + j)**2 for j in range(i, 50 + i)]
df = pd.DataFrame(data={'Time': times})
yield df
def gen_events(count, num_tables):
for i in range(num_tables):
times = [1E4 + 100 * (i + j )**2 for j in range(count)]
yield pd.DataFrame(data={'events': times})
tables = list(gen_table(num_tables)) # a list of 5 DataFrames of length 50
events = list(gen_events(5, num_tables)) # a list of 5 DataFrames of length 5
Comparison
For debugging, I added a dict of verification DataFrames. They are not needed, I just used them for debugging
verification = {}
for i, (table, event_df) in enumerate(zip(tables, events)):
event_list = event_df['events']
time_diff = event_list.values - table['Time'].values[:,np.newaxis] # This is where the magic happens
events_close = np.any( (0 < time_diff) & (time_diff < seconds), axis=1)
table['Class'] = np.where(events_close, 'event soon', 'no event')
# The stuff after this line can be deleted since it's only used for the verification
df = pd.DataFrame(data=time_diff, index=table['Time'], columns=event_list)
df['event'] = np.any((0 < time_diff) & (time_diff < seconds), axis=1)
verification[i] = df
newaxis
A good explanation on broadcasting is in Jakevdp's book
table['Time'].values[:,np.newaxis]
gives a (50,1) 2-d array
array([[10000],
[10201],
[10404],
....
[21609],
[21904],
[22201]], dtype=int64)
Verification
For the first step the verification df looks like this:
events 10000.0 10100.0 10400.0 10900.0 11600.0 event
Time
10000 0.0 100.0 400.0 900.0 1600.0 True
10201 -201.0 -101.0 199.0 699.0 1399.0 True
10404 -404.0 -304.0 -4.0 496.0 1196.0 True
10609 -609.0 -509.0 -209.0 291.0 991.0 True
10816 -816.0 -716.0 -416.0 84.0 784.0 True
11025 -1025.0 -925.0 -625.0 -125.0 575.0 True
11236 -1236.0 -1136.0 -836.0 -336.0 364.0 True
11449 -1449.0 -1349.0 -1049.0 -549.0 151.0 True
11664 -1664.0 -1564.0 -1264.0 -764.0 -64.0 False
11881 -1881.0 -1781.0 -1481.0 -981.0 -281.0 False
12100 -2100.0 -2000.0 -1700.0 -1200.0 -500.0 False
12321 -2321.0 -2221.0 -1921.0 -1421.0 -721.0 False
12544 -2544.0 -2444.0 -2144.0 -1644.0 -944.0 False
....
20449 -10449.0 -10349.0 -10049.0 -9549.0 -8849.0 False
20736 -10736.0 -10636.0 -10336.0 -9836.0 -9136.0 False
21025 -11025.0 -10925.0 -10625.0 -10125.0 -9425.0 False
21316 -11316.0 -11216.0 -10916.0 -10416.0 -9716.0 False
21609 -11609.0 -11509.0 -11209.0 -10709.0 -10009.0 False
21904 -11904.0 -11804.0 -11504.0 -11004.0 -10304.0 False
22201 -12201.0 -12101.0 -11801.0 -11301.0 -10601.0 False
Small optimizations of original answer.
You can shave a few lines and some assignments of the original algorithm
for table, event_df in zip(tables, events):
table['Class'] = 'no event'
for event_time in event_df['events']: # looping over 20 equally structured tables containing events
start_time = event_time - seconds
table.loc[table['Time'].between(start_time, event_time), 'Class'] = 'event soon'
You might shave some more if instead of the text 'no event' and 'event soon' you would just use booleans

How calculate the number prorated day with Stripe API?

i am using Stripe. I would like to know how can calculate number of day prorated
I want display something like that
1 additional seat ($9/month each - prorated for 26 days)
in the api i don't see any item prorate_day
Bolo
subscription_proration_date what you are looking for? Then it will calculate it for you.
See more at https://stripe.com/docs/subscriptions/guide
The example of pro-rated subscription in ruby is as follows
# Set your secret key: remember to change this to your live secret key in production
# See your keys here https://dashboard.stripe.com/account/apikeys
Stripe.api_key = "sk_test_9OkpsFpKa1HDHaZa7e0BeGaO"
proration_date = Time.now.to_i
invoice = Stripe::Invoice.upcoming(:customer => "cus_3R1W8PG2DmsmM9", :subscription => "sub_3R3PlB2YlJe84a",
:subscription_plan => "premium_monthly", :subscription_proration_date => proration_date)
current_prorations = invoice.lines.data.select { |ii| ii.period.start == proration_date }
cost = 0
current_prorations.each do |p|
cost += p.amount
end
# Display the cost of these prorations invoice items to the end user,
# and actually do the update when they agree.
# To make sure that the proration is calculated the same as when it was previewed,
# you need to pass in the proration_date parameter
# later...
subscription = Stripe::Subscription.retrieve("sub_3R3PlB2YlJe84a")
subscription.plan = "premium_monthly"
subscription.proration_date = proration_date
subscription.save

Calculate time remaining with different length of variables

I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff
OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals

Resources