filling the holes in a time series data

filling the holes in a time series data - time

So i am trying to build one factor models with stocks and indices in R. I have 30 stocks and 16 indices in total. They are all time series from "2013-1-1" to "2014-12-31". Well at least all my stocks are. All of my indices are missing some entries here and there. For example, all of my stocks' data have the length of 522 but one indice has a length of 250, one 300, another 400 etc. But they all start from "2013-1-1" and end at "2014-12-31". Because my indice data has holes in it, i can't check correlations and build linear models with them. I can't do anything basically. So i need to fill these holes. I am thinking about filling those holes with their mean. But i don't know how to do it.I am open to other ideas of course. Can you help me? It is an important term project for me, so there is a lot on the line...

Edited based upon your comments (and to fix a mistake I made):
This is basic data management and I'm surprised that you're being required to work with timeseries data without knowing how to merge() and how to create dataframes.
Create some fake date and value data with holes in the dates:
dFA <- data.frame(seq.Date(as.Date("2014-01-01"), as.Date("2014-02-28"), 3))
names(dFA) <- "date"
dFA$vals <- rnorm(nrow(dFA), 25, 5)
Create a dataframe of dates from the min value in dFA to the max value in dFA
dFB <- as.data.frame(seq.Date(as.Date(min(dFA$date, na.rm = T), format = "%Y-%m-%d"),
as.Date(max(dFA$date, na.rm = T), format = "%Y-%m-%d"),
1))
names(dFB) <- "date"
Merge the two dataframes together
tmp <- merge(dFB, dFA, by = "date", all = T)
Change NA values in tmp$vals to whatever you want
tmp$vals[is.na(tmp$vals)] <- mean(dFA$vals)
head(tmp)
date vals
1 2014-01-01 18.48131
2 2014-01-02 24.16256
3 2014-01-03 24.16256
4 2014-01-04 28.78855
5 2014-01-05 24.16256
6 2014-01-06 24.16256
Original comment below
The easiest way to fill in the holes is with merge().
Create a new data frame with one vector as a sequence of dates that span the range of your original dataframe and the other vector with whatever you're going to fill the holes (zeroes, means, whatever). Then just merge() the two together:
merge(dFB, dFA, by = [the column with the date values], all = TRUE)

Related

Replace every cell in a matrix with the average of adjcent cells

Requirement: Must be done in-place.
For example:
Given matrix
1, 2, 3
4, 5, 6
7, 8, 9
Should replace by the average of its sum of 3*3 neighbor cells and its own:
(1+2+4+5)/4, (2+1+3+4+5+6)/6 , (3+2+6+5)/4
(1+2+5+4+7+8)/6, (1+2+3+4+5+6+7+8+9)/9, (2+3+5+6+8+9)/6
(4+5+7+8)/4, (4+5+6+7+8+9)/6, (5+6+8+9)/4
which is:
All floating number convert to int
3, 3.5(3), 4 3, 3, 4
4.5(4), 5, 5.5(5) => 4, 5, 5
6, 6.5(6), 7 6, 6, 7
I tried to just iterate over the matrix and update each cell, but I found this will affect the future calculation:
Say I update the original 1 to 3, but when I when I tried to update the original 2, the original 1 becomes 3 now.
Copying the original matrix for calculating average is a workaround but it's a bad idea, Could we achieve that without using that much space?

In most cases, you should just create a copy of the original matrix and use that for calculating the averages. Unless creating a copy of the matrix would use more memory than you have available, the overhead should be negligible.
If you have a really large matrix, you could use a "rolling" backup (in lack of a better term). Let's say you update the cells row-by-row and you are currently in row n. You don't need a backup of row n-2, as those cells are not relevant any more, and neither of row n+1, because those are still the original values. So you can just keep a backup of the previous and the current row. Whenever you advance to the next row, discard the backup of the previous row, move the backup of the current row to previous, and create a backup of the new current row.
Some pseudo-code (not taking any edge-cases into account):
previous = [] # or whatever works for the first row
for i in len(matrix):
current = copy(matrix[i])
for k in len(matrix[i]):
matrix[i][k] = previous[k-1] + ... + current[k] + ... matrix[i+1][k+1] / 9
previous = current
(You might also keep a backup of the next row, just so you can use only the backup rows for all the values instead of having to differentiate.)

You must have some kind of cache for the result data so you can keep reference to the original data. I don't think there is a way around it.
If the data set is large, you could optimize by using a smaller data buffer (like looking through a keyhole) and 'scrolling' the input matrix as you update it. In your case, you could use a buffer as small as 3x3.
It is a compromise between speed and space though. The smaller your buffer, the worse the performance will be.
To visualize the problem, starting from the top-left (0,0) of the dataset:
(result values are rounded down for simplicity)
First step: update first 4 cells (prime the buffer)
// Data Set // Data Viewport // Result Set
01,02,03,04,05 01,02,03 04,04,??
06,07,08,09,10 06,07,08 06,07,??
11,12,13,14,15 11,12,13 ??,??,??
16,17,18,19,20
21,22,23,24,25
then for each iteration..
( new values indicated with [xx] )
++ update first column in Data Set from Result Set
// Data Set // Data Viewport // Result Set
[04],02,03,04,05 01,02,03 04,04,??
[06],07,08,09,10 06,07,08 06,07,??
11 ,12,13,14,15 11,12,13 ??,??,??
16 ,17,18,19,20
21 ,22,23,24,25
++ shift Data Viewport and Result Set right 1 column
// Data Set // Data Viewport // Result Set
[04],02,03,04,05 02,03,04 04,[03],??
[06],07,08,09,10 07,08,09 07,[08],??
11 ,12,13,14,15 12,13,14 ??, ?? ,??
16 ,17,18,19,20
21 ,22,23,24,25
++ update middle column of Result Set
// Data Set // Data Viewport // Result Set
[04],02,03,04,05 02,03,04 04,[05],??
[06],07,08,09,10 07,08,09 07,[08],??
11 ,12,13,14,15 12,13,14 ??, ?? ,??
16 ,17,18,19,20
21 ,22,23,24,25
At the following iteration, the data state would be:
// Data Set // Data Viewport // Result Set
04,[04],03,04,05 03,04,05 05,[06],??
06,[07],08,09,10 08,09,10 08,[09],??
11, 12 ,13,14,15 13,14,15 ??, ?? ,??
16, 17 ,18,19,20
21, 22 ,23,24,25
.. etc
Don't forget to handle the other edge cases.
*The Data Viewport representation is just for visualization. In code, the actual viewport would be the result buffer.

Sorted Two-Way Tabulation of Many Values

I have a decent-sized dataset (about 18,000 rows). I have two variables that I want to tabulate, one taking on many string values, and the second taking on just 4 values. I want to tabulate the string values by the 4 categories. I need these sorted. I have tried several commands, including tabsort, which works, but only if I restrict the number of rows it uses to the first 603 (at least with the way it is currently sorted). If the number of rows is greater than this, then I get the r(134) error that there are too many values. Is there anything to be done? My goal is to create a table with the most common words and export it to LaTeX. Would it be a lot easier to try and do this in something like R?

Here's one way, via contract and texsave from SSC:
/* Fake Data */
set more off
clear
set matsize 5000
set seed 12345
set obs 1000
gen x = string(rnormal())
expand mod(_n,10)
gen y = mod(_n,4)
/* Collapse Data to Get Frequencies for Each x-y Cell */
preserve
contract x y, freq(N)
reshape wide N, i(x) j(y)
forvalues v=0/3 {
lab var N`v' "`v'" // need this for labeling
replace N`v'=0 if missing(N`v')
}
egen T = rowtotal(N*)
gsort -T x // sort by occurrence
keep if T > 0 // set occurrence threshold
capture ssc install texsave
texsave x N0 N1 N2 N3 using "tab_x_y.tex", varlabel replace title("tab x y")
restore
/* Check Calculations */
type "tab_x_y.tex"
tab x y, rowsort

Find the optimal ordering of elements

I have a list of logos (up to 4 color) that need to be printed. Each logo requires a setup time to mix paints needed for that logo. If I can sort the data so that two logos that use the same colors are back to back, then we will not have to mix as many colors saving money and time. Paints have a limited life span once mixed.
I am looking at a dataset like this...
Red | (Other Color)
Red | Black
(Other Color) | Black
It needs to end up in that order. That is the only order that will allow for 1 red to me made and 1 black. I've tried a few things like assigning a value to each common color, but no matter what, I can't seem to get it ordered correctly.
I used the following SQL procedure that someone wrote based on the TSP problem. (http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=172154)
Using the following test data I received the correct output
delete from routes
delete from cities
insert into cities values ('Black|Red')
insert into cities values ('Red')
insert into cities values ('Blue')
insert into cities values ('Black')
insert into cities values ('Blue|Red')
-- Numeric Value is Colors not Matching
insert into routes values ('Black|Red', 'Red', 3)
insert into routes values ('Black|Red', 'Black', 3)
insert into routes values ('Red', 'Black', 4)
insert into routes values ('Blue|Red', 'Red', 3)
insert into routes values ('Blue|Red', 'Black', 4)
insert into routes values ('Blue', 'Red', 4)
insert into routes values ('Blue', 'Black|Red', 4)
insert into routes values ('Blue', 'Black', 4)
insert into routes values ('Blue', 'Blue|Red', 3)
exec getTSPRoute 'Black'
Results:
Black->Black|Red->Red->Blue|Red->Blue->Black
The only issue is running back to the original "city" (Black returned for both the start and the end) and I have to select a "start city." If the wrong one is selected I don't end up with the most optimized route.

It looks like the travelling salesman problem (TSP). Let me explain.
First, consider an example where you have a map with four cities A, B, C and D. (I use 4 in the example but it has nothing to do with the number of colors). You want to find a route between the cities so you (1) visit each city only once and (2) the route is the shortest possible. [D,C,A,B] might be shorter that [B,A,D,C] and you want the shortest one.
Now, instead of the cities you have four logos. You want to find such an ordering of the logos, that yields a minimum cost in terms of color mixing. If you imagine that each of your logos is a point (city), and the distance between the logos is the "cost" of switching between one color set to the other, then you need to find the shortest "route" between the points. Once you have this shortest route, it tells you how you should order the logos. The "distance" between two logos L1 and L2 can be defined, for example, as a number of colors in L2 that are not in L1.
TSP it is a well known algorithmic problem. And it is hard (actually, NP-hard).
If your input is small you can find the best solution. In case of 4 logos, you have 24 posible combinations. For 10 logos, you have 3.6 million combinations and for 20 logos you get 2432902008176640000 combinations (how to read that?). So for inputs larger than 10-15 you need to use some heuristic that finds an approximate solution, which I am sure is enough for you.
What I would do is that I would create a graph of costs of color mixing and feed it to some TSP solver
EDIT:
Clarification. Not each logo is a separate point, but each set of colours in a logo is a point. That is: if you have two logos that have the same set of colours, you consider them as a single point because they will be printed together. Logos with red, blue, black are on point and logos with red, green are another point.
It's rather Hamiltonian path problem than TSP (you don't need to end with the same color set as at the beginning), but it doesn't change much
If there might be no matches in your logos, then first split your logos into disjoint groups that have no matches between them and later consider each group separately. If there are no matches between any of your logos, then you cannot do much :)
Practically, I would use python and maybe networkx library to model your problem as graph, and later I would pass it to some TSP solver. Just format the input and make some other program do all the dirty work.

For a reasonable amount of logos and colors, an easy way would be a brute-force approach in which you go through all the combinations and increase a counter each time mixing is required. After that, you sort combinations by that counter and choose the one with the lowest value.
Pseudocode
foreach combination
foreach print
foreeach color
if not previous_print.contains(color)
cost++
order combination by cost (ascending)
You didn't mention if you are using (or are about to) any kind of tool (spreadsheet, programming language, ...) in which you intended perform this sort.
Edit:
Here's a quick implementation in VB.NET. Note that the code is intentionally left long as to make it easier to read and understand.
Private Sub GoGoGo()
' Adds some logos
' This is where you add them from the database or text file or wherever
Dim logos() =
{
New String() {"Black", "Magenta", "Orange"},
New String() {"Red", "Green", "Blue"},
New String() {"Orange", "Violet", "Pink"},
New String() {"Blue", "Yellow", "Pink"}
}
' Used to store the best combination
Dim minimumPermutation
Dim minimumCost = Integer.MaxValue
' Calculate all permutations of the logos
Dim permutations = GetPermutations(logos)
' For each permutation
For i As Integer = 0 To permutations.Count() - 1
Dim permutation = permutations(i)
Dim cost = 0
' For each logo in permutation
For j As Integer = 0 To permutation.Count() - 1
Dim logo = permutation(j)
' Check whether the previous logo contains one or more colors of this logo
For Each color In logo
If (j > 0) Then
If Not permutation(j - 1).Contains(color) Then
cost += 1
End If
Else
cost += 1
End If
Next
Next
' Save the best permutation
If (i = 0 Or cost < minimumCost) Then
minimumCost = cost
minimumPermutation = permutation.Clone()
End If
Next
' Output the best permutation
For Each logo In minimumPermutation
Console.Write(logo(0) + " " + logo(1) + " " + logo(2))
Next
End Sub
Public Shared Iterator Function GetPermutations(Of T)(values As T(), Optional fromInd As Integer = 0) As IEnumerable(Of T())
If fromInd + 1 = values.Length Then
Yield values
Else
For Each v In GetPermutations(values, fromInd + 1)
Yield v
Next
For i = fromInd + 1 To values.Length - 1
SwapValues(values, fromInd, i)
For Each v In GetPermutations(values, fromInd + 1)
Yield v
Next
SwapValues(values, fromInd, i)
Next
End If
End Function
Private Shared Sub SwapValues(Of T)(values As T(), pos1 As Integer, pos2 As Integer)
If pos1 <> pos2 Then
Dim tmp As T = values(pos1)
values(pos1) = values(pos2)
values(pos2) = tmp
End If
End Sub

I suspect a Genetic Algorithm would be good for this. If you have a lot of logos, a brute force solution could take quite a while, and greedy is unlikely to produce good results.
http://en.wikipedia.org/wiki/Genetic_algorithm

Efficiently sample a data frame avoiding loops

I have a data frame which consists of a first column (experiment.id) and the rest of the columns are values associated with this experiment id. Each row is a unique experiment id. My data frame has columns in the order of 10⁴ - 10⁵.
data.frame(experiment.id=1:100, v1=rnorm(100,1,2),v2=rnorm(100,-1,2) )
This data frame is the source of my sample space. What i would like to do, is for each unique experiment.id (row) randomly sample (with replacement) one of the values v1, v2, ....,v10000 associated with this id and construct a sample s1. In each sample s1 all experiment ids are represented.
Eventually i want to perform 10⁴ samples, s1, s2, ....,s 10⁴ and calculate some statistic.
What would be the most efficient way (computationally) to perform this sampling process. I would like to avoid for loops as much as possible.
Update:
My questions in not all about sampling but also storing the samples. I guess my real question is if there is a quicker way to perform the above other than
d<-data.frame(experiment.id=1:1000, replicate (10000,rnorm(1000,100,2)) )
results<-data.frame(d$experiment.id,replicate(n=10000,apply(d[,2:10001],1,function(x){sample(x,size=1,replace=T)})))

Here is an expression that chooses one of the columns (excluding the first). It does not copy the first column, you will need to supply that as a separate step.
For a data frame d:
d[matrix(c(seq(nrow(d)), sample(ncol(d)-1, nrow(d), replace=TRUE)+1), ncol=2)]
That's one sample. To get N samples, just multiply the selection (as in John's answer):
mm <- matrix(c(rep(seq(nrow(d)), N), sample(ncol(d)-1, nrow(d)*N, replace=TRUE)+1), ncol=2)
result <- matrix(d[mm], ncol=N)
But you're going to have memory issues.

The shortest and most readable IMHO is still to use apply, but making good use of the fact that sample is vectorized:
results <- data.frame(experiment.id = d$experiment.id,
t(apply(d[, -1], 1, sample, 10000, replace = TRUE)))
If the 3 seconds it takes are too slow for your needs then I would recommend you use matrix indexing.

It's possible to do without any looping whatsoever. If you convert your columns after the first one to a matrix this gets easy because a matrix can be addressed either as [row, column] or sequentially as it's underlying vector.
mat <- as.matrix(datf[,-1])
nr <- nrow(mat); nc <- ncol(mat)
sel <- sample( 1:nc, nr, replace = TRUE )
sel <- sel + ((1:nr)-1) * nc
x <- t(mat)[sel]
seldatf <- data.frame( datf[,1], x = x )
Now, to get lots of the samples it pretty easy just multiplying the same logic.
ns <- 10 # number of samples / row
sel <- sample(1:nc, nr * ns, replace = TRUE )
sel <- sel + rep(((1:nr)-1) * nc, each = ns)
x <- t(mat)[sel]
seldatf <- cbind( datf[,1], data.frame(matrix(x, ncol = ns, byrow = TRUE)) )
It's possible that it's going to be a really big data frame if you're going to set ns <- 1e5 and you have lots of rows. You may have to watch running out of memory. I do a bit of unnecessary copying for readability reasons. You can eliminate that for memory, and speed because once you are using large amounts of memory you'll be swapping out other programs that are running. That is slow. You don't have to assign and save x, mat, or even sel. The result of not doing that would provide you about the fastest answer possible.

How to decide on weights?

For my work, I need some kind of algorithm with the following input and output:
Input: a set of dates (from the past). Output: a set of weights - one weight per one given date (the sum of all weights = 1).
The basic idea is that the closest date to today's date should receive the highest weight, the second closest date will get the second highest weight, and so on...
Any ideas?
Thanks in advance!

First, for each date in your input set assign the amount of time between the date and today.
For example: the following date set {today, tomorrow, yesterday, a week from today} becomes {0, 1, 1, 7}. Formally: val[i] = abs(today - date[i]).
Second, inverse the values in such a way that their relative weights are reversed. The simplest way of doing so would be: val[i] = 1/val[i].
Other suggestions:
val[i] = 1/val[i]^2
val[i] = 1/sqrt(val[i])
val[i] = 1/log(val[i])
The hardest and most important part is deciding how to inverse the values. Think, what should be the nature of the weights? (do you want noticeable differences between two far away dates, or maybe two far away dates should have pretty equal weights? Do you want a date which is very close to today have an extremely bigger weight or a reasonably bigger weight?).
Note that you should come up with an inverting procedure where you cannot divide by zero. In the example above, dividing by val[i] results in division by zero. One method to avoid division by zero is called smoothing. The most trivial way to "smooth" your data is using the add-one smoothing where you just add one to each value (so today becomes 1, tomorrow becomes 2, next week becomes 8, etc).
Now the easiest part is to normalize the values so that they'll sum up to one.
sum = val[1] + val[2] + ... + val[n]
weight[i] = val[i]/sum for each i

Sort dates and remove dups
Assign values (maybe starting from the farthest date in steps of 10 or whatever you need - these value can be arbitrary, they just reflect order and distance)
Normalize weights to add up to 1
Executable pseudocode (tweakable):
#!/usr/bin/env python
import random, pprint
from operator import itemgetter
# for simplicity's sake dates are integers here ...
pivot_date = 1000
past_dates = set(random.sample(range(1, pivot_date), 5))
weights, stepping = [], 10
for date in sorted(past_dates):
weights.append( (date, stepping) )
stepping += 10
sum_of_steppings = sum([ itemgetter(1)(x) for x in weights ])
normalized = [ (d, (w / float(sum_of_steppings)) ) for d, w in weights ]
pprint.pprint(normalized)
# Example output
# The 'date' closest to 1000 (here: 889) has the highest weight,
# 703 the second highest, and so forth ...
# [(151, 0.06666666666666667),
# (425, 0.13333333333333333),
# (571, 0.2),
# (703, 0.26666666666666666),
# (889, 0.3333333333333333)]

How to weight: just compute the difference of all dates and the current date
x(i) = abs(date(i) - current_date)
you can then use different expression to assign weights:
w(i) = 1/x(i)
w(i) = exp(-x(i))
w(i) = exp(-x(i)^2))
use gaussian distribution - more complicated, do not recommend
Then use normalized weights: w(i)/sum(w(i)) so that the sum is 1.
(Note that the exponential func is always used by statisticians in survival analysis)

The first thing that comes to my mind to to use a geometric series:
http://en.wikipedia.org/wiki/Geometric_series
(1/2)+(1/4)+(1/8)+(1/16)+(1/32)+(1/64)+(1/128)+(1/256)..... sums to one.
Yesterday would be 1/2
2 days ago would be 1/4
and so on

Is is the index for the i-th date.
Assign weights equal to to Ni / D.
D0 is the first date.
Ni is the difference in days between the i-th date and the first date D0.
D is the normalization factor

converts dates to yyyymmddhhmiss format (24 hours), add all these values and the total, divide by the total time, and sort by this value.
declare #data table
(
Date bigint,
Weight float
)
declare #sumTotal decimal(18,2)
insert into #Data (Date)
select top 100
replace(replace(replace(convert(varchar,Datetime,20),'-',''),':',''),' ','')
from Dates
select #sumTotal=sum(Date)
from #Data
update #Data set
Weight=Date/#sumTotal
select * from #Data order by 2 desc

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio