Multiple Time Series with Binary Outcome Prediction - algorithm

I'll preface this with saying I am extremely new to neural networks and their operation. I've done a fair bit of reading, played with a few cloud based tools (Cortana and AWS), but beyond that, I am not well adept in the algorithms, the kind of neural networks etc...
I'm looking for advice on what systems / tools / kinds of algorithms I can use to achieve the below.
Problem Description
I have a data set that contains time series data for a number of users. The data set can contain a variable number of unique users (prob max out at 150), and each user has 4 different sets of time series data for four different variables. Example data set below
V = Variable
User | Time | V1 | V2 | V3 | V4
1 | 12.00am | 13 | 1045 | 12.2 | 52.4
1 | 12.01am | 12 | 1565 | 11.9 | 50.3
2 | 12.00am | 2 | 15434 | 1.93 | 47.2
2 | 12.01am | 2.02 | 17434 | 1.98 | 43.1
And so on for x users and hundreds of data points for each user.
Required Output
By parsing the data, I want to be able to train the system to either give back a binary TRUE or FALSE for a user based on the input, or alternatively, a probability % of the user being TRUE.
The binary is effectively a TRUE or FALSE result. There can only be one TRUE of all 10 users. I think getting back a % of chance of being TRUE is probably the simplest form? I may be wrong.
Input Format
End point is to have an API that I can send the data set to and it returns user and their probability (or the binary TRUE | FALSE result).
Systems
I would prefer to be able to do this on a 3rd party service as opposed to having to build my own systems to do the processing, but not a necessity.
Training Data
I have years of data to be able to train the system, hundreds of thousands of real user sets and so on.
To Wrap It Up
Looking for advice on the what and the how to predict a binary outcome from multiple time series data sets.
Really appreciate any assistance and guidance here.
Thanks
Russ

I'm working on a similar problem (I am no expert either) but I'll share my approach in case it answers your "what" part of the question.
My solution was to transform the dataset so I ended up with a problem that could be solved with traditional classification algorithms (Random Forest, boosting, ...)
This approach requires that the data is labeled. Each row of the transformed dataset will represent the information associated to each TRUE or FALSE outcome in the training dataset. Each row will be an unique event and will have:
1 column with the response
p sets of columns (one set for each of the p original variables)
k variables to indicate seasonality
Each set of the p sets of columns will consist of the variable at time t (time when you recorded the response of that row), the variable at time t-1 (lag1), ..., and the variable at time t-T (lagT).
Example:
Original dataset (I've retained only V1 and V2 and added an outcome variable)
User
Time
V1
V2
outcome
1
12.00am
13
1045
FALSE
1
12.01am
12
1565
TRUE
Transformed dataset
ID
V1_lag1
V1_lag0
V2_lag1
V2_lag0
outcome
event_id
13
12
1045
1565
TRUE
With this set up you could fit a model that would predict the probability of TRUE at time t for a new observation, based on V1 and V2 evaluated at time t and V1 and V2 evaluated at lag1 (t-1min).
You could also create new features that would describe the variables better (See Features for time series classification).
And you should incorporate the seasonality somehow if the variables show a seasonality pattern:
ID
V1_lag1
V1_lag0
V2_lag1
V2_lag0
day
hour
outcome
event_id
13
12
1045
1565
wed
12am
TRUE

Related

Is there an algorithm for memory-efficient rank change tracking?

For a racing game, I have this highscore table (it contains more columns to indicate the race track and other things, but let's ignore that):
Rank | ID | Time | Datetime | Player
1 | 4ef9b | 8.470 | today 13:00 | Bob
2 | 23fcf | 8.470 | today 13:04 | Carol
3 | d8512 | 8.482 | today 12:47 | Alice
null | 0767c | 9.607 | today 12:51 | Alice
null | eec81 | 9.900 | today 12:55 | Bob
The Rank column is precomputed and reflects ORDER BY Time, Datetime but uniqued such that each player has only their best entry ranked. (The non-personal-best records, with null Rank, are relevant for historic graphs.)
When inserting a new highscore, there are at least two ways to update the ranks:
Insert the new row and invalidate their old rank, then periodically read the whole table using ORDER BY, deduplicate players, and issue UPDATE queries in batches of 1000 by using long lists of CASE id WHEN 4ef9b THEN rank=1, WHEN 23fcf THEN rank=2, etc., END. The database server is clever enough that if I try setting Carol's rank to 2 again, it will see it's the same and not do a disk write, so it's less-than-horribly inefficient.
Only update the rows that were changed by doing:
oldID, oldRank = SELECT ID, Rank WHERE Player=$player, Rank IS NOT NULL
newRank = SELECT MAX(Rank) WHERE time < $newTime;
UPDATE Rank+=1 WHERE Rank IS NOT NULL AND Rank >= $newRank AND Rank < $oldRank
INSERT (Rank=$newRank, Time=$newTime, Player=..., etc.)
UPDATE Rank=null WHERE ID=$oldID
I implemented the latter because it seemed to be optimally efficient (only selecting and touching rows that need changing), but this actually took the server down upon a flood of new personal best times due to new level releases. It turns out that periodically doing the less-efficient former method actually creates a lot less load, but I'd have to implement a queueing mechanism for rank invalidation which feels like extra complexity on top of inefficiency.
One problem I have is that my database calls fsync after each query (and has severe warnings against turning that off), so doing 100 row updates in one query might take 0.83 seconds whereas doing 100 queries takes 100 × (fsync_time+query_time) = e.g. 1000×0.51 = 51 seconds. This might be why changing Rank rows along with every insert is such a burden on the system, so I want to batch this by storing an ordered list of (oldrank, newrank) pairs and applying them all at once.
What algorithm can be used to compute the batch update? I could select the whole ranked list from the database into a big hashmap (map[rank] = ID), apply any number of rank changes to this memory object, build big CASE WHEN strings to update a thousand rows in one query, and send those updates to the database. However, as the number of players grows, this hashmap might not fit in memory.
Is there a way to do this based on ranges of ranks, instead of individual ranks? The list of incoming changes, such as:
Bob moves from rank 500 to rank 1
Carol moves from rank 350 to rank 100
should turn into a list of changes to make for each rank:
rank[1-99] +=1
rank[100-349] += 2
rank[351-499] += 1
without having a memory object that needs O(n) memory, where n is the number of ranked scores in the database for one race track. In this case, the two changes expand to three ranges, spanning five hundred rank entries. (Changing each row in the database will still have to happen; this can probably not be helped without entirely changing the setup.)
I am using a standard LAMP stack, in case that is relevant for an answer.

Calculate a date when total duration of multiple sub-intervals (within a larger interval) drops below X

I am building an expert system that will run as a web service (i.e. continuously).
Some of the rules in it are coded procedurally and deal with intervals — the rule processor maps over a set of user's events and calculates their total duration within a certain time-frame which is defined in relative terms (like N years ago). This result is then compared with a required threshold to determine whether the rule passes.
So for example the rule calculates for how long you were employed from 3 years ago to 1 year ago and passes if it's more than 9 months.
I have no problem calculating the durations. The difficult part is that I need to display to the user not simply whether the particular rule passed, but also the exact date when this "true" is due to become "false". Ideally, I'd love to display one more additional step ahead - i.e. when "false" switches back to "true" again — if there's data for this, of course. So on the day when the total duration of their employment for last year drops below 6 months the rule reruns, the result changes, and they get an email "hey, your result has just changed, you no longer qualify, but it 5 months you will qualify once again".
| | |
_____|||1|||_______|||2|||__________|||3|||________|||4|||...
| | |
3 y. ago ---------------------- 1 y. ago Now
min 9 months work experience is required
In the example above the user qualifies, but is going not to, we need to tell them up front: "expect this to happen in 44 days" (also the system schedules a background job for that date) and when that will reverse back to true.
| | |
____________________|1|__________________||||||||2||||||||...
| | |
3 y. ago ---------------------- 1 y. ago Now
min 9 months work experience is required
In this one the user doesn't qualify, we need to tell them when they are going to start to qualify.
| |
_____|||1|||___________|||||||2|||||||_________|||3|||____...
| |
1 y. ago ------------------------------------------ Now
at least 6 months of work experience is required
And here — when they are due to stop qualifying, because there's no event that is going on for them currently, so once these events roll to the left far enough, it's over until the user changes their CV and the engine re-runs with new dataset.
I hope it's clear what I want to do. Is there a smart algorithm that can help me here? Or do I just brute-force the solution?
UPD:
The solution I am developing lies in creating a 2-dimensional graph where each point signifies a date (x-axis value) when the curve of total duration for the timeframe (y-axis value) changes direction. There are 4 such breakpoints for any given event. This graph will allow me to do a linear interpolation between two values to find when exactly the duration line crosses the threshold. I am currently writing this in Ruby.

training a kNN algorithm with different features for each record

I have a dataset where each record could contain a different number of features.
The features in total are 56, and each record can contain from 1 to 56 record of this features.
Each features is like a flag, so or exist in the dataset or not, and if it exist, there is another value, double, that put the value of it.
An example of dataset is this
I would know if is possibile training my kNN algorithm using different features for each record, so for example one record has 3 features plus label, other one has 4 features plus label, etc...
I am trying to implement this in Python, but I have no idea about how I have to do.
Yes it is definitely possible. The one thing you need to think about is the distance measure.
The default distance used for kNN Classifiers is usually Euclidean distance. However,Euclidean distance requires records (vectors) of equal number of features (dimensions).
The distance measure you use, highly depends on what you think should make records similar.
If you have a correspondence between features of two records, so you know that feature i of record x describes the same feature as feature i from record y you can adapt Euclidean distance. For example you could either ignore missing dimensions (such that they don't add to the distance if a feature is missing in one record) or penalize missing dimensions (such that a certain penalty value is added whenever a feature is missing in a record).
If you do not have a correspondence between the features of two records, then you would have to look at set distances, e.g., minimum matching distance or Hausdorff distance.
Every instance in your dataset should be represented by the same number of features. If you have data with a variable number of features (e.g. each data point is a vector of x and y where each instance has different number of points) then you should treat such points as missing values.
Therefore you need to deal with missing values. For example:
Replace missing values with the mean value for each column.
Select an algorithm that is able to handle missing values such as Decision trees.
Use a model that is able to predict missing values.
EDIT
First of all you need to bring the data into a better format. Currently, each feature is represented by two columns which is not a very nice technique. Therefore I would suggest to restructure the data as follows:
+------+------------+-----------+----------+--------+
| ID | Feature1 | Feature2 | Feature3 | Label |
+-------------------+-----------+----------+--------+
| 1 | 15.12 | ? | 56.65 | True |
| 2 | ? | 23.6 | ? | True |
| 3 | ? | 12.3 | ? | False |
+-------------------+-----------+----------+--------+
Then you can either replace missing values (denoted with ?) with 0 (this depends on the "meaning" of each feature) or use one of the techniques that I've already mentioned before.

Most related texts/items based on common tags algorithm in Scala

I have 50M different texts as input from which the top (up to) 10 most relevant tags have been extracted.
There are ~100K distinct tags
I would like to develop an algorithm that, given a text id T1 as input (present in the original input data set), computes the most related text id T2 based on the fact that T2 is the text that have most tags in common with T1.
id | tags
-------------
1 | A,B,C,D
2 | B,D,E,F
3 | A,B,D,E
4 | B,C,E
In the example above, the most similar id to 1 is 3 as they have 3 tags in common
This seems to be the same algorithm that shows the most related questions in StackOverflow.
My first idea was to map both texts and tags to integers to build a big (50M * 100K) binary matrix that is very sparse.
This matrix fits in memory, but I do not know how to use it.
As this is for a web application, I would like to deliver the result in real time conditions (at most a few ms, with possible multi-threading).
My main languages are Scala and Java.
Thanks for your help

How to unlock all the chests in the treasure trove?

Heard the following problem in Google Code Jam. The competition has ended now, so it's okay to talk about it https://code.google.com/codejam/contest/2270488/dashboard#s=p3
Following an old map, you have stumbled upon the Dread Pirate Larry's secret treasure trove!
The treasure trove consists of N locked chests, each of which can only be opened by a key of a specific type. Furthermore, once a key is used to open a chest, it can never be used again. Inside every chest, you will of course find lots of treasure, and you might also find one or more keys that you can use to open other chests. A chest may contain multiple keys of the same type, and you may hold any number of keys.
You already have at least one key and your map says what other keys can be found inside the various chests. With all this information, can you figure out how to unlock all the chests?
If there are multiple ways of opening all the chests, choose the "lexicographically smallest" way.
For the competition there were two datasets, a small dataset with troves of at most 20 chests, and a large dataset with troves as big as 200 chests.
My backtracking branch-and-bound algorithm was only fast enough to solve the small dataset. What's a faster algorithm?
I'm not used to algorithm competitions. I was a bit disturbed about this question: to cut branches in the branch & bound general algorithm, you need to have an idea of the general input you'll have.
Basically, i looked at some of the inputs that were provided in the small set. What happen in this set is that you end up in paths where you can't get any key of some type t: all the remaining keys of this type t are all in chests which have a lock of the same type t. So you are not able to access them anymore.
So you could build the following cut criterion: if there is some chest of type t to be opened, if all remaining keys of types t are in those chests and if you don't have anymore key of this type, then you won't find a solution in this branch.
You can generalize the cut criterion. Consider a graph, where vertices are key types and there is an edge between t1 and t2 if there are still some closed chest in t1 which have a key of type t2. If you have some key of type t1, then you can open one of the chests of this type and then get at least a key to one of the chests accessible from the outgoing edges. If you follow a path, then you know you can open at least one chest of each lock type in this path. But if there is no path to a vertex, you know there is no way you will open a chest represented by this vertex.
There is the cuting algorithm. Compute all reachable vertices from the set of vertices you have a key in your posession. If there are unreachable vertices for which there are still closed chest, then you cut the branch. (This means you backtrack)
This was enough to solve the large set. But i had to add the first condition you wrote:
if any(required > keys_across_universe):
return False
Otherwise, it wouldn't work. This means that my solution is weak when the number of keys is very close to the number of chests.
This cut condition is not cheap. It can actually cost O(N²). But it cut so much branches that it is definetely worth it... provided the data sets are nice. (fair ?)
Surprisingly this problem is solvable via a greedy algorithm. I, too, implemented it as a memoized depth-first search. Only afterwards did I notice that the search never backtracked, and there were no hits to the memoization cache. Only two checks on the state of the problem and the partial solution are necessary to know whether a particular solution branch should be pursued further. They are easily illustrated with a pair of examples.
First, consider this test case:
Chest Number | Key Type To Open Chest | Key Types Inside
--------------+--------------------------+------------------
1 | 2 | 1
2 | 1 | 1 1
3 | 1 | 1
4 | 2 | 1
5 | 2 | 2
Initial keys: 1 1 2
Here there are a total of only two keys of type 2 in existence: one in chest #5, and one in your possession initially. However, three chests require a key of type 2 to be opened. We need more keys of this type than exist, so clearly it is impossible to open all of the chests. We know immediately that the problem is impossible. I call this key counting the "global constraint." We only need to check it once. I see this check is already in your program.
With just this check and a memoized depth-first search (like yours!), my program was able to solve the small problem, albeit slowly: it took about a minute. Knowing that the program wouldn't be able to solve the large input in sufficient time, I took a look at the test cases from the small set. Some test cases were solved very quickly while others took a long time. Here's one of the test cases where the program took a long time to find a solution:
Chest Number | Key Type To Open Chest | Key Types Inside
--------------+--------------------------+------------------
1 | 1 | 1
2 | 6 |
3 | 3 |
4 | 1 |
5 | 7 | 7
6 | 5 |
7 | 2 |
8 | 10 | 10
9 | 8 |
10 | 3 | 3
11 | 9 |
12 | 7 |
13 | 4 | 4
14 | 6 | 6
15 | 9 | 9
16 | 5 | 5
17 | 10 |
18 | 2 | 2
19 | 4 |
20 | 8 | 8
Initial keys: 1 2 3 4 5 6 7 8 9 10
After a brief inspection, the structure of this test case is obvious. We have 20 chests and 10 keys. Each of the ten key types will open exactly two chests. Of the two chests that are openable with a given key type, one contains another key of the same type, and the other contains no keys at all. The solution is obvious: for each key type, we have to first open the chest that will give us another key in order to be able to open the second chest that also requires a key of that type.
The solution is obvious to a human, but the program was taking a long time to solve it, since it didn't yet have any way to detect whether there were any key types that could no longer be acquired. The "global constraint" concerned the quantities of each type of key, but not the order in which they were to be obtained. This second constraint concerns instead the order in which keys can be obtained but not their quantity. The question is simply: for each key type I will need, is there some way I can still get it?
Here's the code I wrote to check this second constraint:
# Verify that all needed key types may still be reachable
def still_possible(chests, keys, key_req, keys_inside):
keys = set(keys) # set of keys currently in my possession
chests = chests.copy() # set of not-yet-opened chests
# key_req is a dictionary mapping chests to the required key type
# keys_inside is a dict of Counters giving the keys inside each chest
def openable(chest):
return key_req[chest] in keys
# As long as there are chests that can be opened, keep opening
# those chests and take the keys. Here the key is not consumed
# when a chest is opened.
openable_chests = filter(openable, chests)
while openable_chests:
for chest in openable_chests:
keys |= set(keys_inside[chest])
chests.remove(chest)
openable_chests = filter(openable, chests)
# If any chests remain unopened, then they are unreachable no
# matter what order chests are opened, and thus we've gotten into
# a situation where no solution exists.
return not chests # true iff chests is empty
If this check fails, we can immediately abort a branch of the search. After implementing this check, my program ran very fast, requiring something like 10 seconds instead of 1 minute. Moreover, I noticed that the number of cache hits dropped to zero, and, furthermore, the search never backtracked. I removed the memoization and converted the program from a recursive to an iterative form. The Python solution was then able to solve the "large" test input in about 1.5 seconds. A nearly identical C++ solution compiled with optimizations solves the large input in 0.25 seconds.
A proof of the correctness of this iterative, greedy algorithm is given in the official Google analysis of the problem.
I was not able to solve this problem too. My algorithm at first was too slow, then I added some enhancements but I guess I failed on something else:
As Valentin said, I counted the available keys to quickly discard tricky cases
Tried to ignore chests without keys inside on first hit, skipping to the next one
Skip solutions starting with higher chests
Check for "key-loops", if the available keys were not enough to open a chest (chest contained the key for itself inside)
Performance was good (<2 secs for the 25 small cases), I manually checked the cases and it worked properly but got incorrect answer anyway :P

Resources