I was wondering if it is possible to have more than one time series identifier column in the model? Let's assume I'd like to create a forecast at a product and store level (which the documentation suggests should be possible).
If I select product as the series identifier, the only options I have left for store is either a covariate or an attribute and neither is applicable in this scenario.
Would concatenating product and store and using the individual product and store code values for that concatenated ID as attributes be a solution? It doesn't feel right, but I can't see any other option - am I missing something?
Note: I understand that this feature of Vertex AI is currently in preview and that because of that the options may be limited.
There isn't an alternate way to assign 2 or more Time Series Identifiers in the Forecasting Model on Vertex AI. The "Forecasting model" is in the "Preview" Product launch stage, as you are aware, with all consequences of that fact the options are limited. Please refer to this doc for more information about the best practices for data preparation to train the forecasting model.
As a workaround, the two columns can be concatenated and assigned a Time Series Identifier on that concatenated column, as you have mentioned in the question. This way, the concatenated column carries more contextual information into the training of the model.
Just to follow up on Vishal's (correct) answer in case someone is looking this up in the future.
Yes, concatenating is the only option for now as there can only be one time series identifier (I would hope this changes in the future). Having said that, I've experimented with adding the individual identifiers in the data as categorical attributes and it works actually pretty well. This way I have forecast generated at a product/store level, but I can aggregate all forecasts for individual products and the results are not much off from the models trained on aggregated data (obviously that would depend on the demand classification and selected optimisation method amongst other factors).
Also, an interesting observation. When you include things like product descriptions, you can classify them either as categorical or text. I wasn't able to find in the documentation if the model would only use unigrams (which is what the column statistics in the console would suggest) or a number of n-grams but it is definitely something you would want to experiment with with your data. My dataset was actually showing a better accuracy when the categorical classification was used, which is a bit counter-intuitive as it feels like redundant information, although it's hard to tell as the documentation isn't very detailed. It is likely to be specific to my data set, so as I said make sure you experiment with yours.
I'm loading animation data from a glTF file. The glTF 2.0 spec mandates that animation data is stored as two arrays of data called "input" and "output". The input array stores times, in seconds, and the output array stores sampled values for the given time.
Image source
Given some time value t (say, 1.2 seconds like in the diagram) how do I find the two closest adjacent samples to interpolate? Is binary search the most efficient method?
Edit
I've found a bit of additional context. When a user suggested adding support for fixed timesteps to the glTF spec, someone replied:
You can easily sample to whatever fix frame rate you need when loading the animation data. There is no need to bloat the animation size in the transmitted data.
There are also some notes in an old Samples repository about this:
- All animations time-lines are based on a global time. However, an
application can re-base the animation times and choose to play
animations at different times.
- No support for fixed rate animations. Animations always use a time-line
with a time for each key frame, possibly with a variable delta
time in between key frames. To avoid a statefull animations system
(which would be a terrible idea) an application has to look up the
surrounding key frames from the time-line every frame. This can be
sped up by using a binary search but for long animations with many
hundreds of key frames this is still not cheap because it effectively
results in random memory access. To make matters worse there may be
a separate "animation" for every channel of a joint (scale, translation,
rotation) with a different time-line. Some models actually store a
different time-line per joint or channel, even though the actual
time-lines are the same. As a result, an expensive lookup may be
needed for every joint or even every channel, every frame. This is
particularly silly considering many models store a time-line that
is effectively fixed rate. Significant data preprocessing at load
time would be required to identify all the cases that can be optimized.
So currently, the two potential solutions I'm aware of are:
Binary search each timeline for key t and consider the closest node(s).
Re-sample all timelines at load time (or in a preprocess step) with respect to a fixed dt and use a direct indexed lookup based on t / fixed_dt.
Would be very curious to know about any alternatives.
What is the best practice for including Created By, Created Timestamp, Modified By, Modified Timestamp into a dimensional model?
The first two never change. The last two will change slowly for some data elements but rapidly for other data elements. However, I'd prefer a consistent approach so that reporting users become familiar with it.
Assume that I really only care about the most recent value; I don't need history.
Is it best to put them into a dimension knowing that, for highly-modified data, that dimension is going to change often? Or, is it better to put them into the fact table, treating the unchanging Created information much the same way a sales order number becomes a degenerate dimension?
In my answer I will assume that these ADDITIONAL Columns do NOT define the validity of the Dimensional record and that you are talking about a Slowly Changing Dimension type 1
So we are in fact talking about dimensional metadata here, about who / which process created or modified the dimensional row.
I would always put this kind of metadata in the dimension because it:
Is related to changes in the dimension. These changes happen independent of the fact table
In general it is advised to keep Fact tables as small as possible. If your Fact table would contain 5 Dimensions, this would lead to adding 5*4=20 extra columns to your fact table which will seriously bloath it and impact performance.
I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.
I'm trying to think of a naming convention that accurately conveys what's going on within a class I'm designing. On a secondary note, I'm trying to decide between two almost-equivalent user APIs.
Here's the situation:
I'm building a scientific application, where one of the central data structures has three phases: 1) accumulation, 2) analysis, and 3) query execution.
In my case, it's a spatial modeling structure, internally using a KDTree to partition a collection of points in 3-dimensional space. Each point describes one or more attributes of the surrounding environment, with a certain level of confidence about the measurement itself.
After adding (a potentially large number of) measurements to the collection, the owner of the object will query it to obtain an interpolated measurement at a new data point somewhere within the applicable field.
The API will look something like this (the code is in Java, but that's not really important; the code is divided into three sections, for clarity):
// SECTION 1:
// Create the aggregation object, and get the zillion objects to insert...
ContinuousScalarField field = new ContinuousScalarField();
Collection<Measurement> measurements = getMeasurementsFromSomewhere();
// SECTION 2:
// Add all of the zillion objects to the aggregation object...
// Each measurement contains its xyz location, the quantity being measured,
// and a numeric value for the measurement. For example, something like
// "68 degrees F, plus or minus 0.5, at point 1.23, 2.34, 3.45"
foreach (Measurement m : measurements) {
field.add(m);
}
// SECTION 3:
// Now the user wants to ask the model questions about the interpolated
// state of the model. For example, "what's the interpolated temperature
// at point (3, 4, 5)
Point3d p = new Point3d(3, 4, 5);
Measurement result = field.interpolateAt(p);
For my particular problem domain, it will be possible to perform a small amount of incremental work (partitioning the points into a balanced KDTree) during SECTION 2.
And there will be a small amount of work (performing some linear interpolations) that can occur during SECTION 3.
But there's a huge amount of work (constructing a kernel density estimator and performing a Fast Gauss Transform, using Taylor series and Hermite functions, but that's totally beside the point) that must be performed between sections 2 and 3.
Sometimes in the past, I've just used lazy-evaluation to construct the data structures (in this case, it'd be on the first invocation of the "interpolateAt" method), but then if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch.
In other projects, I've required the user to explicitly call an "object.flip()" method, to switch from "append mode" into "query mode". The nice this about a design like this is that the user has better control over the exact moment when the hard-core computation starts. But it can be a nuisance for the API consumer to keep track of the object's current mode. And besides, in the standard use case, the caller never adds another value to the collection after starting to issue queries; data-aggregation almost always fully precedes query preparation.
How have you guys handled designing a data structure like this?
Do you prefer to let an object lazily perform its heavy-duty analysis, throwing away the intermediate data structures when new data comes into the collection? Or do you require the programmer to explicitly flip the data structure from from append-mode into query-mode?
And do you know of any naming convention for objects like this? Is there a pattern I'm not thinking of?
ON EDIT:
There seems to be some confusion and curiosity about the class I used in my example, named "ContinuousScalarField".
You can get a pretty good idea for what I'm talking about by reading these wikipedia pages:
http://en.wikipedia.org/wiki/Scalar_field
http://en.wikipedia.org/wiki/Vector_field
Let's say you wanted to create a topographical map (this is not my exact problem, but it's conceptually very similar). So you take a thousand altitude measurements over an area of one square mile, but your survey equipment has a margin of error of plus-or-minus 10 meters in elevation.
Once you've gathered all the data points, you feed them into a model which not only interpolates the values, but also takes into account the error of each measurement.
To draw your topo map, you query the model for the elevation of each point where you want to draw a pixel.
As for the question of whether a single class should be responsible for both appending and handling queries, I'm not 100% sure, but I think so.
Here's a similar example: HashMap and TreeMap classes allow objects to be both added and queried. There aren't separate interfaces for adding and querying.
Both classes are also similar to my example, because the internal data structures have to be maintained on an ongoing basis in order to support the query mechanism. The HashMap class has to periodically allocate new memory, re-hash all objects, and move objects from the old memory to the new memory. A TreeMap has to continually maintain tree balance, using the red-black-tree data structure.
The only difference is that my class will perform optimally if it can perform all of its calculations once it knows the data set is closed.
If an object has two modes like this, I would suggest exposing two interfaces to the client. If the object is in append mode, then you make sure that the client can only ever use the IAppendable implementation. To flip to query mode, you add a method to IAppendable such as AsQueryable. To flip back, call IQueryable.AsAppendable.
You can implement IAppendable and IQueryable on the same object, and keep track of the state in the same way internally, but having two interfaces makes it clear to the client what state the object is in, and forces the client to deliberately make the (expensive) switch.
I generally prefer to have an explicit change, rather than lazily recomputing the result. This approach makes the performance of the utility more predictable, and it reduces the amount of work I have to do to provide a good user experience. For example, if this occurs in a UI, where do I have to worry about popping up an hourglass, etc.? Which operations are going to block for a variable amount of time, and need to be performed in a background thread?
That said, rather than explicitly changing the state of one instance, I would recommend the Builder Pattern to produce a new object. For example, you might have an aggregator object that does a small amount of work as you add each sample. Then instead of your proposed void flip() method, I'd have a Interpolator interpolator() method that gets a copy of the current aggregation and performs all your heavy-duty math. Your interpolateAt method would be on this new Interpolator object.
If your usage patterns warrant, you could do simple caching by keeping a reference to the interpolator you create, and return it to multiple callers, only clearing it when the aggregator is modified.
This separation of responsibilities can help yield more maintainable and reusable object-oriented programs. An object that can return a Measurement at a requested Point is very abstract, and perhaps a lot of clients could use your Interpolator as one strategy implementing a more general interface.
I think that the analogy you added is misleading. Consider an alternative analogy:
Key[] data = new Key[...];
data[idx++] = new Key(...); /* Fast! */
...
Arrays.sort(data); /* Slow! */
...
boolean contains = Arrays.binarySearch(data, datum) >= 0; /* Fast! */
This can work like a set, and actually, it gives better performance than Set implementations (which are implemented with hash tables or balanced trees).
A balanced tree can be seen as an efficient implementation of insertion sort. After every insertion, the tree is in a sorted state. The predictable time requirements of a balanced tree are due to the fact the cost of sorting is spread over each insertion, rather than happening on some queries and not others.
The rehashing of hash tables does result in less consistent performance, and because of that, aren't appropriate for certain applications (perhaps a real-time microcontroller). But even the rehashing operation depends only on the load factor of the table, not the pattern of insertion and query operations.
For your analogy to hold strictly, you would have to "sort" (do the hairy math) your aggregator with each point you add. But it sounds like that would be cost prohibitive, and that leads to the builder or factory method patterns. This makes it clear to your clients when they need to be prepared for the lengthy "sort" operation.
Your objects should have one role and responsibility. In your case should the ContinuousScalarField be responsible for interpolating?
Perhaps you might be better off doing something like:
IInterpolator interpolator = field.GetInterpolator();
Measurement measurement = Interpolator.InterpolateAt(...);
I hope this makes sense, but without fully understanding your problem domain it's hard to give you a more coherent answer.
"I've just used lazy-evaluation to construct the data structures" -- Good
"if the user calls the "field.add()" method again, I have to completely discard those data structures and start over from scratch." -- Interesting
"in the standard use case, the caller never adds another value to the collection after starting to issue queries" -- Whoops, false alarm, actually not interesting.
Since lazy eval fits your use case, stick with it. That's a very heavily used model because it is so delightfully reliable and fits most use cases very well.
The only reason for rethinking this is (a) the use case change (mixed adding and interpolation), or (b) performance optimization.
Since use case changes are unlikely, you might consider the performance implications of breaking up interpolation. For example, during idle time, can you precompute some values? Or with each add is there a summary you can update?
Also, a highly stateful (and not very meaningful) flip method isn't so useful to clients of your class. However, breaking interpolation into two parts might still be helpful to them -- and help you with optimization and state management.
You could, for example, break interpolation into two methods.
public void interpolateAt( Point3d p );
public Measurement interpolatedMasurement();
This borrows the relational database Open and Fetch paradigm. Opening a cursor can do a lot of preliminary work, and may start executing the query, you don't know. Fetching the first row may do all the work, or execute the prepared query, or simply fetch the first buffered row. You don't really know. You only know that it's a two part operation. The RDBMS developers are free to optimize as they see fit.
Do you prefer to let an object lazily perform its heavy-duty analysis,
throwing away the intermediate data structures when new data comes
into the collection? Or do you require the programmer to explicitly
flip the data structure from from append-mode into query-mode?
I prefer using data structures that allow me to incrementally add to it with "a little more work" per addition, and to incrementally pull the data I need with "a little more work" per extraction.
Perhaps if you do some "interpolate_at()" call in the upper-right corner of your region, you only need to do calculations involving the points in that upper-right corner,
and it doesn't hurt anything to leave the other 3 quadrants "open" to new additions.
(And so on down the recursive KDTree).
Alas, that's not always possible -- sometimes the only way to add more data is to throw away all the previous intermediate and final results, and re-calculate everything again from scratch.
The people who use the interfaces I design -- in particular, me -- are human and fallible.
So I don't like using objects where those people must remember to do things in a certain way, or else things go wrong -- because I'm always forgetting those things.
If an object must be in the "post-calculation state" before getting data out of it,
i.e. some "do_calculations()" function must be run before the interpolateAt() function gets valid data,
I much prefer letting the interpolateAt() function check if it's already in that state,
running "do_calculations()" and updating the state of the object if necessary,
and then returning the results I expected.
Sometimes I hear people describe such a data structure as "freeze" the data or "crystallize" the data or "compile" or "put the data into an immutable data structure".
One example is converting a (mutable) StringBuilder or StringBuffer into an (immutable) String.
I can imagine that for some kinds of analysis, you expect to have all the data ahead of time,
and pulling out some interpolated value before all the data has put in would give wrong results.
In that case,
I'd prefer to set things up such that the "add_data()" function fails or throws an exception
if it (incorrectly) gets called after any interpolateAt() call.
I would consider defining a lazily-evaluated "interpolated_point" object that doesn't really evaluate the data right away, but only tells that program that sometime in the future that data at that point will be required.
The collection isn't actually frozen, so it's OK to continue adding more data to it,
up until the point something actually extract the first real value from some "interpolated_point" object,
which internally triggers the "do_calculations()" function and freezes the object.
It might speed things up if you know not only all the data, but also all the points that need to be interpolated, all ahead of time.
Then you can throw away data that is "far away" from the interpolated points,
and only do the heavy-duty calculations in regions "near" the interpolated points.
For other kinds of analysis, you do the best you can with the data you have, but when more data comes in later, you want to use that new data in your later analysis.
If the only way to do that is to throw away all the intermediate results and recalculate everything from scratch, then that's what you have to do.
(And it's best if the object automatically handled this, rather than requiring people to remember to call some "clear_cache()" and "do_calculations()" function every time).
You could have a state variable. Have a method for starting the high level processing, which will only work if the STATE is in SECTION-1. It will set the state to SECTION-2, and then to SECTION-3 when it is done computing. If there's a request to the program to interpolate a given point, it will check if the state is SECTION-3. If not, it will request the computations to begin, and then interpolate the given data.
This way, you accomplish both - the program will perform its computations at the first request to interpolate a point, but can also be requested to do so earlier. This would be convenient if you wanted to run the computations overnight, for example, without needing to request an interpolation.