Do I need to recreate a dataset for every pipeline run? - google-cloud-vertex-ai

I'm currently following through the tutorial "https://cloud.google.com/vertex-ai/docs/pipelines/build-pipeline" to remind myself about Vertex pipeline format, and it uses the component:
ImageDatasetCreateOp
Every time I run the pipeline, it recreates the dataset, which takes a long time (10+ minutes). What is the recommended way to work with dataset objects? Is it really to recreate them every single pipeline run?
It's odd because I used to work with Vertex a year ago, and don't remember ever doing that, but I can't for the life of me find any reference to "create if not exists", or even how other people get around this online.
(what I tried: running the pipeline, it recreates the dataset every time, which is very time consuming)

Related

Designing an algorithm for detecting anamoly and statistical significance for ordinal data using python

Firstly, I would like to apologise for the detailed problem statement. Being a novice, I couldn't express it in any lesser words.
Environment Setup Details:
To give some background, I work in a cloud company where we have multiple servers geographically located in all continents. So, we have hierarchy like this:
Several partitions
Each partition has 7 pop's
Each pop has multiple nodes all set up with redundancy.
Turn servers connecting traffic to each node depending on the client location
Actual clients-ios, android, mac, windows,etc.
Now, every time the user uses our product/service, he leaves a rating out of 5, 5 being outstanding. This data is stored in our databases and we mine it and analyse it to pin-point the exact issue on any particular day.
For example, if the users from Asia are giving more bad ratings on Tuesday this week than a usual Tuesday, what factors can cause this - is it something to do with clients app version, or server release , physical factors, loss, increased round trip delay etc.
What we have done:
Till now we have been using visualization tools to track each of these metrics separately per day to see the trends and detect the issues manually.
But, due to growing micr-services, it is becoming difficult day by day. Now, we want to automate it using python/pandas.
What I want to do:
If the ratings drop on a particular day/hour, I run the script and it should do all the manual work by taking all the permutations and combinations of all factors and list out the exact combinations which could have lead to the drop.
The second step would be to check whether the drop was significant due to varying number of ratings.
What I know:
I understand that I can do this using pandas by creating a dataframe for each predictor variable and trying to do it per variable.
And then I can apply tests like whitney test etc for ordinal data.
What I need help with:
But I just wanted to know if there is a better way to do it? It is perfectly fine if there is a learning curve involved. I can learn and do it. I just wanted some help in choosing the right approach for this.

K-Means on time series data with Apache Spark

I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra.
Now I want to use Apache Spark in order train some machine learning models for anomaly detection. The idea is to run the k-means algorithm on the past data for example for every single hour in a day.
For example, I can select all events from 4pm-5pm and build a model for that interval. If I apply this approach, I will get exactly 24 models (centroids for every single hour).
If the algorithm performs well, I can reduce the size of my interval to be for example 5 minutes.
Is it a good approach to do anomaly detection on time series data?
I have to say that strategy is good to find the Outliers but you need to take care of few steps. First, using all events of every 5 minutes to create a new Centroid for event. I think tahat could be not a good idea.
Because using too many centroids you can make really hard to find the Outliers, and that is what you don't want.
So let's see a good strategy:
Find a good number of K for your K-means.
That is reall important for that, if you have too many or too few you can take a bad representation of the reality. So select a good K
Take a good Training set
So, you don't need to use all the data to create a model every time and every day. You should take a example of what is your normal. You don't need to take what is not your normal because this is what you want to find. So use this to create your model and then find the Clusters.
Test it!
You need to test if it is working fine or not. Do you have any example of what you see that is strange? And you have a set that you now that is not strange. Take this an check if it is working or not. To help with it you can use Cross Validation
So, your Idea is good? Yes! It works, but make sure to not do over working in the cluster. And of course you can take your data sets of every day to train even more your model. But make this process of find the centroids once a day. And let the Euclidian distance method find what is or not in your groups.
I hope that I helped you!

Algorithm to monitor file changes

What's a good way to monitor and find the optimal times when specific files on remote sites change? I want to limit how often we have to download a file by finding the pattern of when the file is generally updated...
We download files (product feeds) with data ranging from 1Mb to 200Mb on a regular basis
Some of these files are updated every hour, some a few days a week, others once a month
The files aren't always updated at the exact same time, but there's generally a pattern within a certain period
We only want to download the files when we know they've changed
We want to download the files as soon as possible after they've changed
A simple way to solve this would be to check the files using a HTTP HEAD request every hour and trigger the download when we notice a change in Last-modified or Content-Length. Unfortunately we can't rely on the HTTP headers as they're generally missing or give no indication as to the actual time/size of the file. We often have to download the whole file just to determine if it's changed.
First I thought I could write a process that checks the file every 1, 2, 4, 8, ... hours (doubling for each iteration) until it found that the file had changed and then just stick with that number. This probably works, but it's not optimal.
To optimize it a bit I thought of tweaking the interval number to find a sweet spot. Then all kinds of scenarios started appearing where my ideas would fail - such as weekends and public holidays when the files wouldn't be updated because people aren't at work. There is a pattern, but there are exceptions to it.
Next I started reading about "step detection" algorithms and soon realized I was way out of my depth. How do people solve these problems?
I'm guessing the solution will involve some form of data history, but I fumble with how to optimize the algorithm that collect the data and how to derive the pattern. Hoping someone has dealt with it before.

When timing how long a quick process runs, how many runs should be used?

Lets say I am going to run process X and see how long it takes.
I am going to save into a database a date I ran this process, and the time it took. I want to know what to put into the DB.
Process X almost always runs under 1500ms, so this is a short process. It usually runs between 500 and 1500ms, quite a range (3x difference).
My question is, how many "runs" should be saved into the DB as a single run?
Every run saved into the DB as its
own row?
5 Runs, averaged, then save that
time?
10 Runs averaged?
20 Runs, remove anything more than 2
std deviations away, and save
everything inside that range?
Does anyone have any good info backing them up on this?
Save the data for every run into its own row. Then later you can use and analyze the data however you like... ie, all you the other options you listed can be performed after the fact. It's not really possible for someone else to draw meaningful conclusions about how to average/analyze the data without knowing more about what's going on.
The fastest run is the one that most accurately times only your code.
All slower runs are slower because of noise introduced by the operating system scheduler.
The variance you experience is going to differ from machine to machine, and even on identical machines, the set of runnable processes will introduce noise.
None of the above. Bran is close though. You should save every measurment. But don't average them. The average (arithmetic mean) can be very misleading in this type of analysis. The reason is that some of your measurments will be much longer than the others. This will happen becuse things can interfere with your process - even on 'clean' test systems. It can also happen becuse your process may not be as deterministic as you might thing.
Some people think that simply taking more samples (running more iterations) and averaging the measurmetns will give them better data. It doesn't. The more you run, the more likelty it is that you will encounter a perturbing event, thus making the average overly high.
A better way to do this is to run as many measurments as you can (time permitting). 100 is not a bad number, but 30-ish can be enough.
Then, sort these by magnitude and graph them. Note that this is not a standard distribution. Compute compute some simple statistics: mean, median, min, max, lower quaertile, upper quartile.
Contrary to some guidance, do not 'throw away' outside vaulues or 'outliers'. These are often the most intersting measurments. For example, you may establish a nice baseline, then look for departures. Understanding these departures will help you fully understand how your process works, how the sytsem affecdts your process, and what can interfere with your process. It will often readily expose bugs.
Depends what kind of data you want. I'd say one line per run initially, then analyze the data, go from there. Maybe store a min/max/average of X runs if you want to consolidate it.
http://en.wikipedia.org/wiki/Sample_size
Bryan is right - you need to investigate more. if your code has that much variance even "most" of the time then you might have a lot of fluctuation in your test environment because of other processes, os paging or other factors. If not it seems that you have code paths doing wildly varying amount of work and coming up with a single number/run data to describe the performance of such a multi-modal system is not going to tell you much. So i'd say isolate your setup as much as possible, run at least 30 trials and get a feel for what your performance curve looks like. Once you have that, you can use that wikipedia page to come up with a number that will tell you how many trials you need to run per code-change to see if the performance has increased/decreased with some level of statistical significance.
While saying, "Save every run," is nice, it might not be practical in your case. However, I do think that storing only the average eliminates too much data. I like storing the average of ten runs, but instead of storing just the average, I'd also store the max and min values, so that I can get a feel for the spread of the data in addition to its center.
The max and min information in particular will tell you how often corner cases arise. Is the 1500ms case a one-in-1000 outlier? Or is it something that recurs on a regular basis?

Repeating "Events" (Calendar)

I'm currently working on an application that allows people to schedule "Shows" for an online radio station.
I want the ability for the user to setup a repeated event, say for example:-
"Manic Monday" show - Every Monday From 9-11
"Mid Month Madness" - Every Second Thursday of the Month
"This months new music" - 1st of every month.
What, in your opinion, is the best way to model this (based around an MVC/MTV structure).
Note: I'm actually coding this in Django. But I'm more interested in the theory behind it, rather than specific implementation details.
Ah, repeated events - one of the banes of my life, along with time zones. Calendaring is hard.
You might want to model this in terms of RFC2445. However, that may well give you far more flexibility - and complexity than you really want.
A few things to consider:
Do you need any finer granularity than a certain time on given dates? If you need to repeat based on time as well, it becomes trickier.
Consider date corner cases such as "the 30th of every month" and what that means for leap years
Consider time corner cases such as "1.30am every day" - sometimes 1.30am may happen twice, and sometimes it may not happen at all, due to daylight saving time
Do you need to share the schedule with people in other time zones? That makes life trickier again
Do you need to represent the number of times an event occurs, or a final date on which it occurs? ("Count" or "until" basically.) You may not need either, or you may need one or both.
I realise this is a list of things to think about more than a definitive answer, but I think it's important to define the parameters of your problem before you try to work out a solution.
From reading other posts, Martin Fowler describes recurring events the best.
http://martinfowler.com/apsupp/recurring.pdf
Someone implemented these classes for Java.
http://www.google.com/codesearch#vHK4YG0XgAs/src/java/org/chronicj/DateRange.java
I've had a thought that repeated events should be generated when the original event is saved, with a new model. This means I'm not doing random processing every time the calendar is loaded (and means I can also, for example, cancel one "Show" in a series) but also means that I have to limit this to a certain time frame, so if someone went say, a year into the future, they wouldn't see these repeated shows. But at some point, they'd have to (potentially) be re-generated.

Resources