Specifying threshold to use for h2o steam prediction service - h2o

Using h2o steam's prediction service for a deployed model, the default threshold that seems to be used by the prediction service is the max f1 threshold. However, in my case I would like the be able to use other thresholds (as displayed by the model when built in h2o flow) (eg. max f2 or max accuracy thresholds) like these.
Is there a way to set these thresholds in steam?
Looking at the inspector on the prediction service page, seems to shows that the logic for the predictor is from a script called "predict.js" (see below):
But I can't find where in the steam launch directory (running from local host based on these instructions) these files are (doing a file search in this directory for anything named "predict.js" returns nothing).

I believe for POJOs there's no way to change it as it's hardcoded. Don't know the answer for MOJOs.

Related

Allow Open Telemetry sampling decision at the end of the span

It seems that the current otel specification only allows to make a sampling decision based on the initial attributes.
This is a shame because I'd like to always include some high signal spans. E.g the ones with errors or long durations. These fields are typically only populated before ending a span. I.e. too late for a sampling decision under the current specs.
Is there some other approach to get what I want? Or is it reasonable to open an issue in the repo to discuss allowing this use case?
Some context for my situation:
I'm working on a fairly small project with no dedicated resources for telemetry infrastructure. Instead we're exporting spans directly from our node.js app server to honeycomb and would like to get a more complete picture of errors and long-duration requests while sampling low-signal spans to keep our cost under control.
There are certain ways you could achieve this.
Implementing your own SpanProcessor which filters out these spans. This can get hairy quickly since it breaks the trace and some span might have parentId set to span which is dropped.
Another way to achieve this is to do tail sampling i.e drop the entire trace if it matches certain criteria and there is processor for that in opentelemetry collector contrib https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor. Please note the agent/gateway deployment of collector which is doing tail sampling has to have the access to full trace and there is also some buffering done.
I think honeycomb also has some component which can be used for sampling the telemetry but I have never used it https://github.com/honeycombio/refinery.

In Google Cloud Platform «Start training» is disable

I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.

h2o subset to "bestofFamily"

AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.

Is there a way to do collaborative learning with h2o.ai (flow)?

Relatively new to ML and h2o. Is there a way to do collaborative learning/training with h2o? Would prefer a way that uses the flow UI, else woud be using python.
My use case is that there would be new feature samples x=[a, b, c, d] periodically coming into a system where an h2o algorithm (say, running from a java program using a MOJO) assigns a binary class that users should be able to manually reclassify as either good(0) or bad(1), at which point these samples (with their newly assigned responses) get sent back to theh h2o algorithm to be used to further train it.
Thanks
FLOW UI is great for prototyping something very quick with H2O without writing a single like of code. You can ingest the data, build desired model and the evaluate the results. Unfortunately FLOW UI is can not be extended for the reason you asked, and FLOW is limited for that reason.
For collaborative learning you can write your whole application directly in python or R and it will work as expected.

Using Graphite/Grafana for non time based data

I have been reading through documentation and have been able to get Graphite to receive data I have been sending it. I can definitely see the benefit in tracking things like concurrent users and network load.
But now I have been tasked with implementing it on the client to show things like RAM, and CPU usage. In addition to this, I must also track actions (users buying things). Maybe I am missing a large chunk of the picture here, but I am not sure how I would do these things. Do I need a timestamp? I've also seen plugins for pie charts and such and this would indicate I could perhaps create graphs from devices with different statistics.
What am I missing?
I don't think you're missing anything.
Any data you send on to, say, InfluxDB (as that's what I've used the most) will be timestamped automatically when it arrives - unless you specify an explicit one yourself.
If you're showing, for example, CPU load you can write your query to pick up the latest value, or perhaps an average (mean value) over time, or the max or min over a period of time as appropriate.
Pie charts can also be used successfully to graph relationships over time.
The key is to create a specific query (I use SQL directly) to craft the data used for the panel type. There are excellent examples in the documentation.

Resources