Generate instances and prediction schema files for Vertex AI model evaluation (Google Cloud pipeline components) - google-cloud-vertex-ai

I build a custom model and try to use Google Cloud pipeline components for model evaluation on Vertex AI. According to the model_upload_predict_evaluate notebook sample, I need to prepare instance_schema.yaml and prediction_schema.yaml to import unmanaged model.
So how to generate the instance and prediction schema files programmatically?

For custom models, instances and prediction schemas are not required:
"predictSchemata": {
"predictionSchemaUri": MODEL_URI + "/prediction_schema.yaml",
"instanceSchemaUri": MODEL_URI + "/instance.yaml",
},

In short:
the content of your yaml file must be following the OpenAPI 3.0.2 schema object (https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.0.2.md#schemaObject) -> found from the [python sdk](https://github.com/googleapis/python-aiplatform/blob/e21762b703e3f6f45a40e42395b3387eee72141d/google/cloud/aiplatform_v1/types/model.py#L468
your yaml file must be suffixed with .yaml. I do not fully understand why, but it gave me some headaches to find this problem knowing that kfp OutputPath do not suffix the paths...
Now to build programmatically, it depends on your use case and what you start from. I'm sure it is feasible to build yourself the yaml content from your source data. In my case, I built my custom image to run my predictor using Pydantic models, from which you can automatically generate its openapi schema in a json format, which can then be transformed into a yaml file. Using a different dataset than yours (I'm using the sushi dataset), my code example is:
import yaml
from pydantic import BaseModel, Field
class InstanceModel(BaseModel):
age: int = Field(..., ge=0, le=5)
east_west_id_now: int = Field(..., ge=0, le=1)
east_west_id_until_15yo: int = Field(..., ge=0, le=1)
gender: int = Field(..., ge=0, le=1)
prefecture_id_now: int = Field(..., ge=0, le=47)
prefecture_id_until_15yo: int = Field(..., ge=0, le=47)
region_id_now: int = Field(..., ge=0, le=11)
region_id_until_15yo: int = Field(..., ge=0, le=11)
same_prefecture_id_over_time: int = Field(..., ge=0, le=1)
time_fill_form: int = Field(..., ge=0)
schema_file_path = "..." # where you want your yaml file
model_json_string = InstanceModel.schema_json()
model_dict = json.loads(model_json_string)
example = {}
for key in model_dict['properties']:
example[key] = 0
model_dict['example'] = example
with open(schema_file_path, 'w', encoding='utf8') as f:
yaml.dump(model_dict, f)
which creates the following:
example:
age: 0
east_west_id_now: 0
east_west_id_until_15yo: 0
gender: 0
prefecture_id_now: 0
prefecture_id_until_15yo: 0
region_id_now: 0
region_id_until_15yo: 0
same_prefecture_id_over_time: 0
time_fill_form: 0
properties:
age:
maximum: 5
minimum: 0
title: Age
type: integer
east_west_id_now:
maximum: 1
minimum: 0
title: East West Id Now
type: integer
east_west_id_until_15yo:
maximum: 1
minimum: 0
title: East West Id Until 15Yo
type: integer
gender:
maximum: 1
minimum: 0
title: Gender
type: integer
prefecture_id_now:
maximum: 47
minimum: 0
title: Prefecture Id Now
type: integer
prefecture_id_until_15yo:
maximum: 47
minimum: 0
title: Prefecture Id Until 15Yo
type: integer
region_id_now:
maximum: 11
minimum: 0
title: Region Id Now
type: integer
region_id_until_15yo:
maximum: 11
minimum: 0
title: Region Id Until 15Yo
type: integer
same_prefecture_id_over_time:
maximum: 1
minimum: 0
title: Same Prefecture Id Over Time
type: integer
time_fill_form:
minimum: 0
title: Time Fill Form
type: integer
required:
- age
- east_west_id_now
- east_west_id_until_15yo
- gender
- prefecture_id_now
- prefecture_id_until_15yo
- region_id_now
- region_id_until_15yo
- same_prefecture_id_over_time
- time_fill_form
title: InstanceModel
type: object
Now my issue is that I have no clue where it is actually used... I cannot see it in the model version/details/request example from the console... so I may have done something wrong after that lol.

Related

Conditional intializations of parameters in hydra

I'm pretty new to hydra and was wondering if the following thing is was possible: I have the parameter num_atom_feats in the model section which I would like to make dependent on the feat_type parameter in the data section. In particular, if I have feat_type: type1 then I would like to have num_atom_feats:22. If instead, I initialize data with feat_type : type2 then I would like to have num_atom_feats:200
model:
_target_: model.EmbNet_Lightning
model_name: 'EmbNet'
num_atom_feats: 22
dim_target: 128
loss: 'log_ratio'
lr: 1e-3
wd: 5e-6
data:
_target_: data.DataModule
feat_type: 'type1'
batch_size: 64
data_path: '.'
wandb:
_target_: pytorch_lightning.loggers.WandbLogger
name: embnet_logger
project: ''
trainer:
max_epochs: 1000
You can achieve this using OmeagConf's custom resolver feature.
Here's an example showing how to register a custom resolver that computes model.num_atom_feat based on the value of data.feat_type:
from omegaconf import OmegaConf
yaml_data = """
model:
_target_: model.EmbNet_Lightning
model_name: 'EmbNet'
num_atom_feats: ${compute_num_atom_feats:${data.feat_type}}
data:
_target_: data.DataModule
feat_type: 'type1'
"""
def compute_num_atom_feats(feat_type: str) -> int:
if feat_type == "type1":
return 22
if feat_type == "type2":
return 200
assert False
OmegaConf.register_new_resolver("compute_num_atom_feats", compute_num_atom_feats)
cfg = OmegaConf.create(yaml_data)
assert cfg.data.feat_type == 'type1'
assert cfg.model.num_atom_feats == 22
cfg.data.feat_type = 'type2'
assert cfg.model.num_atom_feats == 200
I'd recommend reading through the docs of OmegaConf, which is the backend used by Hydra.
The compute_num_atom_feats function is invoked lazily when you access cfg.data.num_atom_feats in your python code.
When using custom resolvers with Hydra, you can call OmegaConf.register_new_resolver either before you invoke your #hydra.main-decorated function, or from within the #hydra.main-decorated function itself. The important thing is that you call OmegaConf.register_new_resolver before you access cfg.data.num_atom_feats.

How do I add a map to an array of maps in ytt?

I'm trying to add a map to an array of maps in ytt to modify a YAML doc.
I tried the below but it errors out and says it expects a map but getting an array.
https://gist.github.com/amalagaura/c8b5c7c92402120ed76dec95dfafb276
---
id: 1
type: book
awards:
books:
- id: 1
title: International Botev
reviewers:
- id: 2
name: PersonB
- id: 2
title: Dayton Literary Peace Prize
reviewers:
- id: 3
name: PersonC
#! How do I add a map to an array of maps?
## load("#ytt:overlay", "overlay")
##overlay/match by=overlay.all
---
awards:
books:
##overlay/match by=overlay.all, expects="1+"
##overlay/match missing_ok=True
reviewers:
##overlay/append
- id: 1
name: PersonA
## load("#ytt:overlay", "overlay")
#! Add a map to an array of maps:
##overlay/match by=overlay.all
---
awards:
books:
##overlay/match by=overlay.all, expects="1+"
- reviewers:
##overlay/append
- id: 1
name: Person A
You were really close in your solution, all you really needed was to make reviewers an array item. If you want to be able to add reviewers to a book that does not have that key, then you will have to add a matcher on the array item and the map item; a gist is included below to see this behavior overlay in action.
If you have more than one ##overlay/match annotation on the same item, the last one wins. There are plans to improve this behavior: https://github.com/k14s/ytt/issues/114.
https://get-ytt.io/#gist:https://gist.github.com/gcheadle-vmware/a6243ee73fa5cc139dba870690eb15c5

Multi-line input in Apache Spark using java

I have looked at other similar questions asked already on this site, but did not get a satisfactory answer.
I am a total newbie to Apache spark and hadoop. My problem is that I have an input file(35GB) which contains multi-line reviews of merchandise of online shopping sites. The information is given in the file as shown below:
productId: C58500585F
product: Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.
This is one block of review. There are thousands of such blocks separated by blank lines. what I need from here is the productId, userId and score,so I have filtered the JavaRDD to have just the lines that I need. so it will look like following:
productId: C58500585F
userId: A3NM6WTIAE
score: 2.0
Code :
SparkConf conf = new SparkConf().setAppName("org.spark.program").setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> input = context.textFile("path");
JavaRDD<String> requiredLines = input.filter(new Function<String, Boolean>() {
public Boolean call(String s) throws Exception {
if(s.contains("productId") || s.contains("UserId") || s.contains("score") || s.isEmpty() ) {
return false;
}
return true;
}
});
Now, I need to read these three lines as part of one (key, value) pair which I do not know how. There will only be a blank line between two blocks of reviews.
I have looked at several websites, but did not find solution to my problem.
Can any one please help me with this ? Thanks a lot! Please let me know if you need more information.
Continuing on from my previous comments, textinputformat.record.delimiter can be used here. If the only delimiter is a blank line, then the value should be set to "\n\n".
Consider this test data:
productId: C58500585F
product: Nun Toy
product/price: 5.99
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Bought it for a relative. Was not impressive.
productId: ABCDEDFG
product: Teddy Bear
product/price: 6.50
userId: A3NM6WTIAE
profileName: Heather
helpfulness: 0/1
score: 2.0
time: 1624609
summary: not very much fun
text: Second comment.
productId: 12345689
product: Hot Wheels
product/price: 12.00
userId: JJ
profileName: JJ
helpfulness: 1/1
score: 4.0
time: 1624609
summary: Summarized
text: Some text
Then the code (in Scala) would look something like:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "\n\n")
val raw = sc.newAPIHadoopFile("test.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = raw.map(e => {
val m = e._2.toString
.split("\n")
.map(_.split(":", 2))
.filter(_.size == 2)
.map(e => (e(0), e(1).trim))
.toMap
(m("productId"), m("userId"), m("score").toDouble)
})
Output is:
data.foreach(println)
(C58500585F,A3NM6WTIAE,2.0)
(ABCDEDFG,A3NM6WTIAE,2.0)
(12345689,JJ,4.0)
Wasn't sure exactly what you wanted for output so I just turned it into a 3-element tuple. Also, the parsing logic could definitely be made more efficient if you need to but this should give you something to work on.

How can I convert a bag to an array of numeric values?

I'm trying to turn the following schema:
{
id: chararray,
v: chararray,
paid: chararray,
ts: {(ts: int)}
}
into the following JSON output:
{
"id": "abcdef123456",
v: "some identifier",
paid: "another identifier",
ts: [ 1,2,3,4,5,6 ]
}
I know how to generate the JSON output, but I can't figure out how to turn the ts attribute in my Pig Schema to just the array of numeric values.
The number of items in the ts bag is known, but they all have the same schema (ts: int).
Pig doesn't support array kind of datatype, one option could be you can try something like this.
input
1 1 100 {(1),(2),(3)}
2 2 200 {(4),(5)}
3 3 300 {(1),(2),(3),(4),(5),(6)}
PigScript:
A = LOAD 'input' USING PigStorage() AS (id: chararray, v: chararray,paid: chararray,ts: {(ts: int)});
B = FOREACH A GENERATE id,v,paid,CONCAT('[',BagToString(ts,','),']') AS ts;
STORE B INTO 'output' USING JsonStorage();
Output:
{"id":"1","v":"1","paid":"100","ts":"[1,2,3]"}
{"id":"2","v":"2","paid":"200","ts":"[4,5]"}
{"id":"3","v":"3","paid":"300","ts":"[1,2,3,4,5,6]"}

Simple fields for performance on table with 230 rows

We are trying to embed two simple form fields as columns in a table, we noticed that it takes about 4.5 seconds for simple fields to generate those tags. The table has 230 rows.
Performance with the simple_fields_for block commented out is .5 seconds, with simple fields for : 5 seconds
= simple_form_for :account,url: create_transactions_path, method: :put do |f|
%table.table.table-striped
......
........
........
%tbody
- loans_view = loans_view(#loans)
- loans_view.each do |lv|
- loan = lv[:loan]
- account = loan.account
.......
.......
%td
= lv[:amount_due]
= f.simple_fields_for :loan, index: loan.id do |al_f|
= al_f.simple_fields_for :account_transaction, index: account.id do |act_f|
%td
= act_f.input :amount, label: false, input_html:{value: account.top_up_amount}
%td
= act_f.input :include_for_update, as: :boolean, label: false, input_html: {checked: true}
We had enabled logging and made sure no db call goes out or any time consuming API is being called in the simple fields for block.

Resources