Validate against schema fragment - ruby

I'm new to json-schema so it might not be a relevant issue.
I'm using https://github.com/hoxworth/json-schema.
I have one big json file describing a lot of schemas (mostly small ones) with a lot $ref between schemas, and I need to be able to validate data against one of these "inner" schemas. I can't find a way to do this with json-schema.
Does json-schema support this use case, or am I doing it wrong ?

It appears it does. It states that it uses the json schema v4. Also in the source code: line 265 lib/json-schema/validator.rb.
def build_schemas(parent_schema)
# Build ref schemas if they exist
if parent_schema.schema["$ref"]
load_ref_schema(parent_schema, parent_schema.schema["$ref"])
end

Related

protobuf oneof backwards compatibility

If I had some protobufs created with the following protobuf schema
message Foo {
Bar1 bar_1 = 1;
Bar2 bar_2 = 2;
}
but later on updated the protobuf schema to
message Foo {
oneof foo {
Bar1 bar_1 = 1;
Bar2 bar_2 = 2;
}
}
Will this second version be able to read the protos created with the first version?
I don't think it can.
A Foo message created with the first version of the schema can contain both a bar_1 and a bar_2.
Code generated from the second schema is expecting there to be only a bar_1 or only a bar_2, so regardless of whatever markers GPB puts in its wireformat to denote a oneof, this code wouldn't know what to do with the surplus bar_2.
It's possible that, with the right schema syntax version (is it v3 makes everything optional?) that a message created using code for the first schema that contains only a bar_1 or bar_2 (made possible by the fields being optional) may be parsable by code generated from the second schema. But that would come down to how oneof is treated in the GPB wireformat.
All in all, it's best to not make assumptions about wireformat compatibility between conflicting schemas. It'd be easy to write a small utility to read Foo messages created by the first schema, check to see if there are 2 fields pressent, and create fresh Foo messages under the 2nd schema if there are not (you may have to compile the schema with appropriately diverse namespaces configured). That way you can catch the exceptions (2 fields present), and be sure to have ended up with compatible wireformat data.
Yes, the second version of the protobuf schema should be able to read protobufs created with the first version. When you update a protobuf schema, the changes you make only affect how new protobufs are encoded and decoded. Protobufs that were created with the previous version of the schema will still be encoded and decoded using the old schema. This means that the second version of the schema should still be able to read protobufs created with the first version, even though the schema has changed.
However, it is worth noting that when you make changes to a protobuf schema, you should take care to ensure that the changes are backward-compatible. This means that the new schema should still be able to read protobufs created with the old schema, without losing any information. In the example you provided, the change from the first version of the schema to the second is backward-compatible, so the second version should be able to read protobufs created with the first version. However, if you made a change that was not backward-compatible, the second version of the schema would not be able to read protobufs created with the first version.

Possible to set file name for h2o.save_model() (rather then simply use the model_id value)?

Trying to save an h2o model with some specific name that differs from the model's model_id field, but trying something like...
h2o.save_model(model=model,
path='/some/path/then/filename',
force=False)
just creates a dir/file structure like
some
|__path
|__then
|__filename
|__<model_id>
as opposed to
some
|__path
|__then
|__filename
Is this possible to do from the save_model method?
I can't / hesitate to simply change the model_id before calling the save method because the model names have timestamps appended to them to avoid name collisions with other models that may be on the h2o cluster (am trying to remove these timestamps when saving on disk and simplifying the name on the cluster before saving creates a time where naming collision can occur if other processes are also attempting to save such a model (of, say, a different timestamp)).
Any way to get this behavior or other common alternatives / workarounds?
This is currently not possible, however I created a feature request here. There is a related question here which shows a solution for R (could be adapted to Python). The work-around is just to rename the file manually using a few lines of R/Python code.

Is there a supported way to get list of features used by a H2O model during its training?

This is my situation. I have over 400 features, many of which are probably useless and often zero. I would like to be able to:
train an model with a subset of those features
query that model for the features actually used to build that model
build a H2OFrame containing just those features (I get a sparse list of non-zero values for each row I want to predict.)
pass this newly constructed frame to H2OModel.predict() to get a prediction
I am pretty sure what found is unsupported but works for now (v 3.13.0.341). Is there a more robust/supported way of doing this?
model._model_json['output']['names']
The response variable appears to be the last item in this list.
In a similar vein, it would be nice to have a supported way of finding out which H2O version that the model was built under. I cannot find the version number in the json.
If you want to know which feature columns the model used after you have built a model you can do the following in python:
my_training_frame = your_model.actual_params['training_frame']
which will return some frame id
and then you can do
col_used = h2o.get_frame(my_training_frame)
col_used
EDITED (after comment was posted)
To get the columns use:
col_used.columns
Also, a quick way to check the version of a saved binary model is to try and load it into h2o, if it loads it is the same version of h2o, if it isn't you will get a warning.
you can also open the saved model file, the first line will list the version of H2O used to create it.
For a model saved as a mojo you can look at the model.ini file. It will list the version of H2O.

Data abstraction in API Blueprint + Aglio?

Reading the API Blueprint specification, it seems set up to allow one to specify 'Data Structures' like:
Address
street: 100 Main Str. (string) - street address
zip: 77777-7777 (string) - zip / postal code
...
Customer:
handle: mrchirpy (string)
address: (address)
And then in the model, make a reference to the data structure:
Model
[Customer][]
It seems all set up that by referencing the data structure it should generate documentation and examples in-line with the end points.
However, I can't seem to get it to work, nor can I find examples using "fully normalized data abstraction". I want to define my data structures once, and then reference everywhere. It seems like it might be a problem with the tooling, specifically I'm using aglio as the rendering agent.
It seems like all this would be top of the fold type stuff so I'm confused and wondering if I'm missing something or making the wrong assumptions about what's possible here.
#zanerock, I'm the author of Aglio. The data structure support that you mention is a part of MSON, which was recently added as a feature to API Blueprint to describe data structures / schemas. Aglio has not yet been updated to support this, but I do plan on adding the feature.

Fulltext-index in Neo4J in 2.0

Is there a way to
create a fulltext-index with a given lucene-analyzer on a certain node-Type (and certain fields only)
to get this index updated automatically when a node of the given type is created / deleted
query this index over the Cypher- oder the REST-API
I am using the Cypher/REST-Interface (and of course the shell, etc.) of the server not the embedded version.
If this is not available (which I guess): Is something like this on the roadmap?
Thank you in advance!
Short answer: no
Little bit longer answer:
You can write a KernelExtension adding a TransactionEventHandler that amends the fields to be fulltext indexed to a manual index (aka legacy index).
The code should be wrapped into an unmanaged extension and deployed to the server.
There something similar implemented within https://github.com/sarmbruster/neo4j-uuid.
The contents of the legacy index can be accessed using start n=node:myindex('lucene query string') in Cypher

Resources