GCP Vertex AI AutoML for images requires the image data to also reside in us-central1 - google-cloud-vertex-ai

We have our image data in us-west1, but since AutoML is available in us-central1, we are not able to use the image data in us-west1. We do not want to copy the data in us-central1. Is there any other option to use Vertex AI automl without moving the data to us-central1
Thanks

Based on the documentation[1] that I went through, and since the feature of AutoML for image data is solely available in us-central1 (Iowa), there is no other option aside from copying or moving the data in us-central1.
[1] - https://cloud.google.com/vertex-ai/docs/general/locations#feature-availability

Related

How to save kedro dataset in azure and still have it in memory

I want to save Kedro memory dataset in azure as a file and still want to have it in memory as my pipeline will be using this later in the pipeline. Is this possible in Kedro. I tried to look at Transcoding datasets but looks like not possible. Is there any other way to acheive this?
This may be a good opportunity to use CachedDataSet this allows you to wrap any other dataset, but once it's read into memory - make it available to downstream nodes without re-performing the IO operations.
I would try explicitly saving the dataset to Azure as part of your node logic, i.e. with catalog.save(). Then you can feed the dataset to downstream nodes in memory using the standard node inputs and outputs.

In Google Cloud Platform «Start training» is disable

I am training a model in GCP's AutoML Natural Language Entity extraction.
I have 50+ annotations for each label but still cant start training a model.
Take a look at a screenshot of the
train section. The Start training button remains grey and cannot be selected.
Looking at the screenshot it seems as if you may be talking about training an AutoML Entity Extraction model. Then, this issue seems the same as in Unable to start training my GCP AutoML Entity Extraction model on Web UI
There are thus a couple of reasons that may result in this behavior:
Your dataset are located in a specific region (e.g. "EU") and you need to specify the proper endpoint, as shown in the official documentation.
You might need to increase the number of "Training items per label" to 100 at minimum (see Natural Language limits).
From the aforementioned post, the solution seems to be the first one.

GeoTrellis: Create an Attribute Store for a Cloud-Optimized GeoTIFF hosted outside of GeoTrellis

The strategic performance advantage of the Cloud-Optimized GeoTiff is the ability to retrieve raster data for a given extent while only pulling the overviews and a byte range from a remote resource.
In Python, the vsicurl and gdal.Warp abstractions make it possible to do this with just a URL an an extent:
vsicurl_url = '/vsicurl/' + url_to_cog
gdal.Warp(output_file,
vsicurl_url,
dstSRS = 'EPSG:4326',
cutlineDSName = jsonFileSliceAoi,
cropToCutline = True)
The newly-minted COG Spark Examples explain how to arrive at a Raster[Tile] using an AttributeStore created as a result of tiling an RDD in a previous step:
//tiling an RDD and writing out the catalog
...
// Create the reader instance to query tiles stored as a Structured COG Layer
val reader = FileCOGLayerReader(attributeStore)
// Read layer at the max persisted zoom level
// Actually it can be any zoom level in this case from the [0; zoom] values range
val layer: TileLayerRDD[SpatialKey] = reader.read[SpatialKey, Tile](LayerId("example_cog_layer", zoom))
// Let's stitch the layer into tile
val raster: Raster[Tile] = layer.stitch
The examples, release notes, and docs for COG support in GeoTrellis all confirm that there is support for tiling data and making it available for clients to consume as a COG. Does GeoTrellis also support the ability to act as a client?
How do you create a FileCOGLayerReader if you don't have a pre-existing catalog, but do have a URL that supports range requests?
We have two COG related concepts at the moment:
The first one is a GeoTrellis COG Layer - a very similar to a GeoTrellis avro layer thing, but instead of storing separate avro tiles we store tiles as a segment of a tiff(s). To prepare such catalog you'd have to go through an ingest process to structure data and to reformat it to the GeoTrellis ready format. That would be a layer that consists of partial pyramids, where each partial pyramid is represented by a set of COGS.
The second one is a GeoTrellis Unstructured COG Layer: https://github.com/locationtech/geotrellis/blob/master/doc-examples/src/main/scala/geotrellis/doc/examples/spark/COGSparkExamples.scala#L114
The last one allows you just to collect metadata somehow about your dataset into an (extent, URI) tuples and to provide an interface to query it. Get familiar with the example I posted you and let me know if that works for you.
Btw, RasterFoundry uses unstructured COG layers for their tile server.

h2o subset to "bestofFamily"

AutoML makes two learners, one that includes "all" and the other that is a subset that is "best of family".
Is there any way to not-manually save the components and stacked ensemble aggregator to disk so that that "best of family", treated as a standalone black-box, can be stored, reloaded, and used without requiring literally 1000 less valuable learners to exist in the same space?
If so, how do I do that?
While running AutoML everything runs in memory (nothing is saved to disk unless you save one of the models to disk - or apply the option of saving an object to disk).
If you just want the "Best of Family" stacked ensemble, all you have to do is save that binary model. When you save a stacked ensemble, it saves all the required pieces (base models and meta model) for you. Then you can re-load later for use with another H2O cluster when you're ready to make predictions (just make sure, if you are saving a binary model, that you can use the same version of H2O later on).
Python Example:
bestoffamily = h2o.get_model('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.save_model(bestoffamily, path = "/home/users/me/mymodel")
R Example:
bestoffamily <- h2o.getModel('StackedEnsemble_BestOfFamily_0_AutoML_20171121_012135')
h2o.saveModel(bestoffamily, path = "/home/users/me/mymodel")
Later on, you re-load the stacked ensemble into memory using h2o.load_model() in Python or h2o.loadModel() in R.
Alternatively, instead of using an H2O binary model, which requires an H2O cluster to be running at prediction time, you can use a MOJO model (different model format). It's a bit more work to use MOJOs, though they are faster and designed for production use. If you want to save a MOJO model instead, then you can use h2o.save_mojo() in Python or h2o.saveMojo() in R.

Store Images to display in SOLR search results

I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).

Resources