Online updating for ALS recommendation system - apache-spark-mllib

Is there a way to export the spark ALS model after its being trained, into a PMML model or whatever format, which can be called outside of spark environment?
E.g., in JAVA, given a customer id C and a product id P, load the model file created by SCALA program, and call it, it will output a score for (C,P).
The major reason for this question is that when the size of the active user is huge, say 0.1 billion users over 100 products, the prediction size will be 10 billion. And item-based recommendation is not an option in our case.
Not sure how people in the industry do that, especially when there is a need to update the model daily which is trained by the previous whole month/week data.

There is a way to save your model within the sprakenvironment, like this ALSmodel.save("myModelPath"). With this model you are able to score all known customer/item pairs.
I guess if you want to score outside of spark you have to export the item & user factors into another system and calculate the mf yourself. There you are also able to update user iteractions for your recommendations.
With ALSmodel.userFactors and ALSmodel.itemFactors you are able to extract the factors of your model.
Why would you want to score outside of spark? You can just simply precalculate your predictions and serve them online. If you want to update the recommendations on a very high frequenzy level you have to go the suggested way. If you want to update your model only on a daily basis I would suggest that you simply retrain the model every.

Related

Annotation specs - AutoML (VertexAi)

We're trying to build an imaged based product search for our webshop using the vertex ai image classification model (single label).
Currently we have around 20k products with xx images per product.
So our dataset containing 20k of labels (one for each product - product number), but on import we receive the following error message:
There are too many AnnotationSpecs in the dataset. Up to 5000 AnnotationSpecs are allowed in one Dataset. Check your csv/jsonl format with our public documentation.
Looks like not more than 5000 labels are allowed per Dataset... This quota is not really visible in the documentation - or we didn't find it.
Anyway, any ideas how we can make it work? Does we have to build 5 Datasets with 5 different Endpoints and than query every Enpoint for matching?
You can find those limits in the AutoML quotas and limits documentation.
It is possible to have multiple models for group of products -- Maybe even something like: one initial model to classify the product category (jewery, watches, shoes, toys, etc) and a second step for a specific model (to identify the specific product belong toys, or belong shoes, etc). But to be honest, it seems a bit hard to support - but certainly worth trying.
A second option would be training a custom model where you could do a fine tuning on some larger model (ie. inception, resnet, etc) do know all your 20k+ classes (products). It could add a little bit more work at first, but after established, it will become a single model for inference and re-training would be simpler using MLOps mechanisms (ie. Vertex Pipelines).

How to limit possible arm recommendations to a subset of items?

I have the following scenario:
There is a universe of items to recommend [i1....iN] where N is quite large (say 1 million).
There are categories [c1...cK]. Each category consists of a subset of the items.
The user can go to pages which display items from a given category.
I would like to display recommended items for each category page to a user using a single bandits model across all category pages. So when I ask for a set of top-K recommendations ("actions") for category page ci, the results should be limited to the set of items available within ci.
Is there a way to do this with Vowpal Wabbit?
When you ask VW for a contextual bandit prediction using the ADF (action dependent features) form allows you to specify which actions can be chosen on for that prediction. The ADF form can be read about more here, and contrasted with the more common standard contextual bandit. This would allow you only ask for predictions of actions in the category you're currently looking at. This works because actions are defined as the set of the features that compose them, and so you can present any set of features per action for each prediction. This means that changing the actions between calls is not an issue.
However, empirically we see using contextual bandits with > ~100 actions to not be very effective. Essentially the very small exploration probabilities does not work well with the update rule.
So, it's doable but I am not sure how effective it will be.
In a situation such as this a common approach is to use another model to get a pool of recommendations and then use a contextual bandit as an L2 ranker to personalize a pool of 50 or so actions that were suggested.

Correct way to label lists in GCP AutoML entity text model

I want to create a model to extract info for PDFs containing purchase orders. I thought that I could create an AutoML text entity model for that task. The main doubt is what's the best way to handle with the article lists. How can I label each cell in order I can have a list of rows in the reuslt
Thanks
The labeling is very important, less than 10 labels to start would make it easy. As you will need at least a 100 entities labeled per label to train. remember you have three sets to label, train, test, validate. 100 for train, 30 for test and 30 for validate should suffice.
check the label tab often, it shows the break down of what has been labeled so far.
google's documentation is a good start. https://cloud.google.com/natural-language/automl/docs/prepare
I ended up building a Java client to call predict on the model, sending it a list of files to process. the returned JSON has the entities by label for each file.

Elasticsearch read model sync with write model

My application following CQRS strategy separates Read model from Write model. I have a Product and multiple Purchase orders related to that Product.
The PurchaseOrder read model is in Elasticsearch and with product name attached. Now if I change the product name in the write model then I need to update all the PurchaseOrder's productName field accordingly in the read model(using Elasticsearch's bulk update API).
My question is: As I have millions of PurchaseOrders, will this productName sync be a performance issue? Or any suggestions for modeling such kind of syncing?
Although I do not believe that changing a product name on existing orders is a good idea (the invoice might have been generated and the product name in the order should match the one in the invoice), the question still has merit.
You may want your PurchaseOrder to only keep the ID (and perhaps the version?) of the Product, so that you can avoid such a mass update. This approach, on the other hand, requires a call to the Product aggregate root every time you want to translate the ID of the product in its own name. The impact of such a read can obviously be mitigated by using a cache.
I guess it really depend on the number of occurrences of such two circumstances to happen and I would then optimize the most occurring one.

design of car booking application using elasticsearch

I need some help in designing car booking application.
There is a document with information about car (title, model, brand, info, etc.)
Problems I'm stuck with are:
How to store available booking days? (I suppose I could use nested
free date range objects in array)
How to store price per day (it's possible to have individual price
per day)?
Booking days and prices could change often. So the third question is: "how to update them cleverly (partially), so I shouldn't read the document, and then store it". I'm looking at script solution using
update api (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html), but it looks ugly. Maybe there are other approaches?
Thanks,
Alex
with the introduction of the range datatypes, there is no need to use a real nested object, if you meant that.
That might also help you with storing the prices, but that could just be any object I suppose (it depends if you want to search for that as well).
Update API was made for exactly that use-case, that you do not need to get the whole document, so that shounds like a plan.

Resources