Is it possible to train XGBoost4J-Spark with tree_method='exact' ? - apache-spark-mllib

I intend to use the trained xgboost model with tree_method='exact' in the SparkML pipeline so I need to use XGBoost4J-Spark; however documentation says "Distributed and external memory version only support approximate algorithm." (https://xgboost.readthedocs.io/en/latest//parameter.html). Is there anyway to work around this?
Alternatively, I can train the model with C-based xgboost and some how convert the trained model to XGBoostEstimator which is a SparkML estimator and seamless to integrate in SparkML pipeline. Has anyone came across such a convertor?
I don't mind running on a single node instead of a cluster as I can afford to wait.
Any insights is appreciated.

So there is this way:
import ml.dmlc.xgboost4j.scala.XGBoost
val xgb1 = XGBoost.loadModel("xgb1")
import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressionModel
val xgbSpark = new XGBoostRegressionModel(xgb1)
where xgb1 is the model trained with C-based xgboost. There is a problem however; their predictions don't match. I have reported the issue on the github repo: https://github.com/dmlc/xgboost/issues/3190

Related

Hugging Face - Could not load model facebook/bart-large-mnli

I just wanted to test the facebook/bart-largemnli model but it doesn’t work and I don’t know how to fix it.
The code:
from transformers import pipeline
classifier = pipeline(“zero-shot-classification”, model=“facebook/bart-large-mnli”)
The error message:
ValueError : Could not load model facebook/bart-large-mnli with any of the following classes: (<class ‘transformers.models.auto.modeling_tf_auto.TFAutoModelForSequenceClassification’>,)
classifier = pipeline(task="sentiment-analysis", model="roberta-large-mnli") Works for example.
What can I do? I already cleaned the disk space.
Thank you a lot!
facebook/bart-large-mnli doesn't offer a TensorFlow model at the moment. To load the PyTorch model into the pipeline, make sure you have PyTorch installed:
pip install torch
...and then re-run your code.
On Hugging Face, not all the models are supported by TensorFlow. This model and (apparently) all other Zero Shot Pipeline models are supported only by PyTorch.
On the Hugging Face model selection page you can toggle options under Libraries to limit the model selection to the libraries you are using.

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-mean-tokens. I have an application that will be deployed to a device that does not have internet access. Here, it's already been answered, how to save the model Download pre-trained BERT model locally. Yet I'm stuck at loading the saved model from the locally saved path.
When I try to save the model using the above-mentioned technique, these are the output files:
('/bert-base-nli-mean-tokens/tokenizer_config.json',
'/bert-base-nli-mean-tokens/special_tokens_map.json',
'/bert-base-nli-mean-tokens/vocab.txt',
'/bert-base-nli-mean-tokens/added_tokens.json')
When I try to load it in the memory, using
tokenizer = AutoTokenizer.from_pretrained(to_save_path)
I'm getting
Can't load config for '/bert-base-nli-mean-tokens'. Make sure that:
- '/bert-base-nli-mean-tokens' is a correct model identifier listed on 'https://huggingface.co/models'
- or '/bert-base-nli-mean-tokens' is the correct path to a directory containing a config.json
You can download and load the model like this
from sentence_transformers import SentenceTransformer
modelPath = "local/path/to/model
model = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
model.save(modelPath)
model = SentenceTransformer(modelPath)
this worked for me.You can check the SBERT documentation for model details for the SentenceTransformer class [Here][1]
[1]: https://www.sbert.net/docs/package_reference/SentenceTransformer.html#:~:text=class,Optional%5Bstr%5D%20%3D%20None)
There are many ways to solve this issue:
Assuming you have trained your BERT base model locally (colab/notebook), in order to use it with the Huggingface AutoClass, then the model (along with the tokenizers,vocab.txt,configs,special tokens and tf/pytorch weights) has to be uploaded to Huggingface. The steps to do this is mentioned here. Once it is uploaded, there will be a repository created with your username, and then the model can be accessed as follows:
from transformers import AutoTokenizer
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("<username>/<model-name>")
The second way is to use the trained model locally, and this can be done by using pipelines.The following is an example how to use this model trained(&saved) locally for your use-case (giving an example from my locally trained QA model):
from transformers import AutoModelForQuestionAnswering,AutoTokenizer,pipeline
nlp_QA=pipeline('question-answering',model='./abhilash1910/distilbert-squadv1',tokenizer='./abhilash1910/distilbert-squadv1')
QA_inp={
'question': 'What is the fund price of Huggingface in NYSE?',
'context': 'Huggingface Co. has a total fund price of $19.6 million dollars'
}
result=nlp_QA(QA_inp)
result
The third way is to directly use Sentence Transformers from the Huggingface models repo.
There are also other ways to resolve this but these might help. Also this list of pretrained models might help.

Machine learning with spark, data preparation performance problem, MLeap

I found a lot of good responses about Mleap - a library, allowing fast scoring. It works on a basis of a model, converted into MLeap bundle.
But what with data preparation stage before scoring?
Is there some effective approach to convert 'spark ML data preparation pipeline' (which is working during training, but in spark framework) to a robust, performance effective, optimized byte-code?
You can easily serialize your entire PipelineModel (containing both feature engineering and model training) with MLeap.
NOTE: The following code is a bit old and you probably have access to a cleaner API now..
// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}
// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
bf.loadMleapBundle().get.root
}).tried.get
Be aware that the tricky part is if you define your own Estimators/Transformers in Spark as they will need a corresponding MLeap version as well.

Reusing h2o model mojo or pojo file from python

As H2o models are only reusable with the same major version of h2o they were saved with, an alternative is to save the model as MOJO/POJO format. Is there a way these saved models can be reused/loaded from python code. Or is there any way to keep the model for further development when upgrading the H2O version??
If you want to use your model for scoring via python, you could use either h2o.mojo_predict_pandas or h2o.mojo_predict_csv. But otherwise if you want to load a binary model that you previously saved, you will need to have compatible versions.
Outside of H2O-3 you can look into pyjnius as Tom recommended: https://github.com/kivy/pyjnius
Another alternative is to use pysparkling, if you only need it for scoring:
from pysparkling.ml import H2OMOJOModel
# Load test data to predict
df = spark.read.parquet(test_data_path)
# Load mojo model
mojo = H2OMOJOModel.createFromMojo(mojo_path)
# Make predictions
predictions = mojo.transform(df)
# Show predictions with ground truth (y_true and y_pred)
predictions.select('your_target_column', 'prediction').show()

Using Product and Location Models, how to find "deals" near locations (Rails 3.1.1)

I have a Product Model having deals of stores (another Model Store) in whole city. Now if someone selects particular store I want my view to display deals of all stores in geographically nearby areas of that store (say within range of 3 miles).
One way would be finding all deals on zipcode basis. But wondering if there is any better way to do this. Maybe some gem..
Thanks.
Use geokit gem: http://geokit.rubyforge.org/ . Example:
Store.find(:all, :origin =>[37.792,-122.393], :within=>10)
If works with relational database. However, it is not optimized like Geo spatial databases.
What you're looking for is a spatial database. You can achieve this with Postgres via PostGIS. I'd also highly recommend using GeoServer or MapServer as a front-end to PostGIS. You're going to want to do some serious reading on GIS in general. This is not a topic to cover in a single answer. You may want to spend some time poking around the OSGeo site.
If you're feeling trendy, you can use MongoDB's spatial indexes. This is probably what I would recommend if you're looking for a quick fix. FourSquare actually runs entirely on MongoDB's spatial functionality. It's what they use to find people close-by. So with Mongo you could find nearby deals with something like
db.deals.find({
loc: {
$near: [YOUR_X, YOUR_Y],
$maxDistance : DEAL_DISTANCE
}
});
This will return all deals that are within DEAL_DISTANCE of your coordinates.

Resources