Kedro context and catalog missing from Jupyter Notebook - kedro

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar

I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Related

Load a pre-trained model from disk with Huggingface Transformers

From the documentation for from_pretrained, I understand I don't have to download the pretrained vectors every time, I can save them and load from disk with this syntax:
- a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
- (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.
So, I went to the model hub:
https://huggingface.co/models
I found the model I wanted:
https://huggingface.co/bert-base-cased
I downloaded it from the link they provided to this repository:
Pretrained model on English language using a masked language modeling
(MLM) objective. It was introduced in this paper and first released in
this repository. This model is case-sensitive: it makes a difference
between english and English.
Stored it in:
/my/local/models/cased_L-12_H-768_A-12/
Which contains:
./
../
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
So, now I have the following:
PATH = '/my/local/models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
And I get this error:
> raise EnvironmentError(msg)
E OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E
E - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E
E - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file
Similarly for when I link to the config.json directly:
PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
if state_dict is None and not from_tf:
try:
state_dict = torch.load(resolved_archive_file, map_location="cpu")
except Exception:
raise OSError(
> "Unable to load weights from pytorch checkpoint file. "
"If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
)
E OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
What should I do differently to get huggingface to use my local pretrained model?
Update to address the comments
YOURPATH = '/somewhere/on/disk/'
name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)
>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')
So all is saved, but then....
YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)
"Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.
Where is the file located relative to your model folder? I believe it has to be a relative PATH rather than an absolute one. So if your file where you are writing the code is located in 'my/local/', then your code should be like so:
PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
You just need to specify the folder where all the files are, and not the files directly. I think this is definitely a problem with the PATH. Try changing the style of "slashes": "/" vs "\", these are different in different operating systems. Also try using ".", like so ./models/cased_L-12_H-768_A-12/ etc.
I had this same need and just got this working with Tensorflow on my Linux box so figured I'd share.
My requirements.txt file for my code environment:
tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3
I'm using Python 3.6.
I went to this site here which shows the directory tree for the specific huggingface model I wanted. I happened to want the uncased model, but these steps should be similar for your cased version. Also note that my link is to a very specific commit of this model, just for the sake of reproducibility - there will very likely be a more up-to-date version by the time someone reads this.
I manually downloaded (or had to copy/paste into notepad++ because the download button took me to a raw version of the txt / json in some cases... odd...) the following files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
NOTE: Once again, all I'm using is Tensorflow, so I didn't download the Pytorch weights. If you're using Pytorch, you'll likely want to download those weights instead of the tf_model.h5 file.
I then put those files in this directory on my Linux box:
/opt/word_embeddings/bert-base-uncased/
Probably a good idea to make sure there's at least read permissions on all of these files as well with a quick ls -la (my permissions on each file are -rw-r--r--). I also have execute permissions on the parent directory (the one listed above) so people can cd to this dir.
From there, I'm able to load the model like so:
tokenizer:
# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
layer/model weights:
# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")
This should be quite easy on Windows 10 using relative path. Assuming your pre-trained (pytorch based) transformer model is in 'model' folder in your current working directory, following code can load your model.
from transformers import AutoModel
model = AutoModel.from_pretrained('.\model',local_files_only=True)
Please note the 'dot' in '.\model'. Missing it will make the code unsuccessful.
In addition to config file and vocab file, you need to add tf/torch model (which has.h5/.bin extension) to your directory.
in your case, torch and tf models maybe located in these url:
torch model: https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin
tf model: https://cdn.huggingface.co/bert-base-cased-tf_model.h5
you can also find all required files in files and versions section of your model: https://huggingface.co/bert-base-cased/tree/main
bert model folder containd these files:
config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt
instaed of these if we require bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
bert_model.ckpt.meta
vocab.txt
then how to do
Here is a short ans.
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt',local_files_only=True)
model = BertForMaskedLM.from_pretrained('/path/to/pytorch_model.bin',config='../config.json', local_files_only=True)
Usually config.json need not be supplied explicitly if it resides in the same dir.
you can use simpletransformers library. checkout the link for more detailed explanation.
model = ClassificationModel(
"bert", "dir/your_path"
)
Here I used Classification Model as an example. You can use it for many other tasks as well like question answering etc.

HBase Could not find or load main class org.apache.hadoop.hbase.util.HBaseConfTool

EDIT: Problem sovled (see comment for explnation)
I installed HBase. When I'm trying to start "start-hbase.sh", I get some errors:
Error: Could not find or load main class org.apache.hadoop.hbase.util.HBaseConfTool
Error: Could not find or
load main class org.apache.hadoop.hbase.zookeeper.ZKServerTool
My installation directory is: C:\Users\Alon\Downloads\hadoop_temp\hbase-2.2.4
And I configured HBASE_HOME to: C:\Users\Alon\Downloads\hadoop_temp\hbase-2.2.4
And also HBASE_CONF_DIR: C:\Users\Alon\Downloads\hadoop_temp\hbase-2.2.4\conf
In addition, I added C:\Users\Alon\Downloads\hadoop_temp\hbase-2.2.4\bin to the environment var Path.
JAVA_HOME=C:\Users\Alon\Downloads\jdk1.8.0_202 (as an environment variable and also in hbase-env.sh)
I would like to get your help please, since I don't know how to solve the problem.
Thank you very much.
It looks like the CLASSPATH is not picking up the libraries under $HBASE_HOME/lib.
Set the value of HBASE_HOME to the Hbase installation directory and Update the hbase-env.sh with JAVA_HOME variable.
Restart Hbase start-hbase.sh.

How do I initialize a new database-path defined in stack.yaml

The sample Docker configuration section of stack.yaml gives:
# Location of database used to track image usage, which `stack docker cleanup`
# uses to determine which images should be kept. On shared systems, it may
# be useful to override this in the global configuration file so that
# all users share a single database.
database-path: "~/.stack/docker.db"
However when I put this in the stack.yaml for a new project and stack setup I get:
Aeson exception:
Error in $.docker['database-path']: failed to parse field 'docker': failed to parse field 'database-path': InvalidAbsFile "~/.stack/docker.db"
See http://docs.haskellstack.org/en/stable/yaml_configuration/
This is the only reference I could find to database-path, without digging in to the code.
Is database-path required?
If so: How do I initialize a .db file (to mitigate InvalidAbsFile "~/.stack/docker.db")?
It is not a matter of initialization of the database. The problem is that it does not expand the ~, so you need to use /home/dukedave/.stack/docker.db

In windows, where to create the `.theanorc.txt` file and how to make theano able to see it?

I am trying to make thenao use gpu on windows. This tutorial suggests that I create a .theanorc directory at my home and a theanorc.txt inside it to be able to set the configuration flags before initialization.
Where to create the theanorc.txt file (i.e. how to find out where my home is?) and how to make theano able to see it?
I have tried the following script to create .theanorc and then added theanorc.txt manually inside it, but gpu was not enabled:
import os
_theano_base_dir = os.path.expanduser('~')
if not os.access(_theano_base_dir, os.W_OK):
_theano_base_dir = '/tmp'
_theano_dir = os.path.join(_theano_base_dir, '.theanorc')
if not os.path.exists(_theano_dir):
os.makedirs(_theano_dir)
theano_config_path = os.path.expanduser(os.path.join(_theano_dir, 'theanorc.txt'))
print (theano_config_path)
This printed: C:\SPB_Data\.theanorc\theanorc.txt Is C:\SPB_Data my home?
In Windows, your home directory should be C:\Users\Your_Windows_UserName. Also if you want to create the .theanorc file without the .txt extension you can use Notepad++

Specifying index path with Hibernate Search and Spring

I have problem setting the correct path for my index. It would be great if it was inside my spring application, since it would work even after I deploy my application to Cloudbees I guess.
This is my obejct that I trying to index:
#Entity
#Table(name="educations")
#Indexed(index="educations")
public class Education {
I have the following in servlet-context.xml:
<resources mapping="/resources/**" location="/resources/"/>
I specify the lucene index path like this:
Properties props = new Properties();
props.put("hibernate.search.default.indexBase", "resources/lucene/indexes");
entityManagerFactory.setJpaProperties(props);
Which doesnt give me any error but I cant find the folder either, which I dont understand. I tried searching for it.
I also tried:
props.put("hibernate.search.default.indexBase", "classpath:/lucene/indexes");
and
props.put("hibernate.search.default.indexBase", "/resources/lucene/indexes");
But still cant find the folder. However after a while of struggling with this I try to put it in my home directory. (which might give me problem later when deploying to the cloud):
props.put("hibernate.search.default.indexBase", "/lucene/indexes");
I getting the following
Cannot write into index directory: /lucene/indexes for index educations
So I assume its a permission error. I try the following in terminal (OSX):
sudo chmod -R u+rwX /lucene/indexes/
and
sudo chmod -R 755 /lucene/indexes/
But still the same error. Can someone spread some light on this?
Thank you!
Edit:
After some more investigation I am sure it is a problem of permissions. If I specify the full path to my root of the Spring application, it works. I still don't know how to specify this without giving it the full path.
Relative paths are relative to the directory the Java process is launched from. If you have some startup script or similar look in the directory of this script. Absolute paths work fine, but of course you need permissions to write to it.
If you want a more generic solution for your case, you could for example set the right directory as a system property when starting the application and read it from there when creating your Properties. Or you try in another way to determine the full path of your app at runtime.
I couldn't find where the folders were located either so i came up with the following solution:
First i get the location of the working directory by calling System.getProperty("user.dir"). This is OS independent so it works both on linux and windows. The working directory is the directory from which your application is loaded. Next i simply append the relative path that i want as a location for my lucene indexes to the working directory folderpath. Then i use that as the value for hibernate.search.default.indexBase. Now i'll always know where to look for the lucene indexes.
Heres the code:
String luceneFilePath = System.getProperty("user.dir") + "/resources/lucene/indexes";
Properties props = new Properties();
props.put("hibernate.search.default.indexBase", luceneFilePath);
entityManagerFactory.setJpaProperties(props);

Resources