How to create a list of catalog entries and pass them in as inputs in Kedro Pipeline - kedro

I am trying to get a list of datasets from a catalog file i have created and pass them in as inputs of a single node to combine them and ultimately run the pipeline on airflow using the kedro-airflow plugin
This works on the cli with kedro run but seems to fail in airflow and I am not sure why:
#my_pipeline/pipeline.py
def create_pipeline(**kwargs):
conf_loader = ConfigLoader(['conf/base'])
conf_catalog = conf_loader.get('catalog-a*')
datasets = [key for key, value in conf_catalog.items()]
return Pipeline([
node(
func=combine_data,
inputs=datasets,
outputs="combined_data",
name="combined_data"
),
...#other nodes
])
The error I am getting on airflow looks something like this:
Broken dag: Given configuration path either does not exist
or is not a valid directory: 'conf/base'
This is a Kedro config loader error for sure but i can't seem to figure out why the only error occurs when running the pipeline via airflow. From what I have been reading mixing in the code API is not advised. Is this the right way pass in a list of datasets?
Edit
My catalog is basically a list of Sql query datasets:
dataset_1:
type: pandas.SQLQueryDataSet
sql: select * from my_table where created_at >= '2018-12-21 16:00:00' and partner_id=1
credentials: staging_sql
dataset_2:
type: pandas.SQLQueryDataSet
sql: select * from my_table where created_at >= '2019-08-15 11:55:00' and partner_id=2
credentials: staging_sql

I think it might fail because kedro run is running this from its root directory where it can find the conf/base but the create_pipeline function is under my_pipeline directory so kedro ConfigLoader cannot find that.
I think another way I've done this in the past is, to pass catalog: DataCatalog like this:
def create_pipeline(catalog: DataCatalog = None, * *kwargs) -> Pipeline:
Then you can iterate over or do:
datasets = catalog.datasets.

Related

Issues with loading Maxmind Data into Clickhouse Database using a local file

I'm trying to insert Maxmind Data into a Clickhouse Dictionary but defining it source as a local file where I can running my Client from.
so to define my dictionary I use the query:
CREATE DICTIONARY usage_analytics.city_locations(
geoname_id UInt64 DEFAULT 0,
...
...
...
...
)
PRIMARY KEY geoname_id
SOURCE(File(path '/home/ubuntu/maxmind_csv/GeoLite2-City-Locations-en.csv' format 'CSVWithNames'))
SETTINGS(format_csv_allow_single_quotes = 0)
LAYOUT(HASHED())
LIFETIME(300);
yet I keep getting hit with the error of:
Failed to load dictionary 'usage_analytics.city_locations': std::exception. Code: 1001, type: std::__1::__fs::filesystem::filesystem_error, e.what() = filesystem error: in canonical: No such file or directory [\home/ubuntu/maxmind_csv/GeoLite2-City-Locations-en.csv] [/],
According to the documentation, I have to use its absolute path, which I did by using readlink, and still it cannot detect my file. I am running a clickhouse client from a remote machine and have the files on the remote machine. Am I suppose to have my files else where or what?
It looks like this file is not available, to fix it need to to set right ownership for file:
chown clickhouse:clickhouse /home/ubuntu/maxmind_csv/GeoLite2-City-Locations-en.csv
# chown -R clickhouse:clickhouse /home/ubuntu/maxmind_csv
.XML dictionary allows to read files from any folder.
SQL dictionary does not.
https://clickhouse.tech/docs/en/sql-reference/dictionaries/external-dictionaries/external-dicts-dict-sources/#dicts-external_dicts_dict_sources-local_file
When dictionary with source FILE is created via DDL command (CREATE DICTIONARY ...), the source file needs to be located in user_files directory, to prevent DB users accessing arbitrary file on ClickHouse node.
/etc/clickhouse-server/config.xml
<!-- Directory with user provided files that are accessible by 'file' table function. -->
<user_files_path>/var/lib/clickhouse/user_files/</user_files_path>

Shopify GraphQL - no schema file exists even though it does

I'm trying to get started with Graphql and Shopify. I am getting the following error:
Client for API version 2020-01 does not exist because no schema file exists at `shopify_graphql_schemas/2020-01.json`. (ShopifyAPI::GraphQL::InvalidClient)
Even though I've run this rake task:
rake shopify_api:graphql:dump SHOP_DOMAIN="shipping-inventory-sync.myshopify.com" ACCESS_TOKEN="the_right_token" API_VERSION=2020-01
And I can see the file here: https://nimb.ws/ypDVAK and when I run ShopifyAPI::GraphQL.schema_location in the console I get #<Pathname:/home/shipping-new/db/shopify_graphql_schemas>
I am only trying to do the following to get the shop name as a test:
sesh = ShopifyAPI::Session.new(domain: "shipping-inventory-sync.myshopify.com", token: 'xx', api_version: '2020-01')
ShopifyAPI::Base.activate_session(sesh)
client = ShopifyAPI::GraphQL.client
SHOP_NAME_QUERY = client.parse <<-'GRAPHQL'
{
shop {
name
}
}
GRAPHQL
result = client.query(SHOP_NAME_QUERY)
pp result.data.shop.name
My Shopify Initializer config file also has the same api_version here:
config.api_version = "2021-01"
How can I download the schema in order to use GraphQL for Shopify?
It seems that in your rake task you've specified the api version "2020-01" but in your initializer you are specifying a different version.

Kedro context and catalog missing from Jupyter Notebook

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar
I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Data versioning of "Hello_World" tutorial

i have added "versioned: true" in the "catalog.yml" file of the "hello_world" tutorial.
example_iris_data:
type: pandas.CSVDataSet
filepath: data/01_raw/iris.csv
versioned: true
Then when I used
"kedro run" to run the tutorial, it has error as below:
"VersionNotFoundError: Did not find any versions for CSVDataSet".
May i know what is the right way for me to do versioning for the "iris.csv" file? thanks!
Try versioning one of the downstream outputs. For example, add this entry in your catalog.yml, and run kedro run
example_train_x:
type: pandas.CSVDataSet
filepath: data/02_intermediate/example_iris_data.csv
versioned: true
And you will see example_iris.data.csv directory (not a file) under data/02_intermediate. The reason example_iris_data gives you an error is that it's the starting data and there's already iris.csv in data/01_raw so, Kedro cannot create data/01_raw/iris.csv/ directory because of the name conflict with the existing iris.csv file.
Hope this helps :)
The reason for the error is that when Kedro tries to load the dataset, it looks for a file in data/01_raw/iris.csv/<load_version>/iris.csv and, of course, cannot find such path. So if you really want to enable versioning for your input data, you can move iris.csv like:
mv data/01_raw/iris.csv data/01_raw/iris.csv_tmp
mkdir data/01_raw/iris.csv
mv data/01_raw/iris.csv_tmp data/01_raw/iris.csv/<put_some_timestamp_here>/iris.csv
You wouldn't need to do that for any intermediate data as this path manipulations are done by Kedro automatically when it saves a dataset (but not on load).

Jenkins Pipeline Jira Plugin sorted input parameter

My goal is to create a Jenkins Pipeline Input Step where the user can choose from a selection of JIRA versions. I am able to generate the dropdown list of available versions, but would like the order to be reversed.
The current dropdown is as follows:
REL-3
REL-2
REL-1
And my goal is for it to become:
REL-1
REL-2
REL-3
The order of the items makes the dropdown automatically selects REL-3, whereas it should default to REL-1 as it is the earliest version.
I am using the Jenkins JIRA Plugin with the JiraVersionParameterDefinition to generate the list of versions.
The Jenkins Pipeline code is the following:
#!/usr/bin/env groovy
def jiraVersionForReleaseFromUserInput() {
releases_list = [
$class: 'JiraVersionParameterDefinition',
jiraProjectKey: 'PROJECT_NAME',
jiraReleasePattern: 'REL-[0-9]+',
jiraShowRelease: 'false',
jiraShowArchived: 'false',
]
release = input message: 'Select JIRA Version to release', parameters: [releases_list]
RELEASE_NAME = release
}
My initial thought was the JiraVersionParameterDefinition would have an argument to sort themm but could not fine one. Then I tried using default fields such as sortReverseOrder, sortMode, and reverse_order but they weren't recognized. I have tried to create the list outside the input parameters to be able to sort it beforehand, but it isn't created at that point as when I print it the result is:
[$class:JiraVersionParameterDefinition, jiraProjectKey:PROJECT_NAME, jiraReleasePattern:REL-[0-9]+, jiraShowRelease:false, jiraShowArchived:false]
Am I missing a sorting parameter that is default on Jenkins Inputs? Is there a way to initialize the list before the input step? Is it possible to add a filter argument that can sort the resulting list?

Resources