Kedro 0.17 Override global.yml with extra params - kedro

Im currently not able to update the globals.yml file with extra params passed at run time as I previously did with Kedro 0.16.x. I run kedro through run.py.
#hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
session = get_current_session()
extra_params = session.store.get('extra_params')
loader = TemplatedConfigLoader(
conf_paths,
globals_pattern="*globals.yml",
globals_dict=extra_params,
)
return loader
Trying the above provides the following error:
RuntimeError: There is no active Kedro session.
60 #hook_impl
61 def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
---> 62 session = get_current_session()
63 extra_params = session.store.get('extra_params')
64 loader = TemplatedConfigLoader(
My pipelines work without adding this but I of course cant override the globals.yml at run time (which is essential).

Related

When importing a NetCDF into Geoserver Data Store a non obvious error occurs

I've a brand new install of Geoserver 2.22 on Ubuntu 22.04 and installation was smooth. I've added the official NetCDF plugin by unzipping the contents to the WEB-INF/lib/ folder, and it shows up as a type in the data store. Great!
I have a selection of NEtCDFs that can be loaded successfully elsewhere (QGIS, ArcGIS Pro, Python via Xarray), however, when I attempt to create a new data store, choose NetCDF and select the .nc files, I get the following error message:
There was an error trying to connect to store AFDRS_FSE_curing. Do you want to save it anyway?
Original exception error:
Failed to create reader from file:efs/temp_surface.nc and hints Hints: FORCE_LONGITUDE_FIRST_AXIS_ORDER = true EXECUTOR_SERVICE = java.util.concurrent.ThreadPoolExecutor#1242674b[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] FILTER_FACTORY = FilterFactoryImpl STYLE_FACTORY = StyleFactoryImpl FEATURE_FACTORY = org.geotools.feature.LenientFeatureFactoryImpl#6bfaa0a6 FORCE_AXIS_ORDER_HONORING = http GRID_COVERAGE_FACTORY = GridCoverageFactory TILE_ENCODING = null REPOSITORY = org.geoserver.catalog.CatalogRepository#3ef5cfc5 LENIENT_DATUM_SHIFT = true COMPARISON_TOLERANCE = 1.0E-8
What am I missing here? That error message doesn't seem to be highlighting somethign obviously as an error...
The NetCDFs are located in: /usr/share/geoserver/data_dir/data

I can't run a correlation with ggcorrmat

I am getting this error when running a correlation matrix with the R package ggstatsplo
ggcorrmat(data = d, type = "nonparametric", p.adjust.method = "hochberg",pch = "")
Error: 'data_to_numeric' is not an exported object from 'namespace:datawizard'
Somebody could help me.
I expected to have the script to run normally as I have used it before (around July) without any errors.
Is there anything that has changed about the ggstatsplot package?

Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors

I tried to compile a tflite model to edgetpu model and run into some error like that.
Edge TPU Compiler version 16.0.384591198
Started a compilation timeout timer of 180 seconds.
ERROR: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors.
Compilation failed: Model failed in Tflite interpreter. Please ensure model can be loaded/run in Tflite interpreter.
Compilation child process completed within timeout period.
Compilation failed!
I define my model like that:
preprocess_input = tf.keras.applications.efficientnet.preprocess_input
def Model(image_size=IMG_SIZE):
input_shape = image_size + (3,)
inputs = tf.keras.Input(shape=input_shape)
x = preprocess_input(inputs)
base_model = tf.keras.applications.efficientnet.EfficientNetB0(input_shape=input_shape, include_top=False, weights="imagenet")
base_model.trainable = False
x = base_model(x, training=False)
x = tfl.GlobalAvgPool2D()(x)
x = tfl.Dropout(rate=0.2)(x)
outputs = tfl.Dense(90, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)
return model
The model summary is like that:
I convert to tflite model like that:
converter = tf.lite.TFLiteConverter.from_keras_model(model)
# Defining the representative dataset from training images.
def representative_dataset_gen():
for image, label in train_dataset.take(100):
yield [image]
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
# Using Integer Quantization.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Setting the input and output tensors to uint8.
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tflite_model = converter.convert()
if not os.path.isdir('exported'):
os.mkdir('exported')
with open('/workspace/eff/exported/groups_1.tflite', 'wb') as f:
f.write(tflite_model)
Environment:
Edge TPU Compiler version 16.0.384591198
Python version 3.6.9
tensorflow 1.15.3
When looking for solutions on google, someone said you need to get rid of the preprocess_input, I'm not sure what that means.
How can I check if there is a dynamic shape tensor in the model
and how can I fix it?

Why are spark stream's mini batches longer lasting on windows?

I have created a topic in kafka called "test" which has just one partition and it is not replicated.
I have created a Kafka producer that writes on the topic "test" the following string: "A B C A" in a cycle of 100000 iterations. There is a 1000 ms of sleep between the iterations (Thread.sleep). The key is the index of each cycle's iteration.
I have run the following code both on Centos 7 and on Windows. I usually build a fat jar using maven assembly plugin and run it with spark-submit. I always have to specifiy the packages when submitting the jar: --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0
public class StreamFromKafka {
public static void es() throws StreamingQueryException {
SparkSession session = SparkSession.builder().appName("streamFromKafka").master("local[*]").getOrCreate();
String columnName = "value";
Dataset<Row> df = session.readStream().format("kafka")
.option("group.id","test-consumer-group")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test").load();
Dataset<Row> df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").select(columnName);
Dataset<String> words = df1.as(Encoders.STRING()).flatMap(line -> Arrays.asList(line.split(" ")).iterator(), Encoders.STRING());
//comment1 --> StreamingQuery query0 = words.writeStream().outputMode("update").format("console").start();
//comment2 --> query0.awaitTermination();
Dataset<Row> wordCount = words.groupBy("value").count();
StreamingQuery query = wordCount.writeStream().outputMode("update").format("console").start();
query.awaitTermination();
}
}
If I decomment "comment1" and "comment2" in the above code, the table is printed fast on windows:
-------------------------------------------
Batch: 5
-------------------------------------------
+-----+
|value|
+-----+
| A|
| B|
| C|
| A|
| A|
| B|
| C|
| A|
+-----+
However, if I comment comment1 and comment2, mini batches seem long lasting on Windows.
So I can conclude that the stream DOES read from Kafka on Windows, but group by takes lots of time.
I left running this implementation more time on windows than on Linux, yesterday evening at 20:46. It has very longer mini batches (real time streaming is built with mini batches under the hood of structured streaming API) on Windows. So, for example, as you can see in the following picture, it takes one minute to execute two batches:
As you can see in the following picture, it takes three minutes to execute four batches:
It is quicker on Linux. Since I tried it in Linux firstly, I expected less time on Windows to see the console output, then, since I dind't see anything, I thought it wasn't working.
I should time mini batches on Linux in order to compare the behaviours.

Using apply functions in SparkR

I am currently trying to implement some functions using sparkR version 1.5.1. I have seen older (version 1.3) examples, where people used the apply function on DataFrames, but it looks like this is no longer directly available. Example:
x = c(1,2)
xDF_R = data.frame(x)
colnames(xDF_R) = c("number")
xDF_S = createDataFrame(sqlContext,xDF_R)
Now, I can use the function sapply on the data.frame object
xDF_R$result = sapply(xDF_R$number, ppois, q=10)
When I use a similar logic on the DataFrame
xDF_S$result = sapply(xDF_S$number, ppois, q=10)
I get the error message "Error in as.list.default(X) :
no method for coercing this S4 class to a vector"
Can I somehow do this?
This is possible with user defined functions in Spark 2.0.
wrapper = function(df){
+ out = df
+ out$result = sapply(df$number, ppois, q=10)
+ return(out)
+ }
> xDF_S2 = dapplyCollect(xDF_S, wrapper)
> identical(xDF_S2, xDF_R)
[1] TRUE
Note you need a wrapper function like this because you can't pass the extra arguments in directly, but that may change in the future.
The native R functions do not support Spark DataFrames. We can use user defined functions in SparkR to execute native R modules. These are executed on the executors and thus the libraries must be available on all the executors.
For example, suppose we have a custom function holt_forecast which takes in a data.table as an argument.
Sample R code
sales_R_df %>%
group_by(product_id) %>%
do(holt_forecast(data.table(.))) %>%
data.table(.) -> dt_holt
For using UDFs, we need to specify the schema of the output data.frame returned by the execution of the native R method. This schema is used by Spark to generate back the Spark DataFrame.
Equivalent SparkR code
Define the schema
structField("product_id", "integer"),
structField("audit_date", "date"),
structField("holt_unit_forecast", "double"),
structField("holt_unit_forecast_std", "double")
)
Execute the method
library(data.table)
library(lubridate)
library(dplyr)
library(forecast)
sales <- data.table(x)
y <- data.frame(key,holt_forecast(sales))
}, dt_holt_schema)
Reference: https://shbhmrzd.medium.com/stl-and-holt-from-r-to-sparkr-1815bacfe1cc

Resources