Accessing streaming data from ADLS in PyFlink - azure-blob-storage

Attempting a Pyflink stream setup from ADLS and currently trying to read a json file using the StreamExecutionEnvironment.from_source() method.
Here's how the code looks like:
from flink.plan.Environment import get_environment
from pyflink.datastream.functions import SourceFunction
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors.file_system import (FileSource, StreamFormat, FileSink,
OutputFileConfig, RollingPolicy, BucketAssigner)
from pyflink.common import WatermarkStrategy, Encoder, Types
from azure.storage.filedatalake import FileSystemClient
file_system = FileSystemClient.from_connection_string(connection_str, file_system_name="my_fs")
# setting the stream environment object
env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.STREAMING)
env.add_jars("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
env.add_classpaths("file:///opt/flink/plugins/azure/flink-azure-fs-hadoop-1.16.0.jar")
file_client = file_system.get_file_client(my_file)
input_path = 'abfss://' + file_client.url[8:]
print('URL is ===== >>>>',file_client.url)
# Source
ds = env.from_source(
source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
input_path)
.process_static_file_set().build(),
watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
source_name="file_source"
)
ds.sink_to(
sink=FileSink.for_row_format(
base_path=output_path,
encoder=Encoder.simple_string_encoder())
.with_bucket_assigner(BucketAssigner.base_path_bucket_assigner())
.build())
ds.print()
env.execute()
I am getting below error:
Caused by: org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could not find a file system implementation for scheme 'abfs'. The scheme is directly supported by Flink through the following plugin(s): flink-fs-azure-hadoop. Please ensure that each plugin resides within its own subfolder within the plugins directory. See https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/ for more information. If you want to use a Hadoop file system for that scheme, please add the scheme to the configuration fs.allowed-fallback-filesystems. For a full list of supported file systems, please see https://nightlies.apache.org/flink/flink-docs-stable/ops/filesystems/.
Jar file has already been added in to the plugins folder as given in the documentation:
https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/
Also, storage account key is also added into the config.yaml file.
Alternatively tried adding the source as DataStream using :
ds = env.read_text_file(input_path)

Related

Kedro context and catalog missing from Jupyter Notebook

I am able to run my pipelines using the kedro run command without issue. For some reason though I can't access my context and catalog from Jupyter Notebook anymore. When I run kedro jupyter notebook and start a new (or existing) notebook using my project name when selecting "New", I get the errors following errors:
context
NameError: name 'context' is not defined
catalog.list()
NameError: name 'catalog' is not defined
EDIT:
After running the magic command %kedro_reload I can see that my ProjectContext init_spark_session is looking for files in project_name/notebooks instead of project_name/src. I tried changing the working directory in my Jupyter Notebook session with %cd ../src and os.ch_dir('../src') but kedro still looks in the notebooks folder:
%kedro_reload
java.io.FileNotFoundException: File file:/Users/user_name/Documents/app_name/kedro/notebooks/dist/project_name-0.1-py3.8.egg does not exist
_spark_session.sparkContext.addPyFile() is looking in the wrong place. When I comment out this line from my ProjectContext this error goes away but I receive another one about not being able to find my Oracle driver when trying to load a dataset from the catalog:
df = catalog.load('dataset')
java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
EDIT 2:
For reference:
kedro/src/project_name/context.py
def init_spark_session(self) -> None:
"""Initialises a SparkSession using the config defined in project's conf folder."""
# Load the spark configuration in spark.yaml using the config loader
parameters = self.config_loader.get("spark*", "spark*/**")
spark_conf = SparkConf().setAll(parameters.items())
# Initialise the spark session
spark_session_conf = (
SparkSession.builder.appName(self.package_name)
.enableHiveSupport()
.config(conf=spark_conf)
)
_spark_session = spark_session_conf.getOrCreate()
_spark_session.sparkContext.setLogLevel("WARN")
_spark_session.sparkContext.addPyFile(f'src/dist/project_name-{__version__}-py3.8.egg')
kedro/conf/base/spark.yml:
# You can define spark specific configuration here.
spark.driver.maxResultSize: 8g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true
# https://kedro.readthedocs.io/en/stable/11_tools_integration/01_pyspark.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
# JDBC driver
spark.jars: drivers/ojdbc8-21.1.0.0.jar
I think a combination of this might help you:
Generally, let's try to avoid manually interfering with the current working directory, so let's remove os.chdir in your notebook. Construct an absolute path where possible.
In your init_spark_session, when addPyFile, use absolute path instead. self.project_path points to the root directory of your Kedro project, so you can use it to construct the path to your PyFile accordingly, e.g. _spark_session.sparkContext.addPyFile(f'{self.project_path}/src/dist/project_name-{__version__}-py3.8.egg')
Not sure why you would need to add the PyFile though, but maybe you have a specific reason.

Import type definitions from other packages in protobuf

I am trying to create a grpc service with a very basic single action which is GetDeployment, takes a namespace and a name as an input, and returns a Kubernetes deployment. The thing is that I do not want to define my own message for the Deployment as it already exists on the official Kubernetes repository.
I am pretty new to grpc and probably do not understand well enough how it works but can I import this message to my own file in a way I could then write the following .proto file ?
syntax = "proto3";
package api;
import "google/api/annotations.proto";
import "k8s.io/kubernetes/pkg/api/v1/generated.proto";
message GetDeploymentOptions {
string namespace = 1;
string name = 2;
}
service AppsV1 {
rpc GetDeployment(GetDeploymentOptions) returns (k8s.io.kubernetes.pkg.api.v1.Deployment) {}
}
Thank you in advance
GRPC codegen is just a protoc plugin. It generates code for service and rpc but it follows the normal protobuf rules for imports.
In your example, if your file is in src/api.proto and the k8s api repo is a git submodule checked out into thirdparty/k8s.io/api folder you would generate the files you'd need by running:
root>protoc.exe -I thirdparty k8s.io/api/core/v1/generated.proto --go_out=go
root>protoc.exe -I thirdparty src/api.proto --go_out=plugins=grpc:go
The first command is generating the .pb.go file which contains the k8s messages, while the second command is generating the .pb.go file which contains your messages and your service.
Looking at the transient imports of that file, you may also need to checkout api-machinery into k8s.io/apimachinery and run protoc on that file as well.

Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark? [duplicate]

This question already has answers here:
How to set hadoop configuration values from pyspark
(3 answers)
Closed 5 years ago.
There are three properties in my spark-defaults.conf that I want to be able to set dynamically:
spark.driver.maxResultSize
spark.hadoop.fs.s3a.access.key
spark.hadoop.fs.s3a.secret.key
Here's my attempt to do so:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
Here's the output I get:
5g
<access>
<secret>
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
If I set the credentials via environment variables, I am able to read the file.
Why doesn't Hadoop respect the credentials specified this way?
Update:
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.
It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.<property>', <value>)
but you must instead use:
spark.sparkContext._jsc.hadoopConfiguration().set('<property>', <value>)
I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

How to make Terraform archive_file resource pick up changes to source files?

Using TF 0.7.2 on a Win 10 machine.
I'm trying to set up an edit/upload cycle for development of my lambda functions in AWS, using the new "archive_file" resource introduced in TF 0.7.1
My configuration looks like this:
resource "archive_file" "cloudwatch-sumo-lambda-archive" {
source_file = "${var.lambda_src_dir}/cloudwatch/cloudwatchSumologic.js"
output_path = "${var.lambda_gen_dir}/cloudwatchSumologic.zip"
type = "zip"
}
resource "aws_lambda_function" "cloudwatch-sumo-lambda" {
function_name = "cloudwatch-sumo-lambda"
description = "managed by source project"
filename = "${archive_file.cloudwatch-sumo-lambda-archive.output_path}"
source_code_hash = "${archive_file.cloudwatch-sumo-lambda-archive.output_sha}"
handler = "cloudwatchSumologic.handler"
...
}
This works the first time I run it - TF creates the lambda zip file, uploads it and creates the lambda in AWS.
The problem comes with updating the lambda.
If I edit the cloudwatchSumologic.js file in the above example, TF doesn't appear to know that the source file has changed - it doesn't add the new file to the zip and doesn't upload the new lambda code to AWS.
Am I doing something wrong in my configuration, or is the archive_file resource not meant to be used in this way?
You could be seeing a bug. I'm on 0.7.7 and the issue now is the SHA changes even when you don't make changes. Hashicorp will be updating this resource to a data source in 0.7.8
https://github.com/hashicorp/terraform/pull/8492

wsadmin upload file from local machine to remote

I'm trying to automate process of deployment and I want to upload some files to WAS using wsadmin (jython). My question is if it is possible to upload file from my standalone wsadmin to remote WAS Server. And if so, is it possible to upload file somewhere out of application (fe. /opt/IBM/WebSphere/AppServer/temp)? I don't want to upload it to specific profile, but to server root.
When I'm deploying application it is copying war/ear file to WAS, so is it there some mechani to upload separate file?
many thanks
AntAgent allows you to upload any file, provided that the content of the file can fit in memory:
https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.javadoc.doc/web/mbeanDocs/AntAgent.html
In wsadmin you'll need to use invoke_jmx method of AdminControl object.
from java.lang import String
import jarray
fileContent = 'hello!'
antAgent = AdminControl.makeObjectName(AdminControl.queryNames('WebSphere:*,type=AntAgent,process=dmgr'))
str = String(fileContent)
bytes = str.getBytes()
AdminControl.invoke_jmx(antAgent, 'putScript', [String('hello.txt'),bytes], jarray.array(['java.lang.String', '[B'], String))
Afterwards you'll find 'hello.txt' file in WAS profile's temp directory. You may use relative paths as well.

Resources