Databricks cannot save stream checkpoint - spark-streaming

I'm trying to set up the stream to begin processing incoming files. Looks like Databricks is unable to save a checkpoint. I tried location in ADLS Gen2 and DBFS with the same result. Databricks creates needed folder with some scructure but cannot write to it. Are there any special requirements for a checkpoint location?
Checkpoint folder
Databricks Community Edition, runtime version: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.partitionColumns", "year, month, day")
.option("header", "true")
.schema(schema)
.load(destFolderName)
.writeStream.format("delta")
.option("checkpointLocation", checkpointPath)
.outputMode("append")
.partitionBy("year", "month", "day")
.start(outputPath)
The error:
java.lang.UnsupportedOperationException: com.databricks.backend.daemon.data.client.DBFSV1.createAtomicIfAbsent(path: Path)
at com.databricks.tahoe.store.EnhancedDatabricksFileSystemV1.createAtomicIfAbsent(EnhancedFileSystem.scala:324)
at com.databricks.spark.sql.streaming.AWSCheckpointFileManager.createAtomicIfAbsent(DatabricksCheckpointFileManager.scala:159)
at com.databricks.spark.sql.streaming.DatabricksCheckpointFileManager.createAtomicIfAbsent(DatabricksCheckpointFileManager.scala:60)
at com.databricks.sql.streaming.state.RocksDBFileManager.zipToDbfsFile(RocksDBFileManager.scala:497)
at com.databricks.sql.streaming.state.RocksDBFileManager.saveCheckpointToDbfs(RocksDBFileManager.scala:181)
at com.databricks.sql.rocksdb.CloudRocksDB.$anonfun$open$5(CloudRocksDB.scala:451)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:627)
at com.databricks.sql.rocksdb.CloudRocksDB.timeTakenMs(CloudRocksDB.scala:527)
at com.databricks.sql.rocksdb.CloudRocksDB.$anonfun$open$2(CloudRocksDB.scala:439)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:395)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:484)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:504)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:266)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:261)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:258)
at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:20)

Auto Loader feature, which I was trying to use, is not currently available on Databricks Community Edition
https://databricks.com/notebooks/Databricks-Data-Integration-Demo.html
so "cloudFiles" cannot be used with Community Edition

You can try to disable multi cluster writes:
spark.databricks.delta.multiClusterWrites.enabled false
Check your path - please try to write to standard dbfs managed by databricks ( for example to dbfs:/local_disk0/tmp/checkpointName ).
If you use own mount please check azure permission there (Blob Storage Contributor is necessary).
Please diagnose also read stream
df = spark.readStream(...)
display(df)

Related

HL7 FHIR IG Publisher returns java NullPointerException

I'm trying to generate documentation of my HL7 FHIR profiles using IG Publisher (publisher.jar). I'm running it on command line on macOS. I've uploaded the IG resource on Simplifier and it validates with no errors.
The problem is that i'm getting java.lang.NullPointerException. Full output below:
java -jar publisher.jar -ig fsh-generated/resources/ImplementationGuide-nfz.pozplus.json -tx n/a
FHIR IG Publisher Version 1.1.120 (Git# 210e48f945ad). Built 2022-05-13T15:20:39.709Z (8 days old)
Detected Java version: 11.0.10 from /Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home on Mac OS X/x86_64 (64bit). 2048MB available
dir = /Users/marcingrudzien/Library/CloudStorage/OneDrive-Osobisty/Praca/iEHReu/NFZ/POZ-Plus/FHIR/sushi-test/NfzTest, path = /Users/marcingrudzien/.gem/ruby/3.1.2/bin:/Users/marcingrudzien/.rubies/ruby-3.1.2/lib/ruby/gems/3.1.0/bin:/Users/marcingrudzien/.rubies/ruby-3.1.2/bin:/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/Applications/Postgres.app/Contents/Versions/latest/bin
Parameters: -ig fsh-generated/resources/ImplementationGuide-nfz.pozplus.json -tx n/a
Start Clock # sobota, 21 maja 2022 20:56:04 czas środkowoeuropejski letni (2022-05-21T20:56:04+02:00)
API keys loaded from /Users/marcingrudzien/fhir-api-keys.ini (00:00.0027)
Package Cache: /Users/marcingrudzien/.fhir/packages (00:00.0030)
Load Configuration from /Users/marcingrudzien/Library/CloudStorage/OneDrive-Osobisty/Praca/iEHReu/NFZ/POZ-Plus/FHIR/sushi-test/NfzTest/fsh-generated/resources/ImplementationGuide-nfz.pozplus.json (00:00.0053)
Root directory: /Users/marcingrudzien/Library/CloudStorage/OneDrive-Osobisty/Praca/iEHReu/NFZ/POZ-Plus/FHIR/sushi-test/NfzTest/fsh-generated/resources (00:00.0084)
Publishing Content Failed: null (00:00.0086)
(00:00.0087)
Use -? to get command line help (00:00.0087)
(00:00.0087)
Stack Dump (for debugging): (00:00.0088)
java.lang.NullPointerException
at org.hl7.fhir.igtools.publisher.Publisher.initializeFromJson(Publisher.java:2947)
at org.hl7.fhir.igtools.publisher.Publisher.initialize(Publisher.java:2168)
at org.hl7.fhir.igtools.publisher.Publisher.execute(Publisher.java:854)
at org.hl7.fhir.igtools.publisher.Publisher.main(Publisher.java:10144)
I hope someone here could help me. I could provide any additional information, if needed.
From the stack dump, you are missing a "path" property in your json config file, but I highly recommend that you switch to using the new set up, a ini file that points to a template and a IG resource. Use something like the sample-ig as a starter (https://github.com/FHIR/sample-ig) or see the how-to (https://build.fhir.org/ig/FHIR/ig-guidance/index.html)

Accessing Oracle from AWS Lambda in Python

I am writing (hopefully) a simply AWS Lambda that will do an RDS Oracle SQL SELECT and email the results. So far I have been using the Lambda Management Console, but all the examples I've run across talk about making a Lambda Deployment Package. So my first question is can I do this from the Lambda Management Console?
Next question I have is what to import for the Oracle DB API? In all the examples I have seen, they download and build a package with pip, but that would then seem to imply using a Deployment Package (see above). Trying to import any of these modules listed in the examples simply give "No module named "...
After writing the above I dug into the boto3 API referrence and came up with:
import boto3
client = boto3.client('rds-data')
But it gives the error: Unknown service: 'rds-data'.
So I'm still lost.
As you can probably tell, I'm new to the Lambda environment. Any suggestions or examples would be greatly appreciated. Thanks.
This is an update of the solution using the 18c Oracle client libraries. If it wasn't for main solution it would have taken me a lot longer to get my code working. This will hopefully help anyone that follows.
(an aside - I tried getting it working with the instantclient_19_3 but went round in circles for a day, and then tried with instantclient_18_5 and it worked)
Files downloaded and used
instantclient-basic-linux.x64-18.5.0.0.0dbru.zip (all files)
cx_Oracle 7.2.2 (https://cx-oracle.readthedocs.io/en/latest/release_notes.html#releasenotes)
libaio.so.1.0.1 (as described in main answer, renamed to libaio.so.1)
This then gave these files in the zip (lambda_function.py is my python source code)
zip contents
Apparently, AWS Lambda is using an older version of boto3, which does not have rds-data yet.
So I'm afraid you will have to create a deployment package containing a more recent version of boto3.
One way to do this, would be to:
Create your lambda handler file (in this case named index.py).
def my_handler(event, context):
client = boto3.client('rds-data')
print(client)
# do stuff
return "hello world"
Add a requirements.txt file in the same folder, which will contain something like:
awscli >= 1.16.118
boto3 >= 1.9.108
Now run this (depending on the setup on your computer, you can use pip instead of pip3) in the directory/folder of your index and requirement file:
pip3 install -r requirements.txt -t .
zip -r somezipname .
Next, upload this zip and change your handler 'entry point' to index.my_handler. The code should now run without errors.
older version of boto3 does not support rds-data.
but you can deploy package with zip folder.
i recommend you to use import cx-oracle
for that install cx-oracle using pip
and upload zip packages. check this
[How can I access Oracle from Python?
After much groaning and gnashing of teeth I have come up with a successful solution.
rds_data (as confirmed by AWS Support) is only supporting Aurora Databases. Wish the AWS documents mentioned this. 8{(>
Thanks to the answers above as well as Jason Landrey for hints as to the solution.
In order to access RDS/Oracle, you need to use cx_Oracle. But wait, there's more.
cx_Oracle is not in the standard Lambda environment, so you need to bring your own. My development environment is on Windows, but the Lambda environment is Linux. So, you need to download and install in your packaging directory I got mine from https://pypi.org/project/cx-Oracle/#files. Install locally with:
pip install cx_Oracle-7.1.2-cp37-cp37m-manylinux1_x86_64.whl -t .
You will see several file appear in . Then you need to find a Linux system and download /lib64/libaio.so.1.0.1 and call it libaio.so.1 in your packaging directory.
And then you need to download both Oracle instant client basic and SDK packages from http://www.oracle.com/technetwork/topics/linuxx86-64soft-092277.html.
Create a zip file with all these items (including your own Python source). In doing so, rename Oracle instant client files libclntsh.so.11.1 to libclntsh.so and libocci.so.11.1 to libocci.so.
Upload the zip to a S3 bucket as the direct deploy is limited to 66mb and this zip is a bit larger.
Create a Lambda with the appropriate IAM permissions and VPC access, install the package and it should be good to go.
I found that if you don't include all the instant client files you start getting Oracle errors about missing timezone and NLS information.
List of zip contents (for me, YMMV):
7996693 08/24/2013 12:30 libnnz11.so
0 03/11/2019 16:10 cx_Oracle-7.1.1.data/
0 03/11/2019 16:10 cx_Oracle-7.1.1.data/data/
0 03/11/2019 16:10 cx_Oracle-7.1.1.data/data/cx_Oracle-doc/
0 03/11/2019 16:10 cx_Oracle-7.1.1.dist-info/
1325 03/13/2019 12:35 Email.py
1805 02/19/2019 21:11 cx_Oracle-7.1.1.data/data/cx_Oracle-doc/LICENSE.txt
163 02/19/2019 21:11 cx_Oracle-7.1.1.data/data/cx_Oracle-doc/README.txt
851 02/19/2019 21:11 cx_Oracle-7.1.1.dist-info/METADATA
628 02/19/2019 21:12 cx_Oracle-7.1.1.dist-info/RECORD
109 02/19/2019 21:12 cx_Oracle-7.1.1.dist-info/WHEEL
10 02/19/2019 21:11 cx_Oracle-7.1.1.dist-info/top_level.txt
2270301 02/19/2019 21:11 cx_Oracle.cpython-37m-x86_64-linux-gnu.so
2140 03/13/2019 14:21 getSecrets.py
5560 03/12/2019 08:48 libaio.so.1
53865194 08/24/2013 12:30 libclntsh.so
118738042 08/24/2013 12:30 libociei.so
7633 03/13/2019 16:39 scheduleReports.py

Error while connecting sparklyr to remote sparkR in Rstudio

I tried following command in my local RStudio session to connect to sparkR -
sc <- spark_connect(master = "spark://x.x.x.x:7077",
spark_home = "/home/hduser/spark-2.0.0-bin-hadoop2.7", version="2.0.0", config = list())
But, I am getting following error -
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory '/home/hduser/spark-2.0.0-bin-hadoop2.7' not found
Any help?
Thanks in advance
may I ask you have you actually installed the spark into that folder?
Can you show the result of ls command in /home/ubuntu/ folder?
And sessionInfo() in R?
Let me please share with you how I am using the custom folder structure.
It is on Win, not Ubuntu but I guess it won't make much of the difference.
Using the most recent dev edition
If you would check on GitHub the RStudio guys are updating sparklyr almost every day fixing numerous reported bugs:
devtools::install_github("rstudio/sparklyr")
in my case only installation of sparklyr_0.4.12 has resolved problem with Spark 2.0 under Windows
Checking Spark availability
please check if version you're inquiring is available:
spark_available_versions()
You should see something like the line below, which indicates that the version you indend to use is actually available for your sparklyr package.
[13] 2.0.0 2.7 spark_install(version = "2.0.0", hadoop_version = "2.7")
Installation of Spark
Just to keep the order you may like to install spark in other location rather then home folder of RStudio cache.
options(spark.install.dir = "c:/spark")
Once you are sure the desire version is available it is time to install spark
spark_install(version = "2.0.0", hadoop_version = "2.7")
I'd check if it is install correctly (change it for shell ls if needed)
cd c:/spark
dir (in Win) | ls (in Ubuntu)
Now specify the location of the edition you want to use:
Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-bin-hadoop2.7')
And finally enjoy the creation of connection:
sc <- spark_connect(master = "local")
I hope it helps.

Downloading spark-csv in Windows

I am a beginner in the Spark world, and want to do my Machine Learning algorithms using SparkR.
I installed Spark in standalone mode in my laptop (Win 7 64-bit) and I am available to run Spark (1.6.1), Pyspark and begin SparkR in Windows following this effective guide: link . Once I started SparkR I began with the famous Flights example:
#Set proxy
Sys.setenv(http_proxy="http://user:password#proxy.companyname.es:8080/")
#Set SPARK_HOME
Sys.setenv(SPARK_HOME="C:/Users/amartinezsistac/spark-1.6.1-bin-hadoop2.4")
#Load SparkR and its library
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R", "lib"), .libPaths()))
library(SparkR)
#Set Spark Context and SQL Context
sc = sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
#Read Data
link <- "s3n://mortar-example-data/airline-data"
flights <- read.df(sqlContext, link, source = "com.databricks.spark.csv", header= "true")
Nevertheless, I receive the next error message after the last line:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at org.apache.spark.api.r.RBackendHandler.ch
It seems like the reason is that I do not have installed the read-csv package, which can be downloaded from this page (Github link). As well as in Stack, in spark-packages.org website, (link) the advice is to do: $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 which is for a Linux installation.
My question is: How could I run this code line from Windows 7 cmd in order to download this package?
I also tried an alternate solution for my error message (Github) without success:
#In master you don't need spark-csv.
#CSV data source is built into SparkSQL. Just use it as follows:
flights <- read.df(sqlContext, "out/data.txt", source = "com.databricks.spark.csv", delimiter="\t", header="true", inferSchema="true")
Thanks in advance to everyone.
It is the same for Windows. When you start spark-shell from the bin directory, start it this way:
spark-shell --packages com.databricks:spark-csv_2.11:1.4.0

Loading com.databricks.spark.csv via RStudio

I have installed Spark-1.4.0. I have also installed its R package SparkR and I am able to use it via Spark-shell and via RStudio, however, there is one difference I can not solve.
When launching the SparkR-shell
./bin/sparkR --master local[7] --packages com.databricks:spark-csv_2.10:1.0.3
I can read a .csv-file as follows
flights <- read.df(sqlContext, "data/nycflights13.csv", "com.databricks.spark.csv", header="true")
Unfortunately, when I start SparkR via RStudio (correctly setting my SPARK_HOME) I get the following error message:
15/06/16 16:18:58 ERROR RBackendHandler: load on 1 failed
Caused by: java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
I know I should load com.databricks:spark-csv_2.10:1.0.3 in a way, but I have no idea how to do this. Could someone help me?
This is the right syntax (after hours of trying):
(Note - You've to focus on the first line. Notice to double-quotes)
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
library(SparkR)
library(magrittr)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName="SparkR-Flights-example")
sqlContext <- sparkRSQL.init(sc)
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# Java ref type org.apache.spark.sql.SQLContext id 1
# Load the flights CSV file using `read.df`. Note that we use the CSV reader Spark package here.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")
My colleagues and I found the solution. We have initialized the sparkContext like this:
sc <- sparkR.init(appName="SparkR-Example",sparkEnvir=list(spark.executor.memory="1g"),sparkJars="spark-csv-assembly-1.1.0.jar")
We did not find how to load a remote jar, hence we have downloaded spark-csv_2.11-1.0.3.jar. Including this one in sparkJars however does not work, since it does not find its dependencies locally. You can add a list of jars as well, but we have build an assembly jar containing all dependencies. When loading this jar, it is possible to load the .csv-file as desired:
flights <- read.df(sqlContext, "data/nycflights13.csv","com.databricks.spark.csv",header="true")
I have downloaded Spark-1.4.0, via command line I went to the directory Spark-1.4.0/R, where I have build the SparkR package located in the subdirectory pkg as follows:
R CMD build --resave-data pkg
This gives you a .tar file which you can install in RStudio (with devtools, you should be able to install the package in pkg as well).
In RStudio, you should set your path to Spark as follows:
Sys.setenv(SPARK_HOME="path_to_spark/spark-1.4.0")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
And you should be ready to go. I can only talk from mac experience, I hope it helps?
If after you tried Pragith's solution above and you still having the issue. It is very possible the csv file you want to load is not in the current RStudio working directory. Use getwd() to check the RStudio directory and make sure the csv file is there.

Resources