I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do:
Spark 1.6
df = sqlContext.read \
.format('com.databricks.spark.csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
In the latest release I think it should look like this:
Spark 2.0
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.getOrCreate()
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file:///C:/path/to/my/file.csv', schema=mySchema)
But I am getting this error no matter how many different ways I try to adjust the path:
IllegalArgumentException: 'java.net.URISyntaxException: Relative path in
absolute URI: file:/C:/path//to/my/file/spark-warehouse'
Not sure if this is just an issue with Windows or there is something I am missing. I was excited that the spark-csv package is now a part of Spark right out of the box, but I can't seem to get it to read any of my local files anymore. Any ideas?
I was able to do some digging around in the latest Spark documentation, and I notice they have a new configuration setting that I hadn't noticed before:
spark.sql.warehouse.dir
So I went ahead and added this setting when I set up my SparkSession:
spark = SparkSession.builder \
.master('local[*]') \
.appName('My App') \
.config('spark.sql.warehouse.dir', 'file:///C:/path/to/my/') \
.getOrCreate()
That seems to set the working directory, and then I can just feed my filename directly into the csv reader:
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('file.csv', schema=mySchema)
Once I set the spark warehouse, Spark was able to locate all of my files and my app finishes successfully now. The amazing thing is that it runs about 20 times faster than it did in Spark 1.6. So they really have done some very impressive work optimizing their SQL engine. Spark it up!
Related
I'm trying to use a delta lake through a python program that is called by a step on an EMR cluster, but the step always fails with an unknown error. I suppose the error could be related to the delta.tables import as the code is very simple.
Python program: test.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Spark Session creation
spark = (SparkSession.builder.appName("DeltaExercise")
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
# Importing delta
from delta.tables import *
# Reading
enem = (
spark.read.format("csv")
.option("inferSchema", True)
.option("header", True)
.option("delimiter", ";")
.load("MyBucket/raw-data/microdados_enem_2020.csv")
)
#Writing
(
enem
.write
.mode("overwrite")
.format("delta")
.partitionBy("year")
.save("MyBucket/staging/test")
)
Step in EMR cluster:
spark-submit --deploy-mode cluster --packages io.delta:delta-core_2.12:1.0.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --master yarn MYBUCKET/emr-code/pyspark/test.py
EMR config screens:
If anyone has any tips on how to fix this, I'd appreciate it.
I found the error. It was a mistake in the EMR cluster configuration. Delta files were created successfully.
Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated
To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars
If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')
I am using RStudio to connect to my HDFS file using SparkR. When I leave Spark analyses running overnight, I get "R session aborted" error the next day. From Spark's documentation on SparkR (https://spark.apache.org/docs/latest/configuration.html), the default value of spark.r.backendConnectionTimeout is set to 6000s. I would like to change this value to something large that my connection doesn't time out after the analyses is done.
I have tried the following:
sparkR.session(master = "local[*]", sparkConfig = list(spark.r.backendConnectionTimeout = 10))
sparkR.session(master = "local[*]", spark.r.backendConnectionTimeout = 10)
I get the same output for both commands:
Spark package found in SPARK_HOME: C:\Spark\spark-2.3.2-bin-hadoop2.7
Launching java with spark-submit command C:\Spark\spark-2.3.2-bin-hadoop2.7/bin/spark-submit2.cmd sparkr-shell C:\Users\XYZ\AppData\Local\Temp\3\RtmpiEaE5q\backend_port696c18316c61
Java ref type org.apache.spark.sql.SparkSession id 1
It seems that the parameter was not passed correctly. Also, I am not sure where to pass that parameter.
Any help would be appreciated.
A similar post is around, but that involves Zeppelin (how to change spark.r.backendConnectionTimeout value?).
Thanks.
I found the solution: it is to modify the spark-defaults.conf file and add the following line:
spark.r.backendConnectionTimeout = 6000000
(or whatever time limit you want)
IMPORTANT note - restart hadoop and yarn services, and try connecting to Spark with SparkR normally:
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local")
You can check if the settings took place or not at http://localhost:4040/environment/
I hope this comes useful for other people.
I'm using Amazon EMR.
I have some log data in s3, all in the same bucket, but under different subdirectories
like:
"s3://bucketname/2014/08/01/abc/file1.bz"
"s3://bucketname/2014/08/01/abc/file2.bz"
"s3://bucketname/2014/08/01/xyz/file1.bz"
"s3://bucketname/2014/08/01/xyz/file3.bz"
I'm using :
Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;
When trying to load all data from "s3://bucketname/2014/08/":
CREATE EXTERNAL TABLE table1(id string, at string,
custom struct<param1:string, param2:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucketname/2014/08/';
In return I get:
OK
Time taken: 0.169 seconds
When trying to query the table:
SELECT * FROM table1 LIMIT 10;
I get:
Failed with exception java.io.IOException:java.io.IOException: Not a file: s3://bucketname/2014/08/01
Does anyone has an idea on how to solev this?
It's an EMR specific problem, here is what i got from Amazon support:
Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories.
According to this document ("Are you trying to recursively traverse input directories?")
Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.
This works now (May 2018)
A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf file:
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
hive.mapred.supports.subdirectories true
Or, can be fixed locally like in following pyspark code:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
.config("hive.mapred.supports.subdirectories","true") \
.getOrCreate()
spark.sql("<YourQueryHere>").show()
The problem is the way you have specified the location
s3://bucketname/2014/08/
The hive external table expect files to be present at this location but it has folders.
Try putting path like
"s3://bucketname/2014/08/01/abc/,s3://bucketname/2014/08/01/xyz/"
You need to provide path till files.
This worked in GFV2:
$AS_HOME/bin/asadmin \
--host $AS_ADMIN_HOST \
--user $AS_ADMIN_USER \
--port $AS_ADMIN_PORT \
create-jms-resource \
--restype javax.jms.QueueConnectionFactory \
--description XA\ Queue\ Connection\ Factory \
--property Name=myXAQueueConnectionFactory:SupportsXA=true \
jms/myXAQueueConnectionFactory
But the SupportsXA=true no longer works. Maybe I can't find it in the GFV3 manuals, nor can I find it via our friend Google: how to specify XA transactionality using asadmin to configure the factory? Anybody out there know how?
--property ...:transaction-support=XATransaction:...
This seems to be what I needed. Works. Did not find by search of documentation or by Google. Deduced it by looking at the domain.xml file and taking educated guess at syntax.
Am now trying to figure out what property name/value pair sets the connection-validation property the way I want it.
Question has morphed to: what's the full asadmin syntax and property setting for GFV3 connection factories.