Delta Lake in EMR - aws-lambda

I'm trying to use a delta lake through a python program that is called by a step on an EMR cluster, but the step always fails with an unknown error. I suppose the error could be related to the delta.tables import as the code is very simple.
Python program: test.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Spark Session creation
spark = (SparkSession.builder.appName("DeltaExercise")
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
# Importing delta
from delta.tables import *
# Reading
enem = (
spark.read.format("csv")
.option("inferSchema", True)
.option("header", True)
.option("delimiter", ";")
.load("MyBucket/raw-data/microdados_enem_2020.csv")
)
#Writing
(
enem
.write
.mode("overwrite")
.format("delta")
.partitionBy("year")
.save("MyBucket/staging/test")
)
Step in EMR cluster:
spark-submit --deploy-mode cluster --packages io.delta:delta-core_2.12:1.0.0 --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --master yarn MYBUCKET/emr-code/pyspark/test.py
EMR config screens:
If anyone has any tips on how to fix this, I'd appreciate it.

I found the error. It was a mistake in the EMR cluster configuration. Delta files were created successfully.

Related

Connect to BigQuery from pyspark using simba JDBC

Update the question 6/21
Background about Simba:
The Simba Google BigQuery JDBC Connector is delivered in a ZIP archive named SimbaBigQueryJDBC42-[Version].zip, where [Version] is the version number of the connector.
The archive contains the connector supporting the JDBC API version indicated in the archive name, as well as release notes and third-party license information.
I'm trying to connect to BigQuery from pyspark (docker) using simba jdbc with no success. I had reviewed many posts here but couldn't find clue
my code which I just submit from VC within spark docker image
import pyspark
from pyspark import SparkConf
from pyspark.sql import SQLContext, SparkSession
import os
from glob import glob
my_jar = glob('/root/Downloads/BigQuery/simba_jdbc_1.2.4.1007/*.jar')
my_jar_str = ','.join(my_jar)
print(my_jar_str)
sc_conf = SparkConf()
sc_conf.setAppName("testApp")
sc_conf.setMaster('local[*]')
sc_conf.set("spark.jars", my_jar_str)
sc = pyspark.SparkContext(conf=sc_conf)
spark = SparkSession \
.builder \
.master('local') \
.appName('spark-read-from-bigquery') \
.config("spark.executor.extraClassPath",my_jar_str) \
.config("spark.driver.extraClassPath",my_jar_str) \
.config("spark.jars", my_jar_str)\
.getOrCreate()
myJDBC = '''
jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthType={OAuthType};ProjectId={ProjectId};OAuthServiceAcctEmail={OAuthServiceAcctEmail};OAuthPvtKeyPath={OAuthPvtKeyPath};
'''.format(OAuthType=0,
ProjectId='ProjectId',
OAuthServiceAcctEmail="etl#dProjectId.iam.gserviceaccount.com",
OAuthPvtKeyPath="/workspaces/code/secrets/etl.json")
pgDF = spark.read \
.format("jdbc") \
.option("url", myJDBC) \
.option("driver", "com.simba.googlebigquery.jdbc42.Driver") \
.option("dbtable", my_query) \
.load()
I'm getting error:
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
Is that missing jars or it is wrong logic?
Please any clue is appreciated
To anyone who might have the same thought. I just found that SIMBA is not supporting spark but rather I have to follow the steps in https://github.com/GoogleCloudDataproc/spark-bigquery-connector.
The open issue (as of 6/23) that I don't use Dataproc but rather standalone spark, so I need to figure how to collect consistent support jars
If ODBC also works for you, maybe this can help.
First, download and configure the ODBC driver from here:
Next - use the connection like this (note the IgnoreTransactions parameter):
import pyodbc
import pandas as pd
conn = pyodbc.connect(r'Driver={Simba ODBC Driver for Google BigQuery};OAuthMechanism=0;Catalog=<projectID>;KeyFilePath=<path to json credentials>;Email=<email of service account>;IgnoreTransactions=1')
qry = 'select * from <path to your table>'
data = pd.read_sql(qry,conn)
I had a problem with error: Error converting value to long
And my solution is creating a jar file from java which include jdbc dialect
https://github.com/Fox-sv/spark-bigquery
from pyspark.sql import SparkSession
from py4j.java_gateway import java_import
user_email = "EMAIL"
project_id = "PROJECT_ID"
creds = "PATH_TO_FILE"
jdbc_conn = f"jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;OAuthServiceAcctEmail={user_email};ProjectId={project_id};OAuthPvtKeyPath={creds};"
spark = SparkSession.builder.getOrCreate()
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "MyDialect")
jvm.org.apache.spark.sql.jdbc.JdbcDialects.registerDialect(jvm.MyDialect().change_dialect())
df = spark.read.jdbc(url=jdbc_conn,table='(SELECT * FROM babynames.names_2014) AS table')

Spark/YARN - not all nodes are used in spark-submit

I have a Spark/YARN cluster with 3 slaves setup on AWS.
I spark-submit a job like this: ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master yarn --deploy-mode cluster my.py And the final result is a file containing all the hostnames from all the slaves in a cluster. I was expecting I get a mix of hostnames in the output file, however, I only see one hostname in the output file. That means YARN never utilize the other slaves in the cluster.
Am I missing something in the configuration?
I have also included my spark-env.sh settings below.
HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/
YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop/
SPARK_EXECUTOR_INSTANCES=3
SPARK_WORKER_CORES=3
my.py
import socket
import time
from pyspark import SparkContext, SparkConf
def get_ip_wrap(num):
return socket.gethostname()
conf = SparkConf().setAppName('appName')
sc = SparkContext(conf=conf)
data = [x for x in range(1, 100)]
distData = sc.parallelize(data)
result = distData.map(get_ip_wrap)
result.saveAsTextFile('hby%s'% str(time.time()))
After I updated the following setting or spark-env.sh, all slaves are utilized.
SPARK_EXECUTOR_INSTANCES=3
SPARK_EXECUTOR_CORES=8

sparkSession/sparkContext can not get hadoop configuration

I am running spark 2, hive, hadoop at local machine, and I want to use spark sql to read data from hive table.
It works all fine when I have hadoop running at default hdfs://localhost:9000, but if I change to a different port in core-site.xml:
<name>fs.defaultFS</name>
<value>hdfs://localhost:9099</value>
Running a simple sql spark.sql("select * from archive.tcsv3 limit 100").show(); in spark-shell will give me the error:
ERROR metastore.RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
.....
From local/147.214.109.160 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused;
.....
I get the AlreadyExistsException before, which doesn't seem to influence the result.
I can make it work by creating a new sparkContext:
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
sc.stop()
var sc = new SparkContext()
val session = SparkSession.builder().master("local").appName("test").enableHiveSupport().getOrCreate()
session.sql("show tables").show()
My question is, why the initial sparkSession/sparkContext did not get the correct configuration? How can I fix it? Thanks!
If you are using SparkSession and you want to set configuration on the the spark context then use session.sparkContext
val session = SparkSession
.builder()
.appName("test")
.enableHiveSupport()
.getOrCreate()
import session.implicits._
session.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
You don't need to import SparkContext or created it before the SparkSession

Spark Job error GC overhead limit exceeded [duplicate]

This question already has answers here:
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
(22 answers)
Closed 6 years ago.
I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.
spark.master spark://master:7077
spark.executor.memory 5g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
But I am getting an error saying GC limit exceeded.
Here is the code I am working on.
import os
import sys
import unicodedata
from operator import add
try:
from pyspark import SparkConf
from pyspark import SparkContext
except ImportError as e:
print ("Error importing Spark Modules", e)
sys.exit(1)
# delimeter function
def findDelimiter(text):
sD = text[1]
eD = text[2]
return (eD, sD)
def tokenize(text):
sD = findDelimiter(text)[1]
eD = findDelimiter(text)[0]
arrText = text.split(sD)
text = ""
seg = arrText[0].split(eD)
arrText=""
senderID = seg[6].strip()
yield (senderID, 1)
conf = SparkConf()
sc = SparkContext(conf=conf)
textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")
rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")
I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.
Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?
Try to add below setting for your spark-defaults.sh:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC
Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!
The code you have should have worked with your configuration . As suggested earlier try using G1GC .
Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less.
You can set it by adding spark.storage.memoryFraction 0.4
I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.

Amazon EMR and Hive: Getting a "java.io.IOException: Not a file" exception when loading subdirectories to an external table

I'm using Amazon EMR.
I have some log data in s3, all in the same bucket, but under different subdirectories
like:
"s3://bucketname/2014/08/01/abc/file1.bz"
"s3://bucketname/2014/08/01/abc/file2.bz"
"s3://bucketname/2014/08/01/xyz/file1.bz"
"s3://bucketname/2014/08/01/xyz/file3.bz"
I'm using :
Set hive.mapred.supports.subdirectories=true;
Set mapred.input.dir.recursive=true;
When trying to load all data from "s3://bucketname/2014/08/":
CREATE EXTERNAL TABLE table1(id string, at string,
custom struct<param1:string, param2:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucketname/2014/08/';
In return I get:
OK
Time taken: 0.169 seconds
When trying to query the table:
SELECT * FROM table1 LIMIT 10;
I get:
Failed with exception java.io.IOException:java.io.IOException: Not a file: s3://bucketname/2014/08/01
Does anyone has an idea on how to solev this?
It's an EMR specific problem, here is what i got from Amazon support:
Unfortunately Hadoop does not recursively check the subdirectories of Amazon S3 buckets. The input files must be directly in the input directory or Amazon S3 bucket that you specify, not in sub-directories.
According to this document ("Are you trying to recursively traverse input directories?")
Looks like EMR does not support recursive directory at the moment. We are sorry about the inconvenience.
This works now (May 2018)
A global EMR_wide fix is to set the following in /etc/spark/conf/spark-defaults.conf file:
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
hive.mapred.supports.subdirectories true
Or, can be fixed locally like in following pyspark code:
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.config("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true") \
.config("hive.mapred.supports.subdirectories","true") \
.getOrCreate()
spark.sql("<YourQueryHere>").show()
The problem is the way you have specified the location
s3://bucketname/2014/08/
The hive external table expect files to be present at this location but it has folders.
Try putting path like
"s3://bucketname/2014/08/01/abc/,s3://bucketname/2014/08/01/xyz/"
You need to provide path till files.

Resources