Problem connecting remote PrestoDB from local machine using pyspark - hadoop

I'm trying to connecct to a presto DB installed in a remote server from my mac local machine using pyspark, below is my code. I have downloaded the presto driver and placed it under /user/name//Hadoop/spark-2.3.1-bin-hadoop2.7/jars ( I guess this is where I'm making a mistake, but not sure)
from pyspark.sql import SparkSession, HiveContext
from pyhive import presto, hive
def main():
spark = SparkSession.builder\
.appName("tests")\
.enableHiveSupport()\
.getOrCreate()
df_presto = spark.read.format("jdbc") \
.option("driver", "io.prestosql.jdbc.PrestoDriver")\
.option("url", "jdbc:presto://host.com:443/hive") \
.option("user", "user_name")\
.option("password", "password") \
.option("dbtable", "(select column from table_name limit 10) tmp") \
.load()
Preso driver : presto-jdbc-340.jar
When I tried to execute the code, I'm getting an error as below
Traceback (most recent call last):
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/Users/user_name/Hadoop/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.IllegalArgumentException: java.net.UnknownHostException: ip-10-120-99-149.ec2.internal;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
Any idea how can I fix this?

Related

PySpark with io.github.spark-redshift-community: BasicAWSCredentialsProvider not found

I'm trying to load data from my redshift database using PySpark.
I'm using "io.github.spark-redshift-community" as connector. It's requires a "tempdir" parameter to use a S3. My code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-redshift_2.12:5.0.3")
findspark.add_packages("com.amazonaws:aws-java-sdk-bundle:1.12.262")
findspark.add_packages("org.apache.hadoop:hadoop-aws:3.3.4")
findspark.init()
spark = SparkSession.builder.master("local[8]").appName("Dim_Customer").getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", S3_ACCESS_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", S3_SECRET_KEY)
spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "true")
df_read_1 = spark.read \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://IP/DATABASE?user=USER&password=PASS") \
.option("dbtable", "table") \
.option("tempdir", "s3a://url/")\
.option("forward_spark_s3_credentials", "true") \
.load()
But I'm getting an error: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
I've fond some sources saying to change BasicAWSCredentialsProvider to SimpleAWSCredentialsProvider, but I get another error: NoSuchMethodError.
Could someone help me, please?
Is that any problem with the hadoop and aws-java-sdk versions?
Thank you in advance!

Unable to read json file from S3 using local PySpark

I am trying to read a json file from S3 using PySpark locally. Here is the code
import os
import configparser
from pyspark.sql import SparkSession
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_key = config.get("default", "aws_access_key_id")
secret_key = config.get("default", "aws_secret_access_key")
session_token = config.get("default", "aws_session_token")
if __name__ == "__main__":
spark = SparkSession \
.builder \
.appName("ReadGoogleTrendsData") \
.master("local[1]") \
.getOrCreate()
sc=spark.sparkContext
# hadoop_conf=sc._jsc.hadoopConfiguration()
# hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
# hadoop_conf.set("fs.s3n.awsAccessKeyId", access_key)
# hadoop_conf.set("fs.s3n.awsSecretAccessKey", secret_key)
# hadoop_conf.set("fs.s3n.awsSessionToken", session_token)
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
hadoop_conf.set("fs.s3a.session.token", session_token)
hadoop_conf.set("fs.s3a.connection.ssl.enabled", "true")
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
df = spark.read.json("s3a://gamma-rp/trends-data/trends-2021-07-01.json")
df.printSchema()
df.show(5)
I get the following error when I run this
File "/Users/rppatwa/Downloads/Spark/spark-2.4.5-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o37.json.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: NB1WXDBQJ5S848AM, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: HeI/luU5fMpN1dehMo9+yDzMAXin2j7AZPo3STRqbvx56rDcUotMdze08cJz08s7P581ATdRmck=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:557)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:392)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
I looked up this error and it seems I do have permission to the bucket since I am able use the AWS cli cp command to download a file to my local machine.
Can you please let me know what I am missing here. Thanks so much!
It seems the issue is with your configuration. I have similar issue before, but I am having following jobs successfully able to save abd get data from S3 bucket.
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StringType, TimestampType
spark = SparkSession \
.builder \
.appName("SparkExample") \
.getOrCreate()
spark_context = spark.sparkContext
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", '<KEY>')
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'SECRET_KEY')
spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
schema = StructType().add("_id", StringType()) \
.add("employer", StringType()) \
.add("created_at", TimestampType()) \
.add("name", StringType())
employees = [{'_id': 1,
'employer': 'Microsoft',
'created_at': datetime.now(),
'name': 'Noel'
},
{'_id': 2,
'employer': 'Apple',
'created_at': datetime.now(),
'name': 'Steve'
},
]
df = spark.createDataFrame(employees, schema=schema)
df.write \
.format("json") \
.mode("append") \
.save("s3a://<YOUR BUCKET>/employeesjson")
collect = spark.read.format("json").load(
"s3a://<YOUR BUCKET>/employeesjson").collect()
print(len(collect))
I was able to resolve this issue by updating spark from using hadoop 2.7 to hadoop 3.2

PySpark Parse Kafka csv delimited data to columns using from_csv

I am new to Structured Streaming with Kafka. Trying to convert the , delimited data from Kafka to Dataframe in PySpark using schema and from_csv.
kafkaDataSchema = StructType([
StructField("sid", StringType()), StructField("timestamp", LongType()),
StructField("sensor", StringType()), StructField("value", StringType()),
])
kafkaStream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", self.config.get('kafka-config', 'bootstrap-servers')) \
.option("subscribe", self.config.get('kafka-config', 'topic-list-input')) \
.option("startingOffsets", self.config.get('kafka-config', 'startingOffsets')) \
.load()\
.selectExpr("CAST(value AS STRING)")
formattedStream = kafkaStream.select(from_csv(kafkaStream.value, kafkaDataSchema))
I am getting the below error:
Traceback (most recent call last):
File "main.py", line 43, in <module>
formattedStream = KafkaSource.readData(spark)
File "src.zip/src/main/sources/KafkaSource.py", line 31, in readData
File "src.zip/src/main/sources/KafkaSource.py", line 36, in formatKafkaData
File "/spark-3.1.1-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/functions.py", line 4082, in from_csv
TypeError: schema argument should be a column or string
How can I solve the issue?

Pyspark - error when save data into Hive table "unresolved operator 'InsertIntoTable HiveTableRelation'"

I use the following:
pyspark library , version 2.3.1
python, version 2.7.1
hadoop, version 2.7.3
hive, version 1.2.1000.2.6.5.30-1
spark version 2
My hive table looks following:
CREATE TABLE IF NOT EXISTS my_database.my_table
(
division STRING COMMENT 'Sample column'
)
I want to save data into HIVE using pyspark. I use the following code:
spark_session = SparkSession.builder.getOrCreate()
hive_context = HiveContext(spark_session.sparkContext)
hive_table_schema = hive_context.table("my_database.my_table").schema
df_to_save = spark_session.createDataFrame([["a"],["b"],["c"]], schema=hive_table_schema)
df_to_save.write.mode("append").insertInto("my_database.my_table")
But the following error occur:
Traceback (most recent call last):
File "/home/my_user/mantis service_quality_check__global/scripts/row_counts_preprocess.py", line 147, in <module> df_to_save.write.mode("append").insertInto(hive_table_row_counts_str)
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 716, in insertInto
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"unresolved operator 'InsertIntoTable HiveTableRelation `my_database`.`my_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [division#14], false, false;;\n'InsertIntoTable HiveTableRelation `my_database`.`my_table`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [division#14], false, false\n+- LogicalRDD [division#2], false\n"
Please is there someone who cane help with this? I am stuck with this few days
I found the issue. SparkSession has to support hive. The method enableHiveSupport() has to be call when spark session is created.
Then creation of spark session will looks like following
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()

How to load Impala table directly to Spark using JDBC?

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table
How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!
Tried this with information from the web but it didn't work.
SPARK Notebook
#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'
# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH
# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30
# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false
pyspark --master yarn-client \
--driver-memory 4G \
--executor-memory 2G \
# --num-executors 10 \
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
--driver-class-path $JDBC_PATH/*.jar
Python Code
properties = {
"driver": "com.cloudera.impala.jdbc41.Driver",
"AuthMech": "1",
# "KrbRealm": "EXAMPLE.COM",
# "KrbHostFQDN": "impala.example.com",
"KrbServiceName": "impala"
}
# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"
db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)
I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc.
: java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
You can use
--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')
instead of
--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
or for another approach please see my answer
1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py
impala_jdbc_connection.py
properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}
#initalize the spark session
spark = (
SparkSession.builder
.config("spark.jars.packages", "jar-packages-list")
.config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
)
db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)
db_df.show()
2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe
pip install impyla Source: https://github.com/cloudera/impyla
Connect to impala and fetch results from impala database and convert result to spark dataframe
from impala.dbapi import connect
conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)
cursor = conn.cursor()
cursor.execute('select * from database.table')
res= cursor.fetchall() # convert res to spark dataframe
for data in res:
print(data)
Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.
properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}
db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)
db_df.show()
This works for me:
spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar
val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"
val connectionProperties = new java.util.Properties()
val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)

Resources