Save results from AWS Athena query in Zeppelin - jdbc

I'm able to successfully execute queries on Athena via my Zeppelin notebook, however I don't understand how to save the result set.
The following code displays a table
%athena
select * from table_name limit 5;
My goal is to save the results into a pandas dataframe, so I can do future transformations.
I'm able to save as CSV and import it manually, but this does not seem very efficient.
I'm using Zeppelin 0.8.0, and AthenaJDBC42-2.0.2.jar

I found that the best way to experiment with athena and execute with spark.
so basically:
%spark.pyspark
df = pd.read_sql("select * from table_name limit 5", conn)

Related

how to run sql query on delta table

I have problem with delta lake docs. I know that I can query on delta table with presto,hive,spark sql and other tools but in delta's documents mentioned that "You can load a Delta table as a DataFrame by specifying a table name or a path"
but it isn't clear. how can I run sql query like that?
To read data from tables in DeltaLake it is possible to use Java API or Python without Apache Spark. See details at:
https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html
See how to use with Pandas:
pip3 install deltalake
python3
from deltalake import DeltaTable
table_path = "/opt/data/delta/my-table" # whatever table name and object store
# now using Pandas
df = DeltaTable(table_path).to_pandas()
df
Use the spark.sql() function
spark.sql("select * from delta.`hdfs://192.168.2.131:9000/Delta_Table/test001`").show()

How to read a table and Sql query from Oracle in Pandas?

I am completely new to Python and pandas. I want to load a some tables and Sql Queries from Oracle and Teradata to pandas Dataframes and want to analyse them.
I know, we have to create some connection strings to Oracle and Teradata in Pandas. Can you please suggest me them and also add the sample code to read both table and SQL query in that?
Thanks Inadvance
I don't have Oracle server, so I take Teradata as an example
This is not the only way to to that, just one approach
Make sure you have installed Teradata ODBC Driver. Please refer to Teradata official website about the steps, I suppose you use Windows (since it is easy to use SQL Assistant to run query against Teradata, that is only on Windows). You can check it in ODBC Data Source Administrator
Install pyodbc by the command pip install pyodbc. Here is the official website
The connection string is db_conn_str = "DRIVER=Teradata;DBCNAME={url};UID={username};PWD={pwd}"
Get a connection object conn = pyodbc.connect(db_conn_str)
Read data from a SQL query to a DataFrame df = pd.read_sql(sql="select * from tb", con=conn)
The similar for Oracle, you need to have the driver and the format of ODBC connection string. I know there is a python module from Teradata which supports the connection too, but I just prefer use odbc as it is more generic purpose.
Here is an Oracle example:
import cx_Oracle # pip install cx_Oracle
from sqlalchemy import create_engine
engine = create_engine('oracle+cx_oracle://scott:tiger#host:1521/?service_name=hr')
df = pd.read_sql('select * from table_name', engine)
One way to query an Oracle DB is with a function like this one:
import pandas as pd
import cx_Oracle
def query(sql: str) -> pd.DataFrame:
try:
with cx_Oracle.connect(username, password, database, encoding='UTF-8') as connection:
dataframe = pd.read_sql(sql, con=connection)
return dataframe
except cx_Oracle.Error as error: print(error)
finally: print("Fetch end")
here, sql corresponds to the query you want to run. Since it´s a string it also supports line breaks in case you are reading the query from a .sql file
eg:
"SELECT * FROM TABLE\nWHERE <condition>\nGROUP BY <COL_NAME>"
or anything you need... it could also be an f-string in case you are using variables.
This function returns a pandas dataframe with the results from the sql string you need.
It also keeps the column names on the dataframe

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()?
for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?
You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|alex#gmail.com
2|Paul|paul#example.com
3|John|john#yahoo.com
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()
The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query
then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.
This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

sqoop2 import very large postgreSQL table failed

I am trying to use sqoop transfer from cdh5 to import large postgreSQL table to HDFS. The whole table is about 15G.
First, I tried to import just use the basic information, by entering schema and table name, it didn't work. I always get GC overhead limit exceeded. I tried to change the JVM heap size on Cloudera manager configuration for Yarn and sqoop to maximum (4G), still no help.
Then, I am trying to use sqoop transfer SQL statement to transfer partly of the table, I added SQL statement in the field as the following:
select * from mytable where id>1000000 and id<2000000 ${CONDITIONS}
(partition column is id).
The statement is failed, actually any kind of statements with my own "where" condition were having the error: "GENERIC_JDBC_CONNECTOR_0002:Unable to execute the SQL statement"
Also I tried to use the boundary query, I can use "select min(id), 1000000 from mutable", and it worked, but I tried to use "select 1000000, 2000000 from mytable" to select data further ahead which caused the sqoop server crash and down.
Could someone help? How to add where condition? or how to use the boundary query. I have searched in many places, I didn't find any good document about how to write SQL statement with sqoop2. Also is that possible to use direct on sqoop2?
Thanks

Resources