how to run sql query on delta table - hadoop

I have problem with delta lake docs. I know that I can query on delta table with presto,hive,spark sql and other tools but in delta's documents mentioned that "You can load a Delta table as a DataFrame by specifying a table name or a path"
but it isn't clear. how can I run sql query like that?

To read data from tables in DeltaLake it is possible to use Java API or Python without Apache Spark. See details at:
https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html
See how to use with Pandas:
pip3 install deltalake
python3
from deltalake import DeltaTable
table_path = "/opt/data/delta/my-table" # whatever table name and object store
# now using Pandas
df = DeltaTable(table_path).to_pandas()
df

Use the spark.sql() function
spark.sql("select * from delta.`hdfs://192.168.2.131:9000/Delta_Table/test001`").show()

Related

Save results from AWS Athena query in Zeppelin

I'm able to successfully execute queries on Athena via my Zeppelin notebook, however I don't understand how to save the result set.
The following code displays a table
%athena
select * from table_name limit 5;
My goal is to save the results into a pandas dataframe, so I can do future transformations.
I'm able to save as CSV and import it manually, but this does not seem very efficient.
I'm using Zeppelin 0.8.0, and AthenaJDBC42-2.0.2.jar
I found that the best way to experiment with athena and execute with spark.
so basically:
%spark.pyspark
df = pd.read_sql("select * from table_name limit 5", conn)

How to read a table and Sql query from Oracle in Pandas?

I am completely new to Python and pandas. I want to load a some tables and Sql Queries from Oracle and Teradata to pandas Dataframes and want to analyse them.
I know, we have to create some connection strings to Oracle and Teradata in Pandas. Can you please suggest me them and also add the sample code to read both table and SQL query in that?
Thanks Inadvance
I don't have Oracle server, so I take Teradata as an example
This is not the only way to to that, just one approach
Make sure you have installed Teradata ODBC Driver. Please refer to Teradata official website about the steps, I suppose you use Windows (since it is easy to use SQL Assistant to run query against Teradata, that is only on Windows). You can check it in ODBC Data Source Administrator
Install pyodbc by the command pip install pyodbc. Here is the official website
The connection string is db_conn_str = "DRIVER=Teradata;DBCNAME={url};UID={username};PWD={pwd}"
Get a connection object conn = pyodbc.connect(db_conn_str)
Read data from a SQL query to a DataFrame df = pd.read_sql(sql="select * from tb", con=conn)
The similar for Oracle, you need to have the driver and the format of ODBC connection string. I know there is a python module from Teradata which supports the connection too, but I just prefer use odbc as it is more generic purpose.
Here is an Oracle example:
import cx_Oracle # pip install cx_Oracle
from sqlalchemy import create_engine
engine = create_engine('oracle+cx_oracle://scott:tiger#host:1521/?service_name=hr')
df = pd.read_sql('select * from table_name', engine)
One way to query an Oracle DB is with a function like this one:
import pandas as pd
import cx_Oracle
def query(sql: str) -> pd.DataFrame:
try:
with cx_Oracle.connect(username, password, database, encoding='UTF-8') as connection:
dataframe = pd.read_sql(sql, con=connection)
return dataframe
except cx_Oracle.Error as error: print(error)
finally: print("Fetch end")
here, sql corresponds to the query you want to run. Since it´s a string it also supports line breaks in case you are reading the query from a .sql file
eg:
"SELECT * FROM TABLE\nWHERE <condition>\nGROUP BY <COL_NAME>"
or anything you need... it could also be an f-string in case you are using variables.
This function returns a pandas dataframe with the results from the sql string you need.
It also keeps the column names on the dataframe

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()?
for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?
You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|alex#gmail.com
2|Paul|paul#example.com
3|John|john#yahoo.com
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()
The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query
then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.
This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

How can I export contents of an oracle table to a file?

Getting ready to clean up some old tables which are no longer in use, but I would like to be able to archive the contents before removing them from the database.
Is it possible to export the contents of a table to a file? Ideally, one file per table.
You can use Oracle's export tool: exp
Edit:
exp name/pwd#dbname file=filename.dmp tables=tablename rows=y indexes=n triggers=n grants=n
You can easily do it using Python and cx_Oracle module.
Python script will extract data to disk in CSV format.
Here’s how you connect to Oracle using Python/cx_Oracle:
constr='scott/tiger#localhost:1521/ORCL12'
con = cx_Oracle.connect(constr)
cur = con.cursor()
After data fetch you can loop through Python list and save data in CSV format.
for i, chunk in enumerate(chunks(cur)):
f_out.write('\n'.join([column_delimiter.join(row[0]) for row in chunk]))
f_out.write('\n')
I used this approach when I wrote TableHunter-For-Oracle

Resources