sparklyr write data to hdfs or hive - sparklyr

I tried using sparklyr to write data to hdfs or hive , but was unable to find a way . Is it even possible to write a R dataframe to hdfs or hive using sparklyr ? Please note , my R and hadoop are running on two different servers , thus I need a way to write to a remote hdfs from R .
Regards
Rahul

Writing Spark table to hive using Sparklyr:
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
sdf_copy_to(sc, iris_spark_table)
DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table")

As of latest sparklyr you can use spark_write_table. pass in the format database.table_name to specify a database
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
spark_write_table(
iris_spark_table,
name = 'my_database.iris_hive ',
mode = 'overwrite'
)
Also see this SO post here where i got some input on more options

You can use sdf_copy_to to copy a dataframe into Spark, lets say tempTable. Then use DBI::dbGetQuery(sc, "INSERT INTO TABLE MyHiveTable SELECT * FROM tempTable") to insert the dataframe records in a hive table.

Related

how to run sql query on delta table

I have problem with delta lake docs. I know that I can query on delta table with presto,hive,spark sql and other tools but in delta's documents mentioned that "You can load a Delta table as a DataFrame by specifying a table name or a path"
but it isn't clear. how can I run sql query like that?
To read data from tables in DeltaLake it is possible to use Java API or Python without Apache Spark. See details at:
https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html
See how to use with Pandas:
pip3 install deltalake
python3
from deltalake import DeltaTable
table_path = "/opt/data/delta/my-table" # whatever table name and object store
# now using Pandas
df = DeltaTable(table_path).to_pandas()
df
Use the spark.sql() function
spark.sql("select * from delta.`hdfs://192.168.2.131:9000/Delta_Table/test001`").show()

hive, ask for files within specific range

Suppose on HDFS I have file with following content: data1-2018-01-01.txt, data1-2018-01-02.txt, data1-2018-01-03.txt, data1-2018-01-04.txt, data1-2018-01-06.txt
Now I want to query files based on date:
select * from mytable where date > 2018-01-03 and date < 2018-01-06 ;
And my question: is it possible to create an external table just on these files satisfying my query? Or maybe you have any workaround?
I know, I could use partitions but they require to fetch the data manually when the new data set arrives.
Put those file into a directory and create new table on top of it.
Also Hive has INPUT__FILE__NAME virtual column, you can use it for filtering:
where INPUT__FILE__NAME like '%2018-01-03%'
Also it is possible to use substr or regexp_extract to get date from filename , then use IN or >, < to filter them.

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()?
for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?
You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|alex#gmail.com
2|Paul|paul#example.com
3|John|john#yahoo.com
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()
The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query
then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.
This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

How to use output of hive query in another hive query?

Hello Friends,
I want to use output of one query in another query.
set iCount = 12;
This constant value is fine, but I don't know how to set this variable dynamically as given below.
set iCount = select count(distinct colName) from table;
This will result a string, whatever query is passed. Instead of query I want result of this query.
Thanks in advance
Pankaj Sharma
You can't do it that way. You could try using Oozie to automate the hive query and the java process you want to execute, storing the output of the hive query in a directory that the java program will read from.

Resources