Spark SQL: how to cache sql query result without using rdd.cache() - caching

Is there any way to cache a cache sql query result without using rdd.cache()?
for examples:
output = sqlContext.sql("SELECT * From people")
We can use output.cache() to cache the result, but then we cannot use sql query to deal with it.
So I want to ask is there anything like sqlcontext.cacheTable() to cache the result?

You should use sqlContext.cacheTable("table_name") in order to cache it, or alternatively use CACHE TABLE table_name SQL query.
Here's an example. I've got this file on HDFS:
1|Alex|alex#gmail.com
2|Paul|paul#example.com
3|John|john#yahoo.com
Then the code in PySpark:
people = sc.textFile('hdfs://sparkdemo:8020/people.txt')
people_t = people.map(lambda x: x.split('|')).map(lambda x: Row(id=x[0], name=x[1], email=x[2]))
tbl = sqlContext.inferSchema(people_t)
tbl.registerTempTable('people')
Now we have a table and can query it:
sqlContext.sql('select * from people').collect()
To persist it, we have 3 options:
# 1st - using SQL
sqlContext.sql('CACHE TABLE people').collect()
# 2nd - using SQLContext
sqlContext.cacheTable('people')
sqlContext.sql('select count(*) from people').collect()
# 3rd - using Spark cache underlying RDD
tbl.cache()
sqlContext.sql('select count(*) from people').collect()
1st and 2nd options are preferred as they would cache the data in optimized in-memory columnar format, while 3rd would cache it just as any other RDD in row-oriented fashion
So going back to your question, here's one possible solution:
output = sqlContext.sql("SELECT * From people")
output.registerTempTable('people2')
sqlContext.cacheTable('people2')
sqlContext.sql("SELECT count(*) From people2").collect()

The following is most like using .cache for RDDs and helpful in Zeppelin or similar SQL-heavy-environments
CACHE TABLE CACHED_TABLE AS
SELECT $interesting_query
then you get cached reads both for subsequent usages of interesting_query, as well as on all queries on CACHED_TABLE.
This answer is based off of the accepted answer, but the power of using AS is what really made the call useful in the more constrained SQL-only environments, where you cannot .collect() or do RDD/Dataframe-operations in any way.

Related

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!
You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL
In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

How much data is considered "too large" for a Hive MAPJOIN job?

EDIT: added more file size details, and some other session information.
I have a seemingly straightforward Hive JOIN query that surprisingly requires several hours to run.
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key
WHERE a.keyPart BETWEEN b.startKeyPart AND B.endKeyPart;
I'm trying to determine if the execution time is normal for my dataset and AWS hardware selection, or if I am simply trying to JOIN too much data.
Table A: ~2.2 million rows, 12MB compressed, 81MB raw, 4 files.
Table B: ~245 thousand rows, 6.7MB compressed, 14MB raw, one file.
AWS: emr-4.3.0, running on about 5 m3.2xlarge EC2 instances.
Records from A always matches one or more records in B, so logically I see that at most 500 billion rows are generated before they are pruned with the WHERE clause.
4 mappers are allocated for the job, which completes in 6 hours. Is this normal for this type of query and configuration? If not, what should I do to improve it?
I've partitioned B on the JOIN key, which yields 5 partitions, but haven't noticed a significant improvement.
Also, the logs show that the Hive optimizer starts a local map join task, presumably to cache or stream the smaller table:
2016-02-07 02:14:13 Starting to launch local task to process map join; maximum memory = 932184064
2016-02-07 02:14:16 Dump the side-table for tag: 1 with group count: 5 into file: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2016-02-07 02:14:17 Uploaded 1 File to: file:/mnt/var/lib/hive/tmp/local-hadoop/hive_2016-02-07_02-14-08_435_7052168836302267808-1/-local-10003/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (12059634 bytes)
2016-02-07 02:14:17 End of local task; Time Taken: 3.71 sec.
What is causing this job to run slowly? The data set doesn't appear too large, and the "small-table" size is well under the "small-table" limit of 25MB that triggers the disabling of the MAPJOIN optimization.
A dump of the EXPLAIN output is copied on PasteBin for reference.
My session enables compression for output and intermediate storage. Could this be the culprit?
SET hive.exec.compress.output=true;
SET hive.exec.compress.intermediate=true;
SET mapred.output.compress=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
SET io.seqfile.compression.type=BLOCK;
My solution to this problem is to express the JOIN predicate entirely within the JOIN ON clause, as this is the most efficient way to execute a JOIN in Hive. As for why the original query was slow, I believe that the mappers just need time when scanning the intermediate data set row by row, 100+ billion times.
Due to Hive only supporting equality expressions in the JOIN ON clause and rejecting function calls that use both table aliases as parameters, there is no way to rewrite the original query's BETWEEN clause as an algebraic expression. For example, the following expression is illegal.
-- Only handles exclusive BETWEEN
JOIN b ON a.key = b.key
AND sign(a.keyPart - b.startKeyPart) = 1.0 -- keyPart > startKeyPart
AND sign(a.keyPart - b.endKeyPart) = -1.0 -- keyPart < endKeyPart
I ultimately modified my source data to include every value between startKeyPart and endKeyPart in a Hive ARRAY<BIGINT> data type.
CREATE TABLE LookupTable
key BIGINT,
startKeyPart BIGINT,
endKeyPart BIGINT,
keyParts ARRAY<BIGINT>;
Alternatively, I could have generated this value inline within my queries using a custom Java method; the LongStream.rangeClosed() method is only available in Java 8, which is not part of Hive 1.0.0 in AWS emr-4.3.0.
Now that I have the entire key space in an array, I can transform the array to a table using LATERAL VIEW and explode(), rewriting the JOIN as follows.
WITH b AS
(
SELECT key, keyPart, value
FROM LookupTable
LATERAL VIEW explode(keyParts) keyPartsTable AS keyPart
)
SELECT a.value1, a.value2, b.value
FROM a
JOIN b ON a.key = b.key AND a.keyPart = b.keyPart;
The end result is that the above query takes approximately 3 minutes to complete, when compared with the original 6 hours on the same hardware configuration.

Performance Issue with Oracle Merge Statements with more than 2 Million records

I am executing the below MERGE statement for Insert Update operation.
It is working fine for 1 to 2 million records but for more than 4 to 5 billion records it takes 6 to 7 hours to complete.
Can anyone suggest some alternative or performance tips for Merge Statement
merge into employee_payment ep
using (
select
p.pay_id vista_payroll_id,
p.pay_date pay_dte,
c.client_id client_id,
c.company_id company_id,
case p.uni_ni when 0 then null else u.unit_id end unit_id,
p.pad_seq pay_dist_seq_nbr,
ph.payroll_header_id payroll_header_id,
p.pad_id vista_paydist_id,
p.pad_beg_payperiod pay_prd_beg_dt,
p.pad_end_payperiod pay_prd_end_d
from
stg_paydist p
inner join company c on c.vista_company_id = p.emp_ni
inner join payroll_header ph on ph.vista_payroll_id = p.pay_id
left outer join unit u on u.vista_unit_id = p.uni_ni
where ph.deleted = '0'
) ps
on (ps.vista_paydist_id = ep.vista_paydist_id)
when matched then
update
set ep.vista_payroll_id = ps.vista_payroll_id,
ep.pay_dte = ps.pay_dte,
ep.client_id = ps.client_id,
ep.company_id = ps.company_id,
ep.unit_id = ps.unit_id,
ep.pay_dist_seq_nbr = ps.pay_dist_seq_nbr,
ep.payroll_header_id = ps.payroll_header_id
when not matched then
insert (
ep.employee_payment_id,
ep.vista_payroll_id,
ep.pay_dte,
ep.client_id,
ep.company_id,
ep.unit_id,
ep.pay_dist_seq_nbr,
ep.payroll_header_id,
ep.vista_paydist_id
) values (
seq_employee_payments.nextval,
ps.vista_payroll_id,
ps.pay_dte,
ps.client_id,
ps.company_id,
ps.unit_id,
ps.pay_dist_seq_nbr,
ps.payroll_header_id,
ps.vista_paydist_id
) log errors into errorlog (v_batch || 'EMPLOYEE_PAYMENT') reject limit unlimited;
Try using the Oracle hints:
MERGE /*+ append leading(PS) use_nl(PS EP) parallel (12) */
Try to using hints to optimize inner using query.
Processing lots of data takes lots of time...
Here are some things that may help you (assuming there is not a probolem with bad execution plan):
Adding a where-clause in the UPDATE-part to only update records when the values are actually different. If you are merging the same data over and over again and only a smaller subset of the data is actually modified, this will improve performance.
If you indeed are processing the same data over and over again, investigate whether you can add some modification flag/date to only process new records since last time.
Depending on the kind of environment and when/who is updating your source tables, investigate whether a truncate-insert approach is beneficial. Remember to set the indexes unusuable on before hand.
I think your best bet here is to exploit the patterns in your data. This is something oracle does not know about, so you may have to get creative.
I was working on a similar problem and a good solution i found was to break the query up.
The primary reason big table merges are a bad idea is because of the in memory storage of the result of the using query. Because the PGA gets filled up pretty quickly so the database starts using the temporary table space of sort operations and joins. The temp tablespace being on disk is excruciatingly slow. The use of excessive temp table space can be easily avoided by splitting the query into two queries.
So the below query
merger into emp e
using (
select a,b,c,d from (/* big query here */)
) ec
on /*conditions*/
when matched then
/* rest of merge logic */
can become
create table temp_big_query as select a,b,c,d from (/* big query here */);
merger into emp e
using (
select a,b,c,d from temp_big_query
) ec
on /*conditions*/
when matched then
/* rest of merge logic */
if the using query also has CTEs and sub queries try breaking that query up to use more temp tables like the one shown above. Also avoid using parallel hints because they mostly tend to slow the query down unless the query itself has something that can be done in parallel, try using indexes instead instead as much as possible parallel should be used as the last option for optimization.
I know some references are missing please feel free to comment and add references or point out mistakes in my answer.

Write a nested select statement with a where clause in Hive

I have a requirement to do a nested select within a where clause in a Hive query. A sample code snippet would be as follows;
select *
from TableA
where TA_timestamp > (select timestmp from TableB where id="hourDim")
Is this possible or am I doing something wrong here, because I am getting an error while running the above script ?!
To further elaborate on what I am trying to do, there is a cassandra keyspace that I publish statistics with a timestamp. Periodically (hourly for example) this stats will be summarized using hive, once summarized that data will be stored separately with the corresponding hour. So when the query runs for the second time (and consecutive runs) the query should only run on the new data (i.e. - timestamp > previous_execution_timestamp). I am trying to do that by storing the latest executed timestamp in a separate hive table, and then use that value to filter out the raw stats.
Can this be achieved this using hive ?!
Subqueries inside a WHERE clause are not supported in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
However, often you can use a JOIN statement instead to get to the same result:
https://karmasphere.com/hive-queries-on-table-data#join_syntax
For example, this query:
SELECT a.KEY, a.value
FROM a
WHERE a.KEY IN
(SELECT b.KEY FROM B);
can be rewritten to:
SELECT a.KEY, a.val
FROM a LEFT SEMI JOIN b ON (a.KEY = b.KEY)
Looking at the business requirements underlying your question, it occurs that you might get more efficient results by partitioning your Hive table using hour. If the data can be written to use this factor as the partition key, then your query to update the summary will be much faster and require fewer resources.
Partitions can get out of hand when they reach the scale of millions, but this seems like a case that will not tease that limitation.
It will work if you put in :
select *
from TableA
where TA_timestamp in (select timestmp from TableB where id="hourDim")
EXPLANATION : As > , < , = need one exact figure in the right side, while here we are getting multiple values which can be taken only with 'IN' clause.

Turning off an index in Oracle

To optimize SELECT queries, I run them both with and without an index and measure the difference. I run a bunch of different similar queries and try to select different data to make sure that caching doesn't throw off the results. However, on very large tables, indexes take a really long time to create, and I have several different ideas about what indexes would be appropriate.
Is it possible in Oracle (or any other database for that matter) to perform a query but tell the database to not use a certain index when performing the query? Or just turn off the index entirely, but be able to easily switch it back on without having to re-index the entire table? This would make it much easier to test, since I can create all the indexes I'm thinking about all at once, then try my queries using different ones.
Alternatively, is there any better way to go about optimizing queries on large tables and know which indexes would be best to create?
You can set index visibility in 11g -
ALTER INDEX idx1 [ INVISIBLE | VISIBLE ]
this makes it unusable by the optimizer, but oracle still updates the index when data is added or removed. This makes it easy to test performance with the index disabled without having to remove & rebuild the whole index.
See here for the oracle docs on index visibility
You can use the NO_INDEX hint in the queries to ignore the indexes - see docs for further details. The SQL Access Advisor is an Oracle utility that will recommend indexing strategies.
Well you can write the query in such a way that it wont use index(using expression instead of a value)
For example
Select * from foobar where column1 = 'result' --uses index on column1
To avoid using index for a number and varchar
Select * from foobar where column1 + 0 = 5 -- simple expression to disable the index
Select * from foobar where column1 || '' = 'result' --simple expression to disable the index
Or you can just use NVL to disable the index in the query without worrying about the column's data type
Select * from foobar where nvl(column1,column1) = 'result' --i love this way :D
Similarly you can use index hints
like /* Index(E employee_id) */ to use indexes.
P.S. This is all the paraphrased from Dan Tow's Book SQL Tuning. I started reading it a few days back :)

Resources