Spark JDBC direct path inserts - oracle

When writing data from Hive to Oracle using (Py)Spark JDBC connector, I run into problems with the buffer cache on Oracle.
So my question is if there is a way to bypass oracle buffer cache by using direct path inserts (as suggested here https://renenyffenegger.ch/notes/development/databases/Oracle/architecture/instance/SGA/database-buffer-cache/index).
I was wondering if I can just use the initSessionStatement as described in the docs. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
Something along the lines of option("sessionInitStatement", """BEGIN execute immediate 'alter session set "<DIRECT PATH INSERT PARAMETER?>"=true'; END;""").
Another approach I'm wondering if it could work is using spark sql to insert into oracle as described in this answer what's SparkSQL SQL query to write into JDBC table? and then specifiying the /+ append +/ for direct path insert.
Does anyone have experience with this problem?

Related

Table insert performance bottleneck amazon redshift

While inserting records by using batch insert ( https://tool.oschina.net/uploads/apidocs/Spring-3.1.1/org/springframework/jdbc/core/simple/SimpleJdbcInsert.html#executeBatch(org.springframework.jdbc.core.namedparam.SqlParameterSource[]) ) in Redhsift table , the spring framework falls back to one by one insertion and it is taking more time.
(main) org.springframework.jdbc.support.JdbcUtils: JDBC driver does not support batch updates
is there anyway to enable the batch update in redshift table?
if not , Is there anyway to improve the table insertion performance in redshift ?
I tried - adding ?rewriteBatchedStatements=true to the jdbcurl - still the same.
The recommend way of doing batch insert is to use the copy command. Thus, the comon process is to unload data from redshift to S3 using the UNLOAD command (in the case the data you want to insert comes from a query result), and then to run a copy command referencing the data location in S3. This is far more effective than an insert.
UNLOAD ('my SQL statement')
TO 's3::my-s3-target-location'
FORMAT PARQUET;
COPY my_target_table (col1, col2, ...)
FROM 's3::my-s3-target-location'
FORMAT PARQUET;
Here is the documentation:
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Truncate Oracle table using Spark

my first question here!
I’m learning Spark and so far is awesome. Now I’m writing some DFs to Oracle using DF.write.mode(“append”).jdbc
Now, I need to truncate the table since I don’t want to append. If I use “overwrite” mode, it will drop the table and create a new one but I’ll will have to reGRANT users to Get access to it. Not good.
Can I do something like truncate in Oracle using spark SQL? Open for suggestions! Thanks for your time.
There is an option to make Spark to truncate target Oracle table instead of dropping it. You can find the syntax https://github.com/apache/spark/pull/14086
spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop)
Depending on the versions of Spark, Oracle and JDBC driver, there are other parameters that you could use to make the truncate on cascade as you can see from https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
From my experience, that works on some of DB engines, and depends a lot on the JDBC that you use, because not all of them support it
Hope this helps

informatica execute sql in sql transformation

Background: I am really new. Informatica Developer for PowerCenter Express Version: 9.6.1 HotFix 2
I want to execute a t-sql statement as one step in a work flow:
truncate table dbo.stage_customer
I tried create a mapping, add a sql transformation on it. Input above query in sql query window. I added the mapping to a workflow of just start, the mapping, and the end. When I validate the flow I got this error:
The group [Input] in transformation xxx must have at least one port
I have no idea what ports are needed since this (the truncate statement) basically doesn't need input or output.
Use your query " truncate table dbo.stage_customer" in Pre-SQL command
As Aswin suggested use the built in option in the session property.
But in the production environments user may not have truncate table access for the table in a database. In this case, informatica workflow will fail if you check the truncate target table option. It is good to have a stored procedure to truncate the target table and use that stored procedure in informatica mapping to avoid workflow failures in case of user having no truncate access to the database.
if you would like to truncate a target table before loading why don't you use the in-built option present in session properties?
goto workflow manager-> open session->mapping tab->click on target table listed left side->choose the property "Truncate table option" just enable it
to answer you question, I think you have to connect at least one input and output port into SQL transformation (because it is not unconnected). Just create dummy ports and try again
try this article - click here

How to convert CONNECT BY in greenplum

Can anyone suggest how to convert CONNECT BY Oracle query into Greenplum. Greenplum doesn't support recursive queries. So, we can not use WITH RECURSIVE. Is there any alternate solution to re-write the below query.
SELECT child_id, Parnet_id, LEVEL , SYS_CONNECT_BY_PATH (child_id,'/') as HIERARCHY
FROM pathnode
START WITH Parnet_id = child_id
CONNECT BY NOCYCLE PRIOR child_id = Parnet_id;
There are ways to do this but it will be a one-off per query. You will need to create a function that loops through your pathnode table and "return next" to return each row. You can search on this site to find examples of doing this with PostgreSQL 8.2.
Work is happening to rebase Greenplum to PostgreSQL 8.3, 8.4, and so on. Those later PostgreSQL versions support "with recursive" which is the ANSI SQL way to write your SQL but Greenplum doesn't support it yet. When it does get supported by Greenplum, I don't think it will perform all that well. The query will force looping and individual row lookups. This works great in an OLTP database but not so well for an MPP database.
I suggest you transform your data in Oracle with a VIEW and then just dump the view to a file to load into Greenplum. The DDL of having a self-referencing, N-level table will never be a good idea in an MPP database.

Connect to vertica database using odbc in asp.net and return a dataset

I have inserted some rows in vertica database using insert command on terminal.It shows when i read the record using select command.But i am not able to see record when connect to database using ODBC connection also i am able to find that row when restart the vertica.Please help me to solved out the problem.
Did you COMMIT; after you inserted the rows? It's a simple thing, but one that I've overlooked myself many times in the past.
To elaborate a bit beyond Bobby W's response.
When you perform an insert it will show the data to your current session. This allows a user to perform operations and use 'temporary' data and not affect/corrupt what other people are doing. It is session based data. That is why you can insert and see the data, but when connecting from a 2nd source, are unable to see it.
To 'commit' the data to the database you need to issue the COMMIT; statement as Bobby W mentioned.
Failing to issue COMMIT; is something I've also overlooked more than a few times.
To clarify, you can see the rows after you restart? Are you connecting to the database as the same user from ODBC and vsql?
By default Vertica ISOLATION level is READ COMMITTED mode which means other sessions read only data that is COMMITTED. If you've inserted but not committed, with this level, other sessions cannot read the data you've inserted

Resources