im in the process on transferiing data from oracle to hive with thiveinput talend.
My code looks like this.
SELECT
DISTINCT A.ID,
LEVEL SEQUENCE,
REGEXP_SUBSTR(A.ANEST,'[^|]+', 1, LEVEL),
DATE
FROM
( SELECT A.*
FROM tableaa A,
tablebb B
WHERE A.IDX = B.IDY
and A.DATE = B.DATE
) A
CONNECT BY INSTR(A.ANEST, '|', 1, LEVEL-1) > 0
AND PRIOR sys_guid() IS NOT NULL
Would u mind to explain in the simple way, what connect by instr stands for?
And how should I write in hive?
thank you
"CONNECT BY" is a way of performing a recursive lookup, like data_id and parent_id in a row, where parent_id would point to a prior data_id in another row. This allows Oracle to rapidly construct hierarchical relationship trees and the like.
There is no native equivalent for this in Hive.
I did see one blog post for creating an external function to support something similar in Hive, which you could check out here: https://blog.pythian.com/recursion-in-hive/
Related
I want to retrieve users name and there responsibility_key where there end_date is null and i want to convert it to (sysdate+1) using nvl but i am only able to retrieve the responsibility_key not the name please help.
The error in the image says "column ambiguously defined". Take a close look. Your last END_DATE could refer to either the u alias or the table from the subquery. Change it to match the rest of your subquery (FIND_USER_GROUPS_DIRECT.END_DATE)
EDIT
Your query is
select u.USER_NAME, d.responsibility_key from FND_USER u,FND_RESPONSIBILITY_VL d
where responsibility_id in(
select responsibility_id from
FND_USER_RESP_GROUPS_DIRECT WHERE END_USER_RESP_GROUPS_DIRECT.END_DATE=nvl(END_DATE,sysdate+1)) and
u.END_DATE=nvl(END_DATE,SYSDATE + 1)
;
The query isn't formatted, which makes it hard to read.
Not all columns are qualified with table name (or aliases), as mentioned in the comments.
The query currently uses an implicit join.
The query is impossible to understand without seeing the table definitions (desc [table_name]).
For points 1 and 2, a properly formatted query will look something like
select u.user_name, d.responsibility_key
from
fnd_user u,
fnd_responsibility_vl d
where
d.responsibility_id in (
select urgd.responsibility_id
from
fnd_user_resp_groups_direct urgd
where
urgd.end_date = nvl(u.end_date, sysdate+1)
) and
u.end_date = nvl(urgd.end_date, sysdate + 1)
;
This makes it easier to read and in addition to this, you can see that without table definitions I guessed (see point 4) as to which tables the end_date column belongs in your query. If I had to guess, so does Oracle. That means you have an ambiguity problem. To fix it, take a close look at the end_date column as it appears in your original query and where you do not prefix it with anything, you need to prefix it with the appropriate alias (after you have aliased all your tables).
For point 3, you can write your query more clearly with an explicit join and by using aliases for all columns. As for the explicit join I have no idea what your tables look like but one possibility is something like
select u.user_name, d.responsibility_key
from fnd_user u
join fnd_responsibility_vl d
on u.id = d.user_id
where
d.responsibility_id in (
select responsibility_id
from fnd_user_resp_groups_direct urgd
where
urgd.end_date = nvl(u.end_date, sysdate+1)
) and
u.end_date = nvl(urgd.end_date, sysdate+1)
;
If you follow these points you will get to the root of the error.
In one of my use case, i have two tables namely flow and conf. The flow table contains list of all flight data. It has columns creationdate,datafilename,aircraftid. The conf table contains configuration information. It has columns configdate, aircraftid, configurationame. There are multiple versions of configurations created for one aircraft type. So, when we process a datafilename, we need to identify the aircraftid from the flow table, and pick up the configuration from conf table that was created just before the datafilename was created. So, i tried this,
FROM (
SELECT
F_FILE_CREATION_DATE,
F_FILE_ARCHIVED_RELATIVE_PATH,
F_FILE_ARCHIVED_NAME,
K_AIRCRAFT
from T_FLOW f )x left join
(
select c.config_date, c.aircraft_id, c.configurationfrom t_conf c
) y on y.aircraft_id = x.K_AIRCRAFT
select
x.F_FILE_CREATION_DATE,
x.F_FILE_ARCHIVED_RELATIVE_PATH,
x.F_FILE_ARCHIVED_NAME,
x.K_AIRCRAFT,
y.config_date,
y.aircraft_id,
y.configuration;
This picks up all the configurations created for the aircraft which is obvious as there is no condition to check conf.config_date < flow.f_file_creation_date. I tried to include this condition like this,
FROM (
SELECT
F_FILE_CREATION_DATE,
F_FILE_ARCHIVED_RELATIVE_PATH,
F_FILE_ARCHIVED_NAME,
K_AIRCRAFT
from T_FLOW f )x join
(
select c.config_date, c.aircraft_id, c.FILEFILTER from t_conf c
) y on y.aircraft_id = x.K_AIRCRAFT where y.config_date < x.f_file_creation_date
select
x.F_FILE_CREATION_DATE,
x.F_FILE_ARCHIVED_RELATIVE_PATH,
x.F_FILE_ARCHIVED_NAME,
x.K_AIRCRAFT,
y.config_date,
y.aircraft_id,
y.filefilter;
This time failed with the error
required (...)+ loop did not match anything at input 'where' in statement
Can someone give me a hint or two where i am going wrong and on how to fix this?
select f.f_file_creation_date
,f.f_file_archived_relative_path
,f.f_file_archived_name
,f.k_aircraft
,c.config_date
,c.aircraft_id
,c.filefilter
from t_flow as f
join (select config_date
,aircraft_id
,filefilter
,lead (config_date,1,date '3000-01-01') over
(
partition by aircraft_id
order by config_date
) as next_config_date
from t_conf
) c
on c.aircraft_id =
f.k_aircraft
where f.f_file_creation_date >= c.config_date
and f.f_file_creation_date < c.next_config_date
Please read carefully
Posting a question
When you post a data related question -
Supply a data sample: source data + required results.
It is going to be more clear than any explanation you give.
It will also supply a common background for further discussions and a way for you and others to verify the correctness of the given solutions.
Supply the size properties (records/volume) of the tables.
It is important for performance considerations ans might impact the given solution.
SQL
Hive currently does not support any JOIN condition type other than equijoin (e.g. t1.X = t2.X and t1.Y = t2.Y). This is why you get an error.
If you are doing an inner join (and not outer join) then you can move the non-equijoin conditions to the WHERE clause.
Stick to ISO SQL standard. There is a conventional order for SQL clauses: SELECT-FROM-WHERE...
You gain nothing from esoteric syntax except for esoteric error messages.
There is no reason what so ever to use sub-queries in order to narrow the columns list.
Just to make it perfectly clear - There isn't any performance gain doing that. More than that, if it would have work as you assume (and it does not) the performance would have been worse, not better.
I can't reproduce your error. I guess your query is valid.
What version do you use for Hive ? I tested this query with hive 2.1.1.
DROP TABLE IF EXISTS t_flow;
CREATE TABLE IF NOT EXISTS t_flow (
f_file_creation_date DATE
, f_file_archived_relative_path STRING
, f_file_archived_name STRING
, k_aircraft STRING
);
-- Conf table contains configuration information.
-- It has columns configdate, aircraftid, configurationame
DROP TABLE IF EXISTS t_conf;
CREATE TABLE IF NOT EXISTS t_conf (
config_date DATE
, aircraft_id STRING
, filefilter STRING
);
SELECT
x.f_file_creation_date,
x.f_file_archived_relative_path,
x.f_file_archived_name,
x.k_aircraft,
y.config_date,
y.aircraft_id,
y.filefilter
FROM
(SELECT
f_file_creation_date,
f_file_archived_relative_path,
f_file_archived_name,
k_aircraft
FROM t_flow f) x
JOIN
(SELECT
c.config_date,
c.aircraft_id,
c.filefilter
FROM t_conf c) y on y.aircraft_id = x.k_aircraft where y.config_date < x.f_file_creation_date;
I have the following query in oracle. I want to convert it to PostgreSQL form. Could someone help me out in this,
SELECT user_id, user_name, reports_to, position
FROM pr_operators
START WITH reports_to = 'dpercival'
CONNECT BY PRIOR user_id = reports_to;
A something like this should work for you (SQL Fiddle):
WITH RECURSIVE q AS (
SELECT po.user_id,po.user_name,po.reports_to,po.position
FROM pr_operators po
WHERE po.reports_to = 'dpercival'
UNION ALL
SELECT po.user_id,po.user_name,po.reports_to,po.position
FROM pr_operators po
JOIN q ON q.user_id=po.reports_to
)
SELECT * FROM q;
You can read more on recursive CTE's in the docs.
Note: your design looks strange -- reports_to contains string literals, yet it is being comapred with user_id which typicaly is of type integer.
Oracle poses a limitation on using a subquery within the select clause when creating a materialized view. When you do so, you receive the error "ORA-22818: subquery expressions not allowed here".
Due to this limitation, I've been struggling to rewrite the query and move the subquery out of the select clause. The query is recursively building a path using parent/child relationships, and I'm trying to also indicate if a particular category is a leaf category by joining the table back to itself and seeing if the record has a child.
SELECT A.PRODUCTCATEGORYID, A.PARENTCATEGORYID, SYS_CONNECT_BY_PATH(A.LABEL, ':') "PATH",
(
SELECT CASE WHEN MAX(PRODUCTCATEGORYID) IS NOT NULL THEN 0 ELSE 1 END
FROM PRODUCT_CATEGORY
WHERE parentcategoryid = A.PRODUCTCATEGORYID
) as "LEAF"
FROM PRODUCT_CATEGORY A
CONNECT BY PRIOR A.PRODUCTCATEGORYID = A.PARENTCATEGORYID
START WITH A.PARENTCATEGORYID IS NULL;
Can anyone point me in the right direction of how I should go about rewriting this so the subquery is not part of the select clause?
Thanks in advance.
You can use the CONNECT_BY_ISLEAF pseudocolumn, which returns 1 if current node is a leaf, and 0 otherwise, so your query should be rewritten like this:
SELECT A.PRODUCTCATEGORYID, A.PARENTCATEGORYID, SYS_CONNECT_BY_PATH(A.LABEL, ':') "PATH",
CONNECT_BY_ISLEAF as "LEAF"
FROM PRODUCT_CATEGORY A
CONNECT BY PRIOR A.PRODUCTCATEGORYID = A.PARENTCATEGORYID
START WITH A.PARENTCATEGORYID IS NULL;
Read more in Oracle's documentation: CONNECT_BY_ISLEAF pseudocolumn
In MySql, the concept of pagination can easily be implemented with a single SQL statement using the LIMIT clause something like the following.
SELECT country_id, country_name
FROM country c
ORDER BY country_id DESC
LIMIT 4, 5;
It would retrieve the rows starting from 5 to 10 in the result set which the SQL query retrieves.
In Oracle, the same thing can be achieved using row numbers with a subquery making the task somewhat tedious as follows.
SELECT country_id, country_name
FROM
(SELECT rownum as row_num, country_id, country_name
FROM
(SELECT country_id, country_name
FROM country
ORDER BY country_id desc)
WHERE rownum <= 10
)
WHERE row_num >=5;
In Oracle 10g (or higher, I'm not sure about the higher versions though), this can be made somewhat easy such as,
SELECT country_id, country_name
FROM (SELECT country_id, country_name, row_number() over (order by country_id desc) rank
FROM country)
WHERE rank BETWEEN 6 AND 10;
Regarding an application like a web application, the concept of pagination is required to implement almost everywhere and writing such SQL statements every time a (select) query is executed is sometimes a tedious job.
Suppose, I have a web application using Java. If I use the Hibernate framework then there is a direct way to do so using some methods supported by Hibernate like,
List<Country>countryList=session.createQuery("from Country order by countryId desc")
.setFirstResult(4).setMaxResults(5).list();
but when I simply use JDBC connectivity with Oracle like,
String connectionURL = "jdbc:oracle:thin:#localhost:1521:xe";
Connection connection = null;
Statement statement = null;
ResultSet rs = null;
Class.forName("oracle.jdbc.OracleDriver").newInstance();
connection = DriverManager.getConnection(connectionURL, "root", "root");
statement = connection.createStatement();
rs = statement.executeQuery("SELECT * from country");
My question in this case, is there a precise way to retrieve a specified range of rows using this code? Like in the preceding case using the methods something like setFirstResult() and setMaxResults()? or the only way to achieve this is by using those subqueries as specified.
Because 'No' is an answer too:
Unfortunately, you will have to use the subquery approach. I would personally use the one with the rank (the second one).