I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?
When there isn't a documented way one can use documented features to provide a clean enough hack :)
Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way
Let's go
Basic setup
SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';
SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;
Generate a file in hdfs
with content "SET q1=<the-query-result>;"
CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;
Source in the generated file as any sql-file.
First put the file to local fs since 'source' only operates on local disk...
dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};
Try the setting
hive> SET q1;
q1=4
Use the setting in a query
hive > SELECT lname,
salary,
NTILE( ${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
OK
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Optional cleanup
!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;
Related
I (using Oracle 12c, PL/SQL) need to update an existing table TABLE1 based on information stored in a table MAP. In a simplified version, MAP looks like this:
COLUMN_NAME
MODIFY
COLUMN1
N
COLUMN2
Y
COLUMN3
N
...
...
COLUMNn
Y
COLUMN1 to COLUMNn are column names in TABLE1 (but there are more columns, not just these). Now I need to update a column in TABLE1 if MODIFY in table MAP contains a 'Y' for that columns' name. There are other row conditions, so what I would need would be UPDATE statements of the form
UPDATE TABLE1
SET COLUMNi = value_i
WHERE OTHER_COLUMN = 'xyz_i';
where COLUMNi runs through all the columns of TABLE1 which are marked with MODIFY = 'Y' in MAP. value_i and xyz_i also depend on information stored in MAP (not displayed in the example).
The table MAP is not static but changes, so I do not know in advance which columns to update. What I did so far is to generate the UPDATE-statements I need in a query from MAP, i.e.
SELECT <Text of UPDATE-STATEMENT using row information from MAP> AS SQL_STMT
FROM MAP
WHERE MODIFY = 'Y';
Now I would like to execute these statements (possibly hundreds of rows). Of course I could just copy the contents of the query into code and execute, but is there a way to do this automatically, e.g. using EXECUTE IMMEDIATE? It could be something like
BEGIN
EXECUTE IMMEDIATE SQL_STMT USING 'xyz_i';
END;
only that SQL_STMT should run through all the rows of the previous query (and 'xyz_i' varies with the row as well). Any hints how to achieve this or how one should approach the task in general?
EDIT: As response to the comments, a bit more background how this problem emerges. I receive an empty n x m Matrix (empty except row and column names, think of them as first row and first column) quarterly and need to populate the empty fields from another process.
The structure of the initial matrix changes, i.e. there may be new/deleted columns/rows and existing columns/rows may change their position in the matrix. What I need to do is to take the old version of the matrix, where I already have filled the empty spaces, and translate this into the new version. Then, the populating process merely looks if entries have changed and if so, alters them.
The situation from the question arises after I have translated the old version into the new one, before doing the delta. The new matrix, populated with the old information, is TABLE1. The delta process, over which I have no control, gives me column names and information to be entered into the cells of the matrix (this is table MAP). So I need to find the column in the matrix labeled by the delta process and then to change values in rows (which ones is specified via other information provided by the delta process)
Dynamic SQL it is; here's an example, see if it helps.
This is a table whose contents should be modified:
SQL> select * from test order by id;
ID NAME SALARY
---------- ---------- ----------
1 Little 100
2 200
3 Foot 0
4 0
This is the map table:
SQL> select * from map;
COLUMN CB_MODIFY VALUE WHERE_CLAUSE
------ ---------- ----- -------------
NAME Y Scott where id <= 3
SALARY N 1000 where 1 = 1
Procedure loops through all columns that are set to be modified, composes the dynamic update statement and executes it:
SQL> declare
2 l_str varchar2(1000);
3 begin
4 for cur_r in (select m.column_name, m.value, m.where_clause
5 from map m
6 where m.cb_modify = 'Y'
7 )
8 loop
9 l_str := 'update test set ' ||
10 cur_r.column_name || ' = ' || chr(39) || cur_r.value || chr(39) || ' ' ||
11 cur_r.where_clause;
12 execute immediate l_str;
13 end loop;
14 end;
15 /
PL/SQL procedure successfully completed.
Result:
SQL> select * from test order by id;
ID NAME SALARY
---------- ---------- ----------
1 Scott 100
2 Scott 200
3 Scott 0
4 0
SQL>
How can I get particular pattern of data using sql query?
I have a string which has data like "Valid data: emp no - 123 emp age - 23 emp type - M."
So here I want whatever I have in emp age i.e. 23. The string format will be same. I don't want to get emp age based on position as it can change ,is there any other way to get emp age?
Is a query which will find emp age tag in the string and then look three places after that tag to get emp age value?
If the text is always the same you can try something like the following.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(ColumnA, 'emp age - ', -1), ' ', 1) AS Age
FROM table;
Oracle has built-in regular expression functions. If, as you say, the format is always the same, extracting the second group of numbers will give you the outcome you desire:
SQL> select regexp_substr('Valid data: emp no - 123 emp age - 23 emp type - M.'
2 , '[0-9]+', 1, 2) as emp_age
3 from dual
4 /
EM
--
23
SQL>
These functions are covered in the documentation. Find out more.
Use the built-in Regular Expression functions.
A weird request maybe but. My boss wants me to create an admin version of a page we have that displays data from an oracle query in a table.
The admin page, instead of displaying the data (query returns 1 row), needs to return the table name and column name
Ex: Instead of:
Name Initial
==================
Bob A
I want:
Name Initial
============================
Users.FirstName Users.MiddleInitial
I realize I can do this in code but would rather just modify the query to return the data I want so I can leave the report generation code mostly alone.
I don't want to do it in a stored procedure.
So when I spit out the data in the report using something like:
blah blah = MyDataRow("FirstName")
I can leave that as is but instead of it displaying "BOB" it would display "Users.FirstName"
And I want to do the query using select * if possible instead of listing all the columns
So for each of the columns I am querying in the * , I want to get (instead of the column value) the tablename.ColumnName or tablename|columnName
hope you are following- I am confusing myself...
pseudo:
select tablename + '.' + Columnname as WhateverTheColumnNameIs
from Table1
left join Table2 on whatever...
Join Table_Names on blah blah
Whew- after writing all this I think I will just do it on the code side.
But if you are up for it maybe a fun challenge
Oracle does not provide an authentic way(there is no pseudocolumn) to get the column name of a table as a result of a query against that table. But you might consider these two approaches:
Extract column name from an xmltype, formed by passing cursor expression(your query) in the xmltable() function:
-- your table
with t1(first_name, middle_name) as(
select 1,2 from dual
), -- your query
t2 as(
select * -- col1 as "t1.col1"
--, col2 as "t1.col2"
--, col3 as "t1.col3"
from hr.t1
)
select *
from ( select q.object_value.getrootelement() as col_name
, rownum as rn
from xmltable('//*'
passing xmltype(cursor(select * from t2 where rownum = 1))
) q
where q.object_value.getrootelement() not in ('ROWSET', 'ROW')
)
pivot(
max(col_name) for rn in (1 as "name", 2 as "initial")
)
Result:
name initial
--------------- ---------------
FIRST_NAME MIDDLE_NAME
Note: In order for column names to be prefixed with table name, you need to list them
explicitly in the select list of a query and supply an alias, manually.
PL/SQL approach. Starting from Oracle 11g you could use dbms_sql() package and describe_columns() procedure specifically to get the name of columns in the cursor(your select).
This might be what you are looking for, try selecting from system views USER_TAB_COLS or ALL_TAB_COLS.
I'm tring to use hive to analysis our log, and I have a question.
Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1
without pre-process it with awk/sed or something like that?
Thanks!
Step 1: Create a Hive table for input data set .
create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
(i assumed field seprator is \t, you can replace it with actual separator)
Step 2 : Run below to get the merge data you are looking for
create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;
I tried this for below input set
hive (default)> select * from table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;
hive (default)> select * from table4;
OK
A 1
B 1
C 1
You can use external table as well , but for simplicity I have used managed table here.
One idea.. you could create a table around the first file (called 'oldtable').
Then run something like this....
create table newtable select field1, max(field) from oldtable group by field1;
Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?
For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.
insert overwrite into table test1
(select x.* from t1 x )
UNION ALL
(select y.* from t2 y);
here we are merging two tables data (t1 and t2) into one single table test1.
There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.
hive> SELECT * FROM table1;
OK
A 1
A 1
A 1
B 1
C 1
B 1
B 2 # Added to show it will group correctly with different values
hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A 1
B 1
B 2
C 1
Question:
Is it possible to have a column name in a select statement changed based on a value in it's result set?
For example, if a year value in a result set is less than 1950, name the column OldYear, otherwise name the column NewYear. The year value in the result set is guaranteed to be the same for all records.
I'm thinking this is impossible, but here was my failed attempt to test the idea:
select 1 as
(case
when 2 = 1 then "name1";
when 1 = 1 then "name2")
from dual;
You can't vary a column name per row of a result set. This is basic to relational databases. The names of columns are part of the table "header" and a name applies to the column under it for all rows.
Re comment: OK, maybe the OP Americus means that the result is known to be exactly one row. But regardless, SQL has no syntax to support a dynamic column alias. Column aliases must be constant in a query.
Even dynamic SQL doesn't help, because you'd have to run the query twice. Once to get the value, and a second time to re-run the query with a different column alias.
The "correct" way to do this in SQL is to have both columns, and have the column that is inappropriate be NULL, such as:
SELECT
CASE WHEN year < 1950 THEN year ELSE NULL END AS OldYear,
CASE WHEN year >= 1950 THEN year ELSE NULL END AS NewYear
FROM some_table_with_years;
There is no good reason to change the column name dynamically - it's analogous to the name of a variable in procedural code - it's just a label that you might refer to later in your code, so you don't want it to change at runtime.
I'm guessing what you're really after is a way to format the output (e.g. for printing in a report) differently depending on the data. In that case I would generate the heading text as a separate column in the query, e.g.:
SELECT 1 AS mydata
,case
when 2 = 1 then 'name1'
when 1 = 1 then 'name2'
end AS myheader
FROM dual;
Then the calling procedure would take the values returned for mydata and myheader and format them for output as required.
You will need something similar to this:
select 'select ' || CASE WHEN YEAR<1950 THEN 'OLDYEAR' ELSE 'NEWYEAR' END || ' FROM TABLE 1' from TABLE_WITH_DATA
This solution requires that you launch SQLPLUS and a .sql file from a .bat file or using some other method with the appropriate Oracle credentials. The .bat file can be kicked off manually, from a server scheduled task, Control-M job, etc...
Output is a .csv file. This also requires that you replace all commas in the output with some other character or risk column/data mismatch in the output.
The trick is that your column headers and data are selected in two different SELECT statements.
It isn't perfect, but it does work, and it's the closest to standard Oracle SQL that I've found for a dynamic column header outside of a development environment. We use this extensively to generate recurring daily/weekly/monthly reports to users without resorting to a GUI. Output is saved to a shared network drive directory/Sharepoint.
REM BEGIN runExtract1.bat file -----------------------------------------
sqlplus username/password#database #C:\DailyExtracts\Extract1.sql > C:\DailyExtracts\Extract1.log
exit
REM END runExtract1.bat file -------------------------------------------
REM BEGIN Extract1.sql file --------------------------------------------
set colsep ,
set pagesize 0
set trimspool on
set linesize 4000
column dt new_val X
select to_char(sysdate,'MON-YYYY') dt from dual;
spool c:\DailyExtracts\&X._Extract1.csv
select '&X-Project_id', 'datacolumn2-Project_Name', 'datacolumn3-Plant_id' from dual;
select
PROJ_ID
||','||
replace(PROJ_NAME,',',';')-- "Project Name"
||','||
PLANT_ID-- "Plant ID"
from PROJECTS
where ADDED_DATE >= TO_DATE('01-'||(select to_char(sysdate,'MON-YYYY') from dual));
spool off
exit
/
REM ------------------------------------------------------------------
CSV OUTPUT (opened in Excel and copy/pasted):
old 1: select '&X-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
new 1: select 'MAR-2018-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
MAR-2018-Project_id datacolumn2-Project_Name datacolumn3-Plant_id
31415 name1 1007
31415 name1 2032
32123 name2 3302
32123 name2 3384
32963 name3 2530
33629 name4 1161
34180 name5 1173
34180 name5 1205
...
...
etc...
135 rows selected.