Input file name in Spark 1.6.0 View

Input file name in Spark 1.6.0 View - hadoop

I cannot use the input_file_name() function in Spark 1.6.0 views. It works in select statements or in df.withColumn("path", input_file_name()), but not in a view.
For example:
CREATE VIEW v_test AS SELECT *, input_file_name() FROM table
fails. It also fails when i use INPUT__FILE__NAME instead. Just:
SELECT *, input_file_name() FROM table
works as expected. Is this a known bug or am i doing something wrong?
PS: I can create the view in Hive, but cannot access it from Spark as it fails with the same error: unknown function...
UPDATE:
I use Zeppelin with livy interpreter and Scala API.
The error i get from the above query to create the view is:
invalid function input_file_name
I also tried to import the function, but it has no effect

You have create a temp view as below
df.registerTempTable("table")
and then use input_file_name(). It would just work perfect.
sqlContext.sql("select *, input_file_name() from table")
for newer versions of spark you can use following api for creating temp view
df.createOrReplaceTempView("table")
I hope the answer is helpful

Related

Hadoop view created with CTE misbehaves

Here is the view definition(Runs fine. View gets created)
CREATE OR REPLACE VIEW my_view
AS WITH Q1
AS (SELECT MAX(LOAD_DT) AS LOAD_DT FROM load_table WHERE UCASE(TBL_NM) = 'FACT_TABLE')
SELECT F.COLUMN1
, F.COLUMN2
FROM Q1, FACT_TABLE F
WHERE Q1.LOAD_DT = F.TRAN_DT
;
However, when run
SELECT * from my_view;
getting following error message:
FAILED: SemanticException Line N:M Table not found 'Q1' in definition of view my_view....etc..
Looks like hive is trying to treat the Q1 (wich is CTE) as a physical table. Any ideas how to work around this?
Thank You,
Natalia

We have faced the similar issue in our environment. To answer your question, it's a bug in Hive. Fortunately, we have a workaround to make it work. If you were using impala and hive and both are using same metastore. Create the view in Impala and it will work on both hive and impala.
Reason:
Hive is appending your database name to the CTE reference created which is causing the issue.
Thanks,
Neo

How to execute select query on oracle database using pi spark?

I have written a program using pyspark to connect to oracle database and fetch data. Below command works fine and returns the contents of the table:
sqlContext.read.format("jdbc")
.option("url","jdbc:oracle:thin:user/password#dbserver:port/dbname")
.option("dbtable","SCHEMA.TABLE")
.option("driver","oracle.jdbc.driver.OracleDriver")
.load().show()
Now I do not want to load the entire table data. I want to load selected records. Can I specify select query as part of this command? If yes how?
Note: I can use dataframe and execute select query on the top of it but I do not want to do it. Please help!!

You can use subquery in dbtable option
.option("dbtable", "(SELECT * FROM tableName) AS tmp where x = 1")
Here is similar question, but about MySQL

In general, the optimizer SHOULD be able to push down any relevant select and where elements so if you now do df.select("a","b","c").where("d<10") then in general this should be pushed down to oracle. You can check it by doing df.explain(true) on the final dataframe.

Apache Drill JDBC Plugin Doesnt Recognize Columns

I'm attempting to query a proprietary RDBMS using Apache Drill. I've created the plugin as a JDBC data source and put my JDBC jar in the jars/3rdparty directory, and I'm able to successfully run a query such as SELECT * FROM mytable.
However, if I use a column name in the query such as SELECT mycol FROM mytable, Drill returns the following error: Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 9: Column 'mycol' not found in any table. Moreover, I've noticed that my schema is entirely missing if I run SELECT * FROM INFORMATION_SCHEMA.SCHEMATA, so I have a hunch that Drill is unable to retrieve my database schema from the JDBC driver.
I'm wondering what method of the JDBC driver may be implemented incorrectly that's causing this problem. The JDBC driver has been used with other 3rd party software such as Spark with no issue.

In order to perform a query on your table you need to prefix the name of your table with the name you gave your storage plugin. For example if you named your storage plugin rdbms your query should look like this:
SELECT * FROM rdbms.mytable
Your additional query SELECT * FROM INFORMATION_SCHEMA.SCHEMATA likely failed for the same reason. Try SELECT * FROM rdbms.INFORMATION_SCHEMA.SCHEMATA. And don't forget to replace rdbms with the name you gave your storage plugin.

I think we should query on drill like select * from dfs.<storagePlugin>.tableName
Can you check once.?

sql cannot operate correctly in Redshift

I want to run sql like:
CREATE TEMPORARY TABLE tmptable AS SELECT * FROM redshift_table WHERE date > #{date};
I can run this sql in command line in Redshift, but if I run it in my program, it doesn't work correctly. When I change CREATE TEMPORARY TABLE to CREATE TABLE it works correctly.
I am using mybatis as OR mapper and driver is:
org.postgresql.Driver
org.postgresql:postgresql:9.3-1102-jdbc41
What's wrong?

I am assuming the #date is an actual date in your actual query.
Having said that, there is not reason this command doesnt work, its as per the syntax listed here,
http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_AS.html
Have you tried posting it on AWS Redshift forums, generally they are quite responsive. Please update this thread too if you find something, this is quite an interesting issue, thanks!

Does hsqldb support table alias in oracle compatible mode

We're using hsqdb-2.2.9 in dao tests. The hsqldb was made compatible with oracle (in production) by setting SET DATABASE SQL SYNTAX ORA TRUE; and we use ibatis sql map.
It gets failed when the sql contains table alias, something like select a.name, b.code form t_a a, t_b b where a.id = b.a_id , which reports unexpected token a. We tried adding 'as' between table and table alias, it doesn't work either. Do I miss something?

Yes, HSQLDB supports table aliases.
If you use the exact query you reported, you would get:
unexpected token: T_A
If you correct the query as commented by a_horse_with_no_name it should work. If one of the tables does not exist, you would get:
user lacks privilege or object not found: T_A
BTW, try using the latest 2.3.0 snapshot jar for better Oracle compatibility tests. You can find it from the support page of the website.

uh.... I think I‘ve found the issue........of my own. It suddenly occur to me that I use ’do‘ (table name is t_delivery_order) as the table alias which happens to be a keyword in hsqldb(or in sql). Just replace 'do’ with 'd', its fixed. Thank u all

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Input file name in Spark 1.6.0 View - hadoop

Related

Hadoop view created with CTE misbehaves

How to execute select query on oracle database using pi spark?

Apache Drill JDBC Plugin Doesnt Recognize Columns

sql cannot operate correctly in Redshift

Does hsqldb support table alias in oracle compatible mode

Categories

Resources