I am able to query hive,hbase individually by using Drill.Now i am trying to query HbaseStorageHandler type tables in hive. For this in Drill, Hive Storage Plugin I added these properties as,
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://trinitybdClusterM02.trinitymobility.local:9083",
"javax.jdo.option.ConnectionURL": "jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true",
"hive.metastore.warehouse.dir": "/tmp/drill_hive_wh",
"fs.default.name": "hdfs://trinitybdClusterM02.trinitymobility.local:9000",
"hive.metastore.sasl.enabled": "false",
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "2181"
}
}
I tried to query like,
0: jdbc:drill:zk=localhost> use hive.test;
0: jdbc:drill:zk=localhost> select * from twitter_test_nlp limit 1;
It is giving error as,
Error: SYSTEM ERROR: NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setAttribute(Ljava/lang/String;[B)V
Fragment 0:0
[Error Id: fc3994f4-7d7e-475e-870b-259ac91ea81a on trinitybdClusterM02.trinitymobility.local:31010] (state=,code=0)
Anybody is using this type please share me what properties I have to add for query HBaseStorageHandler tables of Hive.
In drill 1.9 this problem has resolved. drill 1.9 directly supports HbaseStorageHandler tables(Hive and hbase integrated tables) also with hive storage plug-in.And it directly supports spatial queries also like st_contains() etc.So if anybody need these type of requirements use drill 1.9.0.
Related
I'm currently trying to setup a JDBC Connector with the goal to read data from a Oracle DB and push it to a Kafka topic using Kafka Connect. I wanted to use the "timestamp" mode:
timestamp: use a timestamp (or timestamp-like) column to detect new and modified rows. This assumes the column is updated with each write, and that values are monotonically incrementing, but not necessarily unique.
https://docs.confluent.io/kafka-connectors/jdbc/current/source-connector/source_config_options.html#mode
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"dialect.name" : "OracleDatabaseDialect",
"connection.url": "XXX",
"connection.user" : "XXX",
"connection.password" : "XXX",
"mode" : "timestamp",
"quote.sql.identifiers": "never"
"timestamp.column.name" : "LAST_UPDATE_DATE",
"query" : "select a1.ID, a1.LAST_UPDATE_DATE, b1.CODE
from a1
left join b1 on a1.ID = b1.ID
...
}
My problem is, that the timestamp on the database is defined as NUMBER(15), e.g. 20221220145930000. The Connector creates a where statement at the end of my defined query like
where a1.LAST_UPDATE_DATE > :1 and a1.LAST_UPDATE_DATE < :2 order by a1.LAST_UPDATE_DATE asc
This leads to an error message: ORA-00932: inconsistent datatypes: expected NUMBER got TIMESTAMP
Unfortunately, the database is not under my control (proprietary software). I have only read permissions.
Is there a possibility to set the timestamp type in this connector? I already tried to use the (to_timestamp() function directly in the SQL-statement and a SMT (timestampConverter) without success
I have a snappy.parquet file with a schema like this:
{
"type": "struct",
"fields": [{
"name": "MyTinyInt",
"type": "byte",
"nullable": true,
"metadata": {}
}
...
]
}
Update: parquet-tools reveals this:
############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:
11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
The load into the external table works fine with only varchars
CREATE EXTERNAL TABLE [domain].[TempTable]
(
...
MyTinyInt tinyint NULL,
...
)
WITH
(
LOCATION = ''' + #Location + ''',
DATA_SOURCE = datalake,
FILE_FORMAT = parquet_snappy
)
The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.
I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:
there is a known bug in ADF for this particular scenario: The date
type in parquet should be mapped as data type date in Sql sever
however, ADF incorrectly converts this type to Datetime2 which causes
a conflict in PolyBase. I have confirmation for the core engineering
team that this will be rectified with a fix by the end of November and
will be published directly into the ADF product.
In the meantime, as a workaround:
Create the target table with data type DATE as opposed to DATETIME2
Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase
but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem
I have jason files saved in S3 bucket. I am trying to load them as dataframe in spark R and I am getting error logs. Following is my code. Where am I going wrong?
devtools::install_github('apache/spark#v2.2.0',subdir='R/pkg',force=TRUE)
library(SparkR)
sc=sparkR.session(master='local')
Sys.setenv("AWS_ACCESS_KEY_ID"="xxxx",
"AWS_SECRET_ACCESS_KEY"= "yyyy",
"AWS_DEFAULT_REGION"="us-west-2")
movie_reviews <-SparkR::read.df(path="s3a://bucketname/reviews_Movies_and_TV_5.json",sep = "",source="json")
I have tried all combinations of s3a , s3n, s3 and none seems to work.
I get following error log in my sparkR console
17/12/09 06:56:06 WARN FileStreamSink: Error while looking for metadata directory.
17/12/09 06:56:06 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
java.lang.reflect.InvocationTargetException
For me it works
read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
What #Ankit said should work, but if you are trying to get something that looks more like a dataframe, you need to use a select statement. i.e.
rdd<- read.df("s3://bucket/file.json", "json", header = "true", inferSchema = "true", na.strings = "NA")
Then do a printSchema(rdd) to see the structure of the data.
If you see something that has root followed by no indentations to your data, you can probably go ahead and select using the names of the "columns" you want. If you see branching down your schema tree, you may have to put a headers.blah or a payload.blah in you select statement. Like this:
sdf<- SparkR::select(rdd, "headers.something", "headers.somethingElse", "payload.somethingInPayload", "payload.somethingElse")
I am using spark-sql to connect to oracle databse and getting data as dataframes. I would like to write this retrieved data into avro file. While writing to avro I am seeing multiple issues, could you help us.
Here is the code -
val df = sqlContext.read.format("jdbc")
.options(Map( "driver"->"oracle.jdbc.driver.OracleDriver",
"url" -> "jdbc:oracle:thin:user/password#host/service"
, "numPartitions" -> "1", "dbtable"-> "
(Select * from schema.table WHERE STAGE_NUM <=39 and
guid='I284ba1f9cdba11dea82ab9f4ee295c21')"))
.load()
df.write.format("com.databricks.spark.avro").save("Outputfile")
Dependencies that are there in my project -
<dependency><br> <groupId>org.apache.spark</groupId><br> <artifactId>spark-sql_2.10</artifactId><br> <version>1.5.1</version><br></dependency><br><dependency><br> <groupId>com.databricks</groupId><br> <artifactId>spark-avro_2.10</artifactId><br> <version>2.0.1</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro</artifactId><br> <version>1.7.7</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro-mapred</artifactId><br> <version>1.7.7</version><br></dependency>
Here is the exception information -
java.lang.RuntimeException: com.databricks.spark.avro.DefaultSource does not allow create table as select
If I use - df.write.avro("headnotes"), I get the following exception.
java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1
Using Apache Drill v1.2 and Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit in embedded mode.
I'm curious if anyone has had any success connecting Apache Drill to an Oracle DB. I've updated the drill-override.conf with the following configurations (per documents):
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181",
drill.exec.sys.store.provider.local.path = "/mypath"
}
and placed the ojdbc6.jar in \apache-drill-1.2.0\jars\3rdparty. I can successfully create the storage plug-in:
{
"type": "jdbc",
"driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:#<IP>:<PORT>:<SID>",
"username": "USERNAME",
"password": "PASSWORD",
"enabled": true
}
but when I issue a query such as:
select * from <storage_name>.<schema_name>.`dual`;
I get the following error:
Query Failed: An Error Occurred
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: From line 1, column 15 to line 1, column 20: Table '<storage_name>.<schema_name>.dual' not found [Error Id: 57a4153c-6378-4026-b90c-9bb727e131ae on <computer_name>:<PORT>].
I've tried to query other schema/tables and get a similar result. I've also tried connecting to Teradata and get the same error. Does any one have suggestions/run into similar issues?
It's working with Drill 1.3 (released on 23-Dec-2015)
Plugin: name - oracle
{
"type": "jdbc",
"driver": "oracle.jdbc.driver.OracleDriver",
"url": "jdbc:oracle:thin:user/password#192.xxx.xxx.xxx:1521:orcl ",
"enabled": true
}
Query:
select * from <plugin-name>.<user-name>.<table-name>;
Example:
select * from oracle.USER.SAMPLE;
Check drill's documentation for more details.
Note: Make sure you added ojdbc7.12.1.0.2.jar(recommended in docs) in apache-drill-1.3.0/jars/3rdparty
It kind of works in Apache drill 1.3.
The strange thing is that I can only query the tables for which there are synonyms created...
In the command line try:
use <storage_name>;
show tables;
This will give you a list of objects that you can query - dual is not on that list ;-).
I'm using apache-drill-1.9.0 and it seems that the schema name is interpreted case sensitive and must be be therefore be in upper case.
For a table user1.my_tab (which is per default created in Oracle in upper case)
this works in Drill (plugin name is oracle)
SELECT * FROM oracle.USER1.my_tab;
But this triggers an error
SELECT * FROM oracle.user1.my_tab;
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'oracle.user1.my_tab' not found
An alternative approach is to set the plugin name and the schema name with use (owner must be upper case as well)
0: jdbc:drill:zk=local> use oracle.USER1;
+-------+-------------------------------------------+
| ok | summary |
+-------+-------------------------------------------+
| true | Default schema changed to [oracle.USER1] |
+-------+-------------------------------------------+
1 row selected (0,169 seconds)
0: jdbc:drill:zk=local> select * from my_tab;
+------+
| X |
+------+
| 1.0 |
| 1.0 |
+------+
2 rows selected (0,151 seconds)