Databricks External Table - azure-databricks

I have data stored on ADLS Gen2 and have 2 workspaces.
Primary ETL Workspace(Workspace A): Prepares data from sources and stores on ADLS (mounted to Databricks with SP as Storage Blob Data Contributor )
Sandbox Workspace(Workspace B) : Utilizes data from ADLS(read only) (mounted to Databricks with SP as Storage Blob Data Reader). Workspace B should always be able to READ only ADLS
ADLS/
|-- curated/
| |-- google
| |-- table1
| |-- table2
| |-- data.com
| |-- table1
| |-- table2
Writing to the above location has no issues since we use Workspace A.
I see issue when layering External database/tables within Workspace B
Steps:
The following works
create database if not exists google_db comment 'Database for Google' location 'dbfs:/mnt/google'
The following Fails
create external table google_db .test USING DELTA location 'dbfs:/mnt/google/table1'
Message:
Error in SQL statement: AbfsRestOperationException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, PUT, https://XASASASAS.dfs.core.windows.net/google/table1/_delta_log?resource=directory&timeout=90, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:76142564-801f-000f-29cf-e6a826000000 Time:2021-12-01T16:21:33.3117578Z"
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403, PUT, https://XASASASAS.dfs.core.windows.net/google/table1/_delta_log?resource=directory&timeout=90, AuthorizationPermissionMismatch, "This request is not authorized to perform this operation using this permission. RequestId:76142564-801f-000f-29cf-e6a826000000 Time:2021-12-01T16:21:33.3117578Z"
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:237)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.createPath(AbfsClient.java:311)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createDirectory(AzureBlobFileSystemStore.java:632)
at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:481)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$3(DatabricksFileSystemV2.scala:748)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$2(DatabricksFileSystemV2.scala:746)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$withUserContextRecorded$2(DatabricksFileSystemV2.scala:941)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:455)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:455)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withUserContextRecorded(DatabricksFileSystemV2.scala:914)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.$anonfun$mkdirs$1(DatabricksFileSystemV2.scala:745)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:434)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:455)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:455)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:415)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:341)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:455)
at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.mkdirs(DatabricksFileSystemV2.scala:745)
at com.databricks.backend.daemon.data.client.DatabricksFileSystem.mkdirs(DatabricksFileSystem.scala:198)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1881)
at com.databricks.sql.transaction.tahoe.commands.WriteIntoDelta.write(WriteIntoDelta.scala:110)
at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.$anonfun$run$2(CreateDeltaTableCommand.scala:132)
at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:80)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.$anonfun$recordDeltaOperation$5(DeltaLogging.scala:115)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$4(UsageLogging.scala:434)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.spark.util.PublicDBLogging.withAttributionContext(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at com.databricks.spark.util.PublicDBLogging.withAttributionTags(DatabricksSparkUsageLogger.scala:19)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:415)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:341)
at com.databricks.spark.util.PublicDBLogging.recordOperation(DatabricksSparkUsageLogger.scala:19)
at com.databricks.spark.util.PublicDBLogging.recordOperation0(DatabricksSparkUsageLogger.scala:56)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:129)
at com.databricks.spark.util.UsageLogger.recordOperation(UsageLogger.scala:70)
at com.databricks.spark.util.UsageLogger.recordOperation$(UsageLogger.scala:57)
at com.databricks.spark.util.DatabricksSparkUsageLogger.recordOperation(DatabricksSparkUsageLogger.scala:85)
at com.databricks.spark.util.UsageLogging.recordOperation(UsageLogger.scala:402)
at com.databricks.spark.util.UsageLogging.recordOperation$(UsageLogger.scala:381)
at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordOperation(CreateDeltaTableCommand.scala:52)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:113)
at com.databricks.sql.transaction.tahoe.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:97)
at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.recordDeltaOperation(CreateDeltaTableCommand.scala:52)
at com.databricks.sql.transaction.tahoe.commands.CreateDeltaTableCommand.run(CreateDeltaTableCommand.scala:108)
at com.databricks.sql.transaction.tahoe.catalog.DeltaCatalog.com$databricks$sql$transaction$tahoe$catalog$DeltaCatalog$$createDeltaTable(DeltaCatalog.scala:162)
at com.databricks.sql.transaction.tahoe.catalog.DeltaCatalog$StagedDeltaTableV2.commitStagedChanges(DeltaCatalog.scala:413)
at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.$anonfun$writeToTable$1(WriteToDataSourceV2Exec.scala:515)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1677)
at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable(WriteToDataSourceV2Exec.scala:500)
at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable$(WriteToDataSourceV2Exec.scala:495)
at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.writeToTable(WriteToDataSourceV2Exec.scala:104)
at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:126)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:41)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:41)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:47)
at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:233)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3802)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:126)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:267)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:104)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:217)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3800)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:233)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:103)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:687)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:852)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:694)
at com.databricks.backend.daemon.driver.SQLDriverLocal.$anonfun$executeSql$1(SQLDriverLocal.scala:91)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:37)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:544)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:53)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:53)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:521)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221)
at java.lang.Thread.run(Thread.java:748)
at com.databricks.backend.daemon.driver.SQLDriverLocal.executeSql(SQLDriverLocal.scala:129)
at com.databricks.backend.daemon.driver.SQLDriverLocal.repl(SQLDriverLocal.scala:144)
at com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$13(DriverLocal.scala:544)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:240)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:235)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:232)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:53)
at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:279)
at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:271)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:53)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:521)
at com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:689)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:681)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:522)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:634)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:427)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:370)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:221)
at java.lang.Thread.run(Thread.java:748)
However, if database/tables created with the following way, goes through
Create database without location
create database if not exists google_db comment 'Database for Google'
This Works
create table google_db.test USING DELTA location 'dbfs:/mnt/google/table1'
Not sure why Delta/Databricks is trying to write to the location when external database is defined.
The issue with the above is that users can still create additional tables under "google_db" since it is an internal database.User requirements state that data should be immutable (it is since users cannot updated/write to existing tables, but they are able to add tables to the database)
Any help is appreciated.

Related

Failed while loading data from data set SQLQueryDataSet

I am receiving this error:
DataSetError: Failed while loading data from data set SQLQueryDataSet(load_args={}, sql=select * from table)
when I run (within kedro jupyter notebook):
%reload_kedro
c:\users\name.virtualenvs\pipenv_kedro\lib\site-packages\ipykernel\ipkernel.py:283:DeprecationWarning: should_run_async will not call transform_cell automatically in the future. Please pass the result to transformed_cell argument and any exception that happen during the transform in preprocessing_exc_tuple in IPython 7.17 and above.
and should_run_async(code)
2021-04-21 15:29:12,278 - kedro.framework.session.store - INFO - read() not implemented for BaseSessionStore. Assuming empty store.
2021-04-21 15:29:12,696 - root - INFO - ** Kedro project Project
2021-04-21 15:29:12,698 - root - INFO - Defined global variable context, session and catalog
2021-04-21 15:29:12,703 - root - INFO - Registered line magic run_viz
Then this:
catalog.list()
#['table', 'parameters']
catalog.load('table')
where my catalog.yml file contains:
table:
type: pandas.SQLQueryDataSet
credentials: secret
sql: select * from table
layer: raw
However, I am able to pull back the expected result when I run this (within the same kedro jupyter notebook):
from kedro.extras.datasets.pandas import SQLQueryDataSet
sql = "select * from table"
credentials = {
"con": secret
}
data_set = SQLQueryDataSet(sql=sql,
credentials=credentials)
sql_data = data_set.load()
How can I fix this error?
The discrepancy I believe comes from the credentials. In your catalog you had
table:
type: pandas.SQLQueryDataSet
credentials: secret
but in the notebook you were testing with
credentials = {
"con": secret
}
The value mapped in the yaml file should match to the name of an entry in credentials.yml so something like
# in catalog.yml
table:
type: pandas.SQLQueryDataSet
credentials: db_creds
# in credentials.yml
db_creds:
con: secret

Error using Polybase to load Parquet file: class java.lang.Integer cannot be cast to class parquet.io.api.Binary

I have a snappy.parquet file with a schema like this:
{
"type": "struct",
"fields": [{
"name": "MyTinyInt",
"type": "byte",
"nullable": true,
"metadata": {}
}
...
]
}
Update: parquet-tools reveals this:
############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:
11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
The load into the external table works fine with only varchars
CREATE EXTERNAL TABLE [domain].[TempTable]
(
...
MyTinyInt tinyint NULL,
...
)
WITH
(
LOCATION = ''' + #Location + ''',
DATA_SOURCE = datalake,
FILE_FORMAT = parquet_snappy
)
The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.
I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:
there is a known bug in ADF for this particular scenario: The date
type in parquet should be mapped as data type date in Sql sever
however, ADF incorrectly converts this type to Datetime2 which causes
a conflict in PolyBase. I have confirmation for the core engineering
team that this will be rectified with a fix by the end of November and
will be published directly into the ADF product.
In the meantime, as a workaround:
Create the target table with data type DATE as opposed to DATETIME2
Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase
but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem

confluent - kafka-connect - JDBC source connector - ORA-00933: SQL command not properly ended

I've the following sql query in my kafka jdbc source connector properties file :
query=SELECT * FROM JENNY.WORKFLOW where ID = '565231'
If I run the same query in sql developer, it works fine and fetching the results. But if I use the same query in the "jdbc_workflow_connect.properties", getting the following error :
(io.confluent.connect.jdbc.source.JdbcSourceTaskConfig:223)
[2018-09-19 12:32:15,130] INFO WorkerSourceTask{id=Workflow-DB-source-0}
Source task finished initialization and start
(org.apache.kafka.connect.runtime.WorkerSourceTask:158)
[2018-09-19 12:32:15,328] ERROR Failed to run query for table
TimestampIncrementingTableQuerier{name='null', query='SELECT * FROM
JENNY.WORKFLOW where ID = '565231'', topicPrefix='workflow_data1',
timestampColumn='null', incrementingColumn='ID'}: {}
(io.confluent.connect.jdbc.source.JdbcSourceTask:247)
java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1017)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:655)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:249)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:566)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:215)
at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:58)
at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:776)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:897)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1034)
at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3820)
at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3867)
at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1502)
at io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier.executeQuery(TimestampIncrementingTableQuerier.java:201)
at io.confluent.connect.jdbc.source.TableQuerier.maybeStartQuery(TableQuerier.java:84)
at io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier.maybeStartQuery(TimestampIncrementingTableQuerier.java:55)
at io.confluent.connect.jdbc.source.JdbcSourceTask.poll(JdbcSourceTask.java:225)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:179)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:170)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:214)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Here is my JDBC source connector properties file content :
name=Workflow-DB-source
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
connection.password = ******
connection.url = jdbc:oracle:thin:#1.1.1.1:****/****
connection.user = *****
table.types=TABLE
query=SELECT * FROM JENNY.WORKFLOW where ID = '565231'
mode=incrementing
incrementing.column.name=ID
topic.prefix=workflow_data1
timestamp.delay.interval.ms=60000
transforms:createKey
transforms.createKey.type:org.apache.kafka.connect.transforms.ValueToKey
transforms.createKey.fields:ID
I'm using ojdbc7.jar
Observation :
If I remove the "WHERE" clause, the query is working fine(like below) :
SELECT * FROM JENNY.WORKFLOW
Please let me know if I'm doing something wrong or any modifications required for setting in the jdbc source connector.
Thanks in advance.
From the documentation of the JDBC Connect Configuration options you may read
If specified, the query to perform to select new or updated rows. Use this setting if you want to join tables, select subsets of columns in a table, or filter data. If used, this connector will only copy data using this query – whole-table copying will be disabled. Different query modes may still be used for incremental updates, but in order to properly construct the incremental query, it must be possible to append a WHERE clause to this query (i.e. no WHERE clauses may be used).
So if you realy want to consider only the part of the table with a given ID you must wrap the query as follows
select * from (SELECT * FROM JENNY.WORKFLOW where ID = '565231')
But please be sure you checked the documentation of the Configuration Options and you know the role of the query parameter.

Unable to load data into parquet file format?

I am trying to parse log data into parquet file format in hive , the separator used is "||-||".
The sample row is
"b8905bfc-dc34-463e-a6ac-879e50c2e630||-||syntrans1||-||CitBook"
After performing the data staging I am able to get the result
"b8905bfc-dc34-463e-a6ac-879e50c2e630 syntrans1 CitBook ".
While converting the data to parquet file format I got error :
`
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2185)
at org.apache.hadoop.hive.ql.plan.PartitionDesc.getDeserializer(PartitionDesc.java:137)
at org.apache.hadoop.hive.ql.exec.MapOperator.getConvertedOI(MapOperator.java:297)
... 24 more
This is what I have tried
create table log (a String ,b String ,c String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
"field.delim"="||-||",
"collection.delim"="-",
"mapkey.delim"="#"
);
create table log_par(
a String ,
b String ,
c String
) stored as PARQUET ;
insert into logspar select * from log_par ;
`
Aman kumar,
To resolve this issue, run the hive query after adding the following jar:
hive> add jar hive-contrib.jar;
To add the jar permanently, do the following:
1.On Hive Server host, create a /usr/hdp//hive/auxlib directory.
2.Copy /usr/hdp//hive/lib/hive-contrib-.jar to /usr/hdp//hive/auxlib.
3.Restart the HS2 server.
Please check further reference.
https://community.hortonworks.com/content/supportkb/150175/errororgapachehadoophivecontribserde2multidelimits.html.
https://community.hortonworks.com/questions/79075/loading-data-to-hive-via-pig-orgapachehadoophiveco.html
Let me know,if you face any issues

SPARK SQL (1.5.1) connect to Oracle and write to Avro

I am using spark-sql to connect to oracle databse and getting data as dataframes. I would like to write this retrieved data into avro file. While writing to avro I am seeing multiple issues, could you help us.
Here is the code -
val df = sqlContext.read.format("jdbc")
.options(Map( "driver"->"oracle.jdbc.driver.OracleDriver",
"url" -> "jdbc:oracle:thin:user/password#host/service"
, "numPartitions" -> "1", "dbtable"-> "
(Select * from schema.table WHERE STAGE_NUM <=39 and
guid='I284ba1f9cdba11dea82ab9f4ee295c21')"))
.load()
df.write.format("com.databricks.spark.avro").save("Outputfile")
Dependencies that are there in my project -
<dependency><br> <groupId>org.apache.spark</groupId><br> <artifactId>spark-sql_2.10</artifactId><br> <version>1.5.1</version><br></dependency><br><dependency><br> <groupId>com.databricks</groupId><br> <artifactId>spark-avro_2.10</artifactId><br> <version>2.0.1</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro</artifactId><br> <version>1.7.7</version><br></dependency><br><dependency><br> <groupId>org.apache.avro</groupId><br> <artifactId>avro-mapred</artifactId><br> <version>1.7.7</version><br></dependency>
Here is the exception information -
java.lang.RuntimeException: com.databricks.spark.avro.DefaultSource does not allow create table as select
If I use - df.write.avro("headnotes"), I get the following exception.
java.lang.IllegalAccessError: tried to access class org.apache.avro.SchemaBuilder$FieldDefault from class com.databricks.spark.avro.SchemaConverters$$anonfun$convertStructToAvro$1

Resources