Opening access file causes error with cached table creation in the generated `AND` checks - ucanaccess

With ucanaccess5.0.1 either via python or the console.sh launcher, I consistently get the following error:
net.ucanaccess.jdbc.UcanaccessSQLException: net.ucanaccess.jdbc.UcanaccessSQLException: UCAExc:::5.0.1 unexpected token: AND required: VALUE
Here is an example table in the access file and example data that causes the issue:
CREATE TABLE `Channel_Normal_Table` (
`Test_ID` INTEGER,
`Data_Point` INTEGER,
`Test_Time` DOUBLE,
`Step_Time` DOUBLE,
`DateTime` DOUBLE,
`Step_Index` INTEGER,
`Cycle_Index` INTEGER,
`Is_FC_Data` INTEGER,
`Current` DOUBLE,
`Voltage` DOUBLE,
`Charge_Capacity` DOUBLE,
`Discharge_Capacity` DOUBLE,
`Charge_Energy` DOUBLE,
`Discharge_Energy` DOUBLE,
`dV/dt` DOUBLE,
`Internal_Resistance` DOUBLE,
`AC_Impedance` DOUBLE,
`ACI_Phase_Angle` DOUBLE,
PRIMARY KEY ( `Test_ID`, `Data_Point` )
) CHARACTER SET 'UTF8';
INSERT INTO `Channel_Normal_Table`(`Test_ID`,`Data_Point`,`Test_Time`,`Step_Time`,`DateTime`,`Step_Index`,`Cycle_Index`,`Is_FC_Data`,`Current`,`Voltage`,`Charge_Capacity`,`Discharge_Capacity`,`Charge_Energy`,`Discharge_Energy`,`dV/dt`,`Internal_Resistance`,`AC_Impedance`,`ACI_Phase_Angle`)
VALUES(3,1,300.0317741715223,299.541194887277,42439.40247685185,1,1,0,0,3.137637,0,0,0,0,0.0001462596,0,0,0),
(3,2,600.0763955656979,599.5858166480613,42439.40594907408,1,1,0,0,3.138294,0,0,0,0,0.0001493698,0,0,0),
(3,3,900.081434966953,899.5908556827076,42439.4094212963,1,1,0,0,3.138459,0,0,0,0,-6.268177e-05,0,0,0),
(3,4,1200.121875557119,1199.631295906265,42439.41289351852,1,1,0,0,3.139116,0,0,0,0,0,0,0,0),
(3,5,1500.146912722062,1499.656333804425,42439.41636574074,1,1,0,0,3.139609,0,0,0,0,0,0,0,0),
(3,6,1800.155976385471,1799.665396734618,42439.41983796296,1,1,0,0,3.139773,0,0,0,0,-7.467654e-05,0,0,0),
(3,7,2100.196447770757,2099.70586885312,42439.42331018519,1,1,0,0,3.140431,0,0,0,0,0,0,0,0),
(3,8,2400.236885428055,2399.746306510418,42439.4267824074,1,1,0,0,3.140595,0,0,0,0,-0.0004344818,0,0,0),
(3,9,2700.24609793454,2699.755518650294,42439.43025462963,1,1,0,0,3.141088,0,0,0,0,0.0002632971,0,0,0),
(3,10,2700.511304032706,2700.020725115069,42439.4302662037,1,1,0,0,3.141088,0,0,0,0,0.0002632971,0,0,0),
(3,11,2700.796365332008,0.265081817865202,42439.43028935185,2,1,0,-1.549392,3.005509,5.483e-12,5.679546262e-05,1.722e-11,0.000174544749343,-0.5108455,0,0,0),
(3,12,2701.685516834828,1.154233687293897,42439.43030092592,2,1,0,-1.549046,2.98924,5.483e-12,0.000439436350364,1.722e-11,0.001321790743707,-0.02045225,0,0,0),
(4,1,300.0226404646186,299.8219818492102,42439.40248842593,1,1,0,0,3.122866,0,0,0,0,0.0001646538,0,0,0),
(4,2,600.0631496105857,599.8624906285687,42439.40596064815,1,1,0,0,3.123359,0,0,0,0,7.479329e-05,0,0,0),
(4,3,900.0724156419196,899.8717566599028,42439.40943287036,1,1,0,0,3.123688,0,0,0,0,0,0,0,0),
(4,4,1200.112816271753,1199.912157656345,42439.41290509259,1,1,0,0,3.124346,0,0,0,0,6.983971e-05,0,0,0),
(4,5,1500.137653268423,1499.936994653014,42439.41637731482,1,1,0,0,3.12451,0,0,0,0,-5.958042e-05,0,0,0),
(4,6,1800.146877506381,1799.946218890972,42439.41984953704,1,1,0,0,3.125167,0,0,0,0,0,0,0,0),
(4,7,2100.187351091318,2099.986692109301,42439.42332175926,1,1,0,0,3.125332,0,0,0,0,0,0,0,0),
(4,8,2400.227830175383,2400.027171193366,42439.42679398148,1,1,0,0,3.125989,0,0,0,0,0.0005079957,0,0,0),
(4,9,2700.205724412479,2700.005065430463,42439.4302662037,1,1,0,0,3.126318,0,0,0,0,-0.000263683,0,0,0),
(4,10,2700.237142905431,0.031064209656103,42439.43027777778,2,1,0,-0.003326768,3.018975,0,2.8706507e-08,0,8.6664233e-08,0,0,0,0),
(4,11,2700.471202673624,0.26512434445807,42439.43028935185,2,1,0,-1.548955,2.994646,0,5.049082513e-05,0,0.000151816700694,-0.1039431,0,0,0),
(4,12,2820.572026468664,120.0068878416104,42439.43168981482,4,2,0,0,3.123359,0,0,0,0,-0.0005849777,0,0,0);
When attempting to open with console.sh, I get the following error:
Cannot execute:CREATE CACHED TABLE CHANNEL_NORMAL_TABLE(TEST_ID INTEGER,DATA_POINT INTEGER,TEST_TIME DOUBLE,STEP_TIME DOUBLE,DATETIME DOUBLE,STEP_INDEX SMALLINT,CYCLE_INDEX SMALLINT,IS_FC_DATA SMALLINT,CURRENT NUMERIC(100,7),VOLTAGE NUMERIC(100,7),CHARGE_CAPACITY DOUBLE,DISCHARGE_CAPACITY DOUBLE,CHARGE_ENERGY DOUBLE,DISCHARGE_ENERGY DOUBLE,"DV/DT" NUMERIC(100,7),INTERNAL_RESISTANCE NUMERIC(100,7),AC_IMPEDANCE NUMERIC(100,7),ACI_PHASE_ANGLE NUMERIC(100,7), check (3.4028235E+38>=CURRENT AND -3.4028235E+38<=CURRENT), check (3.4028235E+38>=VOLTAGE AND -3.4028235E+38<=VOLTAGE), check (3.4028235E+38>="DV/DT" AND -3.4028235E+38<="DV/DT"), check (3.4028235E+38>=INTERNAL_RESISTANCE AND -3.4028235E+38<=INTERNAL_RESISTANCE), check (3.4028235E+38>=AC_IMPEDANCE AND -3.4028235E+38<=AC_IMPEDANCE), check (3.4028235E+38>=ACI_PHASE_ANGLE AND -3.4028235E+38<=ACI_PHASE_ANGLE)) unexpected token: AND required: VALUE
I have tried the python solution, where it is easier to switch out the particular hsqldb jar I'm using, and had the same problem with hsqldb2.5.0, 2.5.1, 2.5.2 and 2.6.1. I changed hsqldb jar by switching the jar in the UCanAccess/lib folder and changing the jar required in the ucanacess_jars in the below:
import pandas as pd
import jaydebeapi
def fetch_data(filepath : str, query : str) -> pd.DataFrame:
ucanaccess_jars = [
"JDBC/UCanAccess/ucanaccess-5.0.1.jar",
"JDBC/UCanAccess/lib/commons-lang3-3.8.1.jar",
"JDBC/UCanAccess/lib/commons-logging-1.2.jar",
"JDBC/UCanAccess/lib/hsqldb-2.5.0.jar", # swap version number here
"JDBC/UCanAccess/lib/jackcess-3.0.1.jar",
]
classpath = ":".join(ucanaccess_jars)
with jaydebeapi.connect(
"net.ucanaccess.jdbc.UcanaccessDriver",
f"jdbc:ucanaccess://{filepath}",
["", ""],
classpath
) as connection:
with connection.cursor() as cursor:
cursor.execute(query, params)
df = pd.DataFrame(cursor.fetchall(), columns=[x[0] for x in cursor.description])
return df
Swapping jars, understandably, breaks the console script.
Any suggestions? Is there a simple tweak to avoid creating the AND check constraints when the cached tables are corrected? Any hints if this is a hsqldb or ucanaccess issue?

Related

Creating alias for a pandera type: DataFrame with particular pandera schema

I want to define a type to be a DataFrame with a particular pandera schema. However, when I lint this code:
from pandera.typing import DataFrame, Series
class MySchema(pa.SchemaModel):
foo: Series[str]
MyDF = DataFrame[MySchema]
def myfun(df: MyDF) -> bool:
return True
I get
myproject: myfile.py: note: In function "myfun":
myproject: myfile.py:10:15: error: Variable "myproject.myproject.myfile.MyDF" is not valid as a type [valid-type]
I understand I can avoid this error by specifying df: DataFrame[MySchema] in the signature of myfun instead. Is there some way I can define the alias MyDF in a valid manner?

Error using Polybase to load Parquet file: class java.lang.Integer cannot be cast to class parquet.io.api.Binary

I have a snappy.parquet file with a schema like this:
{
"type": "struct",
"fields": [{
"name": "MyTinyInt",
"type": "byte",
"nullable": true,
"metadata": {}
}
...
]
}
Update: parquet-tools reveals this:
############ Column(MyTinyInt) ############
name: MyTinyInt
path: MyTinyInt
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=8, isSigned=true)
converted_type (legacy): INT_8
When I try and run a stored procedure in Azure Data Studio to load this into an external staging table with PolyBase I get the error:
11:16:21Started executing query at Line 113
Msg 106000, Level 16, State 1, Line 1
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Integer cannot be cast to class parquet.io.api.Binary (java.lang.Integer is in module java.base of loader 'bootstrap'; parquet.io.api.Binary is in unnamed module of loader 'app')
The load into the external table works fine with only varchars
CREATE EXTERNAL TABLE [domain].[TempTable]
(
...
MyTinyInt tinyint NULL,
...
)
WITH
(
LOCATION = ''' + #Location + ''',
DATA_SOURCE = datalake,
FILE_FORMAT = parquet_snappy
)
The data will eventually be merged into a Data Warehouse Synapse table. In that table the column will have to be of type tinyint.
I have the same issue and good support plan in Azure, so I've got an answer from Microsoft:
there is a known bug in ADF for this particular scenario: The date
type in parquet should be mapped as data type date in Sql sever
however, ADF incorrectly converts this type to Datetime2 which causes
a conflict in PolyBase. I have confirmation for the core engineering
team that this will be rectified with a fix by the end of November and
will be published directly into the ADF product.
In the meantime, as a workaround:
Create the target table with data type DATE as opposed to DATETIME2
Configure the Copy Activity Sink settings to use Copy Command as opposed to PolyBase
but even Copy command don't work for me, so only one workaround is to use Bulk insert, but Bulk is extremely slow and on big datasets it's would be a problem

Airflow Failed: ParseException line 2:0 cannot recognize input near

I'm trying to run a test task on Airflow but I keep getting the following error:
FAILED: ParseException 2:0 cannot recognize input near 'create_import_table_fct_latest_values' '.' 'hql'
Here is my Airflow Dag file:
import airflow
from datetime import datetime, timedelta
from airflow.operators.hive_operator import HiveOperator
from airflow.models import DAG
args = {
'owner': 'raul',
'start_date': datetime(2018, 11, 12),
'provide_context': True,
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=5),
'email': ['raul.gregglino#leroymerlin.ru'],
'email_on_failure': True,
'email_on_retry': False
}
dag = DAG('opus_data',
default_args=args,
max_active_runs=6,
schedule_interval="#daily"
)
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='create_import_table_fct_latest_values.hql ',
hiveconf_jinja_translate=True,
dag=dag
)
deps = {}
# Explicity define the dependencies in the DAG
for downstream, upstream_list in deps.iteritems():
for upstream in upstream_list:
dag.set_dependency(upstream, downstream)
Here is the content of my HQL file, in case this may be the issue and I can't figure:
*I'm testing the connection to understand if the table is created or not, then I'll try to LOAD DATA, hence the LOAD DATA is commented out.
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED ',';
#LOAD DATA LOCAL INPATH
#'/media/windows_share/schemas/opus/fct_latest_values_20181106.csv'
#OVERWRITE INTO TABLE opus_data.fct_latest_values_new_data;
In the HQL file it should be FIELDS TERMINATED BY ',':
CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data (
id_product STRING,
id_model STRING,
id_attribute STRING,
attribute_value STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
And comments should start with -- in HQL file, not #
Also this seems incorrect and causing Exception hql='create_import_table_fct_latest_values.hql '
Have a look at this example:
#Create full path for the file
hql_file_path = os.path.join(os.path.dirname(__file__), source['hql'])
print hql_file_path
run_hive_query = HiveOperator(
task_id='run_hive_query',
dag = dag,
hql = """
{{ local_hive_settings }}
""" + "\n " + open(hql_file_path, 'r').read()
)
See here for more details.
Or put all HQL into hql parameter:
hql='CREATE TABLE IF NOT EXISTS opus_data.fct_latest_values_new_data ...'
I managed to find the answer for my issue.
It was related to the path my HiveOperator was calling the file from. As no Variable had been defined to tell Airflow where to look for, I was getting the error I mentioned in my post.
Once I have defined it using the webserver interface (See picture), my dag started to work propertly.
I made a change to my DAG code regarding the file location for organization only and this is how my HiveOperator looks like now:
import_lv_data = HiveOperator(
task_id='fct_latest_values',
hive_cli_conn_id='metastore_default',
hql='hql/create_import_table_fct_latest_values2.hql',
hiveconf_jinja_translate=True,
dag=dag
)
Thanks to (#panov.st) who helped me in person to identify my issue.

How to merge orc files for external tables?

I am trying to merge multiple small ORC files. Came across ALTER TABLE CONCATENATE command but that only works for managed tables.
Hive gave me the following error when I try to run it :
FAILED: SemanticException
org.apache.hadoop.hive.ql.parse.SemanticException: Concatenate/Merge
can only be performed on managed tables
Following are the table parameters :
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE true
EXTERNAL TRUE
numFiles 535
numRows 27051810
orc.compress SNAPPY
rawDataSize 20192634094
totalSize 304928695
transient_lastDdlTime 1512126635
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I believe your table is an external table,then there are two ways:
Either you can change it to Managed table (ALTER TABLE <table> SET
TBLPROPERTIES('EXTERNAL'='FALSE') and run the ALTER TABLE
CONCATENATE.Then you can convert the same back to external changing
it to TRUE.
Or you can create a managed table using CTAS and insert the data. Then run the merge query and import the data back to external table
From my previous answer to this question, here is a small script in Python using PyORC to concatenate the small ORC files together. It doesn't use Hive at all, so you can only use it if you have direct access to the files and are able to run a Python script on them, which might not always be the case in managed hosts.
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()

Rscript : Why is Error in UseMethod("extract_") : being indicated when attempting to use raster::extract?

I attempting to use raster package's extract method to extract values from a Raster* object.
RStudioPrompt> jpnpe <- extract(jpnp, jpnb, fun = mean, na.rm = T)
where jpnp is the raster object and jpnb is SpatialPolygonsDataFrame
However the following error is indicated:
Error in UseMethod("extract_") :
no applicable method for 'extract_' applied to an object of class "c('RasterStack', 'Raster', 'RasterStackBrick', 'BasicRaster')"
How can I get passed this error?
Issue may be due to having another package with the same method name, obfuscating the raster extract method.
The tidyr package has an extract method which may conflict with raster's extract method.
Confirm by checking libraries loaded by doing:
>search()
[1] ".GlobalEnv" **"package:tidyr"** "package:dplyr"
[4] "package:rgeos" "package:ggplot2" "package:RColorBrewer"
[7] "package:animation" "package:rgdal" "package:maptools"
[10] **"package:raster"** "package:sp" "tools:rstudio"
[13] "package:stats" "package:graphics" "package:grDevices"
[16] "package:utils" "package:datasets" "package:methods"
[19] "Autoloads" "package:base"
you can also check which extract method is being loaded by typing name of function without brackets (as below, the environment will tell you which package is being used):
> extract
function (data, col, into, regex = "([[:alnum:]]+)", remove = TRUE,
convert = FALSE, ...)
{
col <- col_name(substitute(col))
extract_(data, col, into, regex = regex, remove = remove,
convert = convert, ...)
}
<environment: namespace:tidyr>
To resolve the error just unload the offending package, in RStudio you can use the following command:
>.rs.unloadPackage("tidyr")
and re-execute the raster extract method:
>jpnpe <- extract(jpnp, jpnb, fun = mean, na.rm = T)

Resources