ORA-40474: invalid UTF-8 byte sequence in JSON data

ORA-40474: invalid UTF-8 byte sequence in JSON data - oracle

I am trying to parse JSON data to different columns in oracle. below is the sql I am running. I am not able to identify why I am getting json parse error. I have tried to replace non-ascii, non-printable characters but its still not working -
select * from (select '{
"data": [
{
"note": "Yeah, it still wonb\u00000019t let me. Not sure why.\r\n\r\nThanks man!\r\n\r",
"id": 0
}
]
}' json_data from dual)i ,json_table( i.json_data , '$.data[*]'
COLUMNS (
ID varchar2(4000) path '$.id',
note varchar2(4000) path '$.note'
) )

Related

Having issue with extracing date from CLOB data

Hi I am having issue when extracting fields from CLOB data. For one record I am not getting desired output.
The record is as below:
{1:F014243107336}{2:O2021216200422XXX24563}{3:{108:O2020}{121:2c02a452-5}{433:HIT}}{4:
:4A:SEC:20200901
:4B:FC5253
:4C:20042000,
:4D:XXXXXXX
:4E:RXX
:4F:RXXXX
-}{5:{CHK:87D1003B01F7}{TNG:}}{S:{SAC:}{COP:S}}<APSECSIGN>FS3sfasdfg!==</APSECSIGN>?
I want to extract data from tag :4A: into REF_NUMBER.
I am using below SQL to get the data.
NVL(TRIM(TRANSLATE(REGEXP_REPLACE(REGEXP_SUBSTR(dbms_lob.substr(CLOB, 4000, 1 ), ':4A.?:[^:]+(:|-\})'), ':20.?:([^:]+)(:|-\})', '\1'),CHR(10)||CHR(13), ' ')),' ') AS REF_NUMBER
the output I am getting is "SEC". However I want to see output as SEC:20200901.
Can any one suggest what I am missing in my query or provide me correct query.

A general suggestion. Why don't you have your data stored as JSON ? Because, JSON related functions are very fast when compared to others. And then your problem becomes quite easy.
However to answer your question:
with inputs (str) as
(
select to_clob(q'<
{1:F014243107336}{2:O2021216200422XXX24563}{3:{108:O2020}{121:2c02a452-5}{433:HIT}}{4:
:4A:SEC:20200901
:4B:FC5253
:4C:20042000,
:4D:XXXXXXX
:4E:RXX
:4F:RXXXX
-}{5:{CHK:87D1003B01F7}{TNG:}}{S:{SAC:}{COP:S}}<APSECSIGN>FS3sfasdfg!==</APSECSIGN>?
>') from dual
)
select str, regexp_substr(str,'SEC:\d+',1,1,'n') as val
from inputs;
Output:

Updated
If you know the date is always going to be 8 digits after the :4A: tag, you can use REGEXP_SUBSTR to get the value you need. Combining it with DBMS_LOB.SUBSTR removes the tag and converts it to a string.
SELECT DBMS_LOB.SUBSTR ((REGEXP_SUBSTR (clob_val, ':4A:.*\d{8}')), 4000, 5)
FROM (SELECT EMPTY_CLOB ()
|| '{1:F014243107336}{2:O2021216200422XXX24563}{3:{108:O2020}{121:2c02a452-5}{433:HIT}}{4:
:4A:SEC:20200901
:4B:FC5253
:4C:20042000,
:4D:XXXXXXX
:4E:RXX
:4F:RXXXX
-}{5:{CHK:87D1003B01F7}{TNG:}}{S:{SAC:}{COP:S}}<APSECSIGN>FS3sfasdfg!==</APSECSIGN>?' AS clob_val
FROM DUAL);

Oracle text phrase error (DRG-50920). Query parsing and SYN function

I just started working with Oracle Text, already red the docs but I really struggle to find a solution.
Currently using with progressive relaxation but I keep getting the following error:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50920: part of phrase not itself a phrase or equivalence
29902. 00000 - "error in executing ODCIIndexStart() routine"
*Cause: The execution of ODCIIndexStart routine caused an error.
*Action: Examine the error messages produced by the indextype code and
take appropriate action.
I have two questions:
How to escape the query in the tag? Maybe {} ? Without escaping it won't work when I type something like COCA COLA 0,5L.
On the other hand, when the query is enclosed in the escape chars ({ }) and I try with "EL K" it thorws the same exception. "EL" is part of a thesaurus.
Code
Table and index creation:
CREATE TABLE "DAVID"."INVENTAR_DEV" (
"ID" VARCHAR2(16 BYTE)
NOT NULL ENABLE,
"NAME" VARCHAR2(255 BYTE)
NOT NULL ENABLE
);
CREATE INDEX "DAVID"."INV_DEV_NAME_IDX" ON
"DAVID"."INVENTAR_DEV" (
"NAME"
)
INDEXTYPE IS "CTXSYS"."CONTEXT" PARAMETERS ( 'DATASTORE CTXSYS.DEFAULT_DATASTORE FILTER CTXSYS.NULL_FILTER LEXER INVENTAR_DEV_LEXER WORDLIST INV_DEV_WORDLIST STOPLIST CTXSYS.EMPTY_STOPLIST'
);
The SELECT which I'm using:
SELECT /*+ FIRST_ROWS(150) */
XMLELEMENT("object",
XMLForest(i.id "id",
i.name "name"
).getStringVal()
FROM david.inv_dev i
WHERE contains(i.name,
'<query>
<textquery grammar="context"> {EL KOS}
<progression>
<seq><rewrite>transform((TOKENS, "FUZZY(SYN({", "}, inv_thes), 70, 10, weight)", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "FUZZY(SYN({", "}, inv_thes), 70, 10, weight)", " AND "))</rewrite></seq>
</progression>
</textquery>
<score datatype="FLOAT" algorithm="DEFAULT"/>
<order>
<orderkey> Score DESC </orderkey>
</order>
</query>', 1) > 0;
Also created my own WORDLIST and LEXER:
BEGIN
ctx_ddl.create_preference('INVENTAR_DEV_LEXER','BASIC_LEXER');
ctx_ddl.set_attribute('INVENTAR_DEV_LEXER', 'numgroup',',');
ctx_ddl.set_attribute('INVENTAR_DEV_LEXER', 'numjoin','.');
ctx_ddl.set_attribute('INVENTAR_DEV_LEXER', 'skipjoins','.-_%:;/,()?!*+');
ctx_ddl.create_preference('INV_DEV_WORDLIST', 'BASIC_WORDLIST');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','FUZZY_MATCH','GENERIC');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','FUZZY_SCORE','70');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','FUZZY_NUMRESULTS','10');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','SUBSTRING_INDEX','FALSE');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','STEMMER','NULL');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','PREFIX_INDEX','TRUE');
ctx_ddl.set_attribute('INV_DEV_WORDLIST','PREFIX_MIN_LENGTH',3);
ctx_ddl.set_attribute('INV_DEV_WORDLIST','PREFIX_MAX_LENGTH',7);
Ctx_thes.create_thesaurus('inv_thes', FALSE); -- NAMEE, CASE-SENSITIVE
CTX_THES.CREATE_RELATION('inv_thes','el','SYN','elektro');
END;
Update
I realized that SYN({something}, thes) doesn't work when there are multiple words devided by spaces.
So there must be an operator between those words.
The query works with the SYN if I remove the following line from the textquery:
<seq><rewrite>transform((TOKENS, "FUZZY(SYN({", "}, inv_thes), 70, 10, weight)", " "))</rewrite></seq>
But I'm still not sure what could be the reason.

My solution to the problem was to use a custom workaround function instead of SYN and a custom query parsing function.
The workaround function for SYN:
FUNCTION f_synonyms(p_word IN VARCHAR2)
RETURN VARCHAR2
AS
CURSOR c_synonyms (p_word IN VARCHAR2)
IS
SELECT REPLACE(CTX_THES.SYN(p_word, g_thesaurus_name), '|','=')
FROM SYS.dual;
v_retVal VARCHAR(255);
BEGIN
OPEN c_synonyms(p_word);
FETCH c_synonyms INTO v_retVal;
CLOSE c_synonyms;
RETURN v_retVal;
END;

Getting error "Error parsing insert statement for table ROOT.LOAD_SQL" while using ltrim in sql loader

I trying to load the datas from data file to the database table load_sql using sql loader. I have data like below in the data file.
empid,ename
1,Raja,**Kanchi
2,Poo,**Kanchi
3,Bhasker,**Kanchi
4,Siva,**Kanchi
I have to load to load it in the load_sql table like below format:
1,Raja,Kanchi
2,Poo,Kanchi
3,Bhasker,Kanchi
4,Siva,Kanchi
I have written a control file with help of char manipulation function for inserting records in third column but im getting error:
options(skip = 1,Errors = 100, direct = True)
load data
infile 'D:\SQLLDR\control.ctl'
truncate into table load_sql
when city = 'Kanchi'
fields terminated by ','
optionally enclosed by '"'
(
empid,
ename,
X filler,
city "ltrim(:city,*)"
)
I'm getting the error like
'SQL*Loader-951: Error calling once/load initialization
ORA-02373: Error parsing insert statement for table ROOT.LOAD_SQL.
ORA-00936: missing expression'

You have some syntax errors in your control file. Try this:
options(skip = 1,Errors = 100, direct = True)
load data
infile 'D:\SQLLDR\control.ctl' <-- This doesn't look like a data file name?
truncate into table load_sql
when (city = '**Kanchi')
fields terminated by ','
optionally enclosed by '"'
(
empid,
ename,
city "ltrim(:city, '*')"
)

ltrim(:city,*)
^
|
this is invalid
Should have been
ltrim(:city, '*')
or, possibly,
replace(:city, '*', '')

H2 - CREATE TABLE creates wrong data type

Testing my DAL with H2 in-memory database currently doesn't work because the data tye BINARY gets converted to VARBINARY:
CREATE TABLE test (
pk_user_id INT AUTO_INCREMENT(1, 1) PRIMARY KEY,
uuid BINARY(16) UNIQUE NOT NULL
);
which results in a wrong data type if I check if the columns with the expected data types exists:
2017-03-20 16:24:48 persistence.database.Table check Unexpected column
(UUID) or type (-3, VARBINARY)

tl;dr
which results in a wrong data type
No, not the wrong type, just another label for the same type.
The binary type has five synonyms: { BINARY | VARBINARY | LONGVARBINARY | RAW | BYTEA }
All five names mean the same type, and all map to byte[] in Java.
Synonyms for datatype names
Data types are not strictly defined in the SQL world. The SQL spec defines only a few types. Many database systems define many types by many names. To make it easier for a customer to port from one database system to theirs, the database vendors commonly implement synonyms for data types to match those of their competitors where the types are compatible.
H2 like many other databases systems have more than one name for a datatype. For a binary type where the entire value is loaded into memory, H2 defines five names for the same single data type:
{ BINARY | VARBINARY | LONGVARBINARY | RAW | BYTEA }
Similarly, H2 provides for a signed 32-bit integer datatype by any of five synonyms:
{ INT | INTEGER | MEDIUMINT | INT4 | SIGNED }
So you can specify any of these five names but you will get the same effect, the same underlying datatype provided by H2.
Indeed, I myself ran code to create the column using each of those five names for the binary type. In each case, the metadata for the column name reports the datatype as VARBINARY.
While it does not really matter which of the five is used internally to track the column’s datatype, I am a bit surprised as to the use of VARBINARY because the H2 datatype documentation page heading advertises this type as BINARY. So I would expect BINARY to be used by default in the metadata. You might want to log a bug/issue for this if you really care, as it seems either the doc heading should be changed to VARBINARY or H2’s internal labelling for the datatype should be changed to BINARY.
Below is some example Java JDBC code confirming the behavior you report in your Question.
I suggest you change your datatype-checking code to look for any of the five possible names for this datatype rather than check for only one specific name.
try {
Class.forName ( "org.h2.Driver" );
} catch ( ClassNotFoundException e ) {
e.printStackTrace ( );
}
try ( Connection conn = DriverManager.getConnection ( "jdbc:h2:mem:" ) ;
Statement stmt = conn.createStatement ( ) ; ) {
String tableName = "test_";
String sql = "CREATE TABLE " + tableName + " (\n" +
" pk_user_id_ INT AUTO_INCREMENT(1, 1) PRIMARY KEY,\n" +
" uuid_ BINARY(16) UNIQUE NOT NULL\n" +
");";
// String sql = "CREATE TABLE " + tableName +
// "(" +
// " id_ INT AUTO_INCREMENT(1, 1) PRIMARY KEY, " +
// " binary_id_ BINARY(16) UNIQUE NOT NULL, " +
// " uuid_id_ UUID, " +
// " age_ INTEGER " + ")";
stmt.execute ( sql );
// List tables
DatabaseMetaData md = conn.getMetaData ( );
try ( ResultSet rs = md.getTables ( null, null, null, null ) ) {
while ( rs.next ( ) ) {
System.out.println ( rs.getString ( 3 ) );
}
}
// List columns of our table.
try ( ResultSet rs = md.getColumns ( null, null, tableName.toUpperCase ( Locale.US ), null ) ) {
System.out.println ( "Columns of table: " + tableName );
while ( rs.next ( ) ) {
System.out.println ( rs.getString ( 4 ) + " | " + rs.getString ( 5 ) + " | " + rs.getString ( 6 ) ); // COLUMN_NAME, DATA_TYPE , TYPE_NAME.
}
}
} catch ( SQLException e ) {
e.printStackTrace ( );
}
CATALOGS
COLLATIONS
…
USERS
VIEWS
TEST_
Columns of table: test_
PK_USER_ID_ | 4 | INTEGER
UUID_ | -3 | VARBINARY
Tips:
Adding a trailing underscore to all your SQL names avoids collisions with any of the over one thousand reserved words found in the SQL world. The SQL spec promises a trailing underscore will never be used by a SQL system. For example, your use of the column name uuid could conflict with H2’s UUID datatype.
Your code uuid BINARY(16) suggests you are trying to store a UUID (a 128-bit value where some bits have defined semantics). Note that H2 supports UUID natively as a data type as does Postgres and some other database systems. So change uuid_ BINARY(16) to uuid_ UUID.

Is there anyway to compare two avro files to see what differences exist in the data?

Ideally, I'd like something packaged like SAS proc compare that can give me:
The count of rows for each dataset
The count of rows that exist in one dataset, but not the other
Variables that exist in one dataset, but not the other
Variables that do not have the same format in the two files (I realize this would be rare for AVRO files, but would be helpful to know quickly without deciphering errors)
The total number of mismatching rows for each column, and a presentation of all the mismatches for a column or any 20 mismatches (whichever is smallest)
I've worked out one way to make sure the datasets are equivalent, but it is pretty inefficient. Lets assume we have two avro files with 100 rows and 5 columns (one key and four float features). If we join the tables and create new variables that are the difference between the matching features from the datasets then any non-zero difference is some mismatch in the data. From there it could be pretty easy to determine the entire list of requirements above, but it just seems like there may be more efficient ways possible.

AVRO files store the schema and data separately. This means that beside the AVRO file with the data you should have a schema file, usually it is something like *.avsc. This way your task can be split in 3 parts:
Compare the schema. This way you can get the fields that have different data types in these files, have different set of fields and so on. This task is very easy and can be done outside of the Hadoop, for instance in Python:
import json
schema1 = json.load(open('schema1.avsc'))
schema2 = json.load(open('schema2.avsc'))
def print_cross (s1set, s2set, message):
for s in s1set:
if not s in s2set:
print message % s
s1names = set( [ field['name'] for field in schema1['fields'] ] )
s2names = set( [ field['name'] for field in schema2['fields'] ] )
print_cross(s1names, s2names, 'Field "%s" exists in table1 and does not exist in table2')
print_cross(s2names, s1names, 'Field "%s" exists in table2 and does not exist in table1')
def print_cross2 (s1dict, s2dict, message):
for s in s1dict:
if s in s2dict:
if s1dict[s] != s2dict[s]:
print message % (s, s1dict[s], s2dict[s])
s1types = dict( zip( [ field['name'] for field in schema1['fields'] ], [ str(field['type']) for field in schema1['fields'] ] ) )
s2types = dict( zip( [ field['name'] for field in schema2['fields'] ], [ str(field['type']) for field in schema2['fields'] ] ) )
print_cross2 (s1types, s2types, 'Field "%s" has type "%s" in table1 and type "%s" in table2')
Here's an example of the schemas:
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int"]},
{"name": "favorite_color", "type": ["string", "null"]},
{"name": "test", "type": "int"}
]
}
Here's the output:
[localhost:temp]$ python compare.py
Field "test" exists in table2 and does not exist in table1
Field "favorite_number" has type "[u'int', u'null']" in table1 and type "[u'int']" intable2
If the schemas are equal (and you probably don't need to compare the data if the schemas are not equal), then you can do the comparison in the following way. Easy way that matches any case: calculate md5 hash for each of the rows, join two tables based on the value of this md5 hash. If will give you amount of rows that are the same in both tables, amount of rows specific to table1 and amount of rows specific for table2. It can be easily done in Hive, here's the code of the MD5 UDF: https://gist.github.com/dataminelab/1050002
For comparing the field-to-field you have to know the primary key of the table and join two tables on primary key, comparing the fields
Previously I've developed comparison functions for tables, and they usually looked like this:
Check that both tables exists and available
Compare their schema. If there are some mistmatches in schema - break
If the primary key is specified:
Join both tables on primary key using full outer join
Calculate md5 hash for each row
Output primary keys with diagnosis (PK exists only in table1, PK exists only in table2, PK exists in both tables but the data does not match)
Get the 100 rows same of each problematic class, join with both tables and output into "mistmatch example" table
If the primary key is not specified:
Calculate md5 hash for each row
Full outer join of table1 with table2 on md5hash value
Count number of matching rows, number of rows exists in table1 only, number of rows exists in table2 only
Get 100 rows sample of each mistmatch type and output to "mistmatch example" table
Usually development and debugging such a function takes 4-5 business days

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ORA-40474: invalid UTF-8 byte sequence in JSON data - oracle

Related

Having issue with extracing date from CLOB data

Oracle text phrase error (DRG-50920). Query parsing and SYN function

Getting error "Error parsing insert statement for table ROOT.LOAD_SQL" while using ltrim in sql loader

H2 - CREATE TABLE creates wrong data type

Is there anyway to compare two avro files to see what differences exist in the data?

Categories

Resources