Hive - column type name too long - hadoop

I want to use rcongiu's hive-json-serde to store non-trivial JSON documents complying with an open standard. I've used Michael Peterson's convenient hive-json-schema generator to produce a CREATE TABLE statement that should work, except for its size.
The JSON documents I am encoding follow a well-defined schema, but the schema contains maybe a hundred fields, nested up to four levels deep. A Hive column type that matches the standard is very, very long (around 3700 characters), and when I run my generated create table statement I get the error
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
InvalidObjectException(message:Invalid column type name is too long: <the
really long type name>)
The statement looks like this:
CREATE TABLE foobar_requests (
`event_id` int,
`client_id` int,
`request` struct<very long and deeply nested struct definition>,
`timestamp` timestamp)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
Any path forward to storing these documents?

Hive has a problem with very long column definitions. By default the maximum number of chars supported is 4000 so if you really need more than this you'll have to alter the metastore database by extending the length of COLUMNS_V2.TYPE_NAME.
If you like to read more about the issue go to this link:
https://issues.apache.org/jira/browse/HIVE-12274

Add the following property through Ambari > Hive > Configs > Advanced > Custom hive-site:
hive.metastore.max.typename.length=14000

This issue occurs, when the name of one of the Column Type is longer than the default of 2000 characters.
Solution:
To resolve this issue, do the following: 1.Add the following property through Ambari > Hive > Configs > Advanced > Custom hive-site: hive.metastore.max.typename.length=10000
The above value is an example, and it needs to be tuned according to a specific use case.
2.Save changes, restart services and recreate the table.

Related

Why does Oracle change the index construction function instead of error output? ORA-01722: invalid number by index on a field with type varchar2

Creating a mySomeTable table with 2 fields
create table mySomeTable (
IDRQ VARCHAR2(32 CHAR),
PROCID VARCHAR2(64 CHAR)
);
Creating an index on the table by the PROCID field
create index idx_PROCID on mySomeTable(trunc(PROCID));
Inserting records:
insert into mySomeTable values ('a', '1'); -- OK
insert into mySomeTable values ('b', 'c'); -- FAIL
As you can see, an error has been made in the index construction script and the script will try to build an index on the field using the trunc() function.
trunct() is a function for working with dates or numbers, and the field has the string type
This index building script successfully works out and creates an index without displaying any warnings and errors.
An index is created on the table using the TRUNC(TO_NUMBER(PROCID)) function
When trying to insert or change an entry in the table, if PROCID cannot be converted to a number, I get the error ORA-01722: invalid number, which is actually logical.
However, the understanding that I am working in a table with rows and adding string values to the table, and the error is about converting to a number, was misleading and not understanding what is happening...
Question: Why does Oracle change the index construction function, instead of giving an error? And how can this be avoided in the future?
Oracle version 19.14
Naturally, there was only one solution - to create the right index with the right script
create index idx_PROCID on mySomeTable(PROCID);
however, this does not explain, to me, this Oracle behavior.
Oracle doesn't know if the index declaration is wrong or the column data type is wrong. Arguably (though some may well disagree!) Oracle shouldn't try to second-guess your intentions or enforce restrictions beyond those documented in the manual - that's what user-defined constraints are for. And, arguably, this index acts as a form of pseudo-constraint. That's a decision for the developer, not Oracle.
It's legal, if usually ill-advised, to store a number in a string column. If you actually intentionally chose to store numbers as strings - against best practice and possibly just to irritate future maintainers of your code - then the index behaviour is reasonable.
A counter-question is to ask where it should draw the line - if you expect it to error on your index expression, what about something like
create index idx_PROCID on mySomeTable(
case when regexp_like(PROCID, '^\d.?\d*$') then trunc(PROCID) end
);
or
create index idx_PROCID on mySomeTable(
trunc(to_number(PROCID default null on conversion error))
);
You might actually have chosen to store both numeric and non-numeric data in the same string column (again, I'm not advocating that) and an index like that might then useful - and you wouldn't want Oracle to prevent you from creating it.
Something that obviously doesn't make sense and shouldn't be allowed to you is much harder for software to evaluate.
Interestingly the documentation says:
Oracle recommends that you specify explicit conversions, rather than rely on implicit or automatic conversions, for these reasons:
...
If implicit data type conversion occurs in an index expression, then Oracle Database might not use the index because it is defined for the pre-conversion data type. This can have a negative impact on performance.
which is presumably why it actually chooses here to apply explicit conversion when it creates the index expression (which you can see in user_ind_expressions - fiddle)
But you'd get the same error if the index expression wasn't modified - there would still be an implicit conversion of 'c' to a number, and that would still throw ORA-01722. As would some strings that look like numbers if your NLS settings are incompatible.

ODI-1228: Task Load data-LKM SQL to Oracle- fails on the target > connection

I'm working with Oracle Data Integrator inserting information from original source to temp table (BI_DSA.TMP_TABLE)
ODI-1228: Task Load data-LKM SQL to Oracle- fails on the target
connection BI_DSA. Caused By: java.sql.BatchUpdateException:
ORA-12899: value too large for column
"BI_DSA"."C$_0DELTA_TABLE"."FIELD" (actual: 11, maximum: 10)
I tried changing the lenght of 'FIELD' to more than 10 and reverse engineering but it didn't work.
Is this error coming from the original source? I'm doing a replica so I just have view privileges on it and I believe so because is the C$ table where the error comes from.
Thanks for the help!
Solution: I tried with the length option before like the answers suggested but didn't work, I noticed the orginal source modified their field lenght so I reverse enginereed source table and problem solved.
Greetings!
As Bobby mentioned in the comment it might come from the byte/char semantics.
The C$ tables created by the LKMs usually copy the structure of the source data. So a workaround would be to go in the model and manually increase the size of the FIELD column in the source datastore (even if it doesn't represent what is in the database). The C$ table will be created whith that size on the next run.

Import massive table from Oracle to PostgreSQL with oracle-fdw return ORA-01406

I work on a project to transfer data from an Oracle database to a PostgreSQL database to create a datawarehouse with bash & SQL scripts. To access to the Oracle database, I use the PostgreSQL extension oracle-fdw.
One of my scripts import data from a massive table (~ 100 000 000 new rows/day). This table is partitioned and each partition contains 1 day of data. The query I use to import data looks like that :
INSERT INTO postgre_target_table (some_fields)
SELECT some_aggregated_fields -- (~150 fields)
FROM oracle_source_table
WHERE partition_id = :v_partition_id AND some_others_filters
GROUP BY primary_key;
On DEV server, the query works fine (there is much less data on this server) but in PREPROD, it returns the error ORA-01406: fetched column value was truncated.
In some posts, people say that the output fields may be too small but if I try to send a simple SELECT query without INSERT or GROUP BY I have the same error.
Another idea I found in another post is to create an Oracle side view but in my query I use multiple parameters that I cannot use in a view.
The last idea I found is to create an Oracle stored procedure that fills a table with aggregated data and then import data from this table but the Oracle database is critical and my customer prefers to avoid adding more data on it.
Now, I'm starting to think there's no solution and it's not good...
PostgreSQL version : 12.4 / Oracle version : 11.2
UPDATE
It seems my problem is more complecated than I thought.
After applying the modification given by Laurenz Albe, the query runs correctly on PGAdmin but the problem still appears when I use psql command.
Moreover, another query seems to have the same problem. This other query does not use the same source table as the first query, it uses 4 joined tables without any partition. The common point between these queries is the structure.
The detail I omit to specify in the original post is that the purpose of both queries is to pivot a table. They look like that :
SELECT osr.id,
MIN(CASE osr.category
WHEN 123 THEN
1
END) AS field1,
MIN(CASE osr.category
WHEN 264 THEN
1
END) AS field2,
MIN(CASE osr.category
WHEN 975 THEN
1
END) AS field3,
...
FROM oracle_source_table osr
WHERE osr.category IN (123, 264, 975, ...)
GROUP BY osr.id;
Now that I have detailed what the queries look like, I can give you some results I had with the second one without changing the value of max_long (this query is lighter than the first one) :
Sometimes it works (~10%), sometimes it failed (~90%) on PGadmin but it never works with psql command
If I delete the WHERE, it always works
I don't understand why deleting the WHERE change something, the field used in this clause is a NUMBER(6, 0) between 0 and 2500 and it is still used in the SELECT clause... Oh and in the 4 Oracle tables used by this query, there is no LONG datatype, only NUMBER datatype is used.
Among 20 queries I have, only these two have a problem, their structure is similar and I don't believe in coincidences.
Don't despair!
Set the max_long option on the foreign table big enough that all your oversized data fit.
The documentation has the details:
max_long (optional, defaults to "32767")
The maximal length of any LONG, LONG RAW and XMLTYPE columns in the Oracle table. Possible values are integers between 1 and 1073741823 (the maximal size of a bytea in PostgreSQL). This amount of memory will be allocated at least twice, so large values will consume a lot of memory.
If max_long is less than the length of the longest value retrieved, you will receive the error message
ORA-01406: fetched column value was truncated
Example:
ALTER FOREIGN TABLE my_tab OPTIONS (ADD max_long '1000000');

supporting multiple languages in cassandra

I am analyzing Facebook data using Cassandra due to which I ended up having need of multiple languages text in one of my columns.
I am unable to insert text data into Cassandra which is not English:
<stdin>:1:'ascii' codec can't encode character u'\u010c' in position 51: ordinal not in range(128)
<stdin>:1:Invalid syntax at char 7623
I browsed thorough the Internet and found that i need to override coding (link)
but I am not sure how to configure this.
Note : there is a possibility of multiple language in a single row.
Your column seems to be of type ascii, which only supports US-ASCII-encoded text. If you need a wider range of characters, use varchar instead (see here for details on CQL types).
To change the column type, use this ALTER TABLE statement:
ALTER TABLE my_table ALTER my_column TYPE varchar;
See here for details on ALTER TABLE.

ORA-12704: Unable to convert character data

I am trying to perform SET operations in Oracle across remote databases.
I am using the MINUS operator.
My query looks something like this.
SELECT NAME FROM localdb MINUS SELECT NAME from remotedb#dblink
This is throwing up a ORA-12704 error. I understand this warrants some kind of conversion or a NLS Setting.
What should I try next?
The two name columns are stored in different characters sets. This could be because of their type definitions, or it could be because the two databases are using different character sets.
You might be able to get around this by explicitly converting the field from the remote database to the character set of the local one. Try this:
SELECT NAME FROM localdb MINUS SELECT TO_CHAR(NAME) from remotedb#dblink
It seams the types of NAME column in those 2 tables are different.
Make sure the NAME column in the remotedb table is exactly the same type as the NAME in localdb table. It is mandatory when you use a MINUS operator.

Resources