Hive/Impala column-comments cut off after several characters - hadoop

When I take a look at the column-comments in our Data Lake (Hadoop, comments made during parquet-table creation with Hive or Impala) they are cut of after ~200 characters.
Might this be a global character-setting in our hadoop-system or some Hive-restriction? If not, is there a way to set the maximum-string-length for comments during the table creation? Unfortunately, I have no admin-access to the system itself and, therefore, restricted insights.

Column comments are stored in Hive Metastore table COLUMNS_V2, in a column called COMMENT.
Currently, the size of that column is limited to 256 characters (see MySQL metastore schema definition for Hive version 3.0.0 for example).
In the upcoming 4.0 (?) version, it seems to have been expanded to varchar(4000), but associated Hive JIRA-4921 is still listed as unresolved, and doesn't mention a target release #.

Related

Migration issue while using DMS . Incorrect junk data for empty columns

While migrating from MySQL to ORAcle using AWS DMS servcie, In the source side(MySQL DB instance), some huge column (mediumtext) values are empty for 75% of rows in a table. Whereas in the target (Oracle ), its migrated with some other value (Not Junk values) . For me it looks like the column values are copied incorrectly between rows.
Wherever there is empty values in the source side columns, it copied some other data. Around 75% of table data for some of the clob columns with empty values in source side, are incorrectly mapped with some other data in the oracle side. We used FULL LOB mode and 10000Kb as chunk size.
Some questions or requests -
1. Could you share the table DDL from source and target?
2. Are you sure there is no workload running on the target that could change values in the table outside the DMS process?
3. Full LOB mode migrates LOBs in chunks. Why are we specifying such a high LOB chunk size? Also, do we not know the max LOB size to use limited LOB mode.
4. Could you paste the task ARN here? I work for AWS DMS and can look to see what is going on? Once I find the root cause, I will also make sure I post an analysis here for all stackoverflow users.
Let me know.

Space utilization in DBMS_REDEFINITION package in oracle

I am trying to create partitions on pre-existing tables in my application. While doing same, I am facing space issues in the schema's default tablespaces.
Could someone please explain how the DBMS_REDEFINITION package works in terms of space utilization. Do it requires extra space to perform the task. If yes , why? And is it temporary ? do it release the space after completing the redefinition
Immediate Comments on this would be highly appreciated.
When you use DBMS_REDEFINITION package to redefine a table you need twice the space which is used by your table. Actually DBMS_REDEFINITION copies all data from old table to new table. You have to drop the old table manually after successfull redefition.

Why we need to move external table to managed hive table?

I am new to Hadoop and learning Hive.
In Hadoop definative guide 3rd edition page no. 428 last paragraph
I don't understand below paragraph regarding external table in HIVE.
"A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table."
Can anybody explain briefly what above phrase says?
Usually the data in the initial dataset is not constructed in the optimal way for queries.
You may want to modify the data (like modifying some columns adding columns, making aggregation etc) and to store it in a specific way (partitions / buckets / sorted etc) so that the queries would benefit from these optimizations.
The key difference between external and managed table in Hive is that data in the external table is not managed by Hive.
When you create external table you define HDFS directory for that table and Hive is simply "looking" in it and can get data from it but Hive can't delete or change data in that folder. When you drop external table Hive only deletes metadata from its metastore and data in HDFS remains unchanged.
Managed table basically is a directory in HDFS and it's created and managed by Hive. Even more - all operations for removing/changing partitions/raw data/table in that table MUST be done by Hive otherwise metadata in Hive metastore may become incorrect (e.g. you manually delete partition from HDFS but Hive metastore contains info that partition exists).
In Hadoop definative guide I think author meant that it is a common practice to write MR-job that produces some raw data and keeps it in some folder. Than you create Hive external table which will look into that folder. And than safelly run queries without the risk to drop table etc.
In other words - you can do MR job that produces some generic data and than use Hive external table as a source of data for insert into managed tables. It helps you to avoid creating boring similar MR jobs and delegate this task to Hive queries - you create query that takes data from external table, aggregates/processes it how you want and puts the result into managed tables.
Another goal of external table is to use as a source data from remote servers, e.g. in csv format.
There is no reason to move table to managed unless you are going to enable ACID or other features supported only for managed tables.
The list of differences in features supported by managed/external tables may change in future, better use current documentation. Currently these features are:
ARCHIVE/UNARCHIVE/TRUNCATE/MERGE/CONCATENATE only work for managed
tables
DROP deletes data for managed tables while it only deletes
metadata for external ones
ACID/Transactional only works for
managed tables
Query Results Caching only works for managed
tables
Only the RELY constraint is allowed on external tables
Some Materialized View features only work on managed tables
You can create both EXTERNAL and MANAGED tables on top of the same location, see this answer with more details and tests: https://stackoverflow.com/a/54038932/2700344
Data structure has nothing in common with external/managed table type. If you want to change structure you do not necessarily need to change table managed/external type
It is also mentioned in the book.
when your table is external table.
you can use other technologies like PIG,Cascading or Mapreduce to process it .
You can also use multiple schemas for that dataset.
and You can also create data lazily if it is external table.
when you decide that dataset should be used by only Hive,make it hive managed table.

Best way to add column with default value while under load

When adding a column to a table that has a default value and a constraint of not null. Is it better to run as a single statement or to break it into steps while the database is under load.
ALTER TABLE user ADD country VARCHAR2(4) DEFAULT 'GB' NOT NULL
VERSUS
ALTER TABLE user ADD country VARCHAR2(2)
UPDATE user SET country = 'GB'
COMMIT
ALTER TABLE user MODIFY country DEFAULT 'GB' NOT NULL
Performance depends on the Oracle version you use. Locks are generated anyway.
If version <= Oracle 11.1 then #1 does the same as #2. It is slow anyway.
Beginning with Oracle 11.2, Oracle introduced a great optimization for the first statement (one command doing it all). You don't need to change the command - Oracle just behaves differently. It stores the default value only in data dictionary instead of updating each physical row.
But I also have to say, that I encountered some bugs in the past related to this feature (in Oracle 11.2.0.1)
failure of traditional import if export was done with direct=Y
merge statement can throw an ORA-600 [13013] (internal oracle error)
a performance problem in queries using such tables
I think this issues are fixed in current version 11.2.0.3, so I can recommend to use this feature.
Some time ago we have evaluated possible solutions of the same problem. On our project we had to remove all indexes on table, perform altering and restore indexes back.
If your system needs to be using the table then DBMS_Redefinition is really your only choice.

How is the TRUNCATE command in Oracle able to retrieve the structure of a table after dropping it?

The SQL command TRUNCATE in Oracle is faster than than DELETE FROM table; in that the TRUNATE comand first drops the specified table in it's entirely and then creates a new table with same structure (clarification may require in case I may be wrong). Since TRUNCATE is a part of DDL it implicitly issues COMMIT before being executed and after the completion of execution. If such is a case then, the table that is dropped by the TRUNCATE command is lost permanently with it's entire structure in the data dictionary. In such a scenario, how is the TRUNCATE command able to drop first the table and recreate the same with the same structure?
(Note that I work for Sybase in SQL Anywhere engineering and my answer comes from my knowledge of how truncate is implemented there, but I imagine it's similar in Oracle as well.)
I don't believe the table is actually dropped and re-created; the contents are simply thrown away. This is much faster than delete from <table> because no triggers need to be executed, and rather than deleting a row at a time (both from the table and the indexes), the server can simply throw away all pages that contain rows for that table and any indexes.
I thought a truncate (amoungst other things) simply reset the High Water Mark.
see: http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_10007.htm#SQLRF01707
however in
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2816964500346433991
It is clear that the data segment changes after a truncate.

Resources