Hive managed table drop doesn't delete files on HDFS. Any solutions? - hadoop

While deleting managed tables from the hive, its associated files from hdfs are not being removed (on azure-databricks). I am getting the following error:
[Simba]SparkJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: org.apache.spark.sql.AnalysisException: Can not create the managed table('`schema`.`XXXXX`'). The associated location('dbfs:/user/hive/warehouse/schema.db/XXXXX) already exists
This issue is occurring intermittently. Looking for a solution to this.

I've started hitting this. It was fine for the last year then something is going on with the storage attachment I think. Perhaps enhancements going on in the back ground that are causing issues (PaaS!) As a safeguard I'm manually deleting the directly path as well dropping the table until I can get a decent explanation of what's going on or get a support call answered.
Use
dbutils.fs.rm("dbfs:/user/hive/warehouse/schema.db/XXXXX", true)
becarefull with that though! Get the path wrong and it could be tragic!

So sometimes the metadata(schema info of Hive table) itself gets corrupted. So whenever we try to delete/drop the table we get errors as, spark checks for the existance of the table before deleting.
We can avoid that if we use hive clint to drop the table, as it avoids checking the table's existence.
Please refer this wonder databricks documentation

Related

Pentaho Dimension Lookup/Update deadlock error

I have a Dimension Lookup/update step, and I am trying to update a table with data from JSON files, but it is failing with the following error:
2021/08/03 12:51:58 - dlu-insrting_in_table.0 - ERROR (version 9.1.0.0-324, build 9.1.0.0-324 from 2020-09-07 05.09.05 by buildguy) : Because of an error this step can't continue:
2021/08/03 12:51:58 - dlu-insrting_in_table.0 - Couldn't get row from result set
2021/08/03 12:51:58 - dlu-insrting_in_table.0 - Transaction (Process ID 78) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
This is the configuration of the Dimension Lookup/update step.
and this is part of the transformation
If I use only one copy to start the step, it works everything ok, but if I put more than one it gives me the mentioned error. The thing is that the error seems to be casual, sometime crashes after inserting two rows, other times it inserts everything without giving the error.
Searching the documentation and in interned didn't help much, I was not able to fix it. I read that could be a insertion order problem or a primary key related problem, but the data is fine (the keys are unique) and the configuration of the step seems ok. What I noticed is that does not insert the Technical key in order, I think is because it depends on the process that finishes first, but I don't find a way to force it (assuming this is the problem).
Does anyone know which is the problem here, and how could I fix it? Thank you.
Don’t run multiple copies of the Lookup/Update step. It has a commit size of 100, so if you have 2 copies of the step you have two threads concurrently trying to update the same table. Most likely one of them is locking the table ( or a block of rows) that the other tries to write and that lock causes a timeout.
Why it sometimes crashes and sometimes works? It’s actually a bit random: each copy receives a batch of rows to act upon, and it depends on which rows are sent to each copy and how many updates are required.
So I managed to solve the problem finally. The problem was not related with Pentaho strictly but with SQL Server. In particular, I had to redefine the index on the table on which the insertion was made. You will find more details in this answer: Insert/Update deadlock with SQL Server

How to manually corrupt the Oracle CLOB data

I'm wondering if there's any way to manually corrupt the CLOB data for testing purpose.
I can find the steps for intensional block corruption, but can't find anything for the individual data in a table. Can anyone help me with this?
Below is what I'm trying to do and I need help for step 1:
Prepare the corrupted CLOB data
Run expdb and get ORA-01555 error
Test if my troubleshooting procedure works ok
Some background:
DB: Oracle 12.2.0.1 SE2
OS: Windows Server 2016
The app we're using (from the 3rd party) seems to occasionally corrupt the CLOB data when a certain type of data gets inserted in a table. We don't know what triggers it. The corruption doesn't affect the app's function, but leaving it unfixed gives the following error when running expdb for daily backup:
ORA-01555: snapshot too old: rollback segment number
CLOB consists of a mix of alphanumeric characters and line breaks. It gets inserted by the app, no manual insert takes place
Fixing/replacing the app isn't an option, so we've got a fixing procedure with us.
I took over this from another engineer (who's left already), but since then the app is happily working and no problem has occurred so far. I want to test run the fixing procedure in DEV environment, but the app doesn't reproduce the problem for me.
So I thought if I can manually prepare the "broken" CLOB for testing purpose
So this looks like it is caused by a known bug:
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=364607910994084&parent=DOCUMENT&sourceId=833635.1&id=787004.1&_afrWindowMode=0&_adf.ctrl-state=3xr6s00uo_200
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=364470181721910&id=846079.1&_afrWindowMode=0&_adf.ctrl-state=3xr6s00uo_53
https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=364481844925661&id=833635.1&_afrWindowMode=0&_adf.ctrl-state=3xr6s00uo_102
The main point here is that the corruption isn't caused by anything inherant in the data, but is more likely caused by something like concurrent access to the LOB by multiple updates (application or end-user behavior), or just by apparently random chance. As such, I doubt that there's any way for you to easily force this condition in order to validate your test for it.

Hive "add partition" concurrency

We have an external Hive table that is used for processing raw log file data. The files are hourly, and are partitioned by date and source host name.
At the moment we are importing files using simple python scripts that are triggered a few times per hour. The script creates sub folders on HDFS as needed, copies new files from the temporary local storage and adds any new partitions to Hive.
Today, new partitions are created using "ALTER TABLE ... ADD PARTITION ...". However, if another Hive query is running on the table it will be locked, which means that the add partition command will fail (if the query runs for long enough) since it requires an exclusive lock.
An alternative to this approach would be to use "MSCK REPAIR TABLE", which for some reason does not seem to aquire any locks on the table. However, I have gotten the impression that using repair table is not recommended for a production setting.
What is the best practise for adding Hive partitions programmatically in a concurrent environment?
What are the risks or disadvantages of using MSCK REPAIR TABLE?
Is there an explanation for the seemingly inconsistent locking behaviour of the two partition adding commands? I.e. do they have different effects on running queries?
Not a good answer, but we have the same issue and here are our findings :
in the Hive doc, https://cwiki.apache.org/confluence/display/Hive/Locking , locks seem pretty sensible: an 'ADD partition" will request an exclusive lock on the created partition, and a shared lock on the whole table. A SELECT query will request a shared lock on the table. So it should be fine
however, it does not work this way, at least in CDH 5.3. According to this thread, https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/u7aM9W3pegM this is a known behavior, probably new (I am not sure, but I also think, as the author of this thread, that the issue was not there on CDH 4.7)
So basically, we're still thinking of our partition strategy, but we will probably try to create all possible partition in advance (before getting the data), as we know precisely the values of all future partitions (might not be the case for you).

Oracle: Why are all rows in all my tables duplicated?

I am experiencing a strange problem with an Oracle DB, and would like to ask if anyone has experienced a similar problem:
In ALL of my tables (user_tables) EVERY SINGLE row has been duplicated.
What kind of action could have caused such a thing?
Can I restore the previous state without cleaning every table "by hand"?
I can imagine many actions which could lead to this situation (for example running imp twice), but it doesn't matter. You should simply prevent such duplicates by means of Unique/Primary keys.
As for restoring previous state you may want to read about flashback query feature:
http://docs.oracle.com/cd/B13789_01/appdev.101/b10795/adfns_fl.htm

Dropping a table partition avoiding the error ORA-00054

I need your opinion in this situation. I’ll try to explain the scenario. I have a Windows service that stores data in an Oracle database periodically. The table where this data is being stored is partitioned by date (Interval-Date Range Partitioning). The database also has a dbms_scheduler job that, among other operations, truncates and drops older partitions.
This approach has been working for some time, but recently I had an ORA-00054 error. After some investigation, the error was reproduced with the following steps:
Open one sqlplus session, disable auto-commit, and insert data in the
partitioned table, without committing the changes;
Open another sqlplus session and truncate/drop an old partition (DDL
operations are automatically committed, if I’m not mistaken). We
will then get the ORA-00054 error.
There are some constraints worthy to be mentioned:
I don’t have DBA access to the database;
This is a legacy application and a complete refactoring isn’t
feasible;
So, in your opinion, is there any way of dropping these old partitions, without the risk of running into an ORA-00054 error and without the intervention of the DBA? I can just delete the data, but the number of empty partitions will grow everyday.
Many thanks in advance.
This error means somebody (or something) is working with the data in the partition you are trying to drop. That is, the lock is granted at the partition level. If nobody was using the partition your job could drop it.
Now you say this is a legacy app and you don't want to, or can't, refactor it. Fair enough. But there is clearly something not right if you have a process which is zapping data that some other process is using. I don't agree with #tbone's suggestion of just looping until the lock is released: you can't just get rid of data which somebody is using with establishing why they are still working with data that they apparently should not be using.
So, the first step is to find out what the locking session is doing. Why are they still amending this data your background job wants to retire? Here's a script which will help you establish which session has the lock.
Except that you "don't have DBA access to the database". Hmmm, that's a curly one. Basically this is not a problem which can be resolved without DBA access.
It seems like you have several issues to deal with. Unfortunately for you, they are political and architectural rather than technical, and there's not much we can do to help you further.
How about wrapping the truncate or drop in pl/sql that tries the operation in a loop, waiting x seconds between tries, for a max num of tries. Then use dbms_scheduler to call that procedure/function.
Maybe this can help. Seems to be the same issue as the one that you discribe.
(ignore the comic sans, if you can) :)

Resources