Impala Concurrent READ & Overwrite - hadoop

I noticed in one application that concurrent READ (with invalidating metadata ) and OVERWRITING table , causes underlying files to corrupt.
Is it a known scenario? I expected that while table is been overwritten, concurrent read would be just failed, It can't corrupt underlying files of the table.
Help will be appreciated!

If the files become corrupt, it shouldn't be caused by concurrent reads and writes. HDFS is a read/append-only filesystem and Impala will always write new files. When you insert, files are written to a staging directory which Impala will not read from until files are complete, at which point they are moved into the table/partition directory.
A few things to consider: If you run the insert independently of the select, the files are OK? What do you mean by corrupt? Does it work in Hive? What version of Impala are you running?

Related

SAS to HIVE2 Cloudera - Error trying to write

I have the following error while trying to write on the hive2 db :
ERROR: java.io.IOException: Could not get block locations. Source file "/tmp/sasdata-e1-...dlv - Aborting...block==null
the error appears when trying to write a new table or append rows to an existing table. I can connect correctly to the db (through a libname), read tables from the schema and when trying to create a new table the new table get created but empty because the error above happens .
Can someone help pls?
Thank you
Remember that hive is mostly just a metadatastore that helps you to read files from HDFS. Yes, it does this through a database paradigm but it's really operating on HDFS. Each table is created in an HDFS directory, and files are created.
This sounds like you don't have write permissions to the hdfs folder you are writing to. (but you have read)
To solve this problem you need to understand what user you are using and where the data is being written.
If you are creating a simple table you need to check if you can write to the hive warehouse directory. If you are purposely creating files in a specific hDFS folder you should check that.
Here's a command to help you determine where the data is being written to.
show create table [mytable]
If it doesn't mention a HDFS location you need to find get permissions to the hive warehouse. (Typicallys located hdfs:/user/hive/warehouse , but is actually defined in $HIVE_HOME/conf/hive-default.xml if it's located elsewhere).

How to rollback truncate data in hive table

Unfortunately, I truncated a table in hive and trash got cleaned up. Is there any way to get the data back.thanks
There is a way to recover deleted file(s) but it's not recommended and if not done properly can affect a cluster as well. Use these procedure with caution on production system
Here's step by step description to recover accidentally deleted files.
https://community.hortonworks.com/articles/26181/how-to-recover-accidentally-deleted-file-in-hdfs.html

Can Hive table automatically update when underlying directory is changed

If I build a Hive table on top of some S3 (or HDFS) directory like so:
create external table newtable (name string)
row format delimited
fields terminated by ','
stored as textfile location 's3a://location/subdir/';
When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?
On HDFS each file scanned each time table being queried as #Dudu Markovitz pointed. And files in HDFS are immediately consistent.
Update: S3 is also strongly consistent now, so removed part about eventual consistency.
Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344
Everything #leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...

HBase truncate table

If I will truncate table from HBase, then
1) Does it deletes data from underlying HDFS system also or it just marks data with deletion marker ?
2) How can I make sure/verify that data is also deleted from underlying HDFS system ?
There is currently no way to ensure that HBase table data is completely erased from the underlying filesystem. The HBase table's files may be deleted from HDFS, but that still just means that they are moved to the trash folder.
Hbase tombstones data when they are deleted from tables, so scanning/getting rows does not return them and they cannot be read.
When a Major compaction is run on the table all the tombstoned data is deleted from Hbase and HDFS (native fileSystem) and frees up disk.

sqlldr corrupts my primary key after the first commit

Sqlldr is corrupting my primary key index after the first commit in my ctl file. After the first, no matter what I set the rows value to in my control file, I get:
ORA-39776: fatal Direct Path API error loading table PE_OWNER.CLINICAL_CODE
ORA-01502: index 'PE_OWNER.CODE_PK' or partition of such index is in unusable state
SQL*Loader-2026: the load was aborted because SQL Loader cannot continue.
I'm using Oracle database and client 11.1.0.6.0.
I know the issue is not due to duplicate rows because if I set the rows directive to a huge value, the index is not corrupt after sqlldr does a single commit for the entire file. This provides me with a workaround, but it's still a little alarming...
Thanks for any guidance anyone can give.
I don't use SQL*Loader much on production tables, but from what I've read, you need to use conventional load.
from the SQL*Loader documentation
When to Use a Conventional Path Load
If load speed is most important to
you, you should use direct path load
because it is faster than conventional
path load. However, certain
restrictions on direct path loads may
require you to use a conventional path
load. You should use a conventional
path load in the following situations:
* When accessing an indexed table concurrently with the load, or when
applying inserts or updates to a nonindexed table concurrently with the
load
To use a direct path load (with the exception of parallel loads),
SQL*Loader must have exclusive write
access to the table and exclusive
read/write access to any indexes.
I believe the issue was that Oracle did not have time to rebuild the indices on the table in question, so I increased the batch commit size to a number larger than the number of records I was importing.
That fixed the issue.

Resources