If I will truncate table from HBase, then
1) Does it deletes data from underlying HDFS system also or it just marks data with deletion marker ?
2) How can I make sure/verify that data is also deleted from underlying HDFS system ?
There is currently no way to ensure that HBase table data is completely erased from the underlying filesystem. The HBase table's files may be deleted from HDFS, but that still just means that they are moved to the trash folder.
Hbase tombstones data when they are deleted from tables, so scanning/getting rows does not return them and they cannot be read.
When a Major compaction is run on the table all the tombstoned data is deleted from Hbase and HDFS (native fileSystem) and frees up disk.
Related
I'm new to hive and read about it online too. But still having doubts which are not cleared.
for hive external tables, hive keep table's metadata within HDFS, but not in its warehouse which is also in HDFS. correct ?
whether its internal or external table, in both cases data of table will be available in HDFS only but NOWHERE else. Mean to say, data can taken from anywhere but has to be loaded in HDFS, because HIVE uses hadoop's processing engine to process data. Correct ?
internal table, table's metadata and table's data both will be available in HIVE's data warehouse, and this data warehouse will be at nowhere else but in HDFS only. correct ?
in external table, table's metadata and table's data both will be NOT available in HIVE's data warehouse but in HDFS. But hive must be keeping some info with itself that where is table's metadata located and where is its data located in HDFS, correct ?
Can anyone share feedback to above understanding ?
THanks
Hive uses relational database like MySQL, MariaDB, PostgreSQL, Oracle, DerbyDB(for embedded deployment only) for storing metadata (databases, tables definitions, statistics, grants, etc). See deployment modes and database requirements. Does not matter Internal or external table, the metadata are stored in the relational database.
Yes, the data is stored in HDFS, but also Hive supports integration with external databases using JDBC storage handler. Such table looks like normal Hive table, but the data is stored in some database, your queries executed in the database, predicate push-down works, you can use hive native tables with storage handler tables in single query. Also HBase storage handler is available, Kafka storage handler, etc, you can write your own storage handler.
Depending on your Hive version/vendor It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS. Though Cloudera prefers to have managed tables in dedicated HDFS location for them, see https://stackoverflow.com/a/67073849/2700344 and does not allow to specify location for managed tables outside the warehouse root by default. Read abot the difference between managed and external tables here.
Everything seems correct except last one. When you create external table table metadata will be stored in the Hive otherwise you can not query through hive. HDFS itself keeps control of your data when you create external table. While when you create internal table Hive will be responsible. Dropping internal table drops your data and metadata but dropping external table only drops metadata from Hive. But your data will be remain inside of your file system. Thats why we are changing table types a lot as a workaround when some of our external connection is not compatible with our hive version.
I have a hive external table which creates partitions on a daily basis dynamically.
In order to free up the memory space, I'm planning to delete some of the files from hdfs.
Will the removal of files from hdfs also removes partitions of the corresponding hive table ? (or) Do we need to explicitly remove the partitions of the hive table?
You have to remove the partition separately. I recommend to delete the partition first using hive command and then delete the files.
I have an use case which required around 200 hive parquet table.
I need to load these parquet table from flat text files. But we can not directly load parquet table from flat text file.
So I am using following approach
Created a temporary managed text table.
Loaded temp table with text data.
Created external parquet table.
Loaded parquet table with text table using select query.
Dropped text file for temporary text table (but keep table in metastore).
As this approach is keeping temporary metadata (for 200 tables) in metastore. So I have second approach is that I will drop temporary text table too along with text files from hdfs. And next time re-create temporary table and delete once parquet get created.
Now, as I need to follow above steps for all 200 tables for every 2 hours, So will creating and deleting tables from metastore impact anything in cluster during production?
Which approach can impact production, keeping temporary metadata in metastore, creating and deleting tables (metadata) from hive metastore?
Which approach can impact production, keeping temporary metadata in
metastore, creating and deleting tables (metadata) from hive
metastore?
No, there is no impact, the backend of the HiveMetastore should be able to handle 200 * n Changes per hour easily. If you're unsure start with 50 tables and monitor the backend DB performance.
I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.
While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?
Say for example:
A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv
Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?
Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).
Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.
There could be 2 possible scenarios.
If the table is external the data is not copied to warehouse directory at all.
If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.
Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).
HadoopGig
To answer you Question :
For External Tables:
Hive does not move the data into its warehouse directory. If the external table is dropped, then the table metadata is deleted but not the data.
For Internal tables
Hive moves data into its warehouse directory. If the table is dropped, then the table metadata and the data will be deleted.
For your reference
Difference between Internal & External tables:
For External Tables
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node, and deleting an external table from HIVE only deletes the metadata not the data/file.
For Internal Tables
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory /user/hive/warehouse you can change it by updating the location in the config file.
Deleting the table deletes the metadata and data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organization).
Hive may have internal or external tables, this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schema (tables or views) at a single data set or if you are iterating through various possible schema.
Hive should not own data and control settings, directories, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Source:
HDInsight: Hive Internal and External Tables Intro
Internal & external tables in Hadoop- HIVE
It would not cause data redundancy. For managed (not external) tables Hive moves the data into its warehouse directory. In your example, the data will be moved from original location on HDFS to '/users/big_data/hive/emp_feedback'.
Be careful with the removal of the managed table, it will lead to removal data on HDFS also.
You can send data in two days
A) use LOAD DATA INPATH 'file_location_of_csv' INTO TABLE emp_feedback;
Note that this command will remove content at source directory and create a internal table
OR)
B) Use copyFromLocal or put command to copy local file into HDFS and then create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.
e.g.
create external table emp_feedback (
emp_id int,
emp_name string
)
location '/location_in_hdfs_for_csv file';
When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.
Got it. This is what I was able to understand so far.
It all depends upon which type of table is being created and where from the file is picked up. Below are possible use cases
enter image description here
I have data stored in hdfs in the below format and inserted this data in impala partition table using "alter table add partition" command.
/user/impala/subscriber_data/year=2013/month=10/day=01
/user/impala/subscriber_data/year=2013/month=10/day=02
and everything is working fine.
Now I have a new data with month and year as 10 and 01. Now I need to process this data and append this data into existing hdfs directory(year=2013/month=10/day=01).
When I try to process and insert into hdfs directory, its giving error as output directory already exists.
Is there any way to append the new data into existing hdfs directory without deleting the existing directory?
Also, how to insert the new data into existing partition using impala? (I have only table with partition on year,month,day).
to insert into existing partition, you have to drop the existing partition, and add it back with all the files that make up that partition including your new data.