Which one is better (performance wise and operation on the long run) in maintaining data loaded, managed or external?
And by maintaining, i mean that these tables will have the following operations on daily basis frequently;
Select using partitions most of the time.. but for some of it they are not used.
Delete specific records, not all the partition (for example found a problem in some columns and want to delete and insert it again). - i am not sure if this supported for normal tables, unless transactional is used.
Most important, The need to merge files frequently.. may be twice a day to merge small files to gain less mappers. I know concate is available on managed and insert overwrite on external.. which one is less cost?
It depends on your use case. External table is recommended when they are used across multiple application for example Along with hive pig or other application is also used for processing the data in this kind of scenario external tables are mainly recommended.They are used when you are mainly reading data.
While in case of managed tables hive have complete control over the data. Though you can convert any external table to managed and vice versa
alter table table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
As in your case you are doing frequent modifications in data so it is better that hive should have total control over the data. In this scenraio it is recommended to use Managed tables.
Apart from that managed table are more secure then external table because external table can be accessed by anyone. While in managed table you can implement hive level security which provided better control but in case of external you will have to implement HDFS level security.
You can refer the below links which can give you few pointers in considerations
External Vs Managed tables comparison
Related
We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.
I had earlier created a project of storing daily data of particular entity in RDMS by creating a single table for each day and than storing data of that day in this table.
But now i want to shift my database from RDMS to HBase. So my question is whether I should create a single table and store data of all days in that table or I should use my earlier concept of creating a individual table for each day.I want to compare both cases on basis of performance of hbase.
Sorry if that question seems foolish to you.Thank you
As you mentioned there are 2 options
Option 1: Single table of all days data
Option 2: multiple tables
I would prefer Namespaces (introduced in version 0.96 is a very important feature) with option 2 if you have huge data for single day. This will support multi tenancy requirements also...
See Hbase Book
A namespace is a logical grouping of tables analogous to a database in relation database systems. This abstraction lays the groundwork for
upcoming multi-tenancy related features: Quota Management (HBASE-8410)
Restrict the amount of resources (ie regions, tables) a namespace can consume.
Namespace Security Administration (HBASE-9206) - Provide another level of security administration for tenants.
Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset of - RegionServers thus guaranteeing a course level of
isolation.
below are commands w.r.t. namespaces
alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables
Advantage :
Even if you use column filters, since its less data(per day data), data retrieval will be fast for full table scan compared to single table approach(full scan on big table is costly)
If you want authentication and authorization on a specific table then it could also be achived.
Limitation : you will end up with multiple scripts to manage tables rather single script(option 1)
Note : In any afore mentioned options above,your rowkey design is very imp for better performance & prevent hotspoting.
For more details look at hbase-series
I've tried to figure out which performance impacts the use of temporary tables has on an Oracle database. We want to use these tables in our ETL process to save temporary results. At this time we are using physical tables for this purpose and truncating this tables at the beginning of the ETL process. I know that the truncate process is very expensive and therefore I thought if it would be better to use temporary tables instead.
Have anyone of you experiences if there is a performance boost by using temporary tables in this scenario?
There were only some answers on this question regarding to the SQL Server like in this question. But I don't know if these recommendations also fit for the Oracle db.
It would be nice if anyone could list the advantages and disadvanteges of this feature and also point out in which scenarios this feature could be applicable.
Thanks in advance.
First of all: truncate is not expensive, a delete with no condition is very expensive.
Second: do your temporary table have indexes? What about external keys?
That could affect performance.
The temporary table works more or less like Sql Server (of course the syntax is different, like global temporary table), and both are just table.
You won't get any performance gain with temporary tables against normal table, they are just the same: they have a definition on DB, can have indexes, and are logged.
The only difference is that temporary table are exclusive to your session (except for global table) and that means if multiple scripts from multiple sessions refer to the same table, every one is reading/writing a different table and they cannot locking each other (in this case you could gain performance, but I think it's rarely the case).
I am writing an application in PL/SQL that takes a .csv flat-file, reads it, does some data processing on it, and then decides which of several tables to update, insert into, or delete.
I have the option of using the UTL_FILE.GET_LINE functionality to process a single record at a time, parsing it with various REGEX tools, storing the data temporarily in some variables, and then doing work with it (making decisions, updating tables, etc.)
I ALSO have the option, of creating an External table, and then just stepping through it using a cursor on said external table (using a for each loop for performance) I should still be able to do all of the same things with the data(making decisions, updating tables, etc.)
I have looked around, and a couple of forums suggest that External Tables are the preferred solution to this, as they scale better, are faster, and more reliable. I have not, however, heard a why. Oracles documentation on utl_file and/or external tables does not talk about why one might be faster than the other, so I'm curious if anyone has some more information or references that I do not about what would make one perform better over the other.
The performance difference is quite simple: UTL_FILE is a PL/SQL package, while external tables use the SQL*Loader code written in C.
If you have enough data, you can even load external tables in parallel with minimal effort f.i. ALTER TABLE my_external_table PARALLEL 4;
External tables can be used in bulk mode (INSERT INTO my_table SELECT ... FROM my_external_table JOIN my_lookup_table USING (lookup_column)).
External tables can be set to transactionally safe mode (REJECT LIMIT 0), so the above INSERT either works or rolls back.
Do you need more reasons?
If the file has data that has a known structure/file format then external table is the way to go. UTL_FILE is at a different abstraction level - you are now just working with a file - your use of UTL_FILE will be brittle and likely introduce bugs. The deciding factor should not be performance; however I doubt you will be able to 'outperform' Oracle's external table implementation by rolling your own using REGEX and UTL_FILE.
We are trying to migrate oracle tables to hive and process them.
Currently the tables in oracle has primary key,foreign key and unique key constraints.
Can we replicate the same in hive?
We are doing some analysis on how to implement it.
Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.
As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.
In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.
Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.
Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.