In Greenplum where is data temporarily inserted into tables within a function actually stored? - greenplum

Our project has a database function that stores millions of records from queries into intermediate tables as part of its steps, and then joins those intermediate tables to get the final results, which are stored in a final table. The intermediate tables are standard database tables in the schema (our security team does not allow tables to be created dynamically), and the intermediate tables are truncated before the function returns. When determining the amount of database storage space that our project needs, how do these intermediate tables figure into the determination of space needed? Do I need to figure out the worst-case of how much data is stored in the final table plus these intermediate tables while these functions are executing or does the fact that the data in the intermediate tables does not remain once the function returns mean that I do not need to include the data that would be stored in the intermediate tables when calculating the overall storage needs? I know that those millions of records are stored somewhere before the table is truncated and the function returns, but I was looking for a definitive answer.
Thanks in advance

You should count for the worst case of this intermediate table size. Intermediate table is the same table and store data the same way as your target table. If you have inserted data into it this data would reside on the same HDDs (of course, depends on filespace settings, but usually they are the same)
When you truncate the table, its data files are removed from filesystem and table metadata is altered in pg_class

Related

How to safely update hive external table

I have an external hive table and I would like to refresh the data files on a daily basis. What is the recommended way to do this?
If I just overwrite the files, and if we are unlucky enough to have some other hive queries to execute in parallel against this table, what will happen to those queries? Will they just fail? Or will my HDFS operations fail? Or will they block until the queries complete?
If availability is a concern and space isn't an issue, you can do the following:
Make a synonym for the external table. Make sure all queries use this synonym when accessing the table.
When loading new data, load it to a new table with a different name.
When the load is complete, point the synonym to the newly loaded table.
After an appropriate length of time (long enough for any running queries to finish), drop the previous table.
First of all.. if you are accessing any table it may have two types of locks:
exclusive(if data is getting added) and shared(if data is getting read)..
so if you insert overwrite and add data into the table then at that time if you access the table with other queries, they wont get executed because there will be an exclusive lock on it and once the insert overwrite query completes then you may access the table.
Please refer to the following link:
https://cwiki.apache.org/confluence/display/Hive/Locking

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

One large table partitioned and then subpartitioned or several smaller partitioned tables?

I currently have several audit tables that audit specific tables data.
e.g. ATAB_AUDIT, BTAB_AUDIT and CTAB_AUDIT auditing inserts, updates and deletes from ATAB, BTAB and CTAB respectively.
These audit tables are partitioned by year.
As the columns in these audit tables are identical (change_date, old_value, new_value etc.) would it be beneficial to use one large audit table, add a column holding the name of the table that generated the audit record (table_name) partition it by table_name and then subpartition by year?
The database is Oracle 11g on Solaris.
Why or why not do this?
Many thanks in advance.
I would guess that performance characteristics would be quite similar with either approach. I would make this decision based solely on how you decide to model your data; that is how your application(s) wish to interact with the database. I don't think your partitioning strategy would affect this decision (at least in this example).
Both approaches are valid, but sometimes people get carried away with the single-table approach and end up putting all data in one big table. There's a name for this (anti)pattern but it slips my mind.

Can't drop Oracle index partition -- any alternative besides dropping entire index and rebuilding?

So, I have a .NET program doing batch loading of records into partitioned tables using array bound stored procedure calls via Oracle ODP.NET, but that's neither here nor there.
What I would like to know is: because I have a partitioned index on said tables, the speed of the batch load is pretty slow. I fully understand that I cannot drop an index partition, but I would obviously prefer not to have to drop and rebuild the entire index since that will take considerably more time to execute. Is this my only recourse?
Is there a fairly simple way to drop the partition itself and then rebuild the partition and index partition that would save time and go about accomplishing my goal?
Are you loading an entire partition at once? Or are you merely adding new rows to an existing partition? Are all the indexes equipartitioned with the table?
Normally, if you are loading data into a partitioned table, your partitioning scheme is chosen so that each load will put data into a fresh partition. If that is the case, you can use partition exchange to load the data. In a nutshell, you load data into an (unindexed) staging table whose structure matches the real table, you create the indexes to match the indexes on the real table, and then do
ALTER TABLE partitioned_table
EXCHANGE PARTITION new_partition_name
WITH TABLE staging_table_name
WITHOUT VALIDATION;

PostgreSQL temporary tables

I need to perform a query 2.5 million times. This query generates some rows which I need to AVG(column) and then use this AVG to filter the table from all values below average. I then need to INSERT these filtered results into a table.
The only way to do such a thing with reasonable efficiency, seems to be by creating a TEMPORARY TABLE for each query-postmaster python-thread. I am just hoping these TEMPORARY TABLEs will not be persisted to hard drive (at all) and will remain in memory (RAM), unless they are out of working memory, of course.
I would like to know if a TEMPORARY TABLE will incur disk writes (which would interfere with the INSERTS, i.e. slow to whole process down)
Please note that, in Postgres, the default behaviour for temporary tables is that they are not automatically dropped, and data is persisted on commit. See ON COMMIT.
Temporary table are, however, dropped at the end of a database session:
Temporary tables are automatically dropped at the end of a session, or
optionally at the end of the current transaction.
There are multiple considerations you have to take into account:
If you do want to explicitly DROP a temporary table at the end of a transaction, create it with the CREATE TEMPORARY TABLE ... ON COMMIT DROP syntax.
In the presence of connection pooling, a database session may span multiple client sessions; to avoid clashes in CREATE, you should drop your temporary tables -- either prior to returning a connection to the pool (e.g. by doing everything inside a transaction and using the ON COMMIT DROP creation syntax), or on an as-needed basis (by preceding any CREATE TEMPORARY TABLE statement with a corresponding DROP TABLE IF EXISTS, which has the advantage of also working outside transactions e.g. if the connection is used in auto-commit mode.)
While the temporary table is in use, how much of it will fit in memory before overflowing on to disk? See the temp_buffers option in postgresql.conf
Anything else I should worry about when working often with temp tables? A vacuum is recommended after you have DROPped temporary tables, to clean up any dead tuples from the catalog. Postgres will automatically vacuum every 3 minutes or so for you when using the default settings (auto_vacuum).
Also, unrelated to your question (but possibly related to your project): keep in mind that, if you have to run queries against a temp table after you have populated it, then it is a good idea to create appropriate indices and issue an ANALYZE on the temp table in question after you're done inserting into it. By default, the cost based optimizer will assume that a newly created the temp table has ~1000 rows and this may result in poor performance should the temp table actually contain millions of rows.
Temporary tables provide only one guarantee - they are dropped at the end of the session. For a small table you'll probably have most of your data in the backing store. For a large table I guarantee that data will be flushed to disk periodically as the database engine needs more working space for other requests.
EDIT:
If you're absolutely in need of RAM-only temporary tables you can create a table space for your database on a RAM disk (/dev/shm works). This reduces the amount of disk IO, but beware that it is currently not possible to do this without a physical disk write; the DB engine will flush the table list to stable storage when you create the temporary table.

Resources