Hive: Creating smaller table from big table - hadoop

I currently have a Hive table that has 1.5 billion rows. I would like to create a smaller table (using the same table schema) with about 1 million rows from the original table. Ideally, the new rows would be randomly sampled from the original table, but getting the top 1M or bottom 1M of the original table would be ok, too. How would I do this?

As climbage suggested earlier, you could probably best use Hive's built-in sampling methods.
INSERT OVERWRITE TABLE my_table_sample
SELECT * FROM my_table
TABLESAMPLE (1m ROWS) t;
This syntax was introduced in Hive 0.11. If you are running an older version of Hive, you'll be confined to using the PERCENT syntax like so.
INSERT OVERWRITE TABLE my_table_sample
SELECT * FROM my_table
TABLESAMPLE (1 PERCENT) t;
You can change the percentage to match you specific sample size requirements.

You can define a new table with the same schema as your original table.
Then use INSERT OVERWRITE TABLE <tablename> <select statement>
The SELECT statement will need to query your original table, use LIMIT to only get 1M results.

This query will pull out top 1M rows and overwrite them in a new table.
CREATE TABLE new_table_name AS
SELECT col1, col2, col3, ....
FROM original_table
WHERE (if you want to put any condition) limit 100000;

Related

Why reading from some table in Oracle slower than other table in the same database

I am doing a simple Select col1, col2, col22 from Table1 order by col1 and the same Select statement in Table2. Select col1, col2, col22 from Table2 order by col1.
I use Pentaho ETL tool to replicate data from Oracle 19c to SQL Server. Reading from Table1 is much much slower than reading from Table2. Both have almost the same number for columns and almost the same number for rows. Both exist in the same schema. Table1 is being read at 10 rows per sec while Table2 is being read at 1000 rows a sec.
What can cause this slowness?
Are the indexes the same on the two tables? It's possible Oracle is using a fast full index scan (like a skinny version of the table) if an index covers all the relevant columns in one table, or may be using a full index scan to pre-sort by COL1. Check the execution plans to make sure the statements are using the same access methods:
explain plan for select ...;
select * from table(dbms_xplan.display);
Are the table segment sizes the same? Although the data could be the same, occasionally a table can have a lot of wasted space. For example, if the table used to contain a billion rows, and then 99.9% of the rows were deleted, but the table was never rebuilt. Compare the segment sizes with a query like this:
select segment_name, sum(bytes)/1024/1024 mb
from all_segments
where segment_name in ('TABLE1', 'TABLE2')
It depends on many factors.
The first things I would check are the table indexes:
select
uic.table_name,
uic.index_name,
utc.column_name
from USER_TAB_COLUMNS UTC,
USER_IND_COLUMNS UIC
where utc.table_name = uic.table_name
and utc.column_name = uic.column_name
and utc.table_name in ('TABLE1', 'TABLE2')
order by 1, 2, 3;

Improve DELETE query Oracle

I have a query to delete some records from a table, but take too much time.
The table is use it in a stored procedure to match another table.
Every time that the SP is executed the table is truncated and filled with 2 or 3 millions of records depending of the received parameters.
The table doesn't have any FK or constraints
The query to delete the records that I am using is:
DELETE FROM TABLE1
WHERE (fecha,hora_ini,origen,destino,tipo,valor,rowsm1) IN (
SELECT fecha_t,hora_t,origen_t,destino_t,tipo,valor,id_t
FROM TABLE2)
I try to decrease the time in execute the query creating an index based in the same columns of the query
CREATE INDEX smb1 ON table1 (fecha,hora_ini,origen,destino,tipo,valor,rowsm1);
And the query take more time to execute.
How can improve the performance of this "DELETE" query.
UPDATE
EXPLAIN PLAN OUTPUT
DELETE TABLE1
TABLE ACCESS TABLE1
TABLE ACCESS FULL TABLE1
TABLE ACCESS FULL TABLE2
TABLE ACCESS FULL TABLE2
The index you created looks like a quite big index:
CREATE INDEX smb1
ON table1 (fecha,hora_ini,origen,destino,tipo,valor,rowsm1);
Sure, this depends on the amount of data but generally I would rather look for one or two selective columns - if possible.
Don't forget, that the index data has to be read as well and if it doesn't help to speed up the query, you even loose performance.
This might for instance happen, if the table is very small, because the database reads data block by block (I think it was about 8K). A small table can be read in one step - no need to use an index here.
Or, if more or less all records are selected. In this case the table has to be read anyway.
If you want to speed up the query you should create the same index (with a good selectivity) on table2. This way the EXPLAIN PLAN will look somewhat lie this:
DELETE STATEMENT
DELETE
NESTED LOOPS SEMI
INDEX FULL SCAN
INDEX RANGE SCAN
You can switch off logging and delete the rows,
Here is an example, you can do it 2 ways,
1.) Physically chaning the table to Nologging
2.) Using Nologging hint in the delete statement.
1.) First approach
both testemp and testemp2 are same tables with same data while testemp takes over a minute , testemp2 takes only 1 second
SQL> delete from testemp;
14336 rows deleted.
Elapsed: 00:01:04.12
SQL>
SQL>
SQL> alter table testemp2 nologging;
Table altered.
Elapsed: 00:00:02.86
SQL>
SQL> delete from testemp2;
14336 rows deleted.
Elapsed: 00:00:01.26
SQL>
The table needs to be put back to logging only when we physically change the table using "Alter" command, if you are using as hint not required please see the example below
2.) Second approach
SQL> set timing on;
SQL> delete from testemp2;
14336 rows deleted.
Elapsed: 00:00:01.51
Deleting data after reinserting same data into table now with nologging;
SQL> delete /*+NOLOGGING*/ from testemp2;
14336 rows deleted.
Elapsed: 00:00:00.28
SQL> select logging from user_Tables where table_name='TESTEMP2';
LOG
---
YES

Hive - How to query a table to get its own name?

I want to write a query such that it returns the table name (of the table I am querying) and some other values. Something like:
select table_name, col1, col2 from table_name;
I need to do this in Hive. Any idea how I can get the table name of the table I am querying?
Basically, I am creating a lookup table that stores the table name and some other information on a daily basis in Hive. Since Hive does not (at least the version we are using) support full-fledged INSERTs, I am trying to use the workaround where we can INSERT into a table with a SELECT query that queries another table. Part of this involves actually storing the table name as well. How can this be achieved?
For the purposes of my use case, this will suffice:
select 'table_name', col1, col2 from table_name;
It returns the table name with the other columns that I will require.

Hive: Create New Table from Existing Partitioned Table

I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.

Data loading in Oracle

I am facing problem in loading data. I have to copy 800,000 rows from one table to another in Oracle database.
I tried for 10,000 rows first but the time it took is not satisfactory. I tried using the "BULK COLLECT" and "INSERT INTO SELECT" clause but for both the cases response time is around 35 minutes. This is not the desired response I'm looking for.
Does anyone have any suggestions?
Anirban,
Using an "INSERT INTO SELECT" is the fastest way to populate your table. You may want to extend it with one or two of these hints:
APPEND: to use direct path loading, circumventing the buffer cache
PARALLEL: to use parallel processing if your system has multiple cpu's and this is a one-time operation or an operation that takes place at a time when it doesn't matter that one "selfish" process consumes more resources.
Just using the append hint on my laptop copies 800,000 very small rows below 5 seconds:
SQL> create table one_table (id,name)
2 as
3 select level, 'name' || to_char(level)
4 from dual
5 connect by level <= 800000
6 /
Tabel is aangemaakt.
SQL> create table another_table as select * from one_table where 1=0
2 /
Tabel is aangemaakt.
SQL> select count(*) from another_table
2 /
COUNT(*)
----------
0
1 rij is geselecteerd.
SQL> set timing on
SQL> insert /*+ append */ into another_table select * from one_table
2 /
800000 rijen zijn aangemaakt.
Verstreken: 00:00:04.76
You mention that this operation takes 35 minutes in your case. Can you post some more details, so we can see what exactly is taking 35 minutes?
Regards,
Rob.
I would agree with Rob. Insert into () select is the fastest way to do this.
What exactly do you need to do? If you're trying to do a table rename by copying to a new table and then deleting the old, you might be better off doing a table rename:
alter table
table
rename to
someothertable;
INSERT INTO SELECT is the fastest way to do it.
If possible/necessary, disable all indexes on the target table first.
If you have no existing data in the target table, you can also try CREATE AS SELECT.
As with the above, I would recommend the Insert INTO ... AS select .... or CREATE TABLE ... AS SELECT ... as the fastest way to copy a large volume of data between two tables.
You want to look up the direct-load insert in your oracle documentation. This adds two items to your statements: parallel and nologging. Repeat the tests but do the following:
CREATE TABLE Table2 AS SELECT * FROM Table1 where 1=2;
ALTER TABLE Table2 NOLOGGING;
ALTER TABLE TABLE2 PARALLEL (10);
ALTER TABLE TABLE1 PARALLEL (10);
ALTER SESSION ENABLE PARALLEL DML;
INSERT INTO TABLE2 SELECT * FROM Table 1;
COMMIT;
ALTER TABLE 2 LOGGING:
This turns off the rollback logging for inserts into the table. If the system crashes, there's not recovery and you can't do a rollback on the transaction. The PARALLEL uses N worker thread to copy the data in blocks. You'll have to experiment with the number of parallel worker threads to get best results on your system.
Is the table you are copying to the same structure as the other table? Does it have data or are you creating a new one? Can you use exp/imp? Exp can be give a query to limit what it exports and then imported into the db. What is the total size of the table you are copying from? If you are copying most of the data from one table to a second, can you instead copy the full table using exp/imp and then remove the unwanted rows which would be less than copying.
try to drop all indexes/constraints on your destination table and then re-create them after data load.
use /*+NOLOGGING*/ hint in case you use NOARCHIVELOG mode, or consider to do the backup right after the operation.

Resources