PySpark Select top records using partitions

PySpark Select top records using partitions - sorting

I have a large dataset on S3 saved as parquet files, partitioned by "last_update" column.
I want to take the top 10m records, order by last_update ASC.
I tried to save the attached dataframe to S3 but it just never ends.
Any other why to do it?
The weird thing is that I can kill it after 40 minutes that nothing happened, start it again (with same dataset!) and then it ends after 4 minutes...
(The top 10m can be all in the oldest partition or splitted on few of the oldest partitions)
Thanks!
sql_context.sql(
"""
SELECT
trim(col1) as col1,
col2,
col3
FROM
global_temp.my_tbl
ORDER BY last_update asc
LIMIT {}
""".format(args.num_of_records)
)

Related

Why reading from some table in Oracle slower than other table in the same database

I am doing a simple Select col1, col2, col22 from Table1 order by col1 and the same Select statement in Table2. Select col1, col2, col22 from Table2 order by col1.
I use Pentaho ETL tool to replicate data from Oracle 19c to SQL Server. Reading from Table1 is much much slower than reading from Table2. Both have almost the same number for columns and almost the same number for rows. Both exist in the same schema. Table1 is being read at 10 rows per sec while Table2 is being read at 1000 rows a sec.
What can cause this slowness?

Are the indexes the same on the two tables? It's possible Oracle is using a fast full index scan (like a skinny version of the table) if an index covers all the relevant columns in one table, or may be using a full index scan to pre-sort by COL1. Check the execution plans to make sure the statements are using the same access methods:
explain plan for select ...;
select * from table(dbms_xplan.display);
Are the table segment sizes the same? Although the data could be the same, occasionally a table can have a lot of wasted space. For example, if the table used to contain a billion rows, and then 99.9% of the rows were deleted, but the table was never rebuilt. Compare the segment sizes with a query like this:
select segment_name, sum(bytes)/1024/1024 mb
from all_segments
where segment_name in ('TABLE1', 'TABLE2')

It depends on many factors.
The first things I would check are the table indexes:
select
uic.table_name,
uic.index_name,
utc.column_name
from USER_TAB_COLUMNS UTC,
USER_IND_COLUMNS UIC
where utc.table_name = uic.table_name
and utc.column_name = uic.column_name
and utc.table_name in ('TABLE1', 'TABLE2')
order by 1, 2, 3;

Bulk delete in oracle database with toad

I have a table with over 24 million log records , now we are trying to reduce that. Due to company policy we aren't allowed to do a truncate , move or anyhting of that sort. The records have to be deleted from that table in one flued go, aprox 23 million rows. I have not much experience in bulk deletes. But i was wondering if there is a way to do this without the regular delete ( which times out even when there are multiple indexes on the tables and where clause). I think a bulk remove would do the trick but i have no experience with this. I tried to look into a cursor that would get all the records i need to delete after a certain date , and then loop over the cursor to delete records. See it as select into cursor x records from y table where date is after y.createdate is after sysdate - 30 , loop cursor delete.

There are a couple of options you can use:
Partition the table on createdate and drop the partitions that are older than your 30 day limit.
Create a new table using:
CREATE TABLE table_name2 AS
SELECT * FROM table_name WHERE createdate < SYSDATE - INTERVAL '30' DAY;
Copy the constraints, etc. from the old table and then drop the old table and rename the new table to the old table.
If you cannot delete 21 million rows in one go then split it up into smaller batches:
DELETE FROM table_name
WHERE ROWID IN (
SELECT ROWID
FROM table_name
WHERE createdate < SYSDATE - INTERVAL '30' DAY
ORDER BY createdate
FETCH FIRST 1000000 ROWS ONLY
);
and incrementally remove all the rows.

Joins on two large tables using UDF in Hive - performance is too slow

I have two tables in hive. One has around 2 millions of records and other has 14 miliions of records. I am joining these two tables. Also I am applying UDF in WHERE clause. It is taking too much time to perform JOIN operation.
I have tried to run the query for many times but it run for around 2 hrs and still my reducer remains at 70% and after that I am getting exception "java.io.IOException: No space left on device" and job gets killed.
I have tried to set the parameters as below:
set mapreduce.task.io.sort.mb=256;
set mapreduce.task.io.sort.factor=100;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.child.java.opts=-Xmx1024m;
My Query -
insert overwrite table output select col1, col2, name1, name2, col3, col4,
t.zip, t.state from table1 m join table2 t ON (t.state=m.state and t.zip=m.zip)
where matchStrings(concat(name1,'|',name2))>=0.9;
The above query takes 8 mappers and 2 reducers.
Can someone please suggest what do I suppose to do to improve performance.

That exception probably indicates that you do not have enough space in the cluster for the temporary files created by the query you are running. You should try adding more disk space to the cluster or reducing the amount of rows that are joined by using a subquery to first filter the rows from each table.

How to duplicate a partitioned table

I am constructing a test environment. Oracle 11g is my database. My goal is to place 80 million records in this database. I will start with 1 million records, which will be loaded into the partitioned table. Is there a way to duplicate the initial partition to create 80 partitions for a grand total of 80Meg records. The constraint is this process should take no longer than two hours to generate 80 million records.

After inserting the first partition, follow this principle:
INSERT INTO my_table (partition_column, col1, col2, col3, ...)
SELECT level, col1, col2, col3, ...
FROM my_table
CONNECT BY level < 80
The partition_column is just an assumption about your actual partitioning. You might have to change some values in the SELECT list for the new records to be put into different partitions. It helps to turn off constraints and indexes during this insert.

"Dynamic" partitions in oracle 11g

I have a log table with a lot of information.
I would like to partition it into two: first part is the logs from the past month, since they are commonly viewed. Second part is the logs from the rest of the year (Compressed).
My problem is that all the examples of partitions where "up until 1/1/2013", "more recent than 1/1/2013" - That is with fixed dates...
What I am looking for/expecting is a way to define a partition on the last month, so that when the day changes, the logs from 30 days ago, are "automatically" transferred to the compressed partition.
I guess I can create another table which is completley compressed and move info using JOBS, but I was hoping for a built-in solution.
Thank you.

I think you want interval partitions based on a date. This will automatically generate the partitions for you. For example, monthly partitions would be:
create table test_data (
created_date DATE default sysdate not null,
store_id NUMBER,
inventory_id NUMBER,
qty_sold NUMBER
)
PARTITION BY RANGE (created_date)
INTERVAL(NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION part_01 values LESS THAN (TO_DATE('20130101','YYYYMMDD'))
)
As data is inserted, Oracle will put into the proper partition or create one if needed. The partition names will be a bit cryptic (SYS_xxxx), but you can use the "partition for" clause to grab only the month you want. For example:
select * from test_data partition for (to_date('20130101', 'YYYYMMDD'))

It is not possible to automatically transfer data to a compressed partition. You can, however, schedule a simple job to compress last month's partition at the beginning of every month with this statement:
ALTER TABLE some_table
MOVE PARTITION FOR (add_months(trunc(SYSDATE), -1)
COMPRESS;
If you wanted to stay with only two partitions: current month and archive for all past transactions you could also merge partitions with ALTER TABLE MERGE PARTITIONS, but as far as I'm concerned it would rebuild the whole archive partition, so I would discourage doing so and stay with storing each month in its separate partition.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PySpark Select top records using partitions - sorting

Related

Why reading from some table in Oracle slower than other table in the same database

Bulk delete in oracle database with toad

Joins on two large tables using UDF in Hive - performance is too slow

How to duplicate a partitioned table

"Dynamic" partitions in oracle 11g

Categories

Resources