ETL for processing history records - performance

I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
...
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
thanks!

There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY data_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY newline
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(
uid INTEGER EXTERNAL(6),
status CHAR(10),
sdate date 'yyyymmdd' )
)
LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
et.uid
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.

I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.

I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
)
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.

I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
else
if row is more desirable than savedrow
save row
end
end
end
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)

Related

Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Questions:
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
...
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

How to avoid data duplicates in ClickHouse

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.
I'm storing leaked credentials in a database:
CREATE TABLE credential (
user String,
domain String,
password String,
first_seen Date,
leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen
It something happens that some credentials appear more than once (inside a file or between many).
My long-term objective is(was) the following:
- when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.
I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.
So I wonder:
- what did i miss with the ReplacingMergeTree engine ?
- how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?
I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.
Any ideas will be tried.
Multiple things are going wrong here:
You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other.
I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.
Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.
Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.
Bottomline, as a solution:
So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version.
Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.

Why does the same select statement have different costs in Oracle?

Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.

Delete rows vs Delete Columns performance

I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.
I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.
After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.

overcoming 'log file sync' by design?

Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.
Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.
The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.
Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.
I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.

Resources