How to avoid data duplicates in ClickHouse - clickhouse

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.
I'm storing leaked credentials in a database:
CREATE TABLE credential (
user String,
domain String,
password String,
first_seen Date,
leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen
It something happens that some credentials appear more than once (inside a file or between many).
My long-term objective is(was) the following:
- when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.
I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.
So I wonder:
- what did i miss with the ReplacingMergeTree engine ?
- how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?
I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.
Any ideas will be tried.

Multiple things are going wrong here:
You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other.
I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.
Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.
Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.
Bottomline, as a solution:
So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version.
Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.


Data cleanup in Oracle DB is taking long time for 300 billion records

Problem statement:
There is address table in Oracle which is having relationship with multiple tables like subscriber, member etc.
Currently design is in such a way that when there is any change in associated tables, it increments record version throughout all tables.
So new record is added in address table even if same address is already present, resulting into large number of duplicate copies.
We need to identify and remove duplicate records, and update foreign keys in associated tables while making sure it doesn't impact the running application.
Tried solution:
We have written a script for cleanup logic, where unique hash is generated for every address. If calculated hash is already present then it means address is duplicate, where we merge into single address record and update foreign keys in associated tables.
But the problem is there are around 300 billion records in address table, so this cleanup process is taking lot of time, and it will take several days to complete.
We have tried to have index for hash column, but process is still taking time.
Also we have updated the insertion/query logic to use addresses as per new structure (using hash, and without version), in order to take care of incoming requests in production.
We are planning to do processing in chunks, but it will be very long an on-going activity.
Would like to if any further improvement can be made in above approach
Will distributed processing will help here? (may be using Hadoop Spark/hive/MR etc.)
Is there any some sort of tool that can be used here?
Suggestion 1
Use built-in delete parallel
delete /*+ parallel(t 8) */ mytable t where ...
Suggestion 2
Use distributed processing (Hadoop Spark/hive) - watch out for potential contention on indexes or table blocks. It is recommended to have each process to work on a logical isolated subset, e.g.
process 1 - delete mytable t where id between 1000 and 1999
process 2 - delete mytable t where id between 2000 and 2999
Suggestion 3
If more than ~30% of the table need to be deleted - the fastest way would be to create an empty table, copy there all required rows, drop original table, rename new, create all indexes+constraints. Of course it requires downtime and it greatly depends on number of indexes - the more you have the longer it will take
P.S. There are no "magic" tools to do it. In the end they all run the same sql commands as you can.
It's possible use oracle merge instruction to insert data if you use clean sql.

Postgres tsvector_update_trigger sometimes takes minutes

I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;

ETL for processing history records

I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
(TYPE oracle_loader
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
status CHAR(10),
sdate date 'yyyymmdd' )
LOCATION ('usrsts.dmp')
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.
I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.
I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
build immediate
refresh complete on demand
with primary key
select uid,state,date from
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.
I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
if row is more desirable than savedrow
save row
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)

ABAP select performance hints?

Are there general ABAP-specific tips related to performance of big SELECT queries?
In particular, is it possible to close once and for all the question of FOR ALL ENTRIES IN vs JOIN?
A few (more or less) ABAP-specific hints:
Avoid SELECT * where it's not needed, try to select only the fields that are required. Reason: Every value might be mapped several times during the process (DB Disk --> DB Memory --> Network --> DB Driver --> ABAP internal). It's easy to save the CPU cycles if you don't need the fields anyway. Be very careful if you SELECT * a table that contains BLOB fields like STRING, this can totally kill your DB performance because the blob contents are usually stored on different pages.
Don't SELECT ... ENDSELECT for small to medium result sets, use SELECT ... INTO TABLE instead.
Reason: SELECT ... INTO TABLE performs a single fetch and doesn't keep the cursor open while SELECT ... ENDSELECT will typically fetch a single row for every loop iteration.
This was a kind of urban myth - there is no performance degradation for using SELECT as a loop statement. However, this will keep an open cursor during the loop which can lead to unwanted (but not strictly performance-related) effects.
For large result sets, use a cursor and an internal table.
Reason: Same as above, and you'll avoid eating up too much heap space.
Don't ORDER BY, use SORT instead.
Reason: Better scalability of the application server.
Be careful with nested SELECT statements.
While they can be very handy for small 'inner result sets', they are a huge performance hog if the nested query returns a large result set.
Measure, Measure, Measure
Never assume anything if you're worried about performance. Create a representative set of test data and run tests for different implementations. Learn how to use ST05 and SAT.
There won't be a way to close your second question "once and for all". First of all, FOR ALL ENTRIES IN 'joins' a database table and an internal (memory) table while JOIN only operates on database tables. Since the database knows nothing about the internal ABAP memory, the FOR ALL ENTRIES IN statement will be transformed to a set of WHERE statements - just try and use the ST05 to trace this. Second, you can't add values from the second table when using FOR ALL ENTRIES IN. Third, be aware that FOR ALL ENTRIES IN always implies DISTINCT. There are a few other pitfalls - be sure to consult the on-line ABAP reference, they are all listed there.
If the number of records in the second table is small, both statements should be more or less equal in performance - the database optimizer should just preselect all values from the second table and use a smart joining algorithm to filter through the first table. My recommendation: Use whatever feels good, don't try to tweak your code to illegibility.
If the number of records in the second table exceeds a certain value, Bad Things [TM] happen with FOR ALL ENTRIES IN - the contents of the table are split into multiple sets, then the query is transformed (see above) and re-run for each set.
Another note: The "Avoid SELECT *" statement is true in general, but I can tell you where it is false.
When you are going to take most of the fields anyway, and where you have several queries (in the same program, or different programs that are likely to be run around the same time) which take most of the fields, especially if they are different fields that are missing.
This is because the App Server Data buffers are based on the select query signature. If you make sure to use the same query, then you can ensure that the buffer can be used instead of hitting the database again. In this case, SELECT * is better than selecting 90% of the fields, because you make it much more likely that the buffer will be used.
Also note that as of the last version I tested, the ABAP DB layer wasn't smart enough to recognize SELECT A, B as being the same as SELECT B, A, which means you should always put the fields you take in the same order (preferable the table order) in order to make sure again that the data buffer on the application is being well used.
I usually follow the rules stated in this pdf from SAP: "Efficient Database Programming with ABAP"
It shows a lot of tips in optimizing queries.
This question will never be completely answered.
ABAP statement for accessing database is interpreted several times by different components of whole system (SAP and DB). Behavior of each component depends from component itself, its version and settings. Main part of interpretation is done in DB adapter on SAP side.
The only viable approach for reaching maximum performance is measurement on particular system (SAP version and DB vendor and version).
There are also quite extensive hints and tips in transaction SE30. It even allows you (depending on authorisations) to write code snippets of your own & measure it.
Unfortunately we can't close the "for all entries" vs join debate as it is very dependent on how your landscape is set up, wich database server you are using, the efficiency of your table indexes etc.
The simplistic answer is let the DB server do as much as possible. For the "for all entries" vs join question this means join. Except every experienced ABAP programmer knows that it's never that simple. You have to try different scenarios and measure like vwegert said. Also remember to measure in your live system as well, as sometimes the hardware configuration or dataset is significantly different to have entirely different results in your live system than test.
I usually follow the following conventions:
Never do a select *, Select only the required fields.
Never use 'into corresponding table of' instead create local structures which has all the required fields.
In the where clause, try to use as many primary keys as possible.
If select is made to fetch a single record and all primary keys are included in where clause use Select single, or else use SELECT UP TO TO 1 ROWS, ENDSELECT.
Try to use Join statements to connect tables instead of using FOR ALL ENTRIES.
If for all entries cannot be avoided ensure that the internal table is not empty and a delete the duplicate entries to increase performance.
Two more points in addition to the other answers:
usually you use JOIN for two or more tables in the database and you use FOR ALL ENTRIES IN to join database tables with a table you have in memory. If you can, JOIN.
usually the IN operator is more convinient than FOR ALL ENTRIES IN. But the kernel translates IN into a long select statement. The length of such a statement is limited and you get a dump when it gets too long. In this case you are forced to use FOR ALL ENTRIES IN despite the performance implications.
With in-memory database technologies, it's best if you can finish all data and calculations on the database side with JOINs and database aggregation functions like SUM.
But if you can't, at least try to avoid accessing database in LOOPs. Also avoid reading the database without using indexes, of course.

Slow Performance on Sql Express after inserting big chunks of data

We have noticed that our queries are running slower on databases that had big chunks of data added (bulk insert) when compared with databases that had the data added on record per record basis, but with similar amounts of data.
We use Sql 2005 Express and we tried reindexing all indexes without any better results.
Do you know of some kind of structural problem on the database that can be caused by inserting data in big chunks instead of one by one?
One tip I've seen is to turn off Auto-create stats and Auto-update stats before doing the bulk insert:
Afterwards, manually creating statistics by one of 2 methods:
--generate statistics quickly using a sample of data from the table
exec sp_createstats
--generate statistics using a full scan of the table
exec sp_createstats #fullscan = 'fullscan'
You should probably also turn Auto-create and Auto-update stats back on when you're done.
Another option is to check and defrag the indexes after a bulk insert. Check out Pinal Dave's blog post.
Probably SQL Server allocated new disk space in many small chunks. When doing big transactions, it's better to pre-allocate much space in both the data and log files.
That's an interesting question.
I would have guessed that Express and non-Express have the same storage layout, so when you're Googling for other people with similar problems, don't restrict yourself to Googling for problems in the Express version. On the other hand though, bulk insert is a common-place operation and performance is important, so I wouldn't consider it likely that this is a previously-undetected bug.
One obvious question: which is the clustered index? Is the clustered index also the primary key? Is the primary key unassigned when you insert, and therefore initialized by the database? If so then maybe there's a difference (between the two insert methods) in the pattern or sequence of successive values assigned by the database, which affects the way in which the data is clustered, which then affects performance.
Something else: as well as indexes, people say that SQL uses statistics (which it created as a result of runing previous queries) to optimize its execution plan. I don't know any details of that, but as well as "reindexing all indexes", check the execution plans of your queries in the two test cases to ensure that the plans are identical (and/or check the associated statistics).
