How to improve performance with MonetDB on OSX? - macos

I am using monetdb on a 16GB Macbook Pro with OSX 10.10.4 Yosemite.
I execute queries with SQLWorkbenchJ (configured with a minimum of 2048M RAM).
I find the performance overall erratic:
performance is acceptable / good with small size tables (<100K rows)
abysmal with tables with many rows: a query with a join of two tables (8670 rows and 242K rows) and a simple sum took 1H 20m!!
My 16GB of memory notwithstanding, in one run I never saw MSERVER5 using more than 35MB of RAM, 450MB in another. On the other hand the time is consumed swapping data onto disk (according to Activity Monitor over 160GB of data!).
There are a number of performance-related issues that I would like to understand better:
I have the impression that MonetDB struggles with understanding how much RAM to use / is available in OSX. How can I "force" MonetDB to use more RAM?
I use MonetDB through R. The MonetDB.R driver converts all the character fields into CLOB. I wonder if CLOBs create memory allocation issues?
I find difficult to explain the many GBs of writes (as mentioned >150GB!!) even for index creation or temporary results. On the other hand when I create the DB and load the tables overall the DB is <50MB. Should I create an artificial integer key and set it as index?
I join 2 tables on a timestamp field (e.g. "2015/01/01 01:00") that again is seen as a text CLOB by MonetDb / MonetDb.R. Should I just convert it to integer before saving it to MonetDb?
I have configured each table with a primary key, using a field of type integer. MonetDB (as a typical columnar database) doesn't need the user to specify an index. Is there any other way to improve performance?
Any recommendation is welcome.
For clarity the two tables I join have the following layout:
Calendar # classic calendar table with one entry per our in a year = 8760 rows
Fields: datetime, date, month, weekbyhour, monthbyday, yearbyweek, yearbymonth # all fields are CLOBs as mentioned
Activity # around 200K rows
Fields: company, department, subdepartment, function, subfunction, activityname, activityunits, datetime, duration # all CLOBs except activityunits; datetime refers to when the activity has occurred
I have tied various types of join syntax, but an example would (`*` used for brevity)
select * from Activity as a, Calendar as b where a.datetime=b.datetime

Related

How to avoid data duplicates in ClickHouse

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.
I'm storing leaked credentials in a database:
CREATE TABLE credential (
user String,
domain String,
password String,
first_seen Date,
leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen
It something happens that some credentials appear more than once (inside a file or between many).
My long-term objective is(was) the following:
- when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.
I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.
So I wonder:
- what did i miss with the ReplacingMergeTree engine ?
- how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?
I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.
Any ideas will be tried.
Multiple things are going wrong here:
You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other.
I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.
Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.
Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.
Bottomline, as a solution:
So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version.
Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.

Does the number of columns in a Vertica table impact query performance?

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

SQL Server Full Table Scan and Load

For the purpose of this question, let's pretend I have the following table:
Transaction:
Id
ProductId
ProductName
City
State
Country
UnitCost
SellAmount
NumberOfTimesPurchased
Profit (NumberOfTimesPurchased * (SellAmount - UnitCost))
Basically, a single de-normalized table with a million plus rows in it. It is important to note that only two columns will ever by updated: Profit and NumberOfTimesPurchased. When a sale is made, the NumberOfTimesPurchased will be updated and the new profit amount will be re-calculated.
Now, I need to do some minimal reporting on this table, which consists of queries that aggregate and group. As an example:
SELECT
City, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
City
SELECT
State, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
State
SELECT
Country, AVG(UnitCost), AVG(SellAmount), SUM(NumberOfTimesPurchased),
AVG(Profit)
FROM
Transaction
GROUP BY
Country
SELECT
ProductId, ProductName, AVG(UnitCost), AVG(SellAmount),
SUM(NumberOfTimesPurchased), AVG(Profit)
FROM
Transaction
GROUP BY
ProductId, ProductName
These queries are quick: ~1 second. However, I've noticed that under load, performance significantly drops (from 1 second up to a minute when there are 20+ concurrent requests), and I'm guessing the reason is that each query performs a full table scan.
I've attempted to use indexed views for each query, however my update statement performance takes a beating since each view needs to be rebuilt. On the same note, I've attempted to create covering indexes for each query, but again my update statement performance is not acceptable.
Assuming full table scans are the culprit, do I have any realistic options to get the query time down while keeping update performance at acceptable levels?
Note that I cannot use column store indexes (I'm using the cheaper version of Azure SQL Database). I'd also like to stay away from any sort of roll-up implementation, as I need the data available immediately.
Finally - the example above is not a completely accurate representation of my table. I have 20 or so different columns that can be 'grouped', and 6 columns that can be updated. No inserts or deletes.
Because there are no WHERE clauses on your queries, the database engine can nothing but a table scan (or clustered index scan which is really the same thing). If there were covering indexes with containing all the columns from your query, then the engine would prefer those. If your real queries have WHERE clauses, then appropriate indexing with those columns as the leading columns of the index might help.
But I think your problem lies elsewhere. As far as concurrency goes you haven't put enough money in the meter. According to the main service tiers doc, the Basic tier for Azure SQL Database is for:
... supporting typically one single
active operation at a given time. Examples include databases used for
development or testing, or small-scale infrequently used applications.
Therefore you might want to think about splashing out for Premium edition to support both your concurrency requirement and columnstore indexes, which are perfectly suited to this type of query. Just for fun, I created a test-rig based on AdventureWorksDW2012 to try and recreate your problem which is here. Query performance was atrocious (> 20 secs). I'd be surprised if you weren't getting DTU warnings on your portal:
An upgrade to Standard (S0-S2) did boost performance so you should experiment. You could look at scaling up for busy query times and down when not required.
This table also looks a bit like a fact table, so you might want to consider refactoring this as a fact / dimensional model then use Azure Analysis Services on top to bring that sub-second performance.
Coincidentally there is a feedback item you can vote for to bring columnstore to Standard tier:
https://feedback.azure.com/forums/217321-sql-database/suggestions/6878001-make-sql-column-store-feature-available-for-standa
Recent comments suggest it is "in the work queue" as at May 2017;

select & update in both live & archive tables in the same schema

The application that I am working on currently has an archive logic where all the records older than 6 months will be moved to history tables in the same schema, but on a different table space. This is achieved using a stored procedure which is being executed daily.
For ex. TABLE_A (live, latest 6 months) ==> TABLE_A_H (archive, older than 6 months, up to 8 years).
So far no issues. Now the business has come up with a new requirement where the archived data should also be available for selects & updates. The updates can happen even for an year old data.
selects could be direct like,
select * from TABLE_A where id = 'something'
Or it could be open-ended query like,
select * from TABLE_A where created_date < 'XYZ'
Updates are usually for specific records.
These queries are exposed as REST services to the clients. There are possibilities of junk/null values (no way the application can sanitize the input).
The current snapshot of the DB is
PARENT_TABLE (10M records, 10-15K for each record)
CHILD_TABLE_ONE (28M records, less than 1K for each record)
CHILD_TABLE_TWO (25M records, less than 1K for each record)
CHILD_TABLE_THREE (46M records, less than 1K for each record)
CHILD_TABLE_FOUR (57M records, less than 1K for each record)
Memory is not a constraint - I can procure additional 2 TB of space if needed.
The problem is how do I keep the response time lower when it accesses the archive tables?.
What are all the aspects that I should consider when building a solution?
Solution1: For direct select/update, check if the records are available in live tables. If present, perform the operation on the live tables. If not, perform the operation on the archive tables.
For open ended queries, use UNION ???
Solution2: Use month-wise partitions and keep all 8 years of data in single set of tables?. Does oracle handles 150+ Millions of records in single table for select/update efficiently?
Solution3: Use NoSQL like Couchbase?. Not a feasible solution at the moment because of the infra/cost involved.
Solution4: ???
Tech Stack: Oracle 11G, J2EE Application using Spring/Hibernate (Java 1.6) hosted on JBoss.
Your response will be very much appreciated.
If I were you, I'd go with Solution 2, and ensure that you have the relevant indexes available for the types of queries you expect to be run.
Partitioning by month means that you can take advantage of partition pruning, assuming that the queries involve the column that's being partitioned on.
It also means that your existing code does not need to be amended in order to select or update the archived data.
You'll have to set up housekeeping to add new partitions though - unless you go for interval partitioning, but that has its own set of gotchas.

Boost simple SELECT performance of SQLite database

I'm using a SQLite database (1GB+) in which there's only one table of stacked financial tick time series. I'm using RSQLite package to connect to the database. My situation is very simple: select a subset of data from one date to the other, e.g. from 20140101 to 20140930. Therefore the query is simple too:
SELECT * FROM DATA WHERE YMD BETWEEN 20140101 AND 20140930
The data is stored on a SSD but the reading time looks still slow. Are there some ways or tips to boost the performance of such simple subsetting query?
The database has 14 columns and on automatic index.

Resources