I'm using a SQLite database (1GB+) in which there's only one table of stacked financial tick time series. I'm using RSQLite package to connect to the database. My situation is very simple: select a subset of data from one date to the other, e.g. from 20140101 to 20140930. Therefore the query is simple too:
SELECT * FROM DATA WHERE YMD BETWEEN 20140101 AND 20140930
The data is stored on a SSD but the reading time looks still slow. Are there some ways or tips to boost the performance of such simple subsetting query?
The database has 14 columns and on automatic index.
Related
I am using Microsoft Excel's Power Query to pull information directly from two separate data sources (IBM DB2 and Teradata) and merge them together in an Excel worksheet. The results of the first query, from DB2, are only around 300 rows, and I want to return data from the Teradata table only where it matches those 300 rows (a left join). The Teradata table is very large (more than 5 million). When I build my query in Excel's Power Query, it wants to pull the entire Teradata table first before joining it with the 300 criteria rows, and due to the size of the Teradata table, it fails.
Is there a way for me to set it up so that the initial query pull in Power Query from the Teradata table incorporates the results of the first query, so that it will process and pull back the matching information?
Thank you!
For a query like that, with two different systems as the data sources, all the data will have to be pulled into Excel so that Power Query can work a join or a filter.
With SQL data sources, Power Query can use query folding to create a Select statement that incorporates filters and joins, but that can not be applied when the data is on two totally separate systems. In that case, Excel is the tool that performs the selection, and in order to do that, all the data has to be in Excel first.
If that is too big for Excel to handle, you could try Power BI and see if that makes a difference when the data is refreshed through a data gateway.
I am testing CrateDB with a data set of 80 million events sent from a web app, both as a normalized, relational solution, and also as a denormalized, single database solution.
I imported all 80 million denormalized events into a table, and ran the following aggregation query:
select productName, SUM(elapsed)/60 as total_minutes from denormalized
where country_code = 'NL' AND eventType = 'mediaPlay'
group by productName
order by total_minutes desc
limit 1000;
and the query took .009 seconds. Wowza! CrateDB is blazing fast!
Then I imported the sessionwide docs into one table called "sessions", and all the individual event docs in each session into another table called "events", and ran the following query:
select e.productName, SUM(e.elapsed)/60 as total_minutes from sessions s
join events e ON e.sessionGroup = s.sessionGroup
where s.country_code = 'NL' AND e.eventType = 'mediaPlay'
group by e.productName
order by total_minutes desc
limit 1000;
which took 21 seconds.
My question is, is there any way to get faster relational performance, maybe by creating indexes, or changing the query somehow?
Tangential thought:
We have been using Elasticsearch for analytics, obviously denormalizing the data, and it's plenty fast, but CrateDB seems to offer everything Elasticsearch does (fast queries on denormalized data, clustering, dynamic schema, full text search), plus the additional advantages of:
better SQL support
the option to deploy relational solutions on small data sets (wonderful to standardize on one DB, no context-switching or ramp up for developers who know SQL).
What CrateDB version are you using? If it is < 3.0, than upgrading will probably boost the join query a lot, see https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-three/.
We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.
I am using monetdb on a 16GB Macbook Pro with OSX 10.10.4 Yosemite.
I execute queries with SQLWorkbenchJ (configured with a minimum of 2048M RAM).
I find the performance overall erratic:
performance is acceptable / good with small size tables (<100K rows)
abysmal with tables with many rows: a query with a join of two tables (8670 rows and 242K rows) and a simple sum took 1H 20m!!
My 16GB of memory notwithstanding, in one run I never saw MSERVER5 using more than 35MB of RAM, 450MB in another. On the other hand the time is consumed swapping data onto disk (according to Activity Monitor over 160GB of data!).
There are a number of performance-related issues that I would like to understand better:
I have the impression that MonetDB struggles with understanding how much RAM to use / is available in OSX. How can I "force" MonetDB to use more RAM?
I use MonetDB through R. The MonetDB.R driver converts all the character fields into CLOB. I wonder if CLOBs create memory allocation issues?
I find difficult to explain the many GBs of writes (as mentioned >150GB!!) even for index creation or temporary results. On the other hand when I create the DB and load the tables overall the DB is <50MB. Should I create an artificial integer key and set it as index?
I join 2 tables on a timestamp field (e.g. "2015/01/01 01:00") that again is seen as a text CLOB by MonetDb / MonetDb.R. Should I just convert it to integer before saving it to MonetDb?
I have configured each table with a primary key, using a field of type integer. MonetDB (as a typical columnar database) doesn't need the user to specify an index. Is there any other way to improve performance?
Any recommendation is welcome.
For clarity the two tables I join have the following layout:
Calendar # classic calendar table with one entry per our in a year = 8760 rows
Fields: datetime, date, month, weekbyhour, monthbyday, yearbyweek, yearbymonth # all fields are CLOBs as mentioned
Activity # around 200K rows
Fields: company, department, subdepartment, function, subfunction, activityname, activityunits, datetime, duration # all CLOBs except activityunits; datetime refers to when the activity has occurred
I have tied various types of join syntax, but an example would (`*` used for brevity)
select * from Activity as a, Calendar as b where a.datetime=b.datetime
The application that I am working on currently has an archive logic where all the records older than 6 months will be moved to history tables in the same schema, but on a different table space. This is achieved using a stored procedure which is being executed daily.
For ex. TABLE_A (live, latest 6 months) ==> TABLE_A_H (archive, older than 6 months, up to 8 years).
So far no issues. Now the business has come up with a new requirement where the archived data should also be available for selects & updates. The updates can happen even for an year old data.
selects could be direct like,
select * from TABLE_A where id = 'something'
Or it could be open-ended query like,
select * from TABLE_A where created_date < 'XYZ'
Updates are usually for specific records.
These queries are exposed as REST services to the clients. There are possibilities of junk/null values (no way the application can sanitize the input).
The current snapshot of the DB is
PARENT_TABLE (10M records, 10-15K for each record)
CHILD_TABLE_ONE (28M records, less than 1K for each record)
CHILD_TABLE_TWO (25M records, less than 1K for each record)
CHILD_TABLE_THREE (46M records, less than 1K for each record)
CHILD_TABLE_FOUR (57M records, less than 1K for each record)
Memory is not a constraint - I can procure additional 2 TB of space if needed.
The problem is how do I keep the response time lower when it accesses the archive tables?.
What are all the aspects that I should consider when building a solution?
Solution1: For direct select/update, check if the records are available in live tables. If present, perform the operation on the live tables. If not, perform the operation on the archive tables.
For open ended queries, use UNION ???
Solution2: Use month-wise partitions and keep all 8 years of data in single set of tables?. Does oracle handles 150+ Millions of records in single table for select/update efficiently?
Solution3: Use NoSQL like Couchbase?. Not a feasible solution at the moment because of the infra/cost involved.
Solution4: ???
Tech Stack: Oracle 11G, J2EE Application using Spring/Hibernate (Java 1.6) hosted on JBoss.
Your response will be very much appreciated.
If I were you, I'd go with Solution 2, and ensure that you have the relevant indexes available for the types of queries you expect to be run.
Partitioning by month means that you can take advantage of partition pruning, assuming that the queries involve the column that's being partitioned on.
It also means that your existing code does not need to be amended in order to select or update the archived data.
You'll have to set up housekeeping to add new partitions though - unless you go for interval partitioning, but that has its own set of gotchas.