Why does it take longer for SAS to create a dataset from a data step view using, for example, sashelp.vcolumn versus the equivalent SQL table dictionary.columns?
I did a test using fullstimer and it seems to confirm my suspicion of performance differences.
option fullstimer;
data test1;
set sashelp.vcolumn;
where libname = 'SASHELP' and
memname = 'CLASS' and
memtype = 'DATA';
run;
proc sql;
create table test2 as
select *
from dictionary.columns
where libname = 'SASHELP' and
memname = 'CLASS' and
memtype = 'DATA';
quit;
An excerpt from the log:
NOTE: There were 5 observations read from the data set SASHELP.VCOLUMN.
WHERE (libname='SASHELP') and (memname='CLASS') and (memtype='DATA');
NOTE: The data set WORK.TEST1 has 5 observations and 18 variables.
NOTE: DATA statement used (Total process time):
real time 0.67 seconds
user cpu time 0.23 seconds
system cpu time 0.23 seconds
memory 3820.75k
OS Memory 24300.00k
Timestamp 04/13/2015 09:42:21 AM
Step Count 5 Switch Count 0
NOTE: Table WORK.TEST2 created, with 5 rows and 18 columns.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
memory 3267.46k
OS Memory 24300.00k
Timestamp 04/13/2015 09:42:21 AM
Step Count 6 Switch Count 0
The memory used is a little higher for SASHELP but the difference isn't huge. Note the time--it's 22 times longer using SASHELP than with the SQL dictionary. Surely it can't just be due to the relatively small difference in memory usage.
At #Salva's suggestion, I resubmitted the code in a new SAS session, this time running the SQL step before the data step. The memory and time differences are even more pronounced:
| sql | sashelp
----------------+-----------+-----------
real time | 0.28 sec | 1.84 sec
user cpu time | 0.00 sec | 0.25 sec
system cpu time | 0.00 sec | 0.24 sec
memory | 3164.78k | 4139.53k
OS Memory | 10456.00k | 13292.00k
Step Count | 1 | 2
Switch Count | 0 | 0
Some (if not all) of this is the difference in overhead between SQL and Data Step. For example:
proc sql;
create table test2 as
select *
from sashelp.vcolumn
where libname = 'SASHELP' and
memname = 'CLASS' and
memtype = 'DATA';
quit;
Also very fast.
The SAS page about Dictionary Tables gives some information that is likely the main explanation.
When querying a DICTIONARY table, SAS launches a discovery process
that gathers information that is pertinent to that table. Depending on
the DICTIONARY table that is being queried, this discovery process can
search libraries, open tables, and execute views. Unlike other SAS
procedures and the DATA step, PROC SQL can mitigate this process by
optimizing the query before the discovery process is launched.
Therefore, although it is possible to access DICTIONARY table
information with SAS procedures or the DATA step by using the SASHELP
views, it is often more efficient to use PROC SQL instead.
In my experience, using the sashelp views is slower than using proc datasets. This is more so if you have a lot of libraries assigned, especially external ones:
10 proc datasets lib=sashelp noprint;
11 contents data=class out=work.test2;
12 quit;
NOTE: The data set WORK.TEST2 has 5 observations and 40 variables.
NOTE: PROCEDURE DATASETS used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
memory 635.12k
OS Memory 9404.00k
Timestamp 14.04.2015 kl 10.22
Related
I have an SQLite database of 160GB. There is an index on the id column and stage column of table hwmp.
CREATE INDEX idx_hwmp_id ON hwmp (id, stage);
When I do a count of rows the query returns in 0.09 seconds.
sqlite> select count (*) from hwmp where id = 2000 and stage = 4;
59397
Run Time: real 0.091 user 0.000074 sys 0.080494
However, for a select all the real time is 85 seconds. The user and system time combined is only 2.5 seconds. Why would the real time be so high?
select * from hwmp where id = 2000 and stage = 4;
Run Time: real 85.420 user 0.801639 sys 1.754250
How to fix it? Another query on a sqlite3 database (300MB) used to return in 20ms. Today, it was taking 652ms.
Run Time: real 0.652 user 0.018766 sys 0.010595
There is something wrong with the Linux environment today. I downloaded the same SQLite to my Mac and it ran quickly.
Run Time: real 0.028 user 0.005990 sys 0.010420
It is using the index:
sqlite> explain query plan select * from hwmp where id = 78 and stage = 4;
QUERY PLAN
`--SEARCH hwmp USING INDEX idx_hwmp_id (id=? AND stage=?)
Run Time: real 0.005 user 0.000857 sys 0.000451
The relevant setting is pragma cache_size = 200000; 200000 pages of 4096 bytes. After setting that, for the first time query, it takes approximately 3s and second time query takes 0.28s. Phew.
The cache settings improved the performance for some time. We are working off an AWS linux VM with EBS SSD attached. There seems to be problem in the environment as well. The query times in my Mac is 6.3 times faster than the AWS linux / EBS environment.
I created multiple posts in the forum about the performance problem that I have but now after i made some tests and gathered all the info that is needed I'm creating this post.
I have performance issues with two big tables. Those tables are located on an oracle remote database. I'm running the quert :
insert into local_postgresql_table select * from oracle_remote_table.
The first table has 45M records and its size is 23G. The import of the data from the oracle remote database is taking 1 hour and 38 minutes. After that I create 13 regular indexes on the table and it takes 10 minutes per table ->2 hours and 10 minutes in total.
The second table has 29M records and its size is 26G. The import of the data from the oracle remote database is taking 2 hours and 30 minutes. The creation of the indexes takes 1 hours and 30 minutes (some are indexes on one column and the creation takes 5 min and some are indexes on multiples column and it takes 11 min.
Those operation are very problematic for me and I'm searching for a solution to improve the performance. The parameters I assigned :
min_parallel_relation_size = 200MB
max_parallel_workers_per_gather = 5
max_worker_processes = 8
effective_cache_size = 2500MB
work_mem = 16MB
maintenance_work_mem = 1500MB
shared_buffers = 2000MB
RAM : 5G
CPU CORES : 8
-I tried running select count(*) from table in oracle and in postgresql the running time is almost equal.
-Before importing the data I drop the indexes and the constraints.
-I tried to copy a 23G file from the oracle server to the postgresql server and it took me 12 minutes.
Please advice how can I continue ? How can I improve something in this operation ?
We are analyzing sql statements on an Oracle 12c database. We noticed that the following statement improved by running several times. How can it be explained that it improves by executing it a second and third time?
SELECT COUNT (*)
FROM asset
WHERE ( ( (status NOT IN ( 'x1', 'x2', 'x3'))
AND ( (siteid = 'xxx')))
AND (EXISTS
(SELECT siteid
FROM siteauth a, groupuser b
WHERE a.groupname = b.groupname
AND b.userid = 'xxx'
AND a.siteid = asset.siteid)))
AND ( (assetnum LIKE '5%'));
First run: 24 Sec.
Second run: 17 Sec.
Third run: 7 Sec.
Fourth run:7 Sec.
Tuned by using result cash: 0,003 Sec.
Oracle does not cache results of queries by default, but caches data blocks used by the query. Also 12c has features like "Adaptive execution plans" and "Cardinality feedback" which might enforce execution plan changes between executions even if table statistics were not re-calculated.
Oracle fetches data from disc into memory. The second time you run the query the data is found in memory so no disc reads are necessary. Resulting in faster query execution.
The database is "warmed up".
We are currently running Oracle 11g and I am looking into if we need to run statistics after a large import. We have statistics_level set to 'TYPICAL'. Based on this I'm thinking that we do NOT need to update statistics:
Starting with Oracle Database 11g, the MONITORING and NOMONITORING
keywords have been deprecated and statistics are collected
automatically.
https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables005.htm
However, after creating my database and running my modest import (100's of thousands to millions of records in a handful of tables and the creation of a number of indexes) all of the tables affected by the import show null for last_analyzed and stale_stats using the query below.
select
table_name,
stale_stats,
last_analyzed
from
dba_tab_statistics
where
owner = 'MY_SCHEMA'
order by
last_analyzed desc, table_name asc
;
Should I expect certain queries to have poor performance in this state?
Should I expect the statistics to eventually run and last_analyzed and stale_stats to eventually be populated (the documentation suggests that these values are updated about every three hours by default)?
It has been my experience that for moderately sized databases (tables with millions of records and less than 10's of millions of records) that mucking around with stats is not necessary and generally causes more problems than it solves. Is this generally the case?
* * * NOTES ON OUR RESOLUTION * * *
We were using this:
analyze table my_table compute statistics
We switched to this:
dbms_stats.gather_table_stats('MY_SCHEMA', 'MY_TABLE');
The analyze table statement took about 1:30 minutes in one environment and about 15:00 - 20:00 minutes in the second environment.
The gather_table_stats statement took about 0:30 to 1:00 minutes in both of the two instances we were able to examine.
Our plan moving forward is to switch our analyze table statements to gather_table_stats calls.
STATISTICS_LEVEL and gathering table/index statistics are entirely different things. STATISTICS_LEVEL affect if row source statistics are gathered during command execution. So then you're able to compare the optimizer estimates and actual values for each step in display cursor.
So table/index statistics are used for execution plan optimization and STATISTICS_LEVEL for gathering execution statistics when execution plan is being executed and it's mostly for diagnostic purposes.
When last_analyzed is null it means that table statistics hasn't been gathered yet.
stale_stats says whether the stats are considered fresh or stale, or if the stats will be gathered automatically next time or not. The default settings is 10 percent. If you gather table statistics and then insert/update/delete less than 10 percent of rows the statistics is considered fresh. When you reach 10 percent of modified rows they become stale.
Oracle by default gathers table/index statistics automatically during maintenance window which is automatically configured when a database is created. It's usually reconfigured by DBAs if there are specific requirements.
Regarding the STATISTICS_LEVEL, with default value TYPICAL it looks like this:
HUSQVIK#hq_pdb_tcp> select * from dual;
D
-
X
HUSQVIK#hq_pdb_tcp> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------
SQL_ID a5ks9fhw2v9s1, child number 0
-------------------------------------
select * from dual
Plan hash value: 272002086
-------------------------------------------
| Id | Operation | Name | E-Rows |
-------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | TABLE ACCESS FULL| DUAL | 1 |
-------------------------------------------
Note
-----
- Warning: basic plan statistics not available. These are only collected when:
* hint 'gather_plan_statistics' is used for the statement or
* parameter 'statistics_level' is set to 'ALL', at session or system level
We don't see anything more than estimated number of rows. If you set ALTER SESSION SET statistics_level = ALL then
HUSQVIK#hq_pdb_tcp> ALTER SESSION SET statistics_level = ALL;
HUSQVIK#hq_pdb_tcp> select * from dual;
D
-
X
HUSQVIK#hq_pdb_tcp> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(NULL, NULL, 'ALLSTATS LAST'));
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------
SQL_ID a5ks9fhw2v9s1, child number 1
-------------------------------------
select * from dual
Plan hash value: 272002086
------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 |
| 1 | TABLE ACCESS FULL| DUAL | 1 | 1 | 1 |00:00:00.01 | 3 |
------------------------------------------------------------------------------------
Now we see also the actual number of rows and time taken to execute each step as well as number of consistent reads (buffers column).
With more complex queries you will get much more information than this. You should check the documentation at https://docs.oracle.com/database/121/ARPLS/d_xplan.htm
Also be aware that the statistics sampling is not done with every row but by default every 128 rows (can be changed using undocumented _rowsource_statistics_sampfreq parameter)
(Husqvik thoroughly explained the meaning of the columns and parameters, this answer only addresses how to gather statistics.)
Statistics should be manually gathered after any significant* change to a table. Oracle has a great default, automatic statistics gathering processes since 11g. But even with that new system there are still at least two good reasons to manually gather statistics. The default statistics gathering auto-task is normally meant for slowly-changing OLTP tables, not fast-changing data warehouse tables.
Significant data changes can easily lead to significant performance problems. If the tables are going to be used right after they are loaded then they need good statistics immediately.
A common problem in ETL processes is when tables go from 1 row to a million rows. The optimizer thinks there is still only one row in large tables and uses lots of nested loops joins instead of hash joins. Those algorithms work well in different contexts; without good statistics Oracle does not know the correct context.
It's important to note that a NULL LAST_ANALYZED is not the worst case scenario. When there are no statistics at all, Oracle will use dynamic sampling to generate quick statistics estimates. The worst case is when the statistics job ran last night when the table is empty; Oracle thinks it has good statistics when it really doesn't.
The statistics auto-task may not be able to keep up with large changes. The statistics auto-task is a low-priority, single-threaded process. If there are too many large tables left to the automatic process it may not be able to process them during the maintenance window.
The bad news is that developers can't ignore optimizer statistics. The DBAs can't just handle it later. It might help to read some of the chapters from the manuals, such as Managing Optimizer Statistics.
The goods news is that Oracle 11g finally has nice default settings. You usually don't need to muck around with the parameters. In most cases there's a simple rule to follow: if the table changed significantly, run this:
dbms_stats.gather_table_stats('SCHEMA_NAME', 'TABLE_NAME');
*: "Significant" is a subjective word. A change is normally significant in terms of relative size, not absolute. Adding one million rows to a table is significant if the table currently has one row, but not if the table has a billion rows.
Im using a DataStax Community v 2.1.2-1 (AMI v 2.5) with preinstalled default settings.
And i have a table :
CREATE TABLE notificationstore.note (
user_id text,
real_time timestamp,
insert_time timeuuid,
read boolean,
PRIMARY KEY (user_id, real_time, insert_time))
WITH CLUSTERING ORDER BY (real_time DESC, insert_time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}
AND **default_time_to_live** = 20160
The other configurations are:
I have 2 nodes. on m3.large having 1 x 32 (SSD).
Im facing the issue of timeouts even if consistency is set to ONE on this particular table.
I increased the heap space to 3gb [ram size of 8gb]
I increased the read timeout to 10 secs.
select count (*) from note where user_id = 'xxx' limit 2; // errors={}, last_host=127.0.0.1.
I am wondering if the problem could be with time to live? or is there any other configuration any tuning that matters for this.
The data in the database is pretty small.
Also this problem occurs not as soon as you insert. This happens after some time (more than 6 hours)
Thanks.
[Copying my answer from here because it's the same environment/problem: amazon ec2 - Cassandra Timing out because of TTL expiration.]
You're running into a problem where the number of tombstones (deleted values) is passing a threshold, and then timing out.
You can see this if you turn on tracing and then try your select statement, for example:
cqlsh> tracing on;
cqlsh> select count(*) from test.simple;
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------+--------------+--------------+----------------
...snip...
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 23:36:59,324 | 172.31.0.85 | 123932
Scanned 1 rows and matched 1 | 23:36:59,325 | 172.31.0.85 | 124575
Timed out; received 0 of 1 responses for range 2 of 4 | 23:37:09,200 | 172.31.13.33 | 10002216
You're kind of running into an anti-pattern for Cassandra where data is stored for just a short time before being deleted. There are a few options for handling this better, including revisiting your data model if needed. Here are some resources:
The cassandra.yaml configuration file - See section on tombstone settings
Cassandra anti-patterns: Queues and queue-like datasets
About deletes
For your sample problem, I tried lowering the gc_grace_seconds setting to 300 (5 minutes). That causes the tombstones to be cleaned up more frequently than the default 10 days, but that may or not be appropriate based on your application. Read up on the implications of deletes and you can adjust as needed for your application.