I am sure there's something stupid from my side going on here.
I have a Java application where I need to query a collection of 2,5 Million objects repeatedly all the time, therefore I put them into an in memory db.
For this purpose I tried out
hsqldb
v2.4.1
and
h2
v1.4.198
For both I use exactly the same create table:
String createRateTable = "CREATE MEMORY TABLE INTEREST_RATES " +
"(EFFECTIVE_DATE DATE not NULL, "
+ "INTEREST_RATE DOUBLE, "
+ "INTEREST_RATE_CD BIGINT, "
+ "INTEREST_RATE_TERM BIGINT, "
+ "INTEREST_RATE_TERM_MULT VARCHAR(50),"
+ "PRIMARY KEY (EFFECTIVE_DATE, INTEREST_RATE_CD, INTEREST_RATE_TERM, INTEREST_RATE_TERM_MULT))";
The only difference is the Connection, either I take
con = DriverManager.getConnection("jdbc:h2:mem:ftp", "SA", "");
or
con = DriverManager.getConnection("jdbc:hsql:mem:ftp", "SA", "");
An often fired query for example is this one:
SELECT *
from INTEREST_RATES
where INTEREST_RATE_CD = ?
and EFFECTIVE_DATE = (SELECT MIN(EFFECTIVE_DATE)
from INTEREST_RATES
where INTEREST_RATE_CD = ?)
Now...
as for Hsql the application is finished within about 2 minutes.
as for H2 it's still not done after > 8 minutes.
What's wrong with my H2 setup? Seems like there are no indices created there, as Hsql does with help of the PRIMARY KEY ()? What else can be the problem?
Each DB engine has it's own implementation and in some scenarios can behave different.
I would try the following query, I think the index could be more efficiently used and returns the same result:
SELECT *
from INTEREST_RATES
where INTEREST_RATE_CD = ?
order by INTEREST_RATE_CD, EFFECTIVE_DATE
limit 1;
In the "order by" the first column is not really required as is already filtered but as explained here is needed so the PRIMARY KEY is correctly used by the engine in the execution plan.
Related
I am a beginner with SAS and trying to create a table with code below. Although the code has been running for 3 hours now. The dataset is quite huge (150000 rows). Although, when I insert a different date it runs in 45 mins. The date I have inserted is valid under date_key. Any suggestions on why this may be/what I can do? Thanks in advance
proc sql;
create table xyz as
select monotonic() as rownum ,*
from x.facility_yz
where (Fac_Name = 'xyz' and (Ratingx = 'xyz' or Ratingx is null) )
and Date_key = '20000101'
;
quit;
Tried running it again but same problem
Is your dataset coming from an external database? A SAS dataset of this size should not take nearly this long to query - it should be almost instant. If it is external, you may be able to take advantage of indexing. Try and find out what the database is indexed on and try using that as a first pass. You may consider using a data step instead rather than SQL with the monotonic() function.
For example, assume it is indexed by date:
data xyz1;
set x.facility_xyz;
where date_key = '20000101';
run;
Then you can filter this final dataset within SAS itself. 150,000 rows is nothing for a SAS dataset, assuming there aren't hundreds of variables making it large. A SAS dataset this size should run lightning fast when querying.
data xyz2;
set xyz1;
where fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum = _N_;
run;
Or, you could try it all in one pass while still taking advantage of the index:
data xyz;
set x.facility_xyz;
where date_key = '20000101';
if(fac_name = 'xyz' AND (Ratingx = 'xyz' or Ratingx = ' ') );
rownum+1;
run;
You could also try rearranging your where statement to see if you can take advantage of compound indexing:
data xyz;
set x.facility_xyz;
where date_key = '20000101'
AND fac_name = 'xyz'
AND (Ratingx = 'xyz' or Ratingx = ' ')
;
rownum = _N_;
run;
More importantly, only keep variables that are necessary. If you need all of them then that is okay, but consider using the keep= or drop= dataset options to only pull what you need. This is especially important when talking with an external database.
What kind of libname to you use ?
if you are running implicit passthrough using sas function, it would explain why it takes so long.
If you are using sas/connect to xxx module, first add option to understand what is going on : options sastrace=,,,d sastraceloc=saslog;
You should probably use explicit passthrough : using rdbms native language to avoid automatic translation of your code.
I'm using ORACLE database. How to detect duplicate data even the text is lower or uppercase.
Assuming on my table already inserted : Production
Now I want to add: production (with lower case), it should be detect duplicate. In my current case, it was not detected and inserted.
Here is the sample query:
SELECT * FROM tb_departments WHERE DEPARTMENT_NAME = '" . $getDepartmentName . "';
Anyone have an idea?
You can use the UPPER (or LOWER) function, which capitalize your string, i.e.
SELECT * FROM tb_departments WHERE UPPER(DEPARTMENT_NAME) = UPPER('" . $getDepartmentName . "');
As small variation you could capitalize your input string in the code and use
SELECT * FROM tb_departments WHERE UPPER(DEPARTMENT_NAME) = '" . $yourUpperDepartmentName . "';
Moreover I suggest you use query parameters, instead of injecting directly the parameters string ($getDepartmentName ) in your query.
In Tableau 9.2, is it possible to generate a random sample of records? If so, how could I do this? Say I have a field called Field1, then I intend to only select 20% of the records. So far, I have found how to a generate random integer in Tableau, though it is bewildering that Tableau does not already have a function for this:
Random Seed
(DATEPART('second', NOW()) + 1) * (DATEPART('minute', NOW()) + 1) * (DATEPART('hour', NOW()) + 1) * (DATEPART('day', NOW()) + 1)
Random Number
((PREVIOUS_VALUE(MIN([Seed])) * 1140671485 + 12820163) % (2^24))
Random Int
INT([Random Number] / (2^24) * [Random Upper Limit]) + 1
So how could I create a calculated field to only show random records that make up 20% of Field1?
When you make an extract, there is a dialog panel where you can filter records and specify rolling up to visible dimensions.
For at least some data sources, you can also specify a limit of the number of records (say grab the first 2000 records) or a random percentage (say, 10% of the records)
Then you can work with the small extract quickly to design you viz, and then remove the extract or refresh with all the data when you are ready. I don't think every data source supports the random selection though.
There is a random number function ins Tableau, but it is hidden and doesn't appear on the list of available functions.
It is "random()". It generates a uniformly distributed number between 0 and 1.
It isn't documented but it works. See, for example, this previous answer: how to generate pseudo random numbers and row-count in Tableau
I ended up solving my issue through the back-end in my MS Access database with the following MS Access SQL Query within a MS Access VBA macro I made:
value1 = "some_value"
fieldName = "[my_field_name]"
sqlQuery = "SELECT [my_table].* " & _
" INTO new_table_name" & _
" FROM [my_table] " & _
" WHERE [some_field] = '" & value1 & "'" & _
" ORDER BY Rnd(-(100000*" & fieldName & ")*Time())"
Debug.Print sqlQuery
CurrentDb.Execute sqlQuery
I ended up deciding that something like this would be best left to the back-end and to leave the visual analytics to Tableau.
We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful
I have a database that contains 250,000 records. I am using a DataReader to loop the records and export to a file. Just looping the records with a DataReader and no WHERE conditions is taking approx 22 minutes. I am only selecting two columns (the id and a nvarchar(max) column with about 1000 characters in it).
Does 22 minutes sound correct for SQL Server Express? Would the 1GB of RAM or 1CPU have an impact on this?
22 minutes sounds way too long for a single basic (non-aggregating) SELECT against 250K records (even 22 seconds sounds awfully long for that to me).
To say why, it would help if you could post some code and your schema definition. Do you have any triggers configured?
With 1K characters in each record (2KB), 250K records (500MB) should fit within SQL Express' 1GB limit, so memory shouldn't be an issue for that query alone.
Possible causes of the performance problems you're seeing include:
Contention from other applications
Having rows that are much wider than just the two columns you mentioned
Excessive on-disk fragmentation of either the table or the DB MDF file
A slow network connection between your app and the DB
Update: I did a quick test. On my machine, reading 250K 2KB rows with a SqlDataReader takes under 1 second.
First, create test table with 256K rows (this only took about 30 seconds):
CREATE TABLE dbo.data (num int PRIMARY KEY, val nvarchar(max))
GO
DECLARE #txt nvarchar(max)
SET #txt = N'put 1000 characters here....'
INSERT dbo.data VALUES (1, #txt);
GO
INSERT dbo.data
SELECT num + (SELECT COUNT(*) FROM dbo.data), val FROM dbo.data
GO 18
Test web page to read data and display the statistics:
using System;
using System.Collections;
using System.Data.SqlClient;
using System.Text;
public partial class pages_default
{
protected override void OnLoad(EventArgs e)
{
base.OnLoad(e);
using (SqlConnection conn = new SqlConnection(DAL.ConnectionString))
{
using (SqlCommand cmd = new SqlCommand("SELECT num, val FROM dbo.data", conn))
{
conn.Open();
conn.StatisticsEnabled = true;
using (SqlDataReader reader = cmd.ExecuteReader())
{
while (reader.Read())
{
}
}
StringBuilder result = new StringBuilder();
IDictionary stats = conn.RetrieveStatistics();
foreach (string key in stats.Keys)
{
result.Append(key);
result.Append(" = ");
result.Append(stats[key]);
result.Append("<br/>");
}
this.info.Text = result.ToString();
}
}
}
}
Results (ExecutionTime in milliseconds):
IduRows = 0
Prepares = 0
PreparedExecs = 0
ConnectionTime = 930
SelectCount = 1
Transactions = 0
BytesSent = 88
NetworkServerTime = 0
SumResultSets = 1
BuffersReceived = 66324
BytesReceived = 530586745
UnpreparedExecs = 1
ServerRoundtrips = 1
IduCount = 0
BuffersSent = 1
ExecutionTime = 893
SelectRows = 262144
CursorOpens = 0
I repeated the test with SQL Enterprise and SQL Express, with similar results.
Capturing the "val" element from each row increased ExecutionTime to 4093 ms (string val = (string)reader["val"];). Using DataTable.Load(reader) took about 4600 ms.
Running the same query in SSMS took about 8 seconds to capture all 256K rows.
Your results from running exec sp_spaceused myTable provide a potential hint:
rows = 255,000
reserved = 1994320 KB
data = 1911088 KB
index_size = 82752 KB
unused 480KB
The important thing to note here is reserved = 1994320 KB meaning your table is some 1866 MB, when reading fields that are not indexed (since NVARCHAR(MAX) can not be indexed) SQL Server must read the entire row into memory before restricting the columns. Hence you're easily running past the 1GB RAM limit.
As a simple test delete the last (or first) 150k rows and try the query again see what performance you get.
A few questions:
Does your table have a clustered index on the primary key (is it the id field or something else)?
Are you sorting on a column that is not indexed such as the `nvarchar(max) field?
In the best scenario for you your PK is id and also a clustered index and you either have no order by or you are order by id:
Assuming your varchar(max) field is named comments:
SELECT id, comments
FROM myTable
ORDER BY id
This will work fine but it will require you to read all the rows into memory (but it will only do one parse over the table), since comments is VARCHAR(MAX) and cannot be indexed and table is 2GB SQL will then have to load the table into memory in parts.
Likely what is happening is you have something like this:
SELECT id, comments
FROM myTable
ORDER BY comment_date
Where comment_date is an additional field that is not indexed. The behaviour in this case would be that SQL would be unable to actually sort the rows all in memory and it will end up having to page the table in and out of memory several times likely causing the problem you are seeing.
A simple solution in this case is to add an index to comment_date.
But suppose that is not possible as you only have read access to the database, then another solution is make a local temp table of the data you want using the following:
DECLARE #T TABLE
(
id BIGINT,
comments NVARCHAR(MAX),
comment_date date
)
INSERT INTO #T SELECT id, comments, comment_date FROM myTable
SELECT id, comments
FROM #T
ORDER BY comment_date
If this doesn't help then additional information is required, can you please post your actual query along with your entire table definition and what the index is.
Beyond all of this run the following after you restore backups to rebuild indexes and statistics, you could just be suffering from corrupted statistics (which happens when you backup a fragmented database and then restore it to a new instance):
EXEC [sp_MSforeachtable] #command1="RAISERROR('UPDATE STATISTICS(''?'') ...',10,1) WITH NOWAIT UPDATE STATISTICS ? "
EXEC [sp_MSforeachtable] #command1="RAISERROR('DBCC DBREINDEX(''?'') ...',10,1) WITH NOWAIT DBCC DBREINDEX('?')"
EXEC [sp_MSforeachtable] #command1="RAISERROR('UPDATE STATISTICS(''?'') ...',10,1) WITH NOWAIT UPDATE STATISTICS ? "