I've been developing a DotNet project on oracle (Ver 10.2) for the last couple of months and was using Varchar2 for my string data fields. This was fine and when navigating the project page refreshes were never more than a half second if even (it's quiet a data intensive project). The data is referenced from 2 different schemas, one a centralised store of data and one of which is my own. Now the centralised schema will be changing to be unicode compliant (but hasn't yet) so all Varchar2 fields will become NVarchar2, in preparation for this I changed all the fields in my schema to be NVarchar2 and since then performance has been horrible .. up to 30/40 second page refreshes.
Could this be because Varchar2 fields in the centralised schema will be joined against NVarchar2 fields in my schema on some stored procedures. I know NVarchar2 is twice the size of Varchar2 but that wouldn't explain the sudden massive change. As I said any tips for what to look for to improve would be great, if I haven't explained the scenario well enough do ask for more information.
Firstly, do a
select * from v$nls_parameters where parameter like '%SET%';
Character sets can be complicated. You can have single-byte charactersets, fixed-size multibyte character set sand variable-sized multi-byte character sets. See the unicode descriptions here
Secondly, if you are joining a string in a single-byte characterset to a string in a two-byte characters set, you have a choice. You can do a binary/byte comparison (which generally won't match anything if you compare between a single-byte character set and a two-byte characterset). Or you can do a linguistic comparison, which will generally mean some CPU cost, as one value is converted into another, and often the failure to use an index.
Indexes are ordered, A,B,C etc. But a character like Ä may fall in different places depending on the Linguistic order. Say the index structure puts Ä between A and B. But then you do a linguistic comparison. The language of that comparison may put Ä after Z, in which case the index can't be used. (Remember your condition could be a BETWEEN rather than an = ).
In short, you'll need a lot of preparation, both in your schema and the central store, to enable efficient joins between different charactersets.
It is difficult to say anything based on what you have provided. Did you manage to check if the estimated cardinalities and/or explain plan changed when you changed the datatype to NVARCHAR2? You may want to read the following blog post to see if you can find a lead
http://joze-senegacnik.blogspot.com/2009/12/cbo-oddities-in-determing-selectivity.html
It is likely no longer able to use indexes that it previously could. As Narendra suggests check the explain plan to see what changed. It is possible that once the centeralized store is changed the indexes will again be usable. I suggest testing that path.
Setting the NLS_LANG initialization parameter properly is essential to proper data conversion. The character set that is specified by the NLS_LANG initialization parameter should reflect the setting for the client operating system. Setting NLS_LANG correctly enables proper conversion from the client operating system code page to the database character set. When these settings are the same, Oracle assumes that the data being sent or received is encoded in the same character set as the database character set, so no validation or conversion is performed. This can lead to corrupt data if conversions are necessary.
Related
I'm working with an SQLite DB where all columns are of NVARCHAR data type.
Coming from MS SQL background I know that NVARCHAR has additional baggage associated with it, my first impulse is to refactor most column types to have concrete string lengths enforced (most are under 50 chars long).
But at the same time I know that SQLite treats things a bit differently.
So my question is should I change/refactor the column types? And by doing so is there anything to gain in terms of disk space or performance in SQLite?
DB runs on Android/iOS devices.
Thanks!
You should read https://www.sqlite.org/datatype3.html.
CHARACTER(20), VARCHAR(255), VARYING CHARACTER(255), NCHAR(55)
NATIVE CHARACTER(70), NVARCHAR(100), TEXT and CLOB are treated as TEXT.
Also SQLite does not enforce lengths.
So instead of digging through critic documentation I did a bit of experimenting with column types.
Database I'm working with has well over 1 million records, with most columns as NVARCHAR, so any change on column datatypes was easily seen in file size deltas.
Here are the results I found in effort to reduce DB size:
NVARCHAR:
Biggest savings came from switching column types where possible from NVARCHAR to plain INT or FLOAT. On a DB file of 80MB savings were in Megabytes, very noticable. With some additional refactoring I dropped the size down to 47MB.
NVARCHAR vs. VARCHAR:
Made very little difference, perhaps a few KBs on a DB of a size of 80MBs
NVARCHAR vs. OTHER String Types:
Switching between various string based types made almost no difference, as documentation points out all string types are stored all the same in SQLite, as TEXT
INT vs OTHER numerics
No Difference here, SQLite stores all as NUMBER in the end.
Indexes based on NVARCHAR columns also took up more space, once re-indexed on INT columns I shedded a few MBs
do i need to set length for every poco property in Entity Framework Code First ? if i dont
set stringLength or maxlength/minlength for a property , it will be nvarchar(max) , how bad is nvarchar(max) ? should i just leave it alone in development stage , and improve it before production ?
You should define a Max length for each property where you want to restrict the length. Note that the nvarchar(max) data type is different from the nvarchar(n) datatype, where n is a number from 1-4000. The max version that you get when you define no max length is meant for large blocks of text, like paragraphs and the like. It can handle extremely large lengths, and so the data is stored separately from the rest of the fields of the record. nvarchar(n), on the other hand, is stored inline with the rest of the rows.
It's probably best to go ahead and set those values as you want now, rather than waiting to do so later. Choose values that are as large as you will ever need, so you never have to increase them. nvarchar(n) stores its info efficiently; for example, a nvarchar(200) does not necessarily take up 200 characters of space; it only uses enough space to store what is actually put into it, plus a couple extra bytes for saving its length.
So whenever possible, you should set a limit on your entity's text fields.
NVARCHAR - is variable length field. So it consumes only space you need for it. On the other hand NCHAR allocates all the space it requires not on demand as NVARCHAR does.
MSDN advises to use nvarchar when the sizes of the column data entries are probably going to vary considerably.
It's the way to go as for me on the early stages of a project. You can tune it when needed.
According to the next blog post nvarchar(max) is not the same as ntext until the actual value size does not reach 4000 symbols (cause limitation is 8K, and widechars use two bytes per char). As far as it hits this size it behaves pretty much the same as ntext. So as for me I don't see any good reason to avoid using nvarchar(max) data type.
Is there a reason why Oracle is case sensitive and others like SQL Server, and MySQL are not by default?
I know that there are ways to enable/disable case sensitivity, but it just seems weird that oracle differs from other databases.
I'm also trying to understand reasons for case sensitivity. I can see where "Table" and "TaBlE" can be considered equivalent and not equivalent, but is there an example where case sensitivity would actually make a difference?
I'm somewhat new to databases and am currently taking a class.
By default, Oracle identifiers (table names, column names, etc.) are case-insensitive. You can make them case-sensitive by using quotes around them (eg: SELECT * FROM "My_Table" WHERE "my_field" = 1). SQL keywords (SELECT, WHERE, JOIN, etc.) are always case-insensitive.
On the other hand, string comparisons are case-sensitive (eg: WHERE field='STRING' will only match columns where it's 'STRING') by default. You can make them case-insensitive by setting NLS_COMP and NLS_SORT to the appropriate values (eg: LINGUISTIC and BINARY_CI, respectively).
Note: When inquiring data dictionary views (eg: dba_tables) the names will be in upper-case if you created them without quotes, and the string comparison rules as explained in the second paragraph will apply here.
Some databases (Oracle, IBM DB2, PostgreSQL, etc.) will perform case-sensitive string comparisons by default, others case-insensitive (SQL Server, MySQL, SQLite). This isn't standard by any means, so just be aware of what your db settings are.
Oracle actually treats field and table names in a case-insensitive manner unless you use quotes around identifiers. If you create a table without quotes around the name, for example CREATE MyTable..., the resulting table name will be converted to upper case (i.e. MYTABLE) and will be treated in a case insensitive manner. SELECT * from MYTABLE, SELECT * from MyTable, SELECT * from myTabLe will all match MYTABLE (note the lack of quotes around the table name). Here is a nice article on this issue that discusses this issue in more detail and compares databases.
Keep in mind too for SQL Server the case sensitivity is based on the collation. The default collation is case insensitive - but this could be changed to be case sensitive. A similar example is why do the default Oracle databases use a Western European character set when UTF is required for global applications that use non ASCII characters? I think it's just a vendor preference.
If I had to guess, I'd say for historical/backwards-compatibility reasons.
Oracle first came out in 1977, and it was likely computationally expensive with the technology at the time to do the extra work for case-insensitive searches, so they just opted for exact matches.
For some applications case-sensitivity is important and for others it isn't. Whichever DBMS you use, business requirements should determine whether you need case-senitivity or not. I wouldn't worry too much about the "default".
We are going to migrate an application to have it support Unicode and have to choose between unicode character set for the whole database, or unicode columns stored in N[VAR]CHAR2.
We know that we will no more have the possibility of indexing column contents with Oracle Text if we choose NVARCHAR2, because Oracle Text can only index columns based on the CHAR type.
Apart that, is it likely that other major differences arise when harvesting from Oracle possibilities?
Also, is it likely that some new features are added in newer versions of Oracle, but only supporting either CHAR columns or NCHAR columns but not both?
Thank you for your answers.
Note following Justin's answer:
Thank you for your answer. I will discuss your points, applied to our case:
Our application is usually alone on the Oracle database and takes care of the
data itself. Other software that connect to the database are limited to Toad,
Tora or SQL developer.
We also use SQL*Loader and SQL*Plus to communicate with the database for basic
statements or to upgrade between versions of the product. We have
not heard of any specific problem with all those software regarding NVARCHAR2.
We are also not aware that database administrators among our customers would
like to use other tools on the database that could not support data on
NVARCHAR2 and we are not really concerned whether their tools might disrupt,
after all they are skilled in their job and may find other tools if necessary.
Your last two points are more insightful for our case. We do not use many
built-in packages from Oracle but it still happens. We will explore that
problem.
Could we also expect performance breakage if our application (that is compiled under Visual C++), that uses wchar_t to
store UTF-16, has to perform encoding conversions on all processed data?
If you have anything close to a choice, use a Unicode character set for the entire database. Life in general is just blindingly easier that way.
There are plenty of third party utilities and libraries that simply don't support NCHAR/ NVARCHAR2 columns or that don't make working with NCHAR/ NVARCHAR2 columns pleasant. It's extremely annoying, for example, when your shiny new reporting tool can't report on your NVARCHAR2 data.
For custom applications, working with NCHAR/ NVARCHAR2 columns requires jumping through some hoops that working with CHAR/ VARCHAR2 Unicode encoded columns does not. In JDBC code, for example, you'd constantly be calling the Statement.setFormOfUse method. Other languages and frameworks will have other gotchas; some will be relatively well documented and minor others will be relatively obscure.
Many built-in packages will only accept (or return) a VARCHAR2 rather than a NVARCHAR2. You'll still be able to call them because of implicit conversion but you may end up with character set conversion issues.
In general, being able to avoid character set conversion issues within the database and relegating those issues to the edge where the database is actually sending or receiving data from a client makes the job of developing an application much easier. It's enough work to debug character set conversion issues that result from network transmission-- figuring out that some data got corrupted when a stored procedure concatenated data from a VARCHAR2 and a NVARCHAR2 and stored the result in a VARCHAR2 before it was sent over the network can be excruciating.
Oracle designed the NCHAR/ NVARCHAR2 data types for cases where you are trying to support legacy applications that don't support Unicode in the same database as new applications that are using Unicode and for cases where it is beneficial to store some Unicode data with a different encoding (i.e. you have a large amount of Japanese data that you would prefer to store using the UTF-16 encoding in a NVARCHAR2 rather than the UTF-8 encoding). If you are not in one of those two situations, and it doesn't sound like you are, I would avoid NCHAR/ NVARCHAR2 at all costs.
Responding to your followups
Our application is usually alone on
the Oracle database and takes care of
the data itself. Other software that
connect to the database are limited to
Toad, Tora or SQL developer.
What do you mean "takes care of the data itself"? I'm hoping you're not saying that you've configured your application to bypass Oracle's character set conversion routines and that you do all the character set conversion yourself.
I'm also assuming that you are using some sort of API/ library to access the database even if that is OCI. Have you looked into what changes you'll need to make to your application to support NCHAR/ NVARCHAR2 and whether the API you're using supports NCHAR/ NVARCHAR2? The fact that you're getting Unicode data in C++ doesn't actually indicate that you won't need to make (potentially significant) changes to support NCHAR/ NVARCHAR2 columns.
We also use SQL*Loader and SQL*Plus to
communicate with the database for
basic statements or to upgrade between
versions of the product. We have not
heard of any specific problem with all
those software regarding NVARCHAR2.
Those applications all work with NCHAR/ NVARCHAR2. NCHAR/ NVARCHAR2 introduce some additional complexities into scripts particularly if you are trying to encode string constants that are not representable in the database character set. You can certainly work around the issues, though.
We are also not aware that database
administrators among our customers
would like to use other tools on the
database that could not support data
on NVARCHAR2 and we are not really
concerned whether their tools might
disrupt, after all they are skilled in
their job and may find other tools if
necessary.
While I'm sure that your customers can find alternate ways of working with your data, if your application doesn't play nicely with their enterprise reporting tool or their enterprise ETL tool or whatever desktop tools they happen to be experienced with, it's very likely that the customer will blame your application rather than their tools. It probably won't be a show stopper, but there is also no benefit to causing customers grief unnecessarily. That may not drive them to use a competitor's product, but it won't make them eager to embrace your product.
Could we also expect performance
breakage if our application (that is
compiled under Visual C++), that uses
wchar_t to store UTF-16, has to
perform encoding conversions on all
processed data?
I'm not sure what "conversions" you're talking about. This may get back to my initial question about whether you're stating that you are bypassing Oracle's NLS layer to do character set conversion on your own.
My bottom line, though, is that I don't see any advantages to using NCHAR/ NVARCHAR2 given what you're describing. There are plenty of potential downsides to using them. Even if you can eliminate 99% of the downsides as irrelevant to your particular needs, however, you're still facing a situation where at best it's a wash between the two approaches. Given that, I'd much rather go with the approach that maximizes flexibility going forward, and that's converting the entire database to Unicode (AL32UTF8 presumably) and just using that.
We have a requirement to store char data of different language in the same db schema. Oracle 10g is our DB. I am hoping that someone who have already done this would give me more specific instructions on how to i18n enable a oracle 10g db. We just need to store the data from multiple locales as well as collation (hoping all major db's support this) support at the db level. We doesn't need formatting of dates, datetime, numbers, currency etc.
I read some documentation on oracle's i18n support but somewhat confused about their many nls_* properties. Should I be using nls_lang or nls_language or NLS_CHARACTERSET.....
Assuming that you are building the database from scratch, not trying to retrofit an existing database which introduces other problems.
In the database, you need to ensure that the database character set supports all the characters you want to store. Presumably, that means setting the NLS_CHARACTERSET of the database to AL32UTF8. Personally, I prefer to set NLS_LENGTH_SEMANTICS to CHAR as well. That changes the default behavior of a VARCHAR2(n) to allocate n characters of storage rather than n bytes. Since AL32UTF8 is a variable-length character set, using byte semantics is generally problematic because you either have to declare fields that are 3 times as long and end up with different users being able to enter a different number of characters in the same field.
NLS_LANG is a client setting. That identifies the character set that the client is going to request the data be converted into. That generally depends on the code page of the operating system.