Full text indexing on a column in a foreign table

Full text indexing on a column in a foreign table - full-text-search

My entire database is in INNDB. I love the features, hands down. However it doesn't allow full text indexing on TEXT-type columns. So I have to take my current TEXT column from my main table (INNODB) and create a MYISAM table and reference back to the original table. But because MYISAM doesn't allow FK constraints I realize I've created a potential weakness. If the original table index changes it won't cascade down into the MYISAM table. Vice versa if I create a FK link from the original table to the MYISAM table, and the MYISAM row is deleted, then I have linked to a nonexistent entry. The data consistency check is simply not there.
In short, INNODB got me too comfortable and dependent on FK constraints for my own good.

I would consider not using the MyISAM fulltext indexing at all, and instead using a proper search engine alongside your db. Lucene/Solr, sphinx and xapian seem to be the leading choices (I've only used Lucene/Solr myself).
see this question for more :)
edit: also this question.

If you are using some sort of framework, the framework can control the referential integrity for you. CakePHP does a nice job of this with their Model classes.

Related

Check all table columns for a value

Ok, tricky question I am trying to figure out where a database schema is storing a particular pointer. I know the pointer value I just don't what table it is in or what column. I know the pointer is 123123123. How do I check all table columns to see if any of them have that value?
Thanks.

In h2 you can use fulltext search, but then you would need to add all tables in the search scope and indexing.
If you need to index only primary keys, then it might be better but you still need to come up with individual FT_CREATE_INDEX() calls for each table. You can automate this with several languages or with ETLs (like scriptella).
If you've enough disk space, you could dump a SQL from your db and use a viewer for big files like glogg.
The advantage of the first solution is no external tools but you need to work out a specific indexing script for SQL for any existing or new table. The 2nd solution is a 1 time fix.

I use SQL Search from RedGate. It's free and it helps you find any text anywhere in the database.
https://www.red-gate.com/products/?gclid=CjwKEAjwiYG9BRCkgK-G45S323oSJABnykKAE7IH_EMhnmq7OdLdXljfIkdGZrDD6OnOrT4VB0agahoCVn3w_wcB

Why no primary key

I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?

I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).

Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.

Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.

I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.

You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).

There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.

Removal of foreign key constraints, Referential integrity and Hibernate

My colleague mentioned that our client DBA proposed the removal of all foreign key constraints in our project Oracle DB schema. Initially I did not agree with the decision. I am a developer not a DBA. So later realized that there could be some reasons behind the decision. So I am trying get the pros and cons of this decision.
Proj info:
Spring application with Hibernate persistent.
Oracle 10g DB
There are batch jobs use only SQL-loader or plain JDBC.
Here is my list of pros and cons (Please correct me if I am wrong)
Pros:
Since application persistent is managed by Hibernate, foreign key cascading is not necessary. it is managed by Hibernate with appropriate cascading option.
Hibernate DELETE action(includes delete cascading option) removes the foreign key table records before removing its primary key record (i.e to avoid referential integrity issue). This behavior is same for no-foreign-key case, foreign-key case and foreign-key-with-cascade case. But adding foreign-key will unnecessarily slow down Oracle delete operation.
Cons
Hibernate provides a mechanism for managing association between objects and cascading operations within association. But it never provides complete referential integrity solution that DB has.
Referential integrity is required for those batch jobs use only SQL-loader or plain JDBC.
Guys, I need your advice on this. If anyone of you are a DBA, please provide DBA side reasons.
Thank you.

I have never heard such a proposal from a DBA before! From an application developer, yes, but never from a Database Administrator. It beggars belief.
Tom Kyte has said many times (for example here): applications come and go, but data is forever.
In my own experience, I have worked on Oracle databases that are 20+ years old. They started out in Oracle 6 and got migrated up to 10G or 11g over the years - the same data. But the applications that sat on top? First they were Forms 3.0, then in some cases they got migrated to C++, in some got re-built in Forms 6i, in some rebuilt in Application Express. ADF is another possibility of course; or perhaps a SOA architecture...
What's so special about the current application development tool that it suddenly takes over Oracle's job as the DBMS?

I've worked on databases in projects that decided to drop referential integrity constraints.
We had to write "QC script" to detect orphaned rows with respect to every table relationship (orphaned rows would have been prevented by a foreign key constraint).
Then when (not if) they occured, we had to have policies for how to resolve the orphans. Choices included the following:
Delete orphaned rows.
Archive orphaned rows.
Update any orphaned foreign key values to NULL.
Update any orphaned foreign key values to some existing value in the parent table.
Live with the anomalies. Write more code to exclude orphans from reports. Maybe a set of VIEWs over all the tables?
You might want to schedule a recurring weekly meeting with the stakeholders of this database to review the QC script report, and decided what to do with each of the orphaned rows.
No framework can enforce referential integrity as reliably as constraints that run in the database. Only the database can provide truly atomic changes and ensure consistency.

Since database constraints are guaranteed they can, in some circumstances, allow additional optimizations.
For example, say you have a view
CREATE VIEW orders_vw AS
SELECT ord.order_id, ord.customer_id, lin.product_id
FROM orders ord JOIN order_lines lin on ord.order_id = lin.order_id
Then you have a query that does a SELECT product_id FROM orders_vw WHERE order_id = :val
With the integrity enforced, the database knows that any order_id in order_lines has one row in the parent table and, since no value from the orders table are actually selected, it can save work by not visiting the orders table.
Without the constraint, the database can't be sure that an entry in order_lines has a parent, so it has to do the extra work of visiting the orders table to check it.
Depending on your query patterns, you may find removing constraints actually increases the workload on the DB.

Usually, foreign key removal is what database performance optimization starts with. It's kind of trade-off: you sell guaranteed integrity on DBMS level and have to manage it yourself (which is fairly easy with Hibernate but requires to be very accurate in plain SQL), and you get increased query performance since foreign key checks in queries are quite expensive.

Oracle Data Versioning/Partitioning Strategies/Best Practices

not sure if the subject entirely conveys what I'm trying to achieve, but let me explain:
We are building an application that uses Oracle as storage backend. Each year, last years dataset will be "Archived", and a new instance created and populated from scratch.
What are the options to do this within the same schema?
Keep version information on a record level (we presume this will be too slow for our use-case).
Keep version information on a table level, so for each new version, we will re-create all the tables but with a new version prefix. (We like this solution, since we can do it all in code).
?
Is there not something like partitions/personalities/namespaces available that will allow us to achieve this in Oracle?
My oracle experience is rather limited, any assistance will be greatly appreciated!

The RDBMS conceptual model is not very good at maintaining temporal versions of data. So it is not just Oracle which is lacking in this regard.
I am unclear why you think keeping version information at the record level will be too slow. Too slow in creating a new version? Or too slow where it comes to data retrieval during regular operations?
Here is how you could do it. Given a table CUSTOMERS with a business key of CUSTOMER_REF I might normally build it like this (I am using abbreviated syntax rather than best practice for reasons of space):
create table customers
( id number not null primary key
, customer_ref number not null unique key
, name varchar2(30) not null )
/
The versioned equivalent would look like this:
create table customers
( id number not null primary key
, customer_ref number not null
, version_number number
, name varchar2(30) not null
, constraint whatever unique (customer_ref, version_number) )
/
This works by keeping the current version of VERSION_NUMBER null, and only populating it at archival time. Any lookup is going to have to include and version_number is null. This will be a bit of a pain and you may need to include the column in any additional indexes you build.
Obviously maintaining all versions of the records in the same table will increase the size of your tables, which might have an effect on performance. Oracle's Partitioning option can definitely help here. It also would give you a neat way of creating next year's set of data. However, it is a chargeable extra on top of the Enterprise License, so it is an expensive option. Find out more..
The most time consuming aspect of this will be managing foreign key relationships in the new version of the table. Presuming you choose to use synthetic primary keys, the archival process will have to generate new IDs and then painstakingly cascade them to their dependent records in the new versions of referencing foreign keys.
Thinking about this makes discreet tables for each version seem very attractive. For ease of use I would keep the current version un-prefixed, so that archiving becomes a process simply of
create table customers_n as select * from customers;
You might want to avoid downtime while creating the versioned tables. In that case you could use materialized views to capture the tables' state during the run-up to the archival switchover. When the clock strikes twelve you can switch off the refresh. (caveat: this is thinking on the fly, I have never done anything like this so try before you buy.)
One pertinent advantage of multiple tables (and Partitioning) is that you can move the archived records to a READ ONLY tablespace. This not only preserves them from unwanted change, it also means you can exclude them from subsequent backups.
edit
I notice you have commented that the archived data can occasionbally be amended. In taht case moving it to READ ONLY tablespaces is not a go-er.

The only thing I wil add to what APC said is regarding your asking for "namespaces".
A namespace in Oracle is a schema, whereby you can have the same object name(s) in each schema.
Of course this all depends on how your app must access multiple versions, but I would lean towards a different schema for each year before I would use some sort of naming convention to maintain versions of tables in the same schema. The reason is, eventually you will have a nightmares. At least with different schemas, all DDL can be the same, all references to objects will be the same, and tools like ER modellers and query tools will work within the context of that schema. Data models change, so at some point you may need to run some compare tools, and if all your tables are named funky with some sort of version postfix, that won't work well.
Add a schema can be copied / moved with export or data pump quickly using the fromuser/touser or remap_schema options, so you won't need much code, except to do any cleanup of last years data out of the new version.
I find schemas are very useful as "containers" and most apps I host only have schema level privileges, so I'm guaranteed the app can be easily and quickly moved from instance to instance, or multiple copies of the app can be hosted side-by-side on the same instance.

Might the schema change between years. For example, in 2010 you have fifteen columns but in 2011 you add a sixteenth.
If so, will the same application work on both 2010 and 2011 data.
If the schema is static, I'd go for table with a 'YEAR' column and use VPD/RLS/FGAC to apply a YEAR = '2010' predicate.
I'd only worry about partitioning if performance was a problem.

1) Interval partition it by year and some date field in the row.
2) Add it at the end of each table and populate it with a sequence and trigger.
3) Then partition by interval year on this col.

Is there a performance hit by added nonenforced foreign keys to a SQL Server 2008 database?

I'm working with a database and I want to start using LINQ To SQL with it. The database doesn't have any FKs inside of it right now for performance reasons. We are inserting millions of rows at a time to the DB which is why there aren't any FKs.
So I'm thinking I'm going to add nonenforced FKs to the database to describe the relationships between the tables for my LINQ To SQL but I don't want there to be a performance hit by adding nonenforced foreign keys.
Does anyone know what the effect of this might be?
Update: I'm using LINQ-To-SQL for the nonperformance intesive stuff. 80% of the data access is through stored procs on production. But for writing unit tests and other non performance critical tasks, LINQ-To-SQL makes data access really easy.
Update: Here is how you add a nonenforced FK
ALTER TABLE [dbo].[ACI] WITH NOCHECK ADD CONSTRAINT [FK_ACI_CustomerInformation] FOREIGN KEY([ACIOI])
REFERENCES [dbo].[CustomerInformation] ([ACI_OI])
NOT FOR REPLICATION
GO
ALTER TABLE [dbo].[ACI] NOCHECK CONSTRAINT [FK_ACI_CustomerInformation]
GO

The answer can be different for different environments (data/logs on same drive, tempdb on same drive, lots of cache vs little, etc) so the best way to find this out is to benchmark. Create two identical databases, one with fk's and one without. Do your normal million-row-load into each database, and measure your transactions per second. That way you'll know for sure in your own environment.

Foreign keys will create non-clustered indexes in your table, which will improve performance of joins on foreign keys.
Extra indexes will decrease the performance of your insert/update/delete/merge statements and will increase table sizes.
http://msdn.microsoft.com/en-us/library/ms191195.aspx
Even when created with NOT FOR REPLICATION the indexes are still present and SQL Server will need to maintain them.
In your case I would either:
- use foreign keys and take performance hit
or
- not use foreign keys in production (goodbye data integrity) and run my tests against a copy of production database for which I would create foreign keys.

It may have some impact, especially at those volumes.
However I would test this on a similiar system first, so you can measure the impact, if any.
To be honest though, I would probably use hand written stored procedures for this, so you can optimize them as required, instead of using LINQ to SQL.

I realize this is an old question, but I want to comment on how bad a practice it is to create a FK that is not enforced on existing data. If in fact there is a need for a foreign key, you need to fix any bad data before adding the foreign key (which should have been added at design time) not try to ignore it. All you are doing is masking your very serious data integrity problem by refusing to notice it and do something about it. There is the occasional need to do this due to changed requirements, but it should not be considered as a first choice of techniques when adding a foreign key to a table that has data. Finding and fixing the bad data should be.
Data that has no relationship to the PK is useless. If I had a order table with a customer id that no longer existed in the customer table, how would I know who ordered the product? Of course this is why the FKs should have been enforced from the beginning whether you did million row inserts or not. I do multi-million row inserts through SSIS on a daily basis to many many tables that have foreign keys, to use this as a reason for not setting them up in the first place indicates a lack of understanding of database design. Sacrificing your data integrity to speed is ALWAYS a poor idea. Without data integrity, your database is unreliable and therfore useless.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio