Does Core Data/SQLite compress redundant information? - cocoa

I want to use Core Data (probably with SQLite backing) to store a large database. Much of the string data will be the same between numerous rows. Does Core Data/SQLite see such redundancy, and automatically save space in the db files?
Do I need to make sure that the same text in different rows is the same string object before adding it to the db? If so, how do I detect that a new piece of text matches something anywhere in the existing db?

No, Core Data does not attempt to analyze your data to avoid duplication. If you want to save 10 million objects with the same attributes, you'll get 10 million copies.
If you want to avoid creating duplicate instances, you need to do a fetch for matching instances before creating a new one. The general approach is
Fetch objects matching new data-- according to whatever standard indicates a duplicate for your app. Use a predicate with the fetch that contains the attribute(s) that you don't want to duplicate.
If you find anything, either (a) update the instances you find with any new values you have, or (b) if there are no new values, do nothing.
If you don't find anything, create a new instance.

Application-layer logic can help reduce space at the cost of application complexity.
Say your name field can contain either an integer or a string. (SQLite's weak typing makes this easy to do).
If string -- that's the name right there.
If integer -- go look it up on a name table, using the int as key
Of course you have to create that name table, either on the fly as data is inserted, or a once-in-a-while trawl through the data for new names that are worth surrogating in this way.

Related

How to identify if a schema in database (its structure/metadata) has changed or not

I need to identify if a schema in database has any change in metadata such as changed table columns or changed procedure/package PL/SQL-codes additional/deleted triggers etc. I've tried to make a expdp with content=metadata_only and calculated a checksum of the dump. But this doesn't work because the checksum changes every time despite the same unchanged database. How to identify if a schema in database (its structure) has changed or not? Do I have to export the plain text metadata instead? Thx.
If you only need to know who did what when, use database auditing.
If you only need to know something might have changed, but don't care what and are okay with the possibility of the change not being significant, you can use the last_ddl_time from dba_objects and compare it to the last maximum value you got on the previous check. This can be done either at the schema or object level.
If you really do need to generate a delta and know for certain that something changed, you have two choices:
Construct data dictionary queries against all application dictionary views (lots of work, because there are lot of views - columns, tables, partitions, subpartitions, indexes, index partitions, index subpartitions, lobs, lob partitions, etc, etc, etc.)
(Recommended) Use dbms_metadata to extract the DDL of the entire schema. See this answer for a query that will export almost every object you would likely care about.
Either using #1 or #2, you can then compare old/new strings or use a hash function (e.g. dbms_crypto.hash) to compute a hash value and compare that. I wrote a schema upgrade tool that does exactly this - surgically identifies and upgrades individual objects that are different than some template source schema. I use dbms_metadata to look for diffs on the hash values. You will, however, need to set certain transforms to omit clauses you don't care about and that could have arbitrary changes, or mask them with regexp_replace after the fact (e.g. a sequence will contain the current value which will always be different.. you don't want to see this as a change). It can be a bit of work.

ADOX Rearrange Or Insert Columns Rather than Append them in Access Vb6, VB.Net or CSharp

I need to insert a field in the middle of current fields of a database table. I'm currently doing this in VB6 but may get the green light to do this in .net. Anyway I'm wondering since Access gives you the ability to "insert" fields in the table is there a way to do this in ADOX? If I had to I could step back and use DAO, but not sure how to do it there either.
If yor're wondering why I want to do this applications database has changed over time and I'm being asked to create Upgrade program for some of the installations with older versions.
Any help would be great.
This should not be necessary. Use the correct list of fields in your queries to retrieve them in the required order.
BUT, if you really need to do that, the only way i know is to create a new table with the fields in the required order, read the data from the old table into the new one, delete the old table and rename the new table as the old one.
I hear you: in Access the order of the fields is important.
If you need a comprehensive way to work with ADOX, your go to place is Allen Browne's website. I have used it to from my novice to pro in handling Access database changes. Here it is: www.AllenBrowne.com. Go to Access Tips then scroll down to ADOX Code.
That is also where I normally refer people with doubts about capabilities of Access as a database :)
In your case, you will juggle through creating a new table with the new field in the right position, copying data to the new table, applying properties to the fields, deleting original table, renaming the new table to the required (original) name.
That is the correct order. Do not apply field properties before copying the data. Some indexes and key properties may not be applied when the fields already have data.
Over time, I have automated this so I just run an application to do detect and implement the required changes for me. But that took A LOT of work-weeks.

Best-performing method for associating arbitrary key/value pairs with a table row in a Postgres DB?

I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?
That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.
hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.
As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.

Thread-safe unique entity instance in Core Data

I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.

Referencing object's identity before submitting changes in LINQ

is there a way of knowing ID of identity column of record inserted via InsertOnSubmit beforehand, e.g. before calling datasource's SubmitChanges?
Imagine I'm populating some kind of hierarchy in the database, but I wouldn't want to submit changes on each recursive call of each child node (e.g. if I had Directories table and Files table and am recreating my filesystem structure in the database).
I'd like to do it that way, so I create a Directory object, set its name and attributes,
then InsertOnSubmit it into DataContext.Directories collection, then reference Directory.ID in its child Files. Currently I need to call InsertOnSubmit to insert the 'directory' into the database and the database mapping fills its ID column. But this creates a lot of transactions and accesses to database and I imagine that if I did this inserting in a batch, the performance would be better.
What I'd like to do is to somehow use Directory.ID before commiting changes, create all my File and Directory objects in advance and then do a big submit that puts all stuff into database. I'm also open to solving this problem via a stored procedure, I assume the performance would be even better if all operations would be done directly in the database.
One way to get around this is to not use an identity column. Instead build an IdService that you can use in the code to get a new Id each time a Directory object is created.
You can implement the IdService by having a table that stores the last id used. When the service starts up have it grab that number. The service can then increment away while Directory objects are created and then update the table with the new last id used at the end of the run.
Alternatively, and a bit safer, when the service starts up have it grab the last id used and then update the last id used in the table by adding 1000 (for example). Then let it increment away. If it uses 1000 ids then have it grab the next 1000 and update the last id used table. Worst case is you waste some ids, but if you use a bigint you aren't ever going to care.
Since the Directory id is now controlled in code you can use it with child objects like Files prior to writing to the database.
Simply putting a lock around id acquisition makes this safe to use across multiple threads. I've been using this in a situation like yours. We're generating a ton of objects in memory across multiple threads and saving them in batches.
This blog post will give you a good start on saving batches in Linq to SQL.
Not sure off the top if there is a way to run a straight SQL query in LINQ, but this query will return the current identity value of the specified table.
USE [database];
GO
DBCC CHECKIDENT ("schema.table", NORESEED);
GO

Resources