Core Data's Limits, can Core Data be used as a Serverside Technology?

Core Data's Limits, can Core Data be used as a Serverside Technology? - cocoa

I've found no clear answer so far, but maybe I've searched the wrong way.
My Question is, can Core Data to be used as a Persitence Storage for a Server Project? Where are Core Data's Limits, how much Data can be handled with Core Data and SQLite? SQLite should handle a lot of Data very well according to their website. I know of a properitary Java Persitence Manager with an Oracle DB as Storage that handles Millions of Entries and 3000 Clients without Problems. For my own Project I wonder if I can use Core Data on the Server Side for User Mangament and intern microblogging, texting with up to 5000 clients. Will it handle such big amounts of Data or do I have to manage something like that myself? Does anyone happend to have experience with huge amounts if Data and Core Data?
Thank you
twickl

I wouldn't advise using Core Data for a server side project. Core Data was designed to handle the data of individual, object-oriented applications therefore it lacks many of the common features of dedicated server software such as easily handling multiple simultaneous accesses.
Really, the only circumstance where I would advise using it is when the server side logic is very complex and the number of users small. For example, if you wanted to write an in house web app and have almost all the logic on the server, then Core Data might serve well.
Apple used to have WebObjects which was a package to manage servers using an object-oriented DB much like Core Data. (Core Data was inspired by a component of WebObjects called Enterprise Objects.) However, IIRC Apple no longer supports WebObjects for external use.
Your better off using one of the many dedicated server packages out there than trying to roll your own.

I have no experience using Core Data in the manner you describe, but my understanding of the architecture leads me to believe that it could be used, depending on how you plan to query and manipulate the data.
Core Data is very good at maintaining an object graph and using faults to bring parts into memory as needed. In that manner, it could be good on a server for reducing memory requirements even with a large data set.
Core Data is not very good at manipulating collections of objects without loading them into memory, making a change, and writing them back out to disk. Brent Simmons wrote a blog post about this, where he decide to stop using Core Data for some of his RSS reader's model objects because an operation like "mark all as read" didn't scale. While you would like to be able to say something like UPDATE articles SET status = 'read', Core Data must load each article, set its status property, then write it back to disk.
This isn't because Apple engineers are stupid, but because the query layer can't make assumptions about the storage layer (you could be using XML instead of SQLite) and it also must take into account cascading changes and the fact that some article objects may already be loaded into memory and will need to be updated there.
Note that you can also write your own storage providers for Core Data, see Aaron Hillegass's BNRPersistence project. So if Core Data was "mostly good" you might be able to improve on it for your application.
So, a possible answer to your question is that Core Data may be appropriate to your application, as long as you do not need to rely on batch updates to large number of objects. In general, no algorithm or data structure is appropriate for every scenario. Engineering is about wisely choosing between trade-offs. You won't find anything that works well for many clients in every case. It always matters what you are doing.

Related

What about performance in core data?

I would like to write an application based on core data, but I don't know if it's worth, there will be several tables of a few or several thousand data.
What is the situation at the front?

A few thousand records isn't that many in the grand scheme of things, and so is likely to be fine. Though without knowing exactly what you want to do with the data or what platform you're running it on, it's difficult to be sure.

The important thing to remember about Core Data is that it isn't primarily a persistence API i.e. one primarily concerned with getting data onto and off disk like SQL. It is primarily an API for creating the entire model layer for a Model-View-Controller (MVC) design app. As such, it provides a complete data management solution from persistence to object-graph management to integration with the UI.
Core Data is such a comprehensive solution that in Cocoa using bindings, it is possible to create entire apps without writing any custom code.
Any performance you might hypothetically lose in persistence operations with Core Data is almost always overshadowed by the performance gains of the object-graph management and UI integration.

Core Data cloud sync - need help with logic

I'm in the middle of brainstorming a cloud sync solution for a Core Data app that I am currently developing. I'm planning to open source the code for this once its done, for anyone to use with their Core Data apps, so input from the community on how this system should work is much appreciated :-) Here's what I'm thinking:
Server Side
Storage Provider
As with all cloud sync systems, storage is a major piece of the puzzle. There are many ways to handle this. I could set up my own server for storage, or use a service like Amazon S3, but because I'm starting out with $0 capital, at this moment, a paid storage solution isn't a viable option. After some thought, I decided to settle with Dropbox (an already well established cloud sync application and storage provider). The pros of using Dropbox are:
It's free (for a limited amount of space)
In addition to being a storage service, it also handles cloud sync
They recently released an Objective-C SDK which makes it much easier to interface with it in Mac and iPhone apps
In case I decide to switch to a different storage provider in the future, I intend to add "services" to this cloud sync framework, basically allowing anyone to create a service class to interface with their choice of storage provider, which can then simply be plugged into the framework.
Storage Structure
This is a really difficult part to figure out, so I need as much input as I can here. I've been thinking about a structure like this:
CloudSyncFramework
======> [app name]
==========> devices
=============> (device id)
================> deviceinfo
================> changeset
==========> entities
=============> (entity name)
================> (object id)
A quick explanation of this structure:
The master "CloudSyncFramework" (name undecided) folder will contain separate folders for each app that uses the framework
Each app folder contains a devices folder and an entities folder
The devices folder will contain a folder for each device that is registered with the account. The device folder will be named according to the device ID, obtained using something like [[UIDevice currentDevice] uniqueIdentifier] (on iOS) or a serial number (on Mac OS).
Each device folder contains two files: deviceinfo and changeset. deviceinfo contains information about the device (e.g. OS version, last sync date, model, etc.) and the changeset file contains information about objects that have changed since the device last synchronized. Both files will just be simple NSDictionaries archived into files using NSKeyedArchiver.
Each Core Data entity has a subfolder under the entities folder
Under each entity folder, every object that belongs to that entity will have a separate file. This file will contain a JSON dictionary with the key-value pairs.
Simultaneous Sync
This is one of the areas where I am almost completely clueless. How would I handle 2 devices connecting and syncing with the cloud at the same time? There seems to be a high risk of things getting out of sync here, or even data corruption.
Handling migrations
Once again, another clueless area here. How would I handle migrations of the Core Data managed object model? The easiest thing to do here seems to be just to wipe the cloud data store clean and upload a new copy of the data from a device which has undergone the migration process, but this seems somewhat risky, and there may be a better way.
Client Side
Converting NSManagedObjects into JSON
Converting attributes into JSON isn't a very hard task (theres lots of code for it floating around the web). Relationships are the key problem here. In this stackoverflow post, Marcus Zarra posts code in which the relationship objects themselves are added to the JSON dictionary. However, he mentions that this can cause an infinite loop depending on the structure of the model, and I'm not sure if this would work with my method, because I store each object as an individual file.
I've been trying to find a way to get an ID as a string for an NSManagedObject. Then I could save relationships in JSON as an array of IDs. The closest thing I found was [[managedObject objectID] URIRepresentation], but this isn't really an ID for an object, its more of a location for the object in the persistent store, and I don't know if its concrete enough to use as a reference for an object.
I suppose I could generate a UUID string for each object and save it as an attribute, but I'm open for suggestions.
Syncing changes to the cloud
The first (and still best) solution that popped into my head for this was to listen for the NSManagedObjectContextObjectsDidChangeNotification to get a list of changed objects, then update/delete/insert those objects in the cloud data store. After the changes have been saved, I would need to update the changeset file for every other registered device to reflect the newly changed objects.
One problem that comes up here is, how would I handle a failed or interrupted sync?. One idea I have is to first push changes to a temporary directory on the cloud, then once that has been confirmed as successful, to merge it with the master data on the cloud so that an interruption in the middle of the sync won't corrupt data. Then I would save records of the objects that need to be updated in the cloud into a plist file or something, to be pushed during the next time the app is connected to the internet.
Retrieving changed objects
This is fairly simple, the device downloads its changeset file, figures out which objects need to be updated/inserted/deleted, then acts accordingly.
And that sums up my thoughts for the logic that this system will use :-) Any insight, suggestions, answers to problems, etc. is greatly appreciated.
UPDATE
After lots of thinking, and reading TechZens suggestions, I have come up with some modifications to my concept.
The largest change I've thought up is to make each device have a separate data store in the cloud. Basically, every time the managed object context saves (thanks TechZen), it will upload the changes to that device's data store. After those changes are updated, it will create a "changeset" file with change details, and save it into the changeset folders of the OTHER devices that are using the application. When the other devices connect to sync, they will go through the changeset folder and apply each changeset to the local data store, then update their respective data stores in the cloud as well.
Now, if a new device is registered with the account, it will find the newest copy of the data out of all the devices and download that for use as its local storage. This solves the problem of simultaneous sync and reduces the chances for data corruption because there is no "central" data store, each devices touches only its data and just updates changes rather than every device accessing and modifying the same data at the same time.
There's some obvious conflict situations to deal with, mainly in relation to deleting objects. If a changeset is downloading instructing the app to delete an object that is currently being edited, etc. there needs to be ways to deal with this.

You want to look at this pessimistic take on cloud sync: Why Cloud Sync Will Never Work.
It covers a lot of the issues that you are wrestling with. Many of them are largely intractable.
It is very, very, very difficult to synchronize information period. Adding in different devices, different operating systems, different data structures, etc snowballs the complexity often fatally. People have been working on variants of this problem since the 70s and things really haven't improve much.
The fundamental problem is that if you leave the system flexible and customizable, then the complexity of synchronizing all the variations explodes exponentially as a function of the number of customization. If you make it rigid, you can sync but you are limited in what you can sync.
How would I handle 2 devices
connecting and syncing with the cloud
at the same time?
If you figure that out, you will be rich. It's a big issue for current cloud sync providers. They real problem here is that your not "syncing" your merging. Software sucks at merging because its very hard to establish a predefined rule set to describe all the possible merges.
The simplest system is to establish either a canonical device or a device hierarchy such that the system always knows which input to choose. This however, destroys flexibility.
How would I handle migrations of the
Core Data managed object model?
The migration of the Core Data model is largely irrelevant to the server. That's something that Core Data manages internally to itself. Model migration updates the model i.e. the entity graph, not the actual data.
Converting NSManagedObjects into JSON
Modeling relationships is hard especially with tools that don't support it as easily as Core Data does. However, the URI of a permanent managed object ID is supposed to serve as a UUID that nails the object down to a specific location in a specific store on a specific device. It's not technically guaranteed to be universally unique but its close enough for all practical purposes.
Syncing changes to the cloud
I think you're confusing implementation details of Core Data with the cloud itself. If you use NSManagedObjectContextObjectsDidChangeNotification you will evoke network traffic every time the observed context changes regardless of whether those changes are persisted or not. Depending on the app, this could drive connections thousands of times in a few minutes. Instead, you only want to sync when context is saved at the most.
One problem that comes up here is, how
would I handle a failed or interrupted
sync?
You don't commit changes until the sync completes. This is a big problem and leads to corrupt data. Again, you can have flexibility, complexity and fragility or inflexibility, simplicity and robustness.
Retrieving changed objects: This is
fairly simple, the device downloads
its changeset file, figures out which
objects need to be
updated/inserted/deleted, then acts
accordingly
It's only simple if you have an inflexible data structure. Describing changes to a flexible data structure is a nightmare.
Not sure if I have helped any. None of the problems have elegant solutions. Most designer end up with rigidity and/or slow, brute force iterative merging.

Take a serious look at RestKit.
It is an open source project that aims to help with integrating iOS apps with cloud data, including but not limited to the scenario where there is a core-data model for that data on the client.
I have recently started to use it in one of my projects, and found it to be quite useful. In the core-data scenario, you implement declarative mappings between your data model and the content you GET from and POST to the server, and it takes care of things like injecting objects from the cloud into your client model, posting new objects to the server and incorporating server-generated objects IDs into your client-side model, doing all of this in a background thread and taking care of all the core-data context threading issues and so on.
RestKit by no means is a mature product, but is has a fairly good foundation and quite a few things that can use help from other contributors. Especially, if your goal is to create an open source solution, it would be great to contribute and improve something like this rather than re-invent a new solution. Unless of course, your see serious differences between what you have in mind and other existing solutions :-)

Since this post was current, there are several new options available. It is possible to develop a solution, and there are apps shipping with these solutions.
Here is a short list of the main Core Data sync options:
Apple's native Core Data/iCloud sync. (Had a rocky start. Seems better now.)
TICDS
Wasabi Sync, a paid service.
Simperium (Seems abandoned.)
ParcelKit with Dropbox Datastore API
Ensembles, the most recent. (Disclosure: I am the founder of the project)

It's like Apple answered my question for me with the announcement of the iCloud SDKs, which come complete with Core Data integration. Win!

Performance problems with external data dependencies

I have an application that talks to several internal and external sources using SOAP, REST services or just using database stored procedures. Obviously, performance and stability is a major issue that I am dealing with. Even when the endpoints are performing at their best, for large sets of data, I easily see calls that take 10s of seconds.
So, I am trying to improve the performance of my application by prefetching the data and storing locally - so that at least the read operations are fast.
While my application is the major consumer and producer of data, some of the data can change from outside my application too that I have no control over. If I using caching, I would never know when to invalidate the cache when such data changes from outside my application.
So I think my only option is to have a job scheduler running that consistently updates the database. I could prioritize the users based on how often they login and use the application.
I am talking about 50 thousand users, and at least 10 endpoints that are terribly slow and can sometimes take a minute for a single call. Would something like Quartz give me the scale I need? And how would I get around the schedular becoming a single point of failure?
I am just looking for something that doesn't require high maintenance, and speeds at least some of the lesser complicated subsystems - if not most. Any suggestions?

This does sound like you might need a data warehouse. You would update the data warehouse from the various sources, on whatever schedule was necessary. However, all the read-only transactions would come from the data warehouse, and would not require immediate calls to the various external sources.
This assumes you don't need realtime access to the most up to date data. Even if you needed data accurate to within the past hour from a particular source, that only means you would need to update from that source every hour.
You haven't said what platforms you're using. If you were using SQL Server 2005 or later, I would recommend SQL Server Integration Services (SSIS) for updating the data warehouse. It's made for just this sort of thing.
Of course, depending on your platform choices, there may be alternatives that are more appropriate.
Here are some resources on SSIS and data warehouses. I know you've stated you will not be using Microsoft products. I include these links as a point of reference: these are the products I was talking about above.
SSIS Overview
Typical Uses of Integration Services
SSIS Documentation Portal
Best Practices for Data Warehousing with SQL Server 2008

Client-side caching in Rich Internet Applications

I'm starting to step into unfamiliar territory with regards to performance improvement and our RIA (Rich Internet Application) built with GWT. For those unfamiliar with GWT, essentially when deployed it's just pure JavaScript. We're interfacing with the server side using a REST-style XML web service via XMLHttpRequest.
Our XML is un-marshalled into JavaScript objects and used within the application to represent the data model behind the interface. When changes occur, the model is updated and marshalled back to XML and sent back to the server.
I've learned the number one rule of performance (in terms of user experience) is to make as few requests as possible. Obviously this brings up the possibility of caching. Caching is great for static data but things get tricky in a multi-user system where data on the server may be changing. Also, use of "Last-Modified" and "If-Modified-Since" requests don't quite do enough since we'd like to avoid unnecessary requests altogether.
I'm trying to figure out if caching data in the browser is even right for us before researching the approaches. I hope someone has tread this path before. I'm looking for similar approaches, lessons learned, things to avoid, etc.
I'm happy to provide more specific info if needed...

For GWT, if performance matters that much to you, you get better performance by sending all the data you need in a single request, instead of querying multiple small data. I would recommend against client-side data caching as there are lots of issues like keeping the data in sync with the database.
Besides, you already have a good advantage with GWT over traditional html apps. Unless you are dealing with special data (eg: does not become stale too quickly - implies mostly-read queries) I found out that there is no special need for caching. You are better off doing a service-layer caching, since most of the time should come of server-side processing.
If you can provide more details about the nature of the app, maybe some different conclusions can be taken.

Reporting vs. Coding - thoughts?

Recently I had a project in which I had to get some data from particular software system to a portlet. The software used a database, and I spent a fair bit of time modeling the data I wanted and then creating a web service so that my portlet could grab the information.
Then it suddenly struck me that I was wasting my time. I grabbed BIRT, tossed it into a portlet, and then just wrote some reports that directly grabbed the necessary data from the database. I was done in an afternoon.
I understand that reporting is a one way street, but this got me thinking. Reporting tools can be very effective for creating reports (duh) from your actual data, but when you're doing this you're bypassing your model which except in simple cases is not a direct representation of your data as it exists in your database.
If you're writing a data-intensive application and require the ability to perform non-trivial reporting, do you bypass your application and use something like BIRT or Crystal Reports? How do you manage these tools as part of your overall process? Do you consider the reports you write as being part of your application and treat them as such? A report is a view and a model and a controller (if you will) all in one big mess, how do you deal with and interpret and plan for that?
Revised question: it's possible and even common that a report will perform some business calculations that in a perfect world you would like to have contained in your application. This can lead to a mismatch of information given back to the user. On the other hand, reporting tools make it so easy to gather and display information that it's hard to take a purist's approach and do everything from within the application. Are there any good techniques for ensuring that the data in your reports matches the data that you might be showing in the regular GUI?

I see reporting as simply another view on the data, not a view/model/controller in one (well, maybe a view and controller in one).
We have our reports (built in sql 2008 reporting services) consume a service in our application layer to get data (keeping with our standard, that data access is in a repository). These functions could do a simple query or handle very complex processing that would be a nightmare in your reporting evironment or a stored procedure. In practice, we find this takes no longer than coding up some one-off stored procedure that will, as your system grows and grows, become a nightmare to maintain.
Treating reporting as simply a one-off or not integrating into your application design is a huge mistake.

Reporting is crucial. Reporting is mostly crucial to share values collected in one system to external users, e.g. users not directly using the system (eg management for sales figures). So reporting is a lot more than just displaying facts and figures and is something central to almost every system that drives a commercial.
At least the more advanced systems allow you to enhance them: with your own reusable "controls". Even a way back can be implemented - if you just use the correct plugins. Once I wrote a system to send emails out of a report, because the system did not allow for change. It worked - though it was not meant to be used that way ;)
Reports make a good part of the application, and you gain a lot freedom if you make reports changeable for your customers. Sometimes you come up with more possibilities than you thought of when you built the system in the first place.
So yes, for me reporting is part of the system.

Reports are part of your app but because they are generally something a user will have strong ideas about than, say, your data capture UI, I'd sacrifice purity for convenience/speed of delivery and get back to "real" coding... :-)
As soon as you've done a report, users want another one or change the colour or optional grouping or more filtering or... something that takes you away from whizzier stuff... so I don't bust a gut maintaining purity.

This is a fine line indeed. You don't want to spend too much time building reports (that users want you to change all the time anyway) but you don't want to duplicate logic by putting business logic into your reports! With our reporting products at Data Dynamimcs I think we have reached a happy medium between these two tradeoffs.
By using the ObjectDataProvider (see links below for more info) you can bind the report directly to business objects (plain old objects) so you don't have to bypass your business layer for getting data. At the same time we provide a way to reference and use functions from other libraries in your report. This way if you have some code configured already to do some business logic calculations you can reuse those functions directly within your report. You can see an example of this in the links below too.
Binding to Objects for your Data (see "Object Provider" section): http://www.datadynamics.com/Help/ddReports/ddrconDataSetAndObjectDataSource.html
Adding Custom Code to your reports Walkthrough: http://www.datadynamics.com/Help/ddReports/ddrwlkCustomCode.html
Using Custom Assemblies (referencing shared libraries/dlls from your report): http://www.datadynamics.com/Help/ddReports/ddrconCustomCode.html, and http://www.datadynamics.com/Help/ddReports/ddrtskCreatingAnInstanceMethod.html
Scott Willeke
Data Dynamics / GrapeCity

The way I've always worked with reports is to consider part reports as part of the code-base, and stored in the source along with the application. In some contexts, reports are more important than the application, in that management makes business decisions off of report data, having the wrong information can cause them to cancel a product line, cancel a campaign, or fire a sales person. Obviously, this depends highly on your management and your application.
Regarding keeping your model consistent, this is a bit trickier question. One way to ensure consistent model between reports and your application is to use stored procedures (or views) to retrieve data, depending on your application's architecture.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio