ModifiedNodeDoesNotExistException ODL while writing statistic to operational datastore - opendaylight

Scenario is application is writing data directly to device and after 4 seconds they are writing to config datastore. Now in the time gap there is statistic collection triggered which will collect the data written and will write it to operational datastore.
My question is whether the data should be present in config datastore before statistic is collected or before the same data is being written to operational datastore.

You are not saying which ODL release you are using, nor (more importantly) which application features you have installed are using, but when you write "statistic collection triggered" that sounds like openflowplugin project related? I wonder if https://git.opendaylight.org/gerrit/#/c/66207/ which was proposed just today (not yet reviewed and merged) has anything to do with your problem, and may fix it... If not, you would need to provide more details about you are writing to - as far as I understand, the ModifiedNodeDoesNotExistException basically just means something like that what you want to write to meanwhile was concurrently already removed.

Related

Difference between Apache NiFi and StreamSets

I am planning to do a class project and was going through few technologies where I can automate or set the flow of data between systems and found that there are couple of them i.e. Apache NiFi and StreamSets ( to my knowledge ). What I couldn't understand is the difference between them and use-cases where they can be used? I am new to this and if anyone can explain me a bit would be highly appreciated. Thanks
Suraj,
Great question.
My response is as a member of the open source Apache NiFi project management committee and as someone who is passionate about the dataflow management domain.
I've been involved in the NiFi project since it was started in 2006. My knowledge of Streamsets is relatively limited so I'll let them speak for it as they have.
The key thing to understand is that NiFi was built to do one really important thing really well and that is 'Dataflow Management'. It's design is based on a concept called Flow Based Programming which you may want to read about and reference for your project 'https://en.wikipedia.org/wiki/Flow-based_programming'
There are already many systems which produce data such as sensors and others. There are many systems which focus on data processing like Apache Storm, Spark, Flink, and others. And finally there are many systems which store data like HDFS, relational databases, and so on. NiFi purely focuses on the task of connecting those systems and providing the user experience and core functions necessary to do that well.
What are some of those key functions and design choices made to make that effective:
1) Interactive command and control
The job of someone trying to connect systems is to be able to rapidly and efficiently interact with the constant streams of data they see. NiFi's UI allows you do just that as the data is flowing you can add features to operate on it, fork off copies of data to try new approaches, adjust current settings, see recent and historical stats, helpful in-line documentation and more. Almost all other systems by comparison have a model that is design and deploy oriented meaning you make a series of changes and then deploy them. That model is fine and can be intuitive but for the dataflow management job it means you don't get the interactive change by change feedback that is so vital to quickly build new flows or to safely and efficiently correct or improve handling of existing data streams.
2) Data Provenance
A very unique capability of NiFi is its ability to generate fine grained and powerful traceability details for where your data comes from, what is done to it, where its sent and when it is done in the flow. This is essential to effective dataflow management for a number of reasons but for someone in the early exploration phases and working a project the most important thing this gives you is awesome debugging flexibility. You can setup your flows and let things run and then use provenance to actually prove that it did exactly what you wanted. If something didn't happen as you expected you can fix the flow and replay the object then repeat. Really helpful.
3) Purpose built data repositories
NiFi's out of the box experience offers very powerful performance even on really modest hardware or virtual environments. This is because of the flowfile and content repository design which gives us the high performance but transactional semantics we want as data works its way through the flow. The flowfile repository is a simple write ahead log implementation and the content repository provides an immutable versioned content store. That in turn means we can 'copy' data by only ever adding a new pointer (not actually copying bytes) or we can transform data by simply reading from the original and writing out a new version. Again very efficient. Couple that with the provenance stuff I mentioned a moment ago and it just provides a really powerful platform. Another really key thing to understand here is that in the business of connecting systems you don't always get to dictate things like size of data involved. The NiFi API was built to honor that fact and so our API lets processors do things like receive, transform, and send data without ever having to load the full objects in memory. These repositories also mean that in most flows the majority of processors do not even touch the content at all. However, you can easily see from the NiFi UI precisely how many bytes are actually being read or written so again you get really helpful information in establishing and observing your flows. This design also means NiFi can support back-pressure and pressure-release naturally and these are really critical features for a dataflow management system.
It was mentioned previously by the folks from the Streamsets company that NiFi is file oriented. I'm not really sure what the difference is between a file or a record or a tuple or an object or a message in generic terms but the reality is when data is in the flow then it is 'a thing that needs to be managed and delivered'. That is what NiFi does. Whether you have lots of really high speed tiny things or you have large things and whether they came from a live audio stream off the Internet or they come from a file sitting on your harddrive it doesn't matter. Once it is in the flow it is time to manage and deliver it. That is what NiFi does.
It was also mentioned by the Streamsets company that NiFi is schemaless. It is accurate that NiFi does not force conversion of data from whatever it is originally to some special NiFi format nor do we have to reconvert it back to some format for follow-on delivery. It would be pretty unfortunate if we did that because what this means is that even the most trivial of cases would have problematic performance implications and luckily NiFi does not have that problem. Further had we gone that route then it would mean handling diverse datasets like media (images, video, audio, and more) would be difficult but we're on the right track and NiFi is used for things like that all the time.
Finally, as you continue with your project and if you find there are things you'd like to see improved or that you'd like to contribute code we'd love to have your help. From https://nifi.apache.org you can quickly find information on how to file tickets, submit patches, email the mailing list, and more.
Here are a couple of fun recent NiFi projects to checkout:
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer
https://twitter.com/KayLerch/status/721455415456882689
Good luck on the class project! If you have any questions the users#nifi.apache.org mailing list would love to help.
Thanks
Joe
Both Apache NiFi and StreamSets Data Collector are Apache-licensed open source tools.
Hortonworks does have a commercially supported variant called Hortonworks DataFlow (HDF).
While both have a lot of similarities such as a web-based ui, both are used for ingesting data there are a few key differences. They also both consist of a processors linked together to perform transformations, serialization, etc.
NiFi processors are file-oriented and schemaless. This means that a piece of data is represented by a FlowFile (this could be an actual file on disk, or some blob of data acquired elsewhere). Each processor is responsible for understanding the content of the data in order to operate on it. Thus if one processor understands format A and another only understands format B, you may need to perform a data format conversion in between those two processors.
NiFi can be run standalone, or as a cluster using its own built-in clustering system.
StreamSets Data Collector (SDC) however, takes a record based approach. What this means is that as data enters your pipeline it (whether its JSON, CSV, etc) it is parsed into a common format so that the responsibility of understanding the data format is no longer placed on each individual processor and any processor can be connected to any other processor.
SDC also runs standalone, and also a clustered mode, but it runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have.
NiFi has been around for about the last 10 years (but less than 2 years in the open source community).
StreamSets was released to the open source community a little bit later in 2015. It is vendor agnostic, and as far as Hadoop goes Hortonworks, Cloudera, and MapR are all supported.
Full Disclosure: I am an engineer who works on StreamSets.
They are very similar for data ingest scenarios.
Apache NIFI(HDP) is more mature and StreamSets is more lightweight.
Both are easy to use, both have strong capability. And StreamSets could easily
They have companies behind, Hortonworks and Cloudera.
Obviously there are more contributors working on NIFI than StreamSets, of course, NIFI have more enterprise deployments in production.
Two of the key differentiators between the two IMHO are.
Apache NiFi is a Top Level Apache project, meaning it has gone through the incubation process described here, http://incubator.apache.org/policy/process.html, and can accept contributions from developers around the world who follow the standard Apache process which ensures software quality. StreamSets, is Apache LICENSED, meaning anyone can reuse the code, etc. But the project is not managed as an Apache project. In fact, in order to even contribute to Streamsets, you are REQUIRED to sign a contract. https://streamsets.com/contributing/ . Contrast this with the Apache NiFi contributor guide, which wasn't written by a lawyer. https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-HowtocontributetoApacheNiFi
StreamSets "runs atop Spark on YARN/Mesos instead, leveraging existing cluster resources you may have." which imposes a bit of restriction if you want to deploy your dataflows further toward the Edge where the Devices that are generating the data live. Apache MiniFi, a sub-project of NiFi can run on a single Raspberry Pi, while I am fairly confident that StreamSets cannot, as YARN or Mesos require more resources than a Raspberry Pi provides.
Disclosure: I am a Hortonworks employee

Realtime one-way mirroring of a SQLite database

I am dealing with a 3rd party application that's running a SQLite 3 database with WAL (Write-Ahead Logging) on a local computer, and I'm looking to mirror that database (read only, this is a one-way mirroring) to another system. The challenge is that I'm running in a separate process, which seems to complicate things somewhat.
The database is being created and opened with a normal locking mode so there's no problem reading it from another process, but I'm trying to either find an existing implementation or get some pointers on where to get started. My understanding, based on other posts is that the standard sqlite update hooks (such as sqlite3_update_hook) will not work out of process.
A key issue is speed, I'd like to ideally be able to detect each update as soon as it happens and begin transmitting it. This means that most polling options would be out of the question, but even if they were, how would you detect the most recent changes?
I'm seeing two files that look promising: the actual WAL file (foo.db-wal), and that memory mapped index file (foo.db-shm). I'm hoping that those two contain the information I need to: A. Detect when changes occur in the database and B. Be able to grab just the incremental changes since the last update.
But a pointer to some existing solution would be much preferred... :-)
SymmetricDS might be the solution for you

Can/Should I disable the cache expiry when backing data store is unavailable?

I'm just started out with Ehcache, and it seems pretty good so far. I'm using it in a simplistic fashion to speed up reads against a database, but I wonder whether I can also use it to let the application stay up if the database is unavailable for short periods. (Update - my context is a application with high-availability modules that only read from the database)
It seems like I could do that by disabling expiry in the event of a database read problem, and re-enabling it when a read works again.
What do you think? Is that a reasonable approach or have I missed something? If it's a fair approach, any tips for how best to implement appreciated.
Update - ehcache supports a dynamically configurable option to un/set the cache to 'eternal'. This seems to do what I need.
Interesting question - usually, the answer would be "it depends".
Firstly, if you have database reliability problems, I'd invest time and energy in fixing them, rather than applying a bandaid solution.
Secondly, most applications need both reading and writing to work - it doesn't seem to make sense to keep your app up for reads only.
However, if your app has a genuine "read only" function, and there's a known and controlled reason for database down time (e.g. backups), then yes, you can use your cache to keep the application up and running while the database is down. I would do this by extending the cache periods, rather than trying to code specific edge cases. For instance, you might have a background process which checks whether the database is available and swaps in a different configuration file when there's trouble.

Core Data's Limits, can Core Data be used as a Serverside Technology?

I've found no clear answer so far, but maybe I've searched the wrong way.
My Question is, can Core Data to be used as a Persitence Storage for a Server Project? Where are Core Data's Limits, how much Data can be handled with Core Data and SQLite? SQLite should handle a lot of Data very well according to their website. I know of a properitary Java Persitence Manager with an Oracle DB as Storage that handles Millions of Entries and 3000 Clients without Problems. For my own Project I wonder if I can use Core Data on the Server Side for User Mangament and intern microblogging, texting with up to 5000 clients. Will it handle such big amounts of Data or do I have to manage something like that myself? Does anyone happend to have experience with huge amounts if Data and Core Data?
Thank you
twickl
I wouldn't advise using Core Data for a server side project. Core Data was designed to handle the data of individual, object-oriented applications therefore it lacks many of the common features of dedicated server software such as easily handling multiple simultaneous accesses.
Really, the only circumstance where I would advise using it is when the server side logic is very complex and the number of users small. For example, if you wanted to write an in house web app and have almost all the logic on the server, then Core Data might serve well.
Apple used to have WebObjects which was a package to manage servers using an object-oriented DB much like Core Data. (Core Data was inspired by a component of WebObjects called Enterprise Objects.) However, IIRC Apple no longer supports WebObjects for external use.
Your better off using one of the many dedicated server packages out there than trying to roll your own.
I have no experience using Core Data in the manner you describe, but my understanding of the architecture leads me to believe that it could be used, depending on how you plan to query and manipulate the data.
Core Data is very good at maintaining an object graph and using faults to bring parts into memory as needed. In that manner, it could be good on a server for reducing memory requirements even with a large data set.
Core Data is not very good at manipulating collections of objects without loading them into memory, making a change, and writing them back out to disk. Brent Simmons wrote a blog post about this, where he decide to stop using Core Data for some of his RSS reader's model objects because an operation like "mark all as read" didn't scale. While you would like to be able to say something like UPDATE articles SET status = 'read', Core Data must load each article, set its status property, then write it back to disk.
This isn't because Apple engineers are stupid, but because the query layer can't make assumptions about the storage layer (you could be using XML instead of SQLite) and it also must take into account cascading changes and the fact that some article objects may already be loaded into memory and will need to be updated there.
Note that you can also write your own storage providers for Core Data, see Aaron Hillegass's BNRPersistence project. So if Core Data was "mostly good" you might be able to improve on it for your application.
So, a possible answer to your question is that Core Data may be appropriate to your application, as long as you do not need to rely on batch updates to large number of objects. In general, no algorithm or data structure is appropriate for every scenario. Engineering is about wisely choosing between trade-offs. You won't find anything that works well for many clients in every case. It always matters what you are doing.

Core Data cloud sync - need help with logic

I'm in the middle of brainstorming a cloud sync solution for a Core Data app that I am currently developing. I'm planning to open source the code for this once its done, for anyone to use with their Core Data apps, so input from the community on how this system should work is much appreciated :-) Here's what I'm thinking:
Server Side
Storage Provider
As with all cloud sync systems, storage is a major piece of the puzzle. There are many ways to handle this. I could set up my own server for storage, or use a service like Amazon S3, but because I'm starting out with $0 capital, at this moment, a paid storage solution isn't a viable option. After some thought, I decided to settle with Dropbox (an already well established cloud sync application and storage provider). The pros of using Dropbox are:
It's free (for a limited amount of space)
In addition to being a storage service, it also handles cloud sync
They recently released an Objective-C SDK which makes it much easier to interface with it in Mac and iPhone apps
In case I decide to switch to a different storage provider in the future, I intend to add "services" to this cloud sync framework, basically allowing anyone to create a service class to interface with their choice of storage provider, which can then simply be plugged into the framework.
Storage Structure
This is a really difficult part to figure out, so I need as much input as I can here. I've been thinking about a structure like this:
CloudSyncFramework
======> [app name]
==========> devices
=============> (device id)
================> deviceinfo
================> changeset
==========> entities
=============> (entity name)
================> (object id)
A quick explanation of this structure:
The master "CloudSyncFramework" (name undecided) folder will contain separate folders for each app that uses the framework
Each app folder contains a devices folder and an entities folder
The devices folder will contain a folder for each device that is registered with the account. The device folder will be named according to the device ID, obtained using something like [[UIDevice currentDevice] uniqueIdentifier] (on iOS) or a serial number (on Mac OS).
Each device folder contains two files: deviceinfo and changeset. deviceinfo contains information about the device (e.g. OS version, last sync date, model, etc.) and the changeset file contains information about objects that have changed since the device last synchronized. Both files will just be simple NSDictionaries archived into files using NSKeyedArchiver.
Each Core Data entity has a subfolder under the entities folder
Under each entity folder, every object that belongs to that entity will have a separate file. This file will contain a JSON dictionary with the key-value pairs.
Simultaneous Sync
This is one of the areas where I am almost completely clueless. How would I handle 2 devices connecting and syncing with the cloud at the same time? There seems to be a high risk of things getting out of sync here, or even data corruption.
Handling migrations
Once again, another clueless area here. How would I handle migrations of the Core Data managed object model? The easiest thing to do here seems to be just to wipe the cloud data store clean and upload a new copy of the data from a device which has undergone the migration process, but this seems somewhat risky, and there may be a better way.
Client Side
Converting NSManagedObjects into JSON
Converting attributes into JSON isn't a very hard task (theres lots of code for it floating around the web). Relationships are the key problem here. In this stackoverflow post, Marcus Zarra posts code in which the relationship objects themselves are added to the JSON dictionary. However, he mentions that this can cause an infinite loop depending on the structure of the model, and I'm not sure if this would work with my method, because I store each object as an individual file.
I've been trying to find a way to get an ID as a string for an NSManagedObject. Then I could save relationships in JSON as an array of IDs. The closest thing I found was [[managedObject objectID] URIRepresentation], but this isn't really an ID for an object, its more of a location for the object in the persistent store, and I don't know if its concrete enough to use as a reference for an object.
I suppose I could generate a UUID string for each object and save it as an attribute, but I'm open for suggestions.
Syncing changes to the cloud
The first (and still best) solution that popped into my head for this was to listen for the NSManagedObjectContextObjectsDidChangeNotification to get a list of changed objects, then update/delete/insert those objects in the cloud data store. After the changes have been saved, I would need to update the changeset file for every other registered device to reflect the newly changed objects.
One problem that comes up here is, how would I handle a failed or interrupted sync?. One idea I have is to first push changes to a temporary directory on the cloud, then once that has been confirmed as successful, to merge it with the master data on the cloud so that an interruption in the middle of the sync won't corrupt data. Then I would save records of the objects that need to be updated in the cloud into a plist file or something, to be pushed during the next time the app is connected to the internet.
Retrieving changed objects
This is fairly simple, the device downloads its changeset file, figures out which objects need to be updated/inserted/deleted, then acts accordingly.
And that sums up my thoughts for the logic that this system will use :-) Any insight, suggestions, answers to problems, etc. is greatly appreciated.
UPDATE
After lots of thinking, and reading TechZens suggestions, I have come up with some modifications to my concept.
The largest change I've thought up is to make each device have a separate data store in the cloud. Basically, every time the managed object context saves (thanks TechZen), it will upload the changes to that device's data store. After those changes are updated, it will create a "changeset" file with change details, and save it into the changeset folders of the OTHER devices that are using the application. When the other devices connect to sync, they will go through the changeset folder and apply each changeset to the local data store, then update their respective data stores in the cloud as well.
Now, if a new device is registered with the account, it will find the newest copy of the data out of all the devices and download that for use as its local storage. This solves the problem of simultaneous sync and reduces the chances for data corruption because there is no "central" data store, each devices touches only its data and just updates changes rather than every device accessing and modifying the same data at the same time.
There's some obvious conflict situations to deal with, mainly in relation to deleting objects. If a changeset is downloading instructing the app to delete an object that is currently being edited, etc. there needs to be ways to deal with this.
You want to look at this pessimistic take on cloud sync: Why Cloud Sync Will Never Work.
It covers a lot of the issues that you are wrestling with. Many of them are largely intractable.
It is very, very, very difficult to synchronize information period. Adding in different devices, different operating systems, different data structures, etc snowballs the complexity often fatally. People have been working on variants of this problem since the 70s and things really haven't improve much.
The fundamental problem is that if you leave the system flexible and customizable, then the complexity of synchronizing all the variations explodes exponentially as a function of the number of customization. If you make it rigid, you can sync but you are limited in what you can sync.
How would I handle 2 devices
connecting and syncing with the cloud
at the same time?
If you figure that out, you will be rich. It's a big issue for current cloud sync providers. They real problem here is that your not "syncing" your merging. Software sucks at merging because its very hard to establish a predefined rule set to describe all the possible merges.
The simplest system is to establish either a canonical device or a device hierarchy such that the system always knows which input to choose. This however, destroys flexibility.
How would I handle migrations of the
Core Data managed object model?
The migration of the Core Data model is largely irrelevant to the server. That's something that Core Data manages internally to itself. Model migration updates the model i.e. the entity graph, not the actual data.
Converting NSManagedObjects into JSON
Modeling relationships is hard especially with tools that don't support it as easily as Core Data does. However, the URI of a permanent managed object ID is supposed to serve as a UUID that nails the object down to a specific location in a specific store on a specific device. It's not technically guaranteed to be universally unique but its close enough for all practical purposes.
Syncing changes to the cloud
I think you're confusing implementation details of Core Data with the cloud itself. If you use NSManagedObjectContextObjectsDidChangeNotification you will evoke network traffic every time the observed context changes regardless of whether those changes are persisted or not. Depending on the app, this could drive connections thousands of times in a few minutes. Instead, you only want to sync when context is saved at the most.
One problem that comes up here is, how
would I handle a failed or interrupted
sync?
You don't commit changes until the sync completes. This is a big problem and leads to corrupt data. Again, you can have flexibility, complexity and fragility or inflexibility, simplicity and robustness.
Retrieving changed objects: This is
fairly simple, the device downloads
its changeset file, figures out which
objects need to be
updated/inserted/deleted, then acts
accordingly
It's only simple if you have an inflexible data structure. Describing changes to a flexible data structure is a nightmare.
Not sure if I have helped any. None of the problems have elegant solutions. Most designer end up with rigidity and/or slow, brute force iterative merging.
Take a serious look at RestKit.
It is an open source project that aims to help with integrating iOS apps with cloud data, including but not limited to the scenario where there is a core-data model for that data on the client.
I have recently started to use it in one of my projects, and found it to be quite useful. In the core-data scenario, you implement declarative mappings between your data model and the content you GET from and POST to the server, and it takes care of things like injecting objects from the cloud into your client model, posting new objects to the server and incorporating server-generated objects IDs into your client-side model, doing all of this in a background thread and taking care of all the core-data context threading issues and so on.
RestKit by no means is a mature product, but is has a fairly good foundation and quite a few things that can use help from other contributors. Especially, if your goal is to create an open source solution, it would be great to contribute and improve something like this rather than re-invent a new solution. Unless of course, your see serious differences between what you have in mind and other existing solutions :-)
Since this post was current, there are several new options available. It is possible to develop a solution, and there are apps shipping with these solutions.
Here is a short list of the main Core Data sync options:
Apple's native Core Data/iCloud sync. (Had a rocky start. Seems better now.)
TICDS
Wasabi Sync, a paid service.
Simperium (Seems abandoned.)
ParcelKit with Dropbox Datastore API
Ensembles, the most recent. (Disclosure: I am the founder of the project)
It's like Apple answered my question for me with the announcement of the iCloud SDKs, which come complete with Core Data integration. Win!

Resources