Core Data or sqlite for fast search? - macos

This is a description of the application I want to build and I'm not sure whether to use Core Data or Sqlite (or something else?):
Single user, desktop, not networked, only one frontend is accessing datastorage
User occasionally enters some data, no bulk data importing or large data inserts
Simple datamodel: entity with up to 20-30 attributes
User searches in data (about 50k datasets max.)
Search takes place mostly in attribute values, not looking for any keys here, but searching for text in values
Writing the data is nothing I see as critical, it happens not very often and with small amounts of data. The text search in the attributes has to be blazingly fast, a user would expect almost instant results. This is absolutely critical.
I would rather go with Core Data, but is this a scenario CD can handle?
Thanks
-Fish

Core Data can handle this scenario. But because you're looking for blazingly fast full text search, you'll have to do some extra work. Session 211 of WWDC 2013 goes into depth about how to do this (slides 117-131). You'll probably want to have a separate Entity with text search tokens: all of the findable words in your dataset.
Although one of the FTS extensions is available in Apple's deployment of SQLite, it's not exposed in Core Data.

Related

How to validate BLOB object in oracle

I have BLOB data (pdf file attachment) in a table.
for us, its too expensive to write Java/ some other code to read BLOB for validate.
Is there any short cut/easy/less expensive way to validate my BLOB? Any command/s to read meta data and validate the BLOB?
I would like to check whether the BLOB object is corrupted or not
That's not something you should do in the database. A BLOB is a binary file which is interpreted by the appropriate client software (Adobe Reader, MS Word, whatever). As far as the database is concerned it's a black box. So your application ought to validate the file before it uploads it into the database.
However, there is a workaround. You can build an Oracle Text CONTEXT index on your BLOB column. CONTEXT is really designed for free text searching of documents but indexing is a way to prove that the uploaded file is readable.
The snag with CONTEXT indexes is that they aren't transactional: normally there's a background job running which indexes new documents but for this purpose you would probably want to call CTX_DDL.SYNC_INDEX() as part of the upload to present the user with timely feedback. Find out more.
I will reiterate that Text is a workaround, and expensive in terms of database resources. The index itself will consume space and the indexing process requires time and cpu cycles. That's a big investment unless you're going to work with the document inside the database.

How is social media data unstructured data?

I recently began reading up on big data, and how there are tools like hadoop or BigInsights that can manage both structured and unstructured data.
Social Media Analytics is something that can be done on BigInsights, and it takes unstructured data and analyzes/structures it accordingly.
This got me wondering, how is Social Media Data unstructured? For example, the information you can receive on tweets can be called using the Twitter REST API, and returned to you in a structured JSON format.
So isn't Social Media data already structured? If so why do you need a platform that manages mainly unstructured data?
Some make the distinction „semi-structured”, too.
But the point is the ability to query the data. Yes, Tweets etc. usually have some structure. But it's not helpful for analysis.
Given an ugly SQL schema, you could indeed run a query like
SELECT AVG(TweetID) FROM Twitter;
but that functionality is useless in practise. And that is probably why the data is best considered unstructured: you do not benefit from squeezing it into a relational schema.
Beware of buzzword bingo with big data, though. More often than not „supports unstructured data” actually means „does not benefit from structure in your data (by using indexes) but rereads data every time”
Its not only about getting the tweets. The real value of the data is knowing about what is being tweeted. Consider Facebook, where we can comment about any picture or a video. We need a platform to know what all the comments are positive about the video or how many are sledging it, or how many comments are real feedback about it. How many are providing suggestions to that to be a better one. And also you need to know how many times the video is shared and liked. Again those who all shared are whom, the one who dislikes it or likes it. Such so many varieties of data can be collected hence these are all called unstructured data.

IndexedDB Access Speed and Efficiency

I'm developing an RPG in Dart, and I'm going to use IndexedDB for data persistence.
I will have two databases: one for read-only access and one for read-write access where save games will be stored. I was just wondering if I should read required data directly from the database or cache it in Maps. I could potentially have several hundred records that need to be pulled from the read-only database (enemies, game maps etc.) and I though pulling everything from the database may be less efficient than using Dart's Maps.
Oh, also each database will be stored in a map. Object Stores will be nested maps inside that map.
Should I read directly from the database, or should I put everything into a Map and read from there?
EDIT: Forgot to mention, the read-only database will be initialised with data from a JSON file located on the user's machine, not through AJAX.
I am confident that hundreds of records will present you no issue in IndexedDB. IDB was designed with that kind of scale in mind, and its async APIs -- while vexing for novices -- make sure your app stays responsive by design.
I am working on a demo designed to push IDB further than it should go, and have some easy-to-reach statistics for you. These are gets on a single index in a single store on a database.
Gets are blazing fast in IndexedDB. The issue with IDB at scale is typically writes.
One thousand success callbacks, one complete callback, were sub-second:
Ten thousand success callbacks, one complete callback, was about 5 seconds:
More than fifty thousand success callbacks fired in less than a minute:
Writes are much slower - bursty at first, but then slow after minutes and dog slow after hours. That's with any schema, but you'd likely have multiple indexes on location (both latitude and longitude at least, I imagine) so your writes will be especially slow (more indexes, more work to do to main those in inserts and updates).
Layout for the stats above (just as important as the stats themselves, make sure to design your schema according to how you need to access it):
I would go with direct database access and then monitor the performance and then optimize where noteable gains are to be expected. Premature optimization is seldom a good idea.

Multidimensional data types

So I was thinking... Imagine you have to write a program that would represent a schedule of a whole college.
That schedule has several dimensions (e.g.):
time
location
indivitual(s) attending it
lecturer(s)
subject
You would have to be able to display the schedule from several standpoints:
everything held in one location in certain timeframe
everything attended by individual in certain timeframe
everything lecturered by a certain lecturer in certain timeframe
etc.
How would you save such data, and yet keep the ability to view it from different angles?
Only way I could think of was to save it in every form you might need it:
E.g. you have folder "students" and in it each student has a file and it contains when and why and where he has to be. However, you also have a folder "locations" and each location has a file which contains who and why and when has to be there. The more angles you have, the more size-per-info ratio increases.
But that seems highly inefficinet, spacewise.
Is there any other way?
My knowledge of Javascript is 0, but I wonder if such things would be possible with it, even in this space inefficient form.
If not that, I wonder if it would work in any other standard (C++, C#, Java, etc.) language, primarily in Java...
EDIT: Could this be done by using MySQL database?
Basically, you are trying to first store data and then present it under different views.
SQL databases were made exactly for that: from one side you build a schema and instantiate it in a database to store your data (the language is called Data Definition Language, DDL), then you make requests on it with the query language (SQL), what you call "views". There are even "views" objects in SQL databases to build these views Inside the database (rather than having to the code of the request in the user code).
MySQL can do that for sure, note that it is possible to compile some SQL engine for Javascript (SQLite for example) and use local web store to store the data.
There is another aspect to your question: optimization of the queries. While SQL can do most of the request job for your views. It is sometimes preferred to create actual copies of the requests results in so called "datamarts" (this is called de-normalizing a request), so that the hard work of selecting or computing aggregate/groups functions and so on is done once per period of time (imagine that a specific view changes only on Monday), then requesters just have to read these results. It is important in this case to separate at least semantically what is primary data from what is secondary data (and for performance/user rights reasons, physical separation is often a good idea).
Note that as you cited MySQL, I wrote about SQL but mostly any database technology could do that what you searched to do (hierarchical, object oriented, XML...) as long as the particular implementation that you use is flexible enough for your data and requests.
So in short:
I would use a SQL database to store the data
make appropriate views / requests
if I need huge request performance, make appropriate de-normalized data available
the language is not important there, any will do

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

Resources