NSDictionary, NSArray, NSSet and efficiency - cocoa

I've got a text file, with about 200,000 lines. Each line represents an object with multiple properties. I only search through one of the properties (the unique ID) of the objects. If the unique ID I'm looking for is the same as the current object's unique ID, I'm gonna read the rest of the object's values.
Right now, each time I search for an object, I just read the whole text file line by line, create an object for each line and see if it's the object I'm looking for - which is basically the most inefficient way to do the search. I would like to read all those objects into memory, so I can later search through them more efficiently.
The question is, what's the most efficient way to perform such a search? Is a 200,000-entries NSArray a good way to do this (I doubt it)? How about an NSSet? With an NSSet, is it possible to only search for one property of the objects?
Thanks for any help!
-- Ry

#yngvedh is correct in that an NSDictionary has O(1) lookup time (as is expected for a map structure). However, after doing some testing, you can see that NSSet also has O(1) lookup time. Here's the basic test I did to come up with that: http://pastie.org/933070
Basically, I create 1,000,000 strings, then time how long it takes me to retrieve 100,000 random ones from both the dictionary and the set. When I run this a few times, the set actually appears to be faster...
dict lookup: 0.174897
set lookup: 0.166058
---------------------
dict lookup: 0.171486
set lookup: 0.165325
---------------------
dict lookup: 0.170934
set lookup: 0.164638
---------------------
dict lookup: 0.172619
set lookup: 0.172966
In your particular case, I'm not sure either of these will be what you want. You say that you want all of these objects in memory, but do you really need them all, or do you just need a few of them? If it's the latter, then I would probably read through the file and create an object ID to file offset mapping (ie, remember where each object id is in the file). Then you could look up which ones you want and use the file offset to jump to the right spot in the file, parse that line, and move on. This is a job for NSFileHandle.

Use NSDictionary to map from ID's to objects. That is: use the ID as key and the object as value. NSDictionary is the only collection class which supports efficient key lookup. (Or key lookup at all)
Dictionaries are a different kind of collection than the other collection classes. It is an associative collection (maps IDs to objects in your case) whereas the others are simply containers for multiple objects. NSSet holds unordered unique objects and NSArray holds ordered objects (may hold duplicates).
UPDATE:
To avoid reallocations as you read the entries, use the dictionaryWithCapacity: method. If you know the (approximate) number of entries prior to reading them you can use it to preallocate a big enough dictionary.

200,000 objects sounds like you might run into memory constraints, depending on size of the objects and your target environment. One other thing you may want to consider is to convert the data into SQLite database, and then index the columns you want to do lookup on. This would provide a good compromise between efficiency and resource consumption, as you would not have to load the full set into memory.

Related

Redux/React state normalization - why maintain a separate array of IDs?

Following the tutorial by Dan Abramov here: https://egghead.io/lessons/javascript-redux-normalizing-the-state-shape
He doesn't seem to explain the benefit of maintaining an extra reducer with an array of todo IDs (allIds), would it not be easier to have just the one byId reducer and user Object.keys or Object.values to iterate over it?
The sample Todo app shows a list of todos, in the order in which they were created. It's not possible to retrieve that ordered list in a way that is guaranteed to work across browsers using an Object and Object.keys.
JS Object properties are unordered, but arrays have an order. So the ordering of the output of Object.keys() is not guaranteed to have any relationship to the order in which the keys were added. The array allows the reducer to display the todos in the order in which they were added.
Theoretically you could use a Map, as the keys in a Map are ordered. However, there's no way to re-order the contents of a Map. With an array you could re-order the IDs without needing to touch the todo objects themselves.
In other words, the array data structure is better suited to storing ordered lists than both Object and Map.

Algorithm and data structure to store First name and last name

Is there a efficient way to store first name and last name in data structure so that we can lookup using either first or last name? I would consider a binary search tree with first name. It would be efficient to search first name. But wouldnt be efficient when trying to search last name. we can also consider one more BST with last name. Any ideas to implement it efficiently?
What if the question is
String names[] = { "A B","C D"};
A requirement is to be able to extend this directory dynamically at runtime,
without persistent storage. The directory can eventually grow to hundreds or
thousands of names and must be searchable by first or last name.
Now we can't have hash tables to store. Any ideas?
Two hash tables: one from first name to person, and one from last name to person.
Simple is best.
Why not put both first and last names in a trie?
As a bonus, this way you can even get suggestions on partial names by traversing all leaves after current node (maybe on an asynchronous call)
You're idea is pretty good, but here's another option: how about implementing to hash tables?
The first hash table would use first names as a key, and the associated value would either be the last name or a pointer to a Name object. The second hash table would use last names as keys, with the first names or pointers to Name as the values.
Personally, for choosing the values, I would go for a pointer to a Name object, since this method would be more applicable in case you'd like to store even more information (e.g. data of birth, etc.)
Also, see Does Java have a HashMap with reverse lookup?…, which is specific to Java but the discussion on the data structures is relevant to any language.
Note that structures such as Bidirectional Sorted Maps also allow range searches (which dual hash tables don't).
if you need to search only by first name or only by last name then yes, two hashmaps are the best (and notice you're not duplicating the data, you're partitioning it) but if you don't mind then put both first and last names in a single hashmap and don't differentiate between the two.

Query core data store based on a transient calculated value

I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.

What to prefer in GQL; StringListProperty or ListProperty?

I am building an application with a many to many relationship;
An item of entity 'Picture' can be linked to any number of Galleries ('Gallery'). And of course a Gallery can hold any number of Pictures.
So, following the Google Suggestion here, I will use a List at 'Picture' which holds the foreign keys of 'Gallery'. This is the BigTable approach.
(The old-style Relational DB approach would be to have a table / entity in between 'Picture' and 'Gallery'.)
Here's my question: When storing the Key, should I go for a "StringListProperty" on 'Picture' or would a "ListProperty(db.Key)" work better?
One reason I see for a StringList would be, that I could store also other values then Keys, but on the other hand that would be dirty style anyway. But I am also pretty sure that Google suggested not to use more then one List at an entity because the Index(es) will explode. So this will keep me a backdoor.
As for the ListProperty with type "Key" one point would be the automatic verification, if the value is actually a Key.
As it is very easy to convert Strings to Keys and vice versa, I don't see any reason for one of the List types to prefer here.
When it comes to performance issues, I have no idea on how I could test this - but it looks like this will be the main factor in this decision.
Curious about your input. Especially if someone has tested the performance on this or would be so kind and do it.
Cheers,
//Hannes
Use a db.ListProperty(db.Key) if you're intending to store lists of keys. They will be stored in a binary representation, which is more compact than the string representation you would use in a string list.
You're right that mixing keys with other objects in a list is messy. Having multiple lists in an entity is fine, as long as you don't index more than one of them in the same custom index - that is what causes exploding indexes.
Use db.ListProperty(db.Key), this is will make the data fetch easier than string.. if Gallery model has property had pic_list which is of type db.ListProperty(db.Key), which contains the list of keys of picture entity.. Suppose Picture is the name of your entity.. then Picture.get(//GalleryObject//.pic_list) will get all the picture entites..

Enumerate indexes on a Extensible Storage Engine (ESENT) table

Background
I'm writing an adapter for ESE to .NET and LINQ in a Google Code project called eselinq. One important function I can't seem to figure out is how to get a list of indexes defined for a table. I need to be able to list available indexes so the LINQ part can automatically determine when indexes can be used. This will allow much more efficient plans for user queries if appropriate indexes can be found.
There are two related functions for querying index information:
JetGetTableIndexInfo - get index information by tableID
JetGetIndexInfo - get index information by tableName
These only differ in how the related table is specified (name or tableid). It sounds like these would support the function I want but all the info levels seem to require that I already have a certain index to query information for. The only exception is JET_IdxInfoCount, but that only counts how many indexes are present.
JET_IdxInfo with its JET_INDEXLIST sounds plausible but it only lists the columns on a specific index.
Alternatives
I am aware that I could get the index information another way, like annotations on .NET types corresponding to database tables, or by requiring a index mapping be provided ahead of time. I think there's enough introspection implemented to make everything else work out of the box without the user supplying extra information, except for this one function.
Another option may be to examine the system tables to find related index objects, but this is would mean depending on an undocumented interface.
To satisfy this question, I want a supported method of enumerating the indexes (just the name would be sufficient) on a table.
You are correct about JetGetTableIndexInfo and JetGetIndexInfo and JET_IdxInfo. The twist is that the data is returned in a somewhat complex: a temporary table is returned containing a row for the index and then a row for each column in the table. To just get the index names you will need to skip the column rows (the column count is given by the value of the columnidcColumn column in the first row).
For a .NET example of how to decipher this, look at the ManagedEsent project. In the MetaDataHelpers.cs file there is a method called GetIndexInfoFromIndexlist that extracts all the data from the temporary table.

Resources