Check for duplication inside a plist - macos

I'm making a word game, and storing the words in a plist that I'm editing manually.
It now contains more than 600 words. When I write in new ones I'm always concerned, that the words already exist. I always have to search before I type a new one in, and this is really slows down the whole process.
Is there any way that I can check the list for duplication when I type in immediately, or check the whole plist for duplication and delete the duplicated words?

Make yourself a helper tool.
The tool will read in the array of strings from the plist and convert it to a set. Then it will accept new strings as input, adding them to the set. Duplicates simply won't be added. Convert the set back to an array before writing the file out.

Related

Writing hash information to file and reloading it automatically on program startup?

I wrote a little program that creates a hash called movies. Then I can add, update, delete, and display all current movies in the hash by typing the title.
Instead of having it start a new hash each time and save anything added to a file, and, when updated or deleted, update or delete the key, value pair from the file, I want the program to auto-load the file on startup and create it if it doesn't exist.
I have no idea how to go about doing this.
After reading a lot of the comments I have decided that maybe I should do this with SQL instead, seems like a much better approach!
You can't store Ruby objects directly on the disk; you will first need to convert them to some sequence of bytes (i.e. a string). This is called serialization, and there are several different ways to do it and several different formats the data could be in. I think I would recommend JSON, but you might also want to try YAML or Marshal.
Any of those libraries will allow you to convert your hash into a string and allow you to convert that same string back into a hash. Then you can use Ruby's File class to save and load that string from the disk.
This should get you pointed in the right direction. From here you can search for more specific things like "how do I convert a hash to JSON" or "how do I write a string to a file".
You have the ability to marshal your code in a few ways.
YAML if you would like to use a gem, or JSON. There is also a built in Marshal
RI tells us:
Marshal
(from ruby site)
----------------------------------------------------------------------------- The marshaling library converts collections of Ruby objects into a
byte stream, allowing them to be stored outside the currently active
script. This data may subsequently be read and the original objects
reconstituted.
Marshaled data has major and minor version numbers stored along with
the object information. In normal use, marshaling can only load data
written with the same major version number and an equal or lower minor
version number. If Ruby's ``verbose'' flag is set (normally using -d,
-v, -w, or --verbose) the major and minor numbers must match exactly. Marshal versioning is independent of Ruby's version numbers. You can
extract the version by reading the first two bytes of marshaled data.
And I will leave it at that for Marshal. But there is a bit more documentation there.
You can also use IO#puts to write to a file, and then modify that file to load later, which I use sometimes for config settings. Why use YAML or another external source, when Ruby is easy enough to have a user modify? You use YAML when it needs to be more generally accessible, as the Tin Man points out.
For example this file is the sample file, but is intended for interactive editing (with constraints, of course) but it is simply valid Ruby. And it gets read by a Ruby program, and is a valid object (in this case a Hash stored in a constant.)

Filtering elements while reading .plist array into NSArray

First post - hope I'm doing it right!
I have a file, lexicon.plist, containing an array of about 250K words. I want to load all words of length 'n' into an NSArray.
I know about the NSArray instance method:
(id)initWithContentsOfFile:(NSString *)aPath
but I don't see any way to intervene in the process of reading the file into the NSArray. The only solution I can see is to first load the entire lexicon into one NSArray, and then run through it in a loop selecting the elements of length 'n'.
I'm very new at Cocoa, but I have come across some methods that perform some sort of iterative task, that accept a "block" of code that is invoked at each iteration. I was wondering if such a functional variant of initWithContentsOfFile might exist, or how else I might iteratively read an array from a .plist file and filter the elements I'm interested in.
[And if you're wondering if this might be a case of premature optimization - it is ;-) But I'd still like to know.]
.plist files are basically XML files so you can use an NSXMLParser on it and filter out the elements of interest.
If you want to load a filtered selection of a saved data, you should use a SQL repository using SQLite, for example.
Plain files can only be fully loaded in memory.

Ruby library for manipulating XML with minimal diffs?

I have an XML file (actually a Visual C# project file) that I want to manipulate using a Ruby script. I want to read the XML into memory, do some work on them that includes changing some attributes and some text (fixing up some path references), and then write the XML file back out. This isn't so hard.
The hard part is, I want the file I write to look the same as the file I read in, except where I made changes. If the input file used double quotes, I want the output to use double quotes. If the input had a space before />, I want the output to do the same. Basically, I want the output to be the same as the input, except where I explicitly made changes (which, in my case, will only be to attribute values, or to the text content of an element).
I want minimal diffs because this project file is checked into version control -- and because the next time I make a change in Visual Studio, it's going to rewrite it in its preferred format anyway. I want to avoid checking in a bunch of meaningless diffs that will then be changed back again in the near future. I also want to avoid having to open the project in Visual Studio, make a change, and save, before I can commit my Ruby script's changes. I want my Ruby script to just make its changes, nothing more.
I originally just parsed the file with regexes, but ran into cases where I really needed an XML library because I needed to know more about child elements. So I switched to REXML. But it makes the following undesirable changes to my formatting:
It changes all the attributes from double quotes to single quotes.
It escapes all the apostrophes inside attribute values (changing them to ').
It removes the space before />.
It sorts each element's attributes alphabetically, rather than preserving the original order.
I'm working around this by doing a bunch of gsub calls on REXML's output, but is there a Ruby XML-manipulation library that's a better fit for "minimal diff" scenarios?
You can build your own SAX parser (using Nokogiri, for example, it's very easy and I recommend to use it) to parse your XML file, change some data in it, and flush the processed XML file with your own customized, built from scratch, XML generator. The bad news is, you have to build a tiny XML library and generator routine in this case, so it is not an ordinary task.
Another way: don't build the SAX parser, but write an XML generator. Parse XML with your favourite library, change what you need to change and generate anything you want. You just need to recursively walk through all nodes in your document and output them within your conventions.

SHA-1 hash for storing Files

After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt
This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.

What is the best way to edit the middle of an existing flat file?

I have tool that creates variables for a simulation. The current workflow involves hand copying those variables into the simulation input file. The input file is a standard flat file, i.e. not binary or XML. I would like to automate the addition of the variables to the flat input file.
The variables copy over existing variables in the file, e.g.
New Variables:
Length 10
Height 20
Depth 30
Old Variables:
...
Weight 100
Age 20
Length 10
Height 20
Depth 30
...
Would like to have the old variables copy over the new variable. They are 200 lines into the flat input file.
Thanks for any insights.
P.S. This is on Windows.
If you're stuck using flat, then you're stuck using the old fashioned way of updating them: read from original, write to temp file, either write the original row or change the data and then write that. To add data, write it to the temp file at the appropriate point; to delete data, simply don't copy it from the original file.
Finally, close both files and rename the temp file to the original file name.
Alternatively, it might be time to think about a little database.
For something like this I'd be looking at a simple template engine. You'd have a base template with predefined marker tokens instead of variable values and then just pass the values required to your engine along with the template and it will spit out the resultant file, all present and correct. There are a number of Open Source template engines available in Java that would meet your needs, I imagine such things are also available in your language of choice. You could even roll your own without too much difficulty.
Note that under Unix, one would probably look at using mmap() because you can then use functions such as memmove() to move the data around and add new data or truncate() the result if the file is then smaller (you may also want to use truncate() to grow the file).
Under MS-Windows, you have the MapViewOfFileEx() function to do the same thing. The API is different, though,
and I'm not exactly sure what happens or how to grow/shrink the file (MSDN also includes a truncate()-like function and maybe that works).
Of course, it's important to use memcpy() or memmove() properly to not overwrite the wrong data or go outside the buffer.

Resources