How to store a tree like structure in STL - data-structures

I need to read some data from a file that follows a tree like structure( bookmarks and I guess you can say they follow the windows explorer dir /file structure )
After reading this data needs to be applied to an identical file . Is there a quick STL structure I can use to both load the tree info from the first file and then traverse it with the second file and write it out.

Related

process 100K of image files with bash

here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.
I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.

What structure of a flat file would be the most efficient for a tree representation of a folder listing?

Given a folder on the local file system, what I need to do is this:
Get the recursive listing of all the subfolders/files in it
Output this in a flat text file
Then recreate this folder structure in a tree representation
So what information and how do I need to store it in that file in order to achieve this in an efficient manner?
Efficient manner in this case means spending as little time as possible creating that tree structure given a potential large number of subfolders/files.
Obviously I would need to keep aware of the parent-child relationships between folders and perhaps something like file extensions and size.
I can use the facilities of Windows at the command line and/or other software so there is no limitation there.
This question might spill through in having someone recommend some library for the third step and going back from there, and I don't mind that as long as it's clear about the rest of the question.
Assumption: able to code C#
Regarding your comments, the question is not really an algorithm question, just plain implementation question.
To create a tree structure of your file system in memory I'd use this recursive approach, just like answer here.
And once you get the structure in memory, serialise that with Json.Net:
string json = JsonConvert.SerializeObject(treeStructure, Formatting.Indented);
this will produce you a text-representation of the tree you already created. And once you get the string, you can save it wherever you need to.
And to re-create the tree-structure from Json string, you use Json.Net again:
TreeView treeStructure = JsonConvert.DeserializeObject<TreeView>(json);
And this should be enough.
You can represent the folders as a Json string:
{"name" : "folder_name1", "children" : [{"name" : "folder_name2", "children" : []}, {"name" : "folder_name3", "children" : [{"name" : "folder_name4", "children" : []}]}]}
You can then use any Json library to parse this string into a Json tree, which you can then traverse to generate your internal tree representation. Some libraries will even automatically (de)serialize the internal tree representation to/from a Json string.

pig load udf for loading files from several sub directories

I want to write a custom load udf in pig for loading files from a directory structure.
The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.
eg: maildir/mailholdername1/inbox/1.txt
maildir/mailholdername2/sent/1.txt
I want to read only inbox files from all mailerholdername sub-directories.
I am not able to understand:
what should be passed to the load udf as parameter
how should the entire directory structure be parsed an only respective inbox files are read.
I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records
Further, i want to do some operation on these inbox files and extract some data.
Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:
A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)
You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).
If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt

What is a sensible data-structure for allowing efficient synchronisation between two root paths?

I am working on an application that involves maintaining consistency between two local directories. Specifically, the directories should be identical, with the exception that all files in one of the directories are modified in some particular way (this part is not important to my question).
While running, my application runs two processes that listen for changes occurring under each of the paths, and performs relevant operations to bring them back in sync when necessary.
In terms of my specific question: I'm looking for advice on the tricker situation of when one starts the application. At this point, each process needs to check all files/folders under both the path that it is looking after, to see if anything has changed in anyway whilst the application was not running. (Let us assume that the application cannot be notified by the OS of anything that happened while it was shutdown, and thus will need to directly check every file/folder.)
Each process will have access to (and maintain) a persistent data-structure of all files/folder under its designated path. I was thinking that the following should be held within the data-structure for each of the files and folders:
File/folder name;
File hash (CRC32);
File/folder last mod data; and
File/folder size.
These pieces of information will obviously help to check for any changes to files/folder, but what is the best way to store them?
It seems to me that one sensible way to approach the situation of an application start is for each process to recursively scan through all files/folders under its designated path, and compare the metadata for each file scanned to the metadata stored in its data-structure. Then the processes should also iterate through the data-structures to look for things that have been removed from the paths. Some cases that may be encountered during this process are:
file modified (file name found in data-structure, but hash differs);
file added (no identical filename or hash found in data-structure);
file renamed (file with same hash exists in data-structure, but not with same filename);
folder added (no folder name in data-structure);
folder removed (folder name in data-structure, but not under path);
folder renamed (tricky one).
So, what's the best data-structure to use for this task? In my head I'm thinking some form of sorted associative array, e.g., a red-black tree, which store file and folder objects. Each file object contains name, hash and mod-date attributes , while each folder object contains name and children attributes, where children stores another associative array with everything underneath. Given the path to an arbitrary file, e.g., /foo/bar/file.txt, you begin at the root (foo), check for bar and so on until you get to file.txt's parent object.
Another alternative I can think of is to merely store everything flatly, such that there is one red-black tree where each key is the full path to each file/folder, and the value is the file / folder object. This would probably be quicker for retrieval, but it won't be possible to detect renamed files/folders without iterating through all values anyway, which sounds expensive. In the first approach, it may be the case that identifying a rename would only involves checking a portion of the data-structure rather than all of it.
Sorry the above ideas aren't terribly well thought out. What's the state of the art in this area, and are there any well-trodden approaches to these types of problems?
You're modelling a filesystem, so it's quite natural to use a hierarchical data structure. After all, you don't need to compare the file at dir1\dir2\foo.txt to dir3\bar.txt, right? You didn't mention file moves between directories as something you're tracking.
So, the data structure could be:
interface IFSEntry {
string name
datetime creationDate
pure virtual bool Compare(IFSEntry other)
pure virtual void UpdateFrom(IFSEntry other)
pure virtual bool WasRenamed(Dictionary<string,IFSEntry> possibleOriginals, out string oldName)
...
}
class File : IFSEntry {
...
}
class Directory : IFSEntry {
private Dictionary<string,IFSEntry> children;
...
}
The Directory implementations of UpdateFrom and Compare would recurse down their children.
File renames would be relatively easy by comparing CRC's. You'd miss files that changed in both places and were renamed. You could add a CRC dictionary to the Directory class if the time to run the comparisons proves a performance problem.
For directory moves, if the child files also changed, then you've got a fuzzy logic situation. It would be best to have a merge tool that the user would operate for that situation.
If a file changes in both places, you also need a user-facing merge strategy if conflicting changes occur. I'd argue that is always a good idea, just to let the user eyeball that the document didn't lose coherence.

Arbitrary sort key in filesystem

I have a pet project where I build a text-to-HTML translator. I keep the content and the converted output in a directory tree, mirroring the structure via the filesystem hierachy. Chapters go into directories and subchapters go into subdirectories. I get the chapter headings from the directory and file names. I want to keep all data in files, no database or so.
Kind of a keep-it-simple approach, no need to deal with meta-data.
All works well, except for the sort order of the directories and files to be included. I need sort of an arbitrary key for sorting directories and files in my application. That would determine the order the content goes into the output.
I have two solutions, both not really good:
1) Prepend directories and files with a sort key (e.g. "01_") and strip that in the output files in order not to pollute the output file names. That works badly for directories since they must keep the key data in order not to break the directory structure. That ends with an ugly "01_Introduction"...
2) put an config file into each directory with information on how to sort the directory content, to be used from my applications. That is error-prone and breaks the keep-it-simple no meta-data approach.
Do you have an idea? What would you do?
If your goal is to effectively avoid metadata, then I'd go with some variation of option 1.
I really do not find 01_Introduction to be ugly., at all.

Resources