Strategy for avoiding name conflicts when copying user-selected files?

Strategy for avoiding name conflicts when copying user-selected files? - algorithm

I'm developing a desktop app in `Electron' which allows the "non-pro" user to import (copy) images from their local drive into a project directory which they created earlier. Through the platform dialog (OSX or Windows), the user can select single or multiple images, or single or multiple directories, which could also include sub-directories.
I know how to handle the coding but I am stumped on a strategy to avoid naming conflicts, particularly as images may be coming from camera files which use a simple naming scheme, with batch imports from different camera sessions having the same names.
For a simple example, a user could pick both the "DCIM" directories below, or make selections from within each of the directories of files with the same name.
This is likely a very common programming issue and there must be some solutions which people smarter than me have come up with – but I don't know what this problem is called, in order to search for them.

The solution that I've seen is to look for a naming conflict, and then append something to the name of the thing being imported, before the ending. So you'll see files named foo.txt, foo-001.txt, foo-002.txt, and so on.
If you expect a great many conflicts, the appended text should be random, instead of sequential. That's because it takes 51 duplicate checks before settling on foo-050.txt but only 2.0000214334705... before settling on foo-kyc.txt. The performance difference can be quite noticeable after many conflicts for many files.

Related

Perforce - How to sync specific file types only in certain folders in P4

We have a P4 repo where we have multiple source and compiled data files alongside each other.
Normally this wouldn't be a problem but I work across multiple teams, so there's certain folders where I need the source files, while in the vast majority of cases I don't (and it's literally terabytes of data).
Is it possible in P4 to setup a filter either on the stream or on the workspace, so that I have something along the line:
include root/
exclude *.max
exclude *.ma
exclude *.mb
include root/Engine/Data/Models/Reference/*.max
include root/Engine/Data/Models/Reference/*.ma
include root/Engine/Data/Models/Reference/*.mb
ie: exclude all .max and .maya files by default, but include them in the folder and subfolders of the Reference folder (bonus points if this can be done from the P4V UI).
I know there's a way to do this one by one file, but that's not a option since there's several hundred files in there and there's new ones added every few days by artists.
Thanks in advance

From a best-practices point of view (in terms of both conceptual simplicity of workspace mappings and performance of the Perforce server), it would be preferable to organize the depot in such a way that the source and compiled files are in separate folders, rather than only being separable by file extension and/or name. Presumably if the compiled files went into a "generated" folder you could simply map that entire folder as the general rule, add in the specific source folders you need the source for, and call it a day, without needing any tricky overlapping exclude/include logic.
Streams flatly do not allow the level of granularity you're describing; you can "ignore" an extension across the board, and you can include/exclude individual folders, but you can't mix and match those rules as you describe; this is a forcing function for simpler depot structures. Streams are meant to encourage/enforce best practices, and were built in part as a way to constrain users from building arbitrarily complex client views that have historically been shown to be difficult to support.
In a "classic" client view (i.e. a client where the View is constructed manually rather than being auto-generated based on a stream) you do still have that arbitrary level of flexibility, and can build mappings like:
//depot/root/... //my-client/...
-//depot/root/....max //my-client/....max
-//depot/root/....ma //my-client/....ma
-//depot/root/....mb //my-client/....mb
//depot/root/Engine/Data/Models/Reference/*.max //my-client/Engine/Data/Models/Reference/*.max
//depot/root/Engine/Data/Models/Reference/*.ma //my-client/Engine/Data/Models/Reference/*.ma
//depot/root/Engine/Data/Models/Reference/*.mb //my-client/Engine/Data/Models/Reference/*.mb
with the caveat that the performance does not scale arbitrarily -- in particular, combining a large number of rules like this with a similar set of rules in the protection table can significantly slow down server operations that need to join the two mappings (which is most of them).

GetOpenFileName() custom filtering [duplicate]

Vista introduced an interface: IFileDialog::SetFilter, which allows me to setup a filter that will be called for every potential filename to see if it should be shown to the user.
Microsoft removed that in Windows 7, and didn't support it in XP.
I am trying to customize the our Open file dialog so that I can control which files are displayed to the end user. These files are marked internally with a product-code - there isn't anything in the filename itself to filter on (hence file extension filters are not useful here -= I need to actually interrogate each one to see if it is within the extra filter parameters that our users specified).
I would guess that Microsoft removed the SetFilter interface because too often it was too slow. I can imagine all sorts of similar ideas to this one which don't scale well for networks and cloud storage and what have you.
However, I need to know if there is an alternative interface that accomplishes the same goal, or if I really am restricted to only looking at the file extension for filtering purposes in my File dialogs?
Follow-up:
After looking further into CDN_INCLUDEITEM, which requires the pre-vista version of OPENFILENAME, I have found that this is the most useless API imaginable. It only filters NON-filesystem objects. In other words, you can't use it to filter files. Or folders. The very things one would filter 99.99% of the time for a file open or save dialog. Unbelievable!
There is a very old article by Paul DiLascia which offers the technique of removing each offending filename from the list view control each time the list view is updated.
However, I know from bitter experience that the list view can update over time. If you're looking at a large folder (many items) or the connection is a bit slow (heavily loaded server and/or large number of files), then the files are added to the dialog piecemeal. So one would have to filter out offending filenames repeatedly.
In fact, our current customized file open dialog uses a timer to look at the view's list of filenames periodically to see if any files of a given pattern exist, in order to enable another control. Otherwise it's possible to check for the existence of these files, find none, but a moment later the view updates to have more filenames, and no events are sent to your dialog to indicate that the view has been changed. In fact, my experience with having to write and maintain code for the common controls file dialogs over the years has been that Microsoft is not very cluefull when it comes to how to write such a thing. Events are incomplete, sent at not-useful times, repeated when not necessary, and whole classes of useful notifications don't exist.
Sadly, I think I might have to give up oh this idea. Unless someone has a thought as to how I might be able to keep up with the view spontaneously changing while the user is trying to interact with it (i.e. it would be awkward to go deleting out entries from the list view and changing the user's visual position, or highlighted files, or scroll position, etc.)

You need to initialise the callbacks for your CFileDialog. Then you need to process CDN_INCLUDEITEM notification code to include or exclude items.
You can also check this great article. The author uses some other approaches in addition to callbacks

As you have already discovered, starting in Windows 7 it is no longer possible to filter out files from being displayed based on content, only file extension. You can, however, validate that the user's selected file(s) are acceptable to you before allowing the dialog to close, and if they are not then display a message box to the user and keep the dialog open. That is the best you will be able to do unless you create your own custom dialog.

Extracting the serialized data from unknown files

My dearest stackoverflowers,
I want to access the serialized data contained in files with strange, to me, extensions. The bulk of the data seems to be in a .st and an .idt file.
The program is meant to be run on Windows, and the unix file command gives me only false positives. Any ideas on either what these extensions mean or on how to investigate and extract their contents?
Below I provide the entirety of the extensions in a long list in hope somebody recognizes them. Googling also gives me false positives. For example: .st is commonly used for ATARI emulation files.
Thanks in advance!
.cix
.cmp
.cnt
.dam
.das
.drf
.idt
.irc
.lxp
.mp
.mbr
.str
.vlf
.rpf
.st
.st

Some general advice on how to approach this:
One way to approach this is to use a site like http://filext.com/ to try to figure out where the files came from. This can be tough, because it's not like there's a file extension standard anywhere - anyone can use any extension, so you're going to have a lot of conflicts/disambiguation issues to solve.
Sometimes you can get lucky, and if you open up the files in a plain text editor you can occasionally see plain string data that is readable, which can help identify the general sort of data contained in a file, and therefore help cut down on the possible number of sources for a file. For example, I have often helped people who received a file as an email attachment with no extension, figure out what file type it was using this technique, adding the file extension, and then opening it in the appropriate program.
There are also sites like http://www.oldversion.com/ that keep old versions of programs that you (typcially) can download for free. This is especially helpful if the data you're working with was created 5+ years ago, in a proprietary program, and that program is no longer available/purchasable from the vendor who created it.
Once you have a good idea of what files belong to what programs, then you're probably going to spend a lot of time trying to find online resources for what the structure of the files are. If that isn't available, you can get a copy of the original program, but either the program won't open the files you're interested in or you still want raw access to the data, then try generating some sample output files with data that you input, and go Rosetta Stone on it, comparing your known file to the original file.
From there, the additional knowledge you'll probably want, is to try to find out what language/compiler the software was written in, which can give you a lead on what code libraries were used to serialize the data in the first place. Once you know all that, then it's matter of reading through any available documentation on the serialization process, and then writing a deserializer.
The one thing this technique won't solve is, if you're dealing with corrupt/truncated data files, it may be very difficult to tell the difference between that and whether or not you have the file structure correct. The "Rosetta Stone" technique might be helpful in that case.
Depending on how many different pieces of source software you're talking about, sounds like a pretty big project. Good luck!

How can one detect changes in a directory across program executions?

I am making a protocol, client and server which provide file transfer functionality similar to FTP (among other features). One difference between my protocol and FTP is that I would like to store a copy of the remote server's directory structure in a local cache. The server will only be running on Windows (written in C++) so any applicable Win32 API calls would be appreciated (if any). When initially connected, the client requests the immediate children (both files and directories, just like "ls" or "dir" with no options), then when a user navigates into a directory, this step repeats with the new parent like you might expect.
Of course, most of the time, if the same directory of a given server is requested twice by a client, the directory's contents will be the same. Therefore I would like to cache the results of each directory listing on the client. I would like a simple way of implementing this, but it would need to take into account expiring cache entries because of file/directory access and modification time and name changes, which is the tricky part. I would ideally like something which would enable almost instant directory listings by the client, with something like a hash which takes into account not only file contents, but also changes in subdirectories' contents' filenames, data, modification and access dates, etc.
This is NOT something that could completely rely on FileSystemWatcher (or similar) objects because it would need to maintain this cache even if the program is only run occasionally. Of course these would be nice to help maintain the cache, but that's only part of the problem.
My best(?) idea so far is using FindFirstFile() and FindNextFile(), and sorting (somehow), concatenating and hashing values found in the WIN32_FIND_DATA structs (with file contents maybe), and using that as a token for expiration (just to indicate change in any of these fields). Then I would have one of these tokens for each directory. When a directory is requested, the server would hash everything and compare that to the cached hash provided by the client, and if it's different, return the normal data, otherwise an HTTP 304 equivalent. Is there a less elaborate way of doing something like this? Does "directory last modified date" take into account every one of its subdirectories' files' modification dates under all circumstances? I'm sure that the built-in Windows indexing service has something like this but ideally I wouldn't need to rely on it.
Because this service is for file sharing, something involving hashes would be especially nice so that I could automatically and efficiently find other people who are sharing a given file, but that's less of a concern then hosing the disk during the hash calculation.
I'm wondering what others who are more experienced than I am with programming would do to solve this problem (rsync and subversion have solved similar problems but not identical).

You're asking a lot of a File System Implementation of Very Little Brain (with apologies to A. A. Milne).
This is actually well-trammeled ground and you'd do well to look at the existing literature on distributed filesystems. AFS comes to mind as an example of a very well studied approach.
I doubt you'll be able to come up with something useful and accurate without doing some serious homework. Put another way, 'twould be folly to ignore all the prior art.

Migrating from processing many small data files to a few large files in ruby

What should I keep in mind when migrating from processing many small data files to a few large data files in ruby?
Background: I'm a bioinformatician who is processing next generation sequencing data, which produces about one million sequences per run. I previously saved each one of the million sequences to its own file, and did a few processing steps to each sequence, producing a couple of files for each sequence. Unfortunately, having a couple of million files is making file input and output a major bottleneck (and also makes backup slow). (Having millions of files is also discouraged in answers to this question)
I considered using sqlite to store each file, but I want to avoid this option if possible, to avoid adding dependencies.
I suspect that I should write one and only one module for handling the large files, and let all of the processing scripts (which run as independent processes) use this module whenever it wants to do input or output. Providing the processing classes with a filestream created with StringIO may be useful for this, as that way they don't need to know about how the large files work.
In order to avoid having to read an entire large file when getting input (I want processing of each sequence to be an independent process, so that an analysis of one sequence can't corrupt the analysis of another sequence), I'll have to keep track of where I'm up to in the large input file. Although more sophisticated inter-process communication techniques exist, I might merely use a temporary file to store the character position for IO#seek.
I'll also have to keep in mind that I won't really be able to run multiple processes at once if they're writing to the same file, and that the large file handler will need to flush its output regularly.

I don't know the details of your situation, but the application you are describing -- I want to store a million things and I'd like to access them quickly and flexibly -- sounds like a DB to me. By avoiding tools like sqlite you aren't necessarily avoiding dependencies; you might be trading one kind of dependency for another.
If you do have to roll your own file-based solution, you don't necessarily have to go from one extreme to the other. What about 1000 medium-sized files, dispersed across 10 subdirectories? And those medium-sized files could be .tar archives or something similar (directories in disguise) that, from the point of view of your code, might behave a lot like the 1 million little files you're used to handling. In addition, those .tar files will remain accessible directly from the command-line without any special software.
Maybe those are crazy ideas, but if you're going to avoid a DB and instead whip together something quick and practical, consider options that don't require you to build the moral equivalent of your own DB system.

If this is just a case of storing "a bunch of files" you might just need a simple key/value store like BDB which could scale up quite easily to any RDBMS including MySQL, SQLite, or even a key/value store like Tokyo-Cabinet.
Any reasons for SQLite being such a problem? A robust data storage mechanism might be a much better approach to the 'pile of files' system.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio