How can I check if a string is a valid file name for windows using R?

How can I check if a string is a valid file name for windows using R? - windows

I've been writing a program in R that outputs randomization schemes for a research project I'm working on with a few other people this summer, and I'm done with the majority of it, except for one feature. Part of what I've been doing is making it really user friendly, so that the program will prompt the user for certain pieces of information, and therefore know what needs to be randomized. I have it set up to check every piece of user input to make sure it's a valid input, and give an error message/prompt the user again if it's not. The only thing I can't quite figure out is how to get it to check whether or not the file name for the .csv output is valid. Does anyone know if there is a way to get R to check if a string makes a valid windows file name? Thanks!

These characters aren't allowed: /\:*?"<>|. So warn the user if it contains any of those.
Some other names are also disallowed: COM, AUX, NUL, COM1 to COM9, LPT1 to LPT9.
You probably want to check that the filename is valid using a regular expression. See this other answer for a Java example that should take minimal tweaking to work in R.
https://stackoverflow.com/a/6804755/134830
You may also want to check the filename length (260 characters for maximum portability, though longer names are allowed on some systems).
Finally, in R, if you try to create a file in a directory that doesn't exist, it will still fail, so you need to split the name up into the filename and directory name (using basename and dirname) and try to create the directory first, if necessary.
That said, David Heffernan gives good advice in his comment to let Windows do the wok in deciding whether or not it can create the file: you don't want to erroneously tell the user that a filename is invalid.
You want something a little like this:
nice_file_create <- function(filename)
{
directory_name <- dirname(filename)
if(!file.exists(directory_name))
{
ok <- dir.create(directory_name)
if(!ok)
{
warning("The directory of that path could not be created.")
return(invisible())
}
}
tryCatch(
file.create(filename),
error = function(e)
{
warning("The file could not be created.")
}
)
}
But test it thoroughly first! There are all sorts of edge cases where things can fall over: try UNC network path names, "~", and paths with "." and ".." in them.

I'd suggest that the easiest way to make sure a filename is valid is to use fs::path_sanitize().
It removes control characters, reserved characters, and Windows-reserved filenames, truncating the string at 255 bytes in length.

Related

Shell help text syntax for repeatable group of arguments

I'm writing a help output for a Bash script. Currently it looks like this:
dl [m|r]… (<file>|<URL> [m|r|<index>]…)…
The meaning that I'm trying to convey (and elsewhere describe with words) is that (after a potential "m" and/or "r") there can be an endless list of sets of arguments. The first argument in each set is always a file or URL and the further arguments can each be "m", "r" or a number. After that, it starts over with a file or URL and so on.
In my special case, I could just write this:
dl [m|r]… (<file>|<URL>) (<file>|<URL>|m|r|<index>)…
This works, because listing a URL and then another URL with nothing in between is allowed, as well as listing an arbitrarily long chain of "m"s (it's just useless to do so) and pretty much any other combination.
But what if that wasn't the case? What if I had for example a command like this:
change (<from> <to>)…
…which would be used e.g. like this:
change from1 to1 from2 to2 from3 to3
Would the bracket syntax be correct here? I just guessed it based on the grouping of (a|b), but I wasn't able to find any documentation that uses this for multiple, non-exclusive arguments that belong together. Is there even a standard for this?

Darwin Streaming Server install problems os x

My problem is the same as the one mentioned in this answer. I've been trying to understand the code and this is what I learned:
It is failing in the file parse_xml.cgi, tries to get messages (return $message{$name}) from a file named messages (located in the html_en directory).
The $messages value comes from the method GetMessageHash in file adminprotocol-lib.pl:
sub GetMessageHash
{
return $ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"}
}
The $ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"} is set in the file streamingadminserver.pl:
$ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"} = $messages{"en"}
I dont know anything about Perl so I have no idea of what the problem can be, for what I saw $messages{"en"} has the correct value (if I do print($messages{"en"}{'SunStr'} I get the value "Sun")).
However, if I try to do print($ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"}{'SunStr'} I get nothing. Seems like $ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"} is not set
I tried this simple example and it worked fine:
$ENV{"HELLO"} = "hello";
print($ENV{"HELLO"});
and it works fine, prints "hello".
Any idea of what the problem can be?

Looks like $messages{"en"} is a HashRef: A pointer to some memory address holding a key-value-store. You could even print the associated memory address:
perl -le 'my $hashref = {}; print $hashref;'
HASH(0x1548e78)
0x1548e78 is the address, but it's only valid within the same running process. Re-run the sample command and you'll get different addresses each time.
HASH(0x1548e78) is also just a human-readable representation of the real stored value. Setting $hashref2="HASH(0x1548e78)"; won't create a real reference, just a copy of the human-readable string.
You could easily proof this theory using print $ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"} in both script.
Data::Dumper is typically used to show the contents of the referenced hash (memory location):
use Data::Dumper;
print Dumper($messages{"en"});
# or
print Dumper($ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"});
This will also show if the pointer/reference could be dereferenced in both scripts.
The solution for your problem is probably passing the value instead of the HashRef:
$ENV{"QTSSADMINSERVER_EN_SUN"} = $messages{"en"}->{SunStr};
Best Practice is using a -> between both keys. The " or ' quotes for the key also optional if the key is a plain word.
But passing everything through environment variables feels wrong. They might not be able to hold references on OSX (I don't know). You might want to extract the string storage to a include file and load it via require.
See http://www.perlmaven.com/ or http://learn.perl.org for more about Perl.

fix code:
$$ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"} = $messages{"en"};
sub GetMessageHash
{
return $$ENV{"QTSSADMINSERVER_EN_MESSAGEHASH"};
}
ref:
https://github.com/guangbin79/dss6.0.3-linux-patch

Increment Serial Number using EXIF

I am using ExifTool to change the camera body serial number to be a unique serial number for each image in a group of images numbering several hundred. The camera body serial number is being used as a second place, in addition to where the serial number for the image is in IPTC, to put the serial number as it takes a little more effort to remove.
The serial number is in the format ###-###-####-#### where the last four digits is the number to increment. The first three groups of digits do not change for each batch I run. I only need to increment that last group of digits.
EXAMPLE
I if I have 100 images in my first batch, they would be numbered:
811-010-5469-0001, 811-010-5469-0002, 811-010-5469-0003 ... 811-010-5469-0100
I can successfully drag a group of images onto my ExifTool Shortcut that has the values
exiftool(-SerialNumber='001-001-0001-0001')
and it will change the Exif SerialNumber Tag on the images, but have not been successful in what to add to this to have it increment for each image.
I have tried variations on the below without success:
exiftool(-SerialNumber+=001-001-0001-0001)
exiftool(-SerialNumber+='001-001-0001-0001')
I realize most likely ExifTool is seeing these as numbers being subtracted in the first line and seeing the second line as a string. I have also tried:
exiftool(-SerialNumber+='1')
exiftool(-SerialNumber+=1)
just to see if I can even get it to increment with a basic, single digit number. This also has not worked.
Maybe this cannot be incremented this way and I need to use ExifTool from the command line. If so, I am learning the command line/powershell (Windows), but am still weak in this area and would appreciate some pointers to get started there if this is the route I need to take. I am not afraid to use the command line, just would need a bit more hand holding then normal for a starting point. I also am learning Linux and could do this project from there but again, not afraid to use it, just would need a bit more hand holding to get it done.
I do program in PHP, JavaScript and other languages so code is not foreign to me. Just experience in writing it for the command-line.
If further clarification is needed, please let me know in the comments.
Your help and guidance is appreciated!

You'll probably have to go to the command line rather than rely upon drag and drop as this command relies upon ExifTool's advance formatting.
Exiftool "-SerialNumber<001-001-0001-${filesequence;$_=sprintf('%04d', $_+1 )}" <FILE/DIR>
If you want to be more general purpose and to use the original serial number in the file, you could use
Exiftool "-SerialNumber<${SerialNumber}-${filesequence;$_=sprintf('%04d', $_+1 )}" <FILE/DIR>
This will just add the file count to the end of the current serial number in the image, though if you have images from multiple cameras in the same directory, that could get messy.
As for using the command line, you just need to rename to remove the commands in the parens and then either move it to someplace in the command line's path or use the full path to ExifTool.
As for clarification on your previous attempts, the += option is used with numbers and with lists. The SerialNumber tag is usually a string, though that could depend upon where it's being written to.

If I understand your question correctly, something like this should work:
1..100 | % {
$sn = '811-010-5469-{0:D4}' -f $_
# apply $sn
}
or like this (if you iterate over files):
$i = 1
Get-ChildItem 'C:\some\folder' -File | % {
$sn = '811-010-5469-{0:D4}' -f $i
# update EXIF data of current file with $sn
$i++
}

boost::filesystem normalize filename

I need normalize file names such that it don't contain any non-portable characters in it. There is portable_file_name but that just checks and returns bool. I need to anyhow convert the given string to a portable name to create files.
Is there any reusable works ?

I assume that you mean some characters (*:;\"?<>/\|) are acceptable as file name or path name characters on some operating systems (Mac OS 9 for instance) but are not acceptable on others (such as Windows XP). Is that correct?
If so, you should probably do the character conversion yourself. I've done this in the past by using a regex to find and replace the unacceptable file name characters with a dash or something that works on all target operating systems. Then, you may safely use the files on both.

Try this:
boost::filesystem3::path portable_file_name;
portable_file_name.normalize();

The best I could come up with so far is:
for (auto &c:name)
{
char test[] = { c,0 };
if (!boost::filesystem::portable_file_name(test))
{
c = '_';
}
}

One of the important steps in doing this is converting file names like ./file or ones pointing to symbolic links into filenames that work on other platforms which might not have these concepts. Boost v1.48.0+ actually has the following functions to do so:
path canonical(const path& p, const path& base = current_path());
path canonical(const path& p, system::error_code& ec);
path canonical(const path& p, const path& base, system::error_code& ec);
This usually involves converting relative paths into absolute ones. These functions are also often called before performing security checks (e.g. is the requested file within the web-root directory of a web server?).
Note that cannonical() requires the file to exist.

How does Windows determine/handle the DOS short name of any given file?

I have a folder with these files:
alongfilename1.txt <--- created first
alongfilename3.txt <--- created second
When I run DIR /x in command prompt, I see these short names assigned:
ALONGF~1.TXT alongfilename1.txt
ALONGF~2.TXT alongfilename3.txt
Now, if I add another file:
alongfilename1.txt
alongfilename2.txt <--- created third
alongfilename3.txt
I see this:
ALONGF~1.TXT alongfilename1.txt
ALONGF~3.TXT alongfilename2.txt
ALONGF~2.TXT alongfilename3.txt
Fine. It seems to be assigning the "~#" according to the date/time that I created the file. Is this correct?
Now, if I delete "alongfilename1.txt", the other two files keep their short names.
ALONGF~3.TXT alongfilename2.txt
ALONGF~2.TXT alongfilename3.txt
When will that ID (in this case, ~1) be released for use in another shortname. Will it ever?
Also, is it possible that a file on my machine has a short name of X, whereas the same file has a short name of Y on another machine? I'm particularly concerned for installations whose custom actions utilize DOS short names.
Thanks, guys.

If I were you, I would never rely on any version of any file system driver (be it Microsoft's, be it another OS's) to be consistent about the algorithm it uses to generate short file names. The exact behavior of the Microsoft Fastfat and NTFS drivers is not "officially" documented (except as somewhat high level overviews) thus are not part of the API contract. What works today might not work tomorrow if you update the driver.
In addition, there is absolutely no requirement that short names contain tilde characters - see for example this post by Raymond Chen.
There's a treasure trove of info to be found about this topic in the MSDN blogs - for example:
Registry key to force Windows to use short filenames
NTFS curiosities (Part I): Short file names
Also, do not rely on the sole presence of alphanumerical characters. Look at the Linux VFAT driver which says, for example, that any combination of uppercase letters, digits, and the following characters is valid: $ % ' ` - # { } ~ ! # ( ) & _ ^. NTFS will operate in compatibility mode with that...

The short filename is created with the file. The algorithm works like this (usually, but see moocha's reply):
counter = 1
stripped_filename = strip_dots(strip_non_ascii_characters(filename))
shortfn = first_6_characters(stripped_filename)
while (file_exists(shortfn + "~" + counter + "." + extension)) {
increment counter by 1
if more digits are added to counter, shorten shortfn by 1
/* e.g. if counter comes to 9 and shortf~9.txt is taken. try short~10.txt next */
}
This means that once the file is created, it will keep its short name until it's deleted.
As soon as the file is deleted, the short name may be used again.
If you move the file somewhere else, it may get a new short name (e.g. you're moving c:\somefilewithlongname.txt ("c:\somefi~1.txt") to d:\stuff\somefilewithlongname.txt, if there's d:\stuff\somefileelse.txt ("d:\stuff\somefi~1.txt"), the short name of the moved file will be somefi~2.txt). It seems that the short name is only persistent within a given directory on a given machine.
So: the short filenames will be generated by the filesystem, usually by the method outlined above. It is better to assume that short filenames are not persistent, as c:\longfi~1.txt on one machine might be "c:\longfilename.txt", whereas on another it might be "c:\longfish_story.txt"; also, when a file is deleted, the short name is immediately available again.

I believe MSDOS stores the association between the long and the short name in a per directory file.
It does not depends on the date/time.
If you move your files in a new directory... this will reset the algo mentionned by Piskvor applies itself again
In the new directory (after a move), you will get:
ALONGF~1.TXT alongfilename1.txt
ALONGF~2.TXT alongfilename2.txt
ALONGF~3.TXT alongfilename3.txt
even though alongfilename2.txt has initially been created third.

This link says how NTFS does it. I would guess it's still the same idea on more recent version.
In Windows 2000, both FAT and NTFS use
the Unicode character set for their
names, which contain several forbidden
characters that MS-DOS cannot read. To
generate a short MS-DOS-readable file
name, Windows 2000 deletes all of
these characters from the LFN and
removes any spaces. Because an
MS-DOS-readable file name can have
only one period, Windows 2000 also
removes all extra periods from the
file name. Next, Windows 2000
truncates the file name, if necessary,
to six characters and appends a tilde
( ~ ) and a number. For example, each
non-duplicate file name is appended
with ~1 . Duplicate file names end
with ~2 , then ~3, and so on. After
the file names are truncated, the file
name extensions are truncated to three
or fewer characters. Finally, when
displaying file names at the command
line, Windows 2000 translates all
characters in the file name and
extension to uppercase.

When the files are provided by a network server which is running Samba, then the short names are generated by the server, and they do not follow a predictable pattern.
So it is not safe to assume that you can predict the form of the short name.
G:\>dir /x *.txt
Directory of G:\
08/25/2009 12:34 PM 1,848 S2XYYV~1.TXT strace_output.txt
03/01/2010 05:32 PM 325,428 TEY7IH~O.TXT tomcat-dump-march-1.txt
03/11/2010 12:01 AM 5,811 DI356A~S.TXT ddmget-output.txt
01/23/2009 01:03 PM 313,880 DLA94Q~K.TXT ddm-log-fn.txt
04/20/2010 07:42 PM 7,491 A50QZP~A.TXT april-20-2010.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio