What is the best character to use as a delimiter in a custom batch syntax? - syntax

I've written a little program to download images to different folders from the web. I want to create a quick and dirty batch file syntax and was wondering what the best delimiter would be for the different variables.
The variables might include urls, folder paths, filenames and some custom messages.
So are there any characters that cannot be used for the first three? That would be the obvious choice to use as a delimiter. How about the good old comma?
Thanks!

You can use either:
A Control character: Control characters don't appear in files. Tab (\t) is probably the best choice here.
Some combination of characters which is unlikely to occur in your files. For e.g. #s# etc.
Tab is the generally preferred choice though.

Why not just use something that exists already? There are one or two choices, perl, python, ruby, bash, sh, csh, Groovy, ECMAscript, heavens for forbid windows scripting files.
I can't see what you'd gain by writing yet another batch file syntax.

Tabs. And then expand or compress any tabs found in the text.

Choose a delimiter that has the least chance of collision with the names of any variable that you may have (which precludes #, /, : etc). The comma (,) looks good to me (unless your custom message has a few) or < and > (subject to previous condition).
However, you may also need to 'escape' delimiter characters occurring as part of the variables you want to delimit.

This sounds like a really bad idea. There is no need to create yet another (data-representation) language, there are plenty ones which might fit your needs. In addition to Ruby, Perl, etc., you may want to consider YAML.
Designing good syntax for these sort of this is difficult and fraught with peril. Does reinventing the wheel ring a bell?

I would use '|'
It's one of the rarest characters.

How about String.fromCharCode(1) ?

Related

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

How to define aspell word delimiters?

Aspell considers words with underscores or dashes as two, e.g. cloud-based is spell checked as "cloud" and "based". Is there any way to specify the word delimiters so as to exclude dash and underscore?
If I understand the question correctly, Aspell cannot do exactly what you want (up to my knowledge). This has to do with conditional compound word treatment, which is on the Aspells TODO list.
On the same list it is mentioned that Hunspell does a better job with compound words, so it might be a viable alternative if you're not bound to Aspell.
OpenOffice uses Hunspell for spellchecking, so it is easy to find out whether it fits your requirements. It does, at least, work for the "cloud-based" example, and does NOT consider all hyphenated words unconditional compounds, i.e. "based-cloud" would not be considered a spelling error.
Aspell is unable to do what you want it to do at this point. The interface it uses for handling word with symbols in them is not sophisticated enough to handle such a case at this time. More information on this is listed here.
Sorry that this cannot be solved up to this point, unless you want to implement your own interface. I would recommend using Hunspell as Mikhail suggested.

Why should tags be space-delimited?

I'm working on a Web application that uses tags like Stackoverflow tags. I've noticed that a lot of sites that use tags make them space-delimited, which disallows a tags like...
"favorite recipes"
Instead they enforce this...
"favorite-recipes" | "favorite_recipes" | "FavoriteRecipes"
If the tags were comma-delimited, an item could have a set of tags like...
"cats, birds, favorite recipes, horses"
I have to decide on the policy for my app.
I guess I like the idea of space-delimited, but if my users aren't programmers they might be more comfortable with the familiar idea of commas denoting a list.
Why are comma-delimited tags unusual? Is there a major downside to them?
I think space delimited tags feel more like typing and you aren't syntactically bound to commas or pipes etc...
I also think that not allowing spaces in tags is kind of nice because you don't have to worry about turning them into URLs they are already ready to go.
Online tools like Delicious use space delimiting for tags, but other tools like Wordpress allow you to have spaces in tags so they require commas. If you do things this way, you will have to create similar tag "slugs" to make sure they work in a URL cleanly (i.e. my tag would be my-tag)
Remember, allowing too much freedom in your tag creation can result in some pretty crazy tags like "Things I like to do on a Saturday afternoon..." etc.
I think the idea is that a tag should be short and to the point.
If you go to type in "favorite recipes", when it doesn't work, you think to yourself "Oh, I should redo this." So you make "favorites" "recipes" instead.
If it was something you really needed, like "pork roast" then it makes sense to make it "pork-roast" but only after you've thought about really joining those works. Perhaps you shouldn't though - perhaps it should be "pork" "roast" so it shows up under searches for pork and searches for roast.
tl;dr It's for user experience and searching so they don't enter something that can't be easily searched for.
Yes, I also think that space-separated tags reduce complexity. And you can enter them faster, because you have the big space key to separate them. Maybe the brain is also less loaded, because it doesn't need to think about where to put a comma delimiter and where not.
I built an app with tags once, and used commas. The only down side to commas is that you have to do more thorough checking of empty tags. For instance in the example:
"George Bush, Bill Clinton, Barack Obama, "
If someone posts the tags with a comma at the end and there is a space it generally will get added to the database.
This is because if you simply check to delete once space you would turn Bill Clinton into BillClinton.
However you can make sure that the tag is at least a certain amount of characters to ensure there is not empty tags. This will not ensure that there is not three or four spaces in a row though.
Anther thing to note is that humans generally put spaces before words and after commas.
So in the example above there is a space after each comma and before each president. This space will be included in the database making the tag:
" Bill Clinton"
instead of
"Bill Clinton"
as the user probably attempted.
Once again you can eliminate both beginning and trailing spaces, but there is more server side code to implement in this case.
If you just use spaces then you can eliminate any unwanted characters like commas etc, and put the tags into an array using space characters to separate them.
I actually prefer tags that are comma or semi-colon delimited for two basic reasons:
It's more natural (user-centric not techie-centric)
It reduces redundancy, as the example noted "favorite-recipes"
"favorite_recipes" "FavoriteRecipes"
And I'm afraid some of the other answers sort of make it sound as if web developers are (how shall I say this) less willing to do the work than, for instance, database developers - who've been addressing some of these same issues for years. (I've coded for web and client-server databases.)

Delimiter for meta data in Windows file name

I'm working on maintenance of an application that transfers a file to another system and uses a structured filename to include meta data including a language code. The current app uses a two character language code and a dash/hyphen for a delimiter.
Ex. Canada-EN-ProdName-ProdCode.txt
I'm converting it to use IETF language code and so the dash delimiter won't do and need a replacement. I'm trying to determine a delimiter to avoid future errors and am considering the tilde ~.
Ex. Canada~en-GB~ProdName~ProdCode.txt
This will be use only on Windows Sever 2003 + systems. I certainly didn't come up with this system of parsing a filename to get meta data. Unfortunately, I can't include this in the file itself and the destination system is expecting the language code to be in IETF format with the dash.
Any thoughts on potential issues with using the tilde in the filename, or perhaps a better character to use? I'm just looking for a second opinion in case I'm overlooking a possible failure. I believe windows will use the tilde when shortening a long filename to 8.3 format, but I don't see that as an issue here as the OSs can handle lang filenames.
The tilde is probably fine, but what's wrong with the good old underscore _ ? It has no special meaning on either windows or unix, and makes names that are relatively easy to read. If there are no other special considerations, I would avoid the tilde solely out of paranoia, since windows does use it as a special character sometimes, as you mentioned.
For anyone readiong this question I would strongly recommend anything but the tilde in the file name or at least be careful in testing for any speed problems with any .NET path work where one exists.
I used this as a file name delimiter some time ago. I couldn't understand why simply getting a list of files from the folders was taking so long. It was a number of years later (having written a lot of speed up code that had marginal advantage) that I discovered there is a problem with the (DirectoryInfo(path).name in .NET at least) where simple existience of the tilde was forcing underlying code to through a lot of hoops.
The slow down was substantial (it was over a network so I had thought it was bandwidth/Network issues for a fair while)
I understand this is a legacy overhang for when alternative short versions of filenames could be used for Windows files.
I am now stuck with the tilde in these file names but, given that the problem lay in some of the .NET path functions (I don't actually know if it still does), I could work around it by spotting a tilde and creating my own answers when it existed rather than passing it through.
If in any doubt just run speed tests with and without the tilde in filenames for say just 500-1,000 files.

Should tags use comma or space

What is your opinion on whether a tagging user interface widget should require commas or spaces as the delimiter? For example, this site uses spaces, requiring multi-word tags to use a hyphen. I assumed this was some design suggestion from Joel; but then I realized that Facebook and Wordpress use commas.
So what should it be? Or does it not matter much? Let's suppose the users of this widget are generally computer literate but not terribly so.
I would try to think about the domain of the tags and figure out what is the likelihood that potential tags would contain spaces.
For example, most things on this site are single word or acronyms, so it's not difficult to use spaces.
On the other hand, when tagging facebook photos, for example, an average tag is something like "spring break", "frat party", "random hookup", "secretary of state", etc. So dealing with space interpretation or with quotes is more difficult, hence commas make more sense.
I'm not familiar with a specific rule.
If you're thinking of tag clouds though, spaces make less sense.
Be fault-tolerant, if possible. For example, would it work to use whatever is provided? The following two inputs could result in the same, if parsed nicely:
foo bar "hello world"
foo, bar, hello world
Both would result in three obvious tags.
I realize that this would it make hard to parse the following input unambiguously:
hello world
In that case, I'd probably read two distinct tags.
commas. it is more natural. you can use words that include spaces more easily. other solutions seem to complicated for human beings (maybe not for programmers but they think different - remember the "u" in gui stands for "user")
i would go for a comma as it is more natural to separate multiple word tags by commas then use hyphens or other less usable replacement techniques
I don't think it matters. I think for a programming site, most of your tags will not be multi-word so it makes to use a space delimiter. But I think you could make a very compelling argument either way and it really just comes down to personal choice.

Resources