Trick to use file paths with spaces in Mallet (Terminal, OSx)? - macos

Is there a trick to be able to use file paths with spaces in Mallet through the terminal on mac?
For example, all of the following give me errors:
escaping the space
./bin/mallet import-dir --input /Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
double quotes, no escapes
./bin/mallet import-dir --input "/Volumes/Macintosh HD/Users/MY_NAME/Desktop/en" --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
and, with double quotes
./bin/mallet import-dir --input "/Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en" --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
and finally with single quotes
./bin/mallet import-dir --input '/Volumes/Macintosh\ HD/Users/MY_NAME/Desktop/en' --output /Users/MY_NAME/Desktop/en.mallet --remove-stopwords TRUE --keep-sequence TRUE
They all want to treat the folder as multiple folders, split on the space:
Labels =
/Volumes/Macintosh\
HD/Users/MY_NAME/Desktop/en
Exception in thread "main" java.lang.IllegalArgumentException: /Volumes/Macintosh\ is not a directory.
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:108)
at cc.mallet.pipe.iterator.FileIterator.<init>(FileIterator.java:145)
at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322)
Is there anyway around this, other than renaming all of my files with spaces to underscores? (I understand that I don't need to type /Volumes/Macintosh\ HD/... but can just start at /Users. This was just an example.)

The issue is that import-dir is designed to take multiple directories as input. The argument parser would need a way to distinguish this use case from the "escaped space" use case, keeping in mind that Windows paths can end in \.
The best way to support both cases might be to add a --single-input option that would take its argument as a single string.
I also find that the spreadsheet-style import-file command is almost always preferable to working with directories.

As a work around you could:
(1) write some code to read the directory contents and generate a single examples file for use with:
bin/mallet input-file
Here's the mallet quick-start page for importing which describes the input-file version: http://mallet.cs.umass.edu/import.php
(2) Generate a symbolic link to the folder in a location without any spaces in it

Related

Renaming the file Directory which contains Space based on CSV in Shell

I need to rename the files inside the folder that has a space in it eg(Deco/main library/file1.txt )
code:
while IFS="," read orig new pat
do
mv -v $pat$new $pat$orig
done < new.csv
csv file:
newname,file1.txt,Deco/main\\\ library/
error:
mv: invalid option -- '\'
Welcome to Stackoverflow!
First: Use quotes around the use of variables. That means except in very rare occasions, you always should use "$foo" instead of $foo because if you are using the latter, the shell is supposed (and will) interpret spaces in the variables as word delimiters which you rarely want. Especially in your case you do not want it.
Second: Your CSV file seems to contain backslashes to quote the spaces. And some additional step seems to have added another level of quotation so than now you end up with three backslashes and a space for each original space. If this really is the case (please double check if what you wrote in your question is correct, otherwise my answer doesn't fit), you need to unquote this before you can use it.
There are security issues involved in using eval, so do not use it lightly (this disclaimer is necessary whenever proposing to use eval), but if you have trust in the input you are handling to not contain any nastinesses, then you can do this using this code:
while IFS="," read orig new pat
do
eval eval mv -v "$pat$new" "$pat$orig"
done < new.csv
Using this, two levels of quotation are evaluated (that's what eval does) before the mv command is executed.
I strongly suggest to do a dry run by adding echo before the mv first. Then instead of executing your commands they are merely printed first.

addsuffix behaviour in tcsh complete

I am working with a Makefile for conversion of documents.
To specify which document to convert, I have to give folder names in 2 make-variables:
NAME and DATE.
The directory structure is /data/$(NAME)/$(DATE)
NAME may contain numbers and characters.
DATE follows this format: YYYYMMDD_XXXXXXXX where X is a hex-char.
I want to make complete suggest the NAME and DATE variables in tcsh (mandatory use on site), because it is annoying to enter those random X-chars.
I ended up having the following to suggest me the NAME variable:
'c#{NAME}=#D:/data#' \
'C/N*/(NAME=)/'
This works as expected as long as I do unset addsuffix.
make N[TAB] » make NAME=[TAB] » make NAME=10001.1
If addsuffix is set, the cursor will be after the trailing whitespace in make NAME=.
For complete a new word starts, so it does not suggest me the directories for NAME then.
If I go to NAME= then, it also adds the trailing / to folder name, which is not needed.
Is there a way to disable this behaviour for these completions?
tcsh.org states:
addsuffix If set, filename completion adds `/' to the end of directories and a space to the end of normal files when they are matched exactly. Set by default.
Obviously I want to keep the behaviour (as set by the user) for other completion.
The [suffix] is what I want to be void.
from documentation:
complete [command [word/pattern/list[:select]/[[suffix]/] ...]]
So I end up with
complete make \
'c#NAME=#D:/data/#' \
'c#DATE=#`echo $COMMAND_LINE | sed -f /data/sandbox/sbulka/tmp/sed-tmp.sed | xargs ls`#' \
'C/N*/(NAME=)//' \
'n/NAME=/(DATE=)//'
The sed is in a file so I don't have to bother quoting. Looks like this:
s/^.*NAME=\([^ ]*\).*$/\/data\/\1/

How to write a script to fetch the address of the links to .rar files on a webpage?

Have a pile of 50 .rar files on a web server and I want to download them all.
And, the names of the files have nothing in common other than .rar.
I wanted to try aria2 to download all of them altogether, but I think I need to write a script to fetch the addresses of all the .rar files.
I have no idea how to start writing the scrip. Any hint will be appreciated.
You can try to play with wget with -A parameter in your shell script:
wget -r "https://foo/" -P /tmp -A "*.rar"
Here is an explanation of what -A does
Specify comma-separated lists of file name suffixes or patterns to accept or reject (see Types of Files). Note that if any of the wildcard characters, ‘’, ‘?’, ‘[’ or ‘]’, appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix. In this case, you have to enclose the pattern into quotes to prevent your shell from expanding it, like in ‘-A ".mp3"’ or ‘-A '*.mp3'’.

Why do bash parameter expansions cause an rsync command to operate differently?

I am attempting to run an rsync command that will copy files to a new location. If I run the rsync command directly, without any parameter expansions on the command line, rsync does what I expect
$ rsync -amnv --include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*' \
/tmp/from/ /tmp/to/
building file list ... done
created directory /tmp/to
./
arm-none-eabi/
arm-none-eabi/include/
arm-none-eabi/include/_ansi.h
...
arm-none-eabi/lib/
arm-none-eabi/lib/aprofile-validation.specs
arm-none-eabi/lib/aprofile-ve.specs
...
lib/
lib/gcc/
lib/gcc/arm-none-eabi/
lib/gcc/arm-none-eabi/4.9.2/
lib/gcc/arm-none-eabi/4.9.2/crtbegin.o
...
sent 49421 bytes received 6363 bytes 10142.55 bytes/sec
total size is 423195472 speedup is 7586.32 (DRY RUN)
However, if I enclose the filter arguments in a variable, and invoke the command using that variable, different results are observed. rsync copies over a number of extra directories I do not expect:
$ FILTER="--include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*'"
$ rsync -amnv ${FILTER} /tmp/from/ /tmp/to/
building file list ... done
created directory /tmp/to
./
arm-none-eabi/
arm-none-eabi/bin/
arm-none-eabi/bin/ar
...
arm-none-eabi/include/
arm-none-eabi/include/_ansi.h
arm-none-eabi/include/_syslist.h
...
arm-none-eabi/lib/
arm-none-eabi/lib/aprofile-validation.specs
arm-none-eabi/lib/aprofile-ve.specs
...
bin/
bin/arm-none-eabi-addr2line
bin/arm-none-eabi-ar
...
lib/
lib/gcc/
lib/gcc/arm-none-eabi/
lib/gcc/arm-none-eabi/4.9.2/
lib/gcc/arm-none-eabi/4.9.2/crtbegin.o
...
sent 52471 bytes received 6843 bytes 16946.86 bytes/sec
total size is 832859156 speedup is 14041.53 (DRY RUN)
If I echo the command that fails, it generates the exact command that succeeds. Copying the output, and running directly gives me the expected result.
There is obviously something I'm missing about how bash parameter expansion works. Can somebody please explain why the two different invocations produce different results?
The shell parses quotes before expanding variables, so putting quotes in a variable's value doesn't do what you expect -- by the time they're in place, it's too late for them to do anything useful. See BashFAQ #50: I'm trying to put a command in a variable, but the complex cases always fail! for more details.
In your case, it looks like the easiest way around this problem is to use an array rather than a plain text variable. This way, the quotes get parsed when the array is created, each "word" gets stored as a separate array element, and if you reference the variable properly (with double-quotes and [#]), the array elements get included in the command's argument list without any unwanted parsing:
filter=(--include='lib/***' --include='arm-none-eabi/include/***' \
--include='arm-none-eabi/lib/***' --include='*/' --exclude='*')
rsync -amnv "${filter[#]}" /tmp/from/ /tmp/to/
Note that arrays are available in bash and zsh, but not all other POSIX-compatible shells. Also, I lowercased the filter variable name -- recommended practice to avoid colliding with the shell's special variables (which are all uppercase).
I like to break the arguments onto separate lines, for convenience sake:
ROPTIONS=(
-aNHXxEh
--delete
--fileflags
--exclude-from=$EXCLUDELIST
--delete-excluded
--force-change
--stats
--protect-args
)
and then call it thusly:
rsync "${ROPTIONS[#]}" "$SOURCE" "$DESTINATION"

How to delete files like 'Incoming11781rKD'

I have a programme that is generating files like this "Incoming11781Arp", and there is always Incoming, and there is always 5 numbers, but there are 3 letters/upper-case/lower-case/numbers/special case _ in any way. Like Incoming11781_pi, or Incoming11781rKD.
How can I delete them using a script run from a cron job please? I've tried -
#!/bin/bash
file=~/Mail/Incoming******
rm "$file";
but it failed saying that there was no matching file or directory.
You mustn't double-quote the variable reference for pathname expansion to occur - if you do, the wildcard characters are treated as literals.
Thus:
rm $file
Caveat: ~/Mail/Incoming****** doesn't work the way you think it does and will potentially match more files than intended, as it is equivalent to ~/Mail/Incoming*, meaning that any file that starts with Incoming will match.
To only match files starting with Incoming that are followed by exactly 6 characters, use ~/Mail/Incoming??????, as #Jidder suggests in a comment.
Note that you could make your glob (pattern) even more specific:
file=~/Mail/Incoming[0-9][0-9][0-9][0-9][0-9][[:alpha:]_][[:alpha:]_][[:alpha:]_]
See the bash manual for a description of pathname expansion and pattern syntax: http://www.gnu.org/software/bash/manual/bashref.html#index-pathname-expansion.
You can achieve the same effect with the find command...
$ directory='~/Mail/'
$ file_pattern='Incoming*'
$ find "${directory}" -name "${file_pattern}" -delete
The first two lines define the directory and the file pattern separately, the find command will then proceed to delete any matching files inside that directory.

Resources