gsutil: Argument list too long - bash

I am trying to upload many thousands of files to Google Cloud Storage, with the following command:
gsutil -m cp *.json gs://mybucket/mydir
But I get this error:
-bash: Argument list too long
What is the best way to handle this? I can obviously write a bash script to iterate over different numbers:
gsutil -m cp 92*.json gs://mybucket/mydir
gsutil -m cp 93*.json gs://mybucket/mydir
gsutil -m cp ...*.json gs://mybucket/mydir
But the problem is that I don't know in advance what my filenames are going to be, so writing that command isn't trivial.
Is there either a way to handle this with gsutil natively (I don't think so, from the documentation), or a way to handle this in bash where I can list say 10,000 files at a time, then pipe them to the gsutil command?

Eric's answer should work, but another option would be to rely on gsutil's built-in wildcarding, by quoting the wildcard expression:
gsutil -m cp "*.json" gs://mybucket/mydir
To explain more: The "Argument list too long" error is coming from the shell, which has a limited size buffer for expanded wildcards. By quoting the wildcard you prevent the shell from expanding the wildcard and instead the shell passes that literal string to gsutil. gsutil then expands the wildcard in a streaming fashion, i.e., expanding it while performing the operations, so it never needs to buffer an unbounded amount of expanded text. As a result you can use gsutil wildcards over arbitrarily large expressions. The same is true when using gsutil wildcards over object names, so for example this would work:
gsutil -m cp "gs://my-bucket1/*" gs://my-bucket2
even if there are a billion objects at the top-level of gs://my-bucket1.

If your filenames are safe from newlines you could use gsutil cp's ability to read from stdin like
find . -maxdepth 1 -type f -name '*.json' | gsutil -m cp -I gs://mybucket/mydir
or if you're not sure if your names are safe and your find and xargs support it you could do
find . -maxdepth 1 -type f -name '*.json' -print0 | xargs -0 -I {} gsutil -m cp {} gs://mybucket/mydir

Here's a way you could do it, using xargs to limit the number of files that are passed to gsutil at once. Null bytes are used to prevent problems with spaces in or newlines in the filenames.
printf '%s\0' *.json | xargs -0 sh -c 'copy_all () {
gsutil -m cp "$#" gs://mybucket/mydir
}
copy_all "$#"'
Here we define a function which is used to put the file arguments in the right place in the gsutil command. This whole process should happen the minimum number of times required to process all arguments, passing the maximum number of filename arguments possible each time.
Alternatively you can define the function separately and then export it (this is bash-specific):
copy_all () {
gsutil -m cp "$#" gs://mybucket/mydir
}
printf '%s\0' *.json | xargs -0 bash -c 'export -f copy_all; copy_all "$#"'

Related

How to copy files found with grep on OSX

I'm wanting to copy files I've found with grep on an OSX system, where the cp command doesn't have a -t option.
A previous posts' solution for doing something like this relied on the -t flag in cp. However, like that poster, I want to take the file list I receive from grep and then execute a command over it, something like:
grep -lr "foo" --include=*.txt * 2>/dev/null | xargs cp -t /path/to/targetdir
Less efficient than cp -t, but this works:
grep -lr "foo" --include=*.txt * 2>/dev/null |
xargs -I{} cp "{}" /path/to/targetdir
Explanation:
For filenames | xargs cp -t destination, xargs changes the incoming filenames into this format:
cp -t destination filename1 ... filenameN
i.e., it only runs cp once (actually, once for every few thousand filenames -- xargs breaks the command line up if it would be too long for the shell).
For filenames | xargs -I{} cp "{}" destination, on the other hand, xargs changes the incoming filenames into this format:
cp "filename1" destination
...
cp "filenameN" destination
i.e., it runs cp once for each incoming filename, which is much slower. For a large number (e.g., >10k) of very small (e.g., <10k) files, I'd guess it could even be thousands of times slower. But it does work :)
PS: Another popular technique is use find's exec function instead of xargs, e.g., https://stackoverflow.com/a/5241677/1563960
Yet another option is, if you have admin privileges or can persuade your sysadmin, to install the coreutils package as suggested here, and follow the steps but for cp rather than ls.

Using touch and sed within a find -ok command

I have some wav files. For each of those files I would like to create a new text file with the same name (obviously with the wav extension being replaced with txt).
I first tried this:
find . -name *.wav -exec 'touch $(echo '{}" | sed -r 's/[^.]+\$/txt/')" \;
which outputted
< touch $(echo {} | sed -r 's/[^.]+$/txt/') ... ./The_stranglers-Golden_brown.wav > ?
Then find complained after I hit y key with:
find: ‘touch $(echo ./music.wav | sed -r 's/[^.]+$/txt/')’: No such file or directory
I figured out I was using a pipe and actually needed a shell. I then ran:
find . -name *.wav -exec sh -c 'touch $(echo "'{}"\" | sed -r 's/[^.]+\$/txt/')" \;
Which did the job.
Actually, I do not really get what is being done internally, but I guess a shell is spawned on every file right ? I fear this is memory costly.
Then, what if I need to run this command on a large bunch of files and directories !?
Now is there a way to do this in a more efficient way ?
Basically I need to transform the current file's name and to feed touch command.
Thank you.
This find with bash parameter-expansion will do the trick for you. You don't need sed at all.
find . -type f -name "*.wav" -exec sh -c 'x=$1; file="${x##*/}"; woe="${file%.*}"; touch "${woe}.txt"; ' sh {} \;
The idea is the part
x=$1 represents each of the entry returned from the output of find
file="${x##*/}" strips the path of the file leaving only the last file name part (only filename.ext)
The part woe="${file%.*}" stores the name without extension, and the new file is created with an extension .txt from the name found.
EDIT
Parameter expansion sets us free from using Command substitution $() sub-process and sed.
After looking at sh man page, I figured out that the command up above could be simplified.
Synopsis -c [-aCefnuvxIimqVEbp] [+aCefnuvxIimqVEbp] [-o option_name] [+o option_name] command_string [command_name [argument ...]]
...
-c Read commands from the command_string operand instead of from the stan‐dard input. Special parameter 0 will be set from the command_name oper‐and and the positional parameters ($1, $2, etc.) set from the remaining argument operands.
We can directly pass the file path, skipping the shell's name (which is useless inside the script anyway). So {} is passed as the command_name $0 which can be expanded right away.
We end up with a cleaner command.
find . -name *.wav -exec sh -c 'touch "${0%.*}".txt ;' {} \;

xargs command length limits

I am using jsonlint to lint a bunch of files in a directory (recursively). I wrote the following command:
find ./config/pages -name '*.json' -print0 | xargs -0I % sh -c 'echo Linting: %; jsonlint -V ./config/schema.json -q %;'
It works for most files but some files I get the following error:
Linting: ./LONG_FILE_NAME.json
fs.js:500
return binding.open(pathModule._makeLong(path), stringToFlags(flags), mode);
^
Error: ENOENT, no such file or directory '%'
It appears to fail for long filenames. Is there a way to fix this? Thanks.
Edit 1:
Found the problem.
-I replstr
Execute utility for each input line, replacing one or more occurrences
of replstr in up to replacements (or 5 if no -R flag is specified)
arguments to utility with the entire line of input. The resulting
arguments, after replacement is done, will not be allowed to grow
beyond 255 bytes; this is implemented by concatenating as much of the
argument containing replstr as possible, to the con-structed arguments
to utility, up to 255 bytes. The 255 byte limit does not apply to
arguments to utility which do not contain replstr, and furthermore, no
replacement will be done on utility itself. Implies -x.
Edit 2:
Partial solution. Supports longer file names than before but still not as long as I need.
find ./config/pages -name '*.json' -print0 | xargs -0I % sh -c 'file=%; echo Linting: $file; jsonlint -V ./config/schema.json -q $file;'
On BSD like systems (e.g. Mac OS X)
If you happen to be on a mac or freebsd etc. your xargs implementation may support option -J which does not suffer from the argument size limits imposed on option -I.
Excert from manpage
-J replstr
If this option is specified, xargs will use the data read from standard input to replace the first occurrence of replstr instead of appending that data after all other arguments. This option will not effect how many arguments will be read from input (-n), or the size of the command(s) xargs will generate (-s). The option just moves where those arguments will be placed in the command(s) that are executed. The replstr must show up as a distinct argument to xargs. It will not be recognized if, for instance, it is in the middle of a quoted string. Furthermore, only the first occurrence of the replstr will be replaced. For example, the following command will copy the list of files and directories which start with an uppercase letter in the current directory to destdir:
/bin/ls -1d [A-Z]* | xargs -J % cp -Rp % destdir
If you need to refer to the repstr multiple times (*points up* TL;DR -J only replaces first occurrence) you can use this pattern:
echo hi | xargs -J{} sh -c 'arg=$0; echo "$arg $arg"' "{}"
=> hi hi
POSIX compliant method
The posix compliant method of doing this would be to use some other tool, e.g. sed to construct the code you want to execute and then use xargs to just specify the utility. When no repl string is used in xargs the 255 byte limit does not apply. xargs POSIX spec
find . -type f -name '*.json' -print |
sed "s_^_-c 'file=\\\"_g;s_\$_\\\"; echo \\\"Definitely over 255 byte script..$(printf "a%.0s" {1..255}): \\\$file\\\"; wc -l \\\"\\\$file\\\"'_g" |
xargs -L1 sh
This of course largely defeats the purpose of xargs to begin with, but can still be used to leverage e.g. parallel execution using xargs -L1 -P10 sh which is quite widely supported, though not posix.
Use -exec in find instead of piping to xargs.
find ./config/pages -name '*.json' -print0 -exec echo Linting: {} \; -exec jsonlint -V ./config/schema.json -q {} \;
The limit on xargs's command line length is imposed by the system (not an environment) variable ARG_MAX. You can check it like:
$ getconf ARG_MAX
2097152
Surprisingly, there doesn't not seem to be a way to change it, barring kernel modification.
But even more surprising that xargs by default gets capped to a much lower value, and you can increase with -s option. Still, ARG_MAX is not the value you can set after -s — acc. to man xargs you need to subtract size of environment, plus some "headroom", no idea why. To find out the actual number use the following command (alternatively, using an arbitrary big number for -s will result in a descriptive error):
$ xargs --show-limits 2>&1 | grep "limit on argument length (this system)"
POSIX upper limit on argument length (this system): 2092120
So you need to run … | xargs -s 2092120 …, e.g. with your command:
find ./config/pages -name '*.json' -print0 | xargs -s 2092120 -0I % sh -c 'echo Linting: %; jsonlint -V ./config/schema.json -q %;'

Find recursive/xargs/cp/awk/sed/single quote in single quote together in a one-liner

I am trying to create a shell one liner to find all jpegs in a directory recursively. Then I want to copy them all out to an external directory, while renaming them according to their date and time and then append a random integer in order to avoid overwrites with images that have the same timestamp.
First Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 sh -c 'for filename; do echo "$filename" && cp "$filename" $(echo /storage/primary/legacy/image3/$(stat $filename |awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg);done' fnord
Among other things, the above doesn't work because there are the single quotes of the awk within the sh -c single quotes.
The second attempt should do the same thing without sh -c, but gives me this error on stat:
stat: can't stat '': No such file or directory
/system/bin/sh: file: not found
Second Attempt:
find /storage/sdcard0/tencent/MicroMsg/ -type f -iname '*.jpg' -print0 | xargs -0 file cp "$file" $(echo /storage/primary/legacy/image3/$(stat "$file" | awk '/Mod/ {print $2"_"$3}'|sed s/:/-/g)_$RANDOM.jpg)
I think the problem with the second attempt may be too many subshells?
Can anyone help me know where I'm going wrong here?
On another note: if anyone knows how to preserve the actual modified date/time stamps when copying a file, I would love the throw that in here.
Thank you Thank you
Were it my problem, I'd create a script — call it filecopy.sh — like this:
TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +'%Y-%m-%d.%H-%M-%S' -d #$(stat -c '%Y' "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done
The basetime line runs stat to get the modification time of the file in seconds since The Epoch, then uses that with date to format the time as a modified ISO 8601 format (using - in place of :, and . in place of T). This is then used to create the target file name, along with a semi-random number.
Then the find command becomes simply:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec /path/to/filecopy.sh {} +
Personally, I'd not bother to try making it work without a separate shell script. It could be done, but it would not be trivial:
SOURCE="/storage/sdcard0/tencent/MicroMsg"
find "$SOURCE" -type f -iname '*.jpg' -exec bash -c \
'TARGET="/storage/primary/legacy/image3"
for file in "$#"
do
basetime=$(date +%Y-%m-%d.%H-%M-%S -d #$(stat -c %Y "$file"))
cp "$file" "$TARGET/$basetime.$RANDOM.jpg"
done' command {} +
I've taken some liberties in that by removing the single quotes that I used in the main shell script. They were optional, but I'd use them automatically under normal circumstances.
If you have GNU Parallel > version 20140722 you can run:
find . | parallel 'cp {} ../destdir/{= $a = int(10000*rand); $_ = `date -r "$_" +%FT%T"$a"`; chomp; =}'
It will work on file names containing ' and space, but fail on file names containing ".
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Is there such a thing as inline bash scripts?

I want to do something on the lines of:
find -name *.mk | xargs "for i in $# do mv i i.aside end"
I realize that there might be more than on error in this, but I'd like to specifically know about this sort of inline command definition that I can pass xargs to.
This particular command isn't a great example, but you can use an "inline shell script" by giving sh -c 'here is the script' as a command. And you can give it arguments which will be $# inside the script but there's a catch: the first argument after here is the script goes to $0 inside the script, so you have to put an extra word there or you'll lose the first argument.
find . -name '*.mk' -exec sh -c 'for i; do mv "$i" "$i.aside"; done' fnord '{}' +
Another fun feature I took advantage of there is the fact that for loops iterate over the command line arguments by default: for i; do ... is equivalent to for i in "$#"; do ...
I reiterate, the above command is convoluted and slow compared to the many other methods of doing the bulk mv. I'm posting it only to show some cool syntax.
There's no need for xargs here
find -name *.mk -exec mv {} {}.aside \;
I'm not sure what the semantics of your for loop should be, but blindly coding it would give something like this:
find -name *.mk | while read file
do
for i in $file; do mv $i $i.aside; done
done
If the body is used in multiple places, you can also use bash functions.
In some version of find an argument is needed : . for the current directory
Star * must be escaped
You can try with echo command to be sure what command will do
find . -name '*.mk' -print0 | xargs -0i sh -c "echo mv '{}' '{}.aside'"
man xargs
/-i
man sh
/-c
I'm certain you could do this in a nice manner, but since you requested xargs:
find -name "*.tk" | xargs -I% mv % %.aside
Looping over filenames makes no sense, since you can only rename one at a time. Using inline uglyness is not necessary, but I could not make it work with the pipe and either eval or bash -c.

Resources