Error while searching for a string using xargs - xargs

I am trying to grep for a string as below but running into error shown below,can anyone suggest how to fix it?
find . | xargs grep 'bin data doesn't exist for HY11' -sl
Error:-
args: unmatched single quote; by default quotes are special to xargs unless you use the -0 option

Your grep pattern contains a quotation mark!
Use double quotes round the pattern: "bin doesn't exist for HY11" rather than 'bin ... HY11'.
You also want to add -print0 to the find command, and -0 to xargs.
The better way is to do this all directly:
find . -type f -exec grep -H "bin doesn't exist for HY11" "{}" "+"
That doesn't even need xargs.

If you have GNU Parallel you can run:
find . | parallel -X -q grep "bin data doesn't exist for HY11" -sl
All new computers have multiple cores, but most programs are serial in nature and will therefore not use the multiple cores. However, many tasks are extremely parallelizeable:
Run the same program on many files
Run the same program for every line in a file
Run the same program for every block in a file
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
A personal installation does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel

Related

pipe '|' doesn't seem to be liked on my ansible script

I have a task to count how many files exist. Is there something wrong with the syntax?
task:
- name: "Getting Local File Count"
kubernetes.core.k8s_exec:
namespace: "{{namespace}}"
pod: "{{pod}}"
command: "find '{{local_dir}}' -type f | wc -l"
register: command_status
but after executing the playbook I get
"find: paths must precede expression: |",
That's a common mistake coming from a world of CMD in docker, which has two forms: free text and exec -- but in kubernetes descriptors, it's always the exec form. If one wishes to have shell helpers, such as pipes or && or functions or redirection, one must do that explicitly via sh -c style constructs
The bad news is that the ansible module's command: argument doesn't accept a list[str] and instead they choose to use shlex.split so you have to either copy the complicated command into the pod and just use command: /path/to/my/script.sh or take your chances with shlex, which seems to understand sh -c quoting:
>>> shlex.split("""sh -c "find '{{local_dir}}' -type f | wc -l" """)
['sh', '-c', "find '{{local_dir}}' -type f | wc -l"]
making your ansible parameter look like:
command: >-
sh -ec "find '{{local_dir}}' -type f | wc -l"
as always when using jinja2 templates in a shell context, the safer choice is to use | quote versus just wrapping your thing in single-quotes and hoping no one uses a single-quote in local_dir, although in this case one will need to be extra cautious since it is a shell literal inside a shell literal :-(

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

Parse filenames into touch command on OS X

I have a large number of photos on my machine where I'd like to parse the standard naming convention I have created for each file, and then pipe it to the touch command.
For example, I have these files:
2016-08-06-00h28m34.jpg
2016-08-06-00h28m35.jpg
2016-08-06-00h28m36.jpg
I would like to generate (and then run) the following commands:
touch -t 201608060028.34 2016-08-06-00h28m34.jpg
touch -t 201608060028.35 2016-08-06-00h28m35.jpg
touch -t 201608060028.36 2016-08-06-00h28m36.jpg
I can do this manually in a text editor, but it's extremely time-consuming, due to the number of files in each directory. I could also do it in C# and run it over my LAN, but that seems like overkill. Heck, I can even do this in SQL Server, but ... it's OS X and I'm sure there's a simple command-line thing I'm missing.
I've looked at Windows version of Unix touch command to modify a file's modification date from filename part, and Split Filename Up to Define Variables, but I can't seem to figure out how to add in the period for the seconds portion of the script, plus I don't want to add the batch script to each of the hundreds of folders I have.
Any assistance will be greatly appreciated.
Simple Option
I presume you are trying to set the filesystem time to match the EXIF capture time of thousands of photos. There is a tool for that, which runs on OSX, Linux (and Windows if you must). It is called jhead and I installed it on OSX using homebrew with:
brew install jhead
There may be other ways to install it - jhead website.
Please make a back up before trying this, or try it out on a small subset of your files, as I may have misunderstood your needs!
Basically the command to set the filesystem timestamp to match the EXIF timestamp on a single file is:
jhead -ft SomeFile.jpg
So, if you wanted to set the timestamps for all files in $HOME/photos/tmp and all subdirectories, you would do:
find $HOME/photos/tmp -iname \*.jpg -exec jhead -ft {} \;
Option not requiring any extra software
Failing that, you could do it with Perl which is installed on OSX by default anyway:
find . -name \*.jpg | perl -lne 'my $a=$_; s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" '
which gives this sort of output on my machine:
touch -t 201608060028.34 "./2016-08-06-00h28m34.jpg"
touch -t 201608060028.35 "./2016-08-06-00h28m35.jpg"
touch -t 201501060028.35 "./tmp/2015-01-06-00h28m35.jpg"
and if that looks good on your machine, you could send those commands into bash to be executed like this:
find . -name \*.jpg | perl -lne 'my $a=$_;s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" ' | bash -x
And, for the Perl purists out there, yes, I know Perl could do the touch itself and save invoking a whole touch process per file, but that would require modules and explanation and a heap of other extraneous stuff that is not really necessary for a one-off, or occasional operation.

How to run fswatch to call a program with static arguments?

I used to use fswatch v0.0.2 like so (in this instance to run django test suit when a file changed)
$>fswatch . 'python manage.py test'
this works fine.
I wanted to exclude some files that were causing the test to run more than once per save (Sublime text was saving a .tmp file, and I suspect .pyc files were also causing this)
So I upgraded fswatch to enable the -e mode.
However the way fswatch has changed which is causing me troubles - it now accepts a pipe argument like so:
$>fswatch . | xargs -n1 program
I can't figure out how to pass in arguments to the program here. e.g. this does not work:
$>fswatch . | xargs -n1 python manage.py test
nor does this:
$>fswatch . | xargs -n1 'python manage.py test'
how can I do this without packaging up my command in a bash script?
fswatch documentation (either the Texinfo manual, or the wiki, or README) have examples of how this is done:
$ fswatch [opts] -0 -o path ... | xargs -0 -n1 -I{} your full command goes here
Pitfalls:
xargs -0, fswatch -0: use it to make sure paths with newlines are interpreted correctly.
fswatch -o: use it to have fswatch "bubble" all the events in the set into a single one printing only the number of records in the set.
-I{}: specifying a placeholder is the trick you missed for xargs interpreting correctly your command arguments in those cases where you do not want the record (in this case, since -o was used, the number of records in the set) to be passed down to the command being executed.
Alternative answer not fighting xargs' default reason for being - passing on the output as arguments to the command to be run.
fswatch . | (while read; do python manage.py test; done)
Which is still a bit wordy/syntaxy, so I have created a super simple bash script fswatch-do that simplifies things for me:
#!/bin/bash
(while read; do "$#"; done)
usage:
fswatch -r -o -e 'pyc' somepath | fswatch-do python manage.py test someapp.SomeAppTestCase

xargs with command that open editor leaves shell in weird state

I tried to make an alias for committing several different git projects. I tried something like
cat projectPaths | \
xargs -I project git --git-dir=project/.git --work-tree=project commit -a
where projectPaths is a file containing the paths to all the projects I want to commit. This seems to work for the most part, firing up vi in sequence for each project so that I can write a commit msg for it. I do, however, get a msg:
"Vim: Warning: Input is not from a terminal"
and afterward my terminal is weird: it doesn't show the text I type and doesn't seem to output any newlines. When I enter "reset" things pretty much back to normal, but clearly I'm doing something wrong.
Is there some way to get the same behavior without messing up my shell?
Thanks!
Using the simpler example of
ls *.h | xargs vim
here are a few ways to fix the problem:
xargs -a <( ls *.h ) vim
or
vim $( ls *.h | xargs )
or
ls *.h | xargs -o vim
The first example uses the xargs -a (--arg-file) flag which tells xargs to take its input from a file rather than standard input. The file we give it in this case is a bash process substitution rather than a regular file.
Process substitution takes the output of the command contained in <( ) places it in a filedescriptor and then substitutes the filedescriptor, in this case the substituted command would be something like xargs -a /dev/fd/63 vim.
The second command uses command substitution, the commands are executed in a subshell, and their stdout data is substituted.
The third command uses the xargs --open-tty (-o) flag, which the man page describes thusly:
Reopen stdin as /dev/tty in the child process before executing the
command. This is useful if you want xargs to run an interactive
application.
If you do use it the old way and want to get your terminal to behave again you can use the reset command.
The problem is that since you're running xargs (and hence git and hence vim) in a pipeline, its stdin is taken from the output of cat projectPaths rather than the terminal; this is confusing vim. Fortunately, the solution is simple: add the -o flag to xargs, and it'll start git (and hence vim) with input from /dev/tty, instead of its own stdin.
The man page for GNU xargs shows a similar command for emacs:
xargs sh -c 'emacs "$#" < /dev/tty' emacs
(in this command, the second "emacs" is the "dummy string" that wisbucky refers to in a comment to this answer)
and says this:
Launches the minimum number of copies of Emacs needed, one after the
other, to edit the files listed on xargs' standard input. This example
achieves the same effect as BSD's -o option, but in a more flexible and
portable way.
Another thing to try is using -a instead of cat:
xargs -a projectPaths -I project git --git-dir=project/.git --work-tree=project commit -a
or some combination of the two.
If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you should be able to do this:
cat projectPaths |
parallel -uj1 git --git-dir={}/.git --work-tree={} commit -a
In general this works too:
cat filelist | parallel -Xuj1 $EDITOR
in case you want to edit more than one file at a time (and you have set $EDITOR to your favorite editor).
-o for xargs (as mentioned elsewhere) only works for some versions of xargs (notably it does not work for GNU xargs).
Watch the intro video to learn more about GNU Parallel http://www.youtube.com/watch?v=OpaiGYxkSuQ
Interesting! I see the exact same behaviour on Mac as well, doing something as simple as:
ls *.h | xargs vim
Apparently, it is a problem with vim:
http://talideon.com/weblog/2007/03/xargs-vim.cfm

Resources