In bash, how to find files which there is "test" string in,Exclude binary files - bash

find . -type f |xargs grep string |awk -F":" '{print $1}' |uniq
the command above,it get all files' name which contain string "test". but the result includes
binary file.
The problem is how to exclude binary file.
thanks you all.

If I understand properly, you want to get the name of all the files in the directory and its subdirectories that contain the string string, excluding binary files.
Reading grep's friendly manual, I was able to catch this:
-I Process a binary file as if it did not contain matching data;
this is equivalent to the --binary-files=without-match option.
Amazing!
Now how about I get rid of find. Is this possible with just grep? Oh, two lines below, still in the funky manual, I read this:
-R, -r, --recursive
Read all files under each directory, recursively; this is
equivalent to the -d recurse option.
That seems great, doesn't it?
How about getting only the file name? Still in grep's funny manual, I read:
-l, --files-with-matches
Suppress normal output; instead print the name of each input
file from which output would normally have been printed. The
scanning will stop on the first match. (-l is specified by
POSIX.)
Yay! I think we're done:
grep -IlR 'string' .
Remarks.
I also tried to find make me a sandwich in the manual, but my version of grep doesn't seem to support it. YMMV.
The manual is located at man grep.
As William Pursell rightly comments, the -R and -I switches are not available in all implementations of grep. If your grep possesses the make me a sandwich option, it will very likely support the -R and -I switches. YMMV.

Version of Unix that I work with, does not support the command "grep -I/R".
I tried the command:
file `find ./` | grep text | cut -d: -f1 | xargs grep "test"

Related

How to print the file names from which I grep some lines

I'm trying to get some lines from several json files using the following code:
cat $(find ./*/*/folderA/*DTI*.json) | grep -i -E '(phaseencodingdirection|phaseencodingaxis)' > phase_direction
It worked! the problem is that I don't know which line comes from which file
With this find ./*/*/preprocessing/*DTI*.json -type f -printf "%f\n" I can print those names, but they appear at the end and not in order with their respective phaseencodingdirection|phaseencodingaxis extracted lines.
I don't know how to combine those lines of code to print the file's name from which the line was extracted and their respective extracted lines!?
Could you help me?
the problem is that I don't know which line comes from which file
Well no, you don't, because you have concatenated the contents of all the files into a single stream. If you want to be able to identify at the point of pattern matching which file each line comes from then you have to give that information to grep in the first place. Like this, for example:
find ./*/*/folderA/*DTI*.json |
xargs grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' > phase_direction
The xargs program converts lines read from its standard input into arguments to the specified command (grep in this case). The -H option to grep causes it to list the filename of each match along with the matching line itself.
Alternatively, this variation on the same thing is a little simpler, and closer in some senses to the original:
grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' \
$(find ./*/*/folderA/*DTI*.json) > phase_direction
That takes xargs out of the picture, and moves the command substitution directly to the argument list of grep.
But now observe that if the pattern ./*/*/folderA/*DTI*.json does not match any directories then find isn't actually doing anything useful for you. There is then no directory recursion to be done, and you haven't specified any tests, so the command substitution will simply expand to all the paths that match the pattern, just like the pattern would do if expanded without find. Thus, this is probably best of all:
grep -i -E -H '(phaseencodingdirection|phaseencodingaxis)' \
./*/*/folderA/*DTI*.json > phase_direction
Use the filenames as arguments to grep rather than cat.
grep -i -H -E '(phaseencodingdirection|phaseencodingaxis)' $(find ./*/*/folderA/*DTI*.json) > phase_direction
The -H option forces grep to incliude filenames in the output even if there's only one file.
But since your arguments to find are filenames, not directories to search recursively, there's no need to use it at all. Just pass the wildcard directly to grep. There's also no need to begin with ./. Any non-absolute pathname is interpreted relative to the current directory.
grep -i -H -E '(phaseencodingdirection|phaseencodingaxis)' */*/folderA/*DTI*.json > phase_direction
You may use recursive grep:
grep -iER 'phaseencodingdirection|phaseencodingaxis' --include=*DTI*.json */*/folderA

I want to pipe grep output to sed for input

I'm trying to pipe the output of grep to sed so it will only edit specific files. I don't want sed to edit something without changing it. (Changing the modified date.)
I'm searching with grep and writing with sed. That's it
The thing I am trying to change is a dash, not the normal type, a special type. "-" is normal. "–" isn't normal
The code I currently have:
sed -i 's/– foobar/- foobar/g' * ; perl-rename 's/– foobar/- foobar/' *'– foobar'*
Sorry about the trouble, I'm inexperienced.
Are you sure about what you want to achieve? Let me explain you:
grep "string_in_file" <filelist> | sed <sed_script>
This is first showing the "string_in_file", preceeded by the filename.
If you launch a sed on this, then it will just show you the result of that sed-script on screen, but it will not change the files itself. In order to do this, you need the following:
grep -l "string_in_file" <filelist> | sed <sed_script_on_file>
The grep -l shows you some filenames, and the new sed_script_on_file needs to be a script, reading the file, and altering it.
Thank you all for helping, I'm sorry about not being fast in responding
After a bit of fiddling with the command, I got it:
grep -l 'old' * | xargs -d '\n' sed -i 's/old/new/'
This should only touch files that contain old and leave all other files.
This might be what you're trying to do if your file names don't contain newlines:
grep -l -- 'old' * | xargs sed -i 's/old/new/'

Grep/Sed/Awk Options

How could you grep or use sed or awk to parse for a dynamic length substring? Here are some examples:
I need to parse out everything except for the "XXXXX.WAV" in these strings, but the strings are not a set length.
Sometimes its like this:
{"filename": "/assets/JFM/imaging/19001.WAV"},
{"filename": "/assets/JFM/imaging/19307.WAV"},
{"filename": "/assets/JFM/imaging/19002.WAV"}
And sometimes like this:
{"filename": "/assets/JFM/LN_405999/101.WAV"},
{"filename": "/assets/JFM/LN_405999/102.WAV"},
{"filename": "/assets/JFM/LN_405999/103.WAV"}
Is there a great dynamic way to parse for just the .WAV? Maybe if I start at "/" and parse until "?
Edit:
Expected output like this:
19001.WAV
19307.WAV
19002.WAV
Or:
101.WAV
101.WAV
103.WAV
Just use grep as proposed in comments:
grep -o '[^/]\{1,\}\.WAV' yourfile
If the wav file always contains numbers, this seems more explicit (same result):
grep -o '[0-9]\{1,\}\.WAV'
Assuming there are [ and ] lines at the beginning and end of your file, it looks like your input is JSON, in which case I would recommend installing and using jq rather than text-based utilities, and doing something like this:
jq -r '.[]|.filename|split("/")[-1]'
But failing that, any of the tools listed will work just fine.
grep -o '[^/]*\.WAV'
or
sed -ne 's,.*/\([^/]*\.WAV\).*$,\1,p'
or
awk -F'"' '/WAV/ {split($4,a,"/"); print a[length(a)]}'
In each case there are a variety of other possible solutions as well.
Or with sed
$ sed 's,.*/,,; s,".*,,' x
101.WAV
102.WAV
103.WAV
Explanation:
s,.*/,, - delete everything up to and including the rightmost /
s,".*,, - delete everything starting with the leftmost " to the end of the line
another awk
awk -F'[/"]' '{print $(NF-1)}' file
19001.WAV
19307.WAV
19002.WAV
Try this -
awk -F'[{":}/]' '{print $(NF-2)}' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.WAV' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.[[:alpha:]]{3}' f
19001.WAV
19307.WAV
19002.WAV
You can easily change the value of digit and character as per your need for different example in egrep but awk will work fine for both case.
All of the programs you listed use regex to parse the names, so I will show you an example using grep, being probably the most basic one for this case.
There are a couple of options, depending on the exact way you define the XXX part before the ".wav".
Option 1, as you pointed out is just the file name, i.e., everything after the last slash:
grep -hoi "[^/]\+\.WAV"
This reads as "any character besides slash" ([^/]) repeated at least once (\+), followed by a literal .WAV (\.WAV).
Option 2 would be to only grab the digits before the extension:
grep -hoi "[[:digit:]]\+\.WAV"
OR
grep -hoi "[0-9]\+\.WAV"
These read as "digits" ([[:digit:]] and [0-9] mean the same thing) repeated at least once (\+), followed by a literal .WAV (\.WAV).
In all cases, I recommend using the flags -h, -o, -i, which I have concatenated into a single option -hoi. -h suppresses the file name from the output. -o makes grep only output the portion that matches. -i makes the match case insensitive, so should your extension ever change to .wav instead of .WAV, you'll be fine.
Also, in all cases, the input is up to you. You can pipe it in from another program, which will look like
program | grep -hoi "[^/]\+\.WAV"
You can get it from a file using stdin redirection:
grep -hoi "[^/]\+\.WAV" < somefile.txt
Or you can just pass the filename to grep:
grep -hoi "[^/]\+\.WAV" somefile.txt
awk -F/ '{print substr($5,1,7)}' file
101.WAV
102.WAV
103.WAV

Getting last element of a path (different from #10124314 as basename falls over)

I need to process a couple of thousand PDF files sorted alphabietically on their filename ideally from bash. So from my simple perspective I need to walk a tree of files, stripping off path as I go and then do various grepping, sorting etc
Having seen an answer to a similar question I've tried doing a
tim#MERLIN:~/Documents/Scanned$ basename `find ./ -print`
but that gets messed up by some directory names which have spaces in them - e.g. there is one called General Letters which acts like a chicken-bone in the works and results in
basename: extra operand ‘Letters’
Try 'basename --help' for more information.
I can't see a way to get find to strip out the pathname and I would prefer to use find given its plethora of options to filter on age, size etc. Nor can I see any way to get basename to cope gracefully with spaces in this context.
I considered using cut but I can't work out how to get cut to give me the last field by doing something like cut -d/ <whatever> I'm sure there must be an easy way to do it: some sort of in-line sed or awk script?
I don't particularly want the buggeration of writing a perl/Python script to do it for me as I know I should be able to do it from the command line.
So any simple tips or suggestions?
Updated/Solved
Many thanks to Cyrus the solution is
tim#MERLIN:~/Documents/Scanned$ find . -name *.pdf -printf '%f\n' | sort
Try this:
find ./ -printf '%f\n'
%f: File's name with any leading directories removed (only the last element).
Here is a working solution using awk:
find ./ | awk -F'/' '{ print $NF }';
It simply uses / as delimiter and prints the last value of the line.
Or with grep:
find ./ | grep -oE "[^/]+$"
Through sed,
find ./ | sed 's/.*\/\(.*\)$/\1/g'
If you want get a list of pathnames (recursively) but want sort them by filenames (not by path names) you can use:
find . -printf '%f|%p\n' | sort -k 1 -t'|' | cut -d'|' -f2-
You need a GNU find for this. (Linux ok, not default in OS X).
Without the GNU find, you can do the above with:
find . -print | sed 's:\(.*\)/\(.*\)$:\2\|\1/\2:' | sort -k 1 -t'|' | cut -d'|' -f2-
(Assuming there is no \n in the filenames)

Use lines in a file as filenames for grep?

I have a file which contains filenames (and the full path to them) and I want to search for a word within all of them.
some pseudo-code to explain:
grep keyword <all files specified in files.txt>
or
cat files.txt > grep keyword
cat files txt | grep keyword
the problem is that I can only get grep to search the filenames, not the contents of the actual files.
cat files.txt | xargs grep keyword
or
grep keyword `cat files.txt`
or (equivalent to previous but harder to mis-read)
grep keyword $(cat files.txt)
should do the trick.
Pitfalls:
If files.txt contains file names with spaces, either solution will malfunction, because "This is a filename.txt" will be interpreted as four files, "This", "is", "a", and "filename.txt". A good reason why you shouldn't have spaces in your filenames, ever.
There are ways around this, but none of them is trivial. (find ... -print0 / xargs -0 is one of them.)
The second (cat) version can result in a very long command line (which might fail when exceeding the limits of your environment). The first (xargs) version handles long input automatically; xargs offers several options to control the details.
Both of the answers from DevSolar work (tested on Linux Ubuntu), but the xargs version is preferable if there may be many files, since it will avoid running into command line length limits.
so:
cat files.txt | xargs grep keyword
is the way to go
tr '\n' '\0' <files.txt | LANG=C xargs -r0 grep -F keyword
tr will delimit names with NUL character so that spaces not significant (note the corresponding -0 option to xargs).
xargs -r will start a single grep process for a "large" number of files, but not start any grep process if there are no files.
LANG=C means use quick routines for matching, rather than slow locale ones
grep -F means use quick string matching rather than slow regular expression matching
bash, ksh & zsh version:
grep keyword $(<files.txt)
Long time when last created a bash shell script, but you could store the result of the first grep (the one finding all filenames) in an array and iterate over it, issuing even more grep commands.
A good starting point should be the bash scripting guide.

Resources