How to treat files as huge string with macos grep? - macos

I'm using the following command on Ubuntu to list all files containing a given pattern:
for f in *; do if grep -zoPq "foo\nbar" $f; then echo $f; fi; done
But on macos, I'm geting the following error:
grep: invalid option -- z
There's no -z option to treat files as a big string with macos grep, unlike gnu grep.
Is there another option on macos grep equivalent to `-z ? If not, what alternative can I use to get the same result ?

-P (PERL regex) is only supported in gnu grep but not on BSD grep found on Mac OS.
You can either use home brew to install gnu grep or else use this equivalent awk command:
awk 'p ~ /foo$/ && /^bar/ {print FILENAME; nextfile}; {p=$0}' *
Please note that this eliminates the need to use shell for loop.

You can install pcregrep via home brew, and then use it with the -M option:
By default, each line that matches a pattern is copied to the
standard output, and if there is more than one file, the file name is
output at the start of each line, followed by a colon. However, there
are options that can change how pcregrep behaves. In particular, the
-M option makes it possible to search for patterns that span line
boundaries. What defines a line boundary is controlled by the -N
(--newline) option.

With ripgrep
rg -lU 'foo\nbar'
This will list all filenames containing foo\nbar in the current directory. -U option allows to match multiple lines. Unlike grep -z, whole file isn't read in one-shot, so this is safe to use even for larger input files.
ripgrep recursively searches by default. Use rg -lU --max-depth 1 'foo\nbar' if you don't want to search sub-directories.
However, note that by default, rigprep ignores
files and directories that match rules specified by ignore files like .gitignore
hidden files and directories
binary files
You can change that by using:
-u or --no-ignore
-uu or --no-ignore --hidden
-uuu or --no-ignore --hidden --binary

It seems you are searching for files which have the sequence foo\nbar. With GNU awk (brew install gawk), you can set the record separatorRS to this sequence and check if the record matches:
gawk 'BEGIN{RS="foo\nbar"}{exit (RT!=RS)}' file
This will try to split your files in records which are separated by the record separator RS, if so, it will terminate with exit code 0, otherwise with exit code 1. The behaviour is the same as the proposed grep
If you just want the files listed, you can do:
gawk 'BEGIN{RS="foo\nbar"}(RT==RS){print FILENAME}{nextfile}' *

Related

How do I get rid of “--” line separator when using grep

I'm using the commands given below for splitting my fastq file into two separate paired end reads files:
grep '#.*/1' -A 3 24538_7#2.fq >24538_7#2_1.fq
grep '#.*/2' -A 3 24538_7#2.fq >24538_7#2_2.fq
But it's automatically introducing a -- line separator between the entries. Hence, making my fastq file inappropriate for further processing(because it then becomes an invalid fastq format).
So, I want to get rid of the line separator(--).
PS: I've found the answer for Linux machine but I'm using MacOS, and those didn't work on Mac terminal.
You can use the --no-group-separator option to suppress it (in GNU grep).
Alternatively, you could use (GNU) sed:
sed '\|#.*/1|,+3!d'
deletes all lines other than the one matching #.*/1 and the next three lines.
For macOS sed, you could use
sed -n '\|#.*/1|{N;N;N;p;}'
but this gets unwieldy quickly for more context lines.
Another approach would be to chain grep with itself:
grep '#.*/1' -A 3 file.fq | grep -v "^--"
The second grep selects non-matching (-v) lines that start with -- (though this pattern can sometimes be interpreted as a command line option, requiring some weird escaping like "[-][-]", which is why i put the ^ there).

grep some expression from some logs and only select the ones with positive results

I'm using 'zgrep' to find some ip's from some apache/nginx logs and I need a way to sort only positive results.
I'm using this:
for i in `cat /var/tmp/list_of_ip.txt`; do
zgrep -arcH $i /webstats/some_website/*/*.2012-10-1{7..9}* ;
done
There are lots of log files. I just want to know which ones return positive results.
Most of them will print 0 at the end indicating that there were nothing found, and some will print the number of matches. How can I select only those and output results in a file ?
Thanks.
You want to avoid decompressing the log files more than once each, so you should use:
zgrep -l -F -f /var/tmp/list_of_ip.txt /webstats/some_website/*/*.2012-10-1{7..9}*
This will decompress the log files once, run grep in fgrep mode (-F) and read the list of words to look for (IP addresses) from the file (-f /var/tmp/list_of_ip.txt), and list only the file names of the files that contain one or more of the matching IP addresses (-l). fgrep mode looks for words without metacharacters; if you need metacharacters, I think you can use grep -E (with the -f option) instead. You can add the -r option of your -arcH set as you wish; the others are redundant in this context.
Some casual testing done on Mac OS X 10.7.5 with zgrep (gzip) 1.3.12 and grep (GNU grep) 2.5.1, as reported by the --version options.
Think I've got it.., just toss another grep in there that filers out the results containing ':0' at the end of them!
for i in `cat /var/tmp/list_of_ip.txt`
do zgrep -arcH $i /webstats/some_website/*/*.2012-10-1{7..9}* | grep -v :0$
done

How to archive files under certain dir that are not text files in Mac OS?

Hey, guys, I used zip command, but I only want to archive all the files except *.txt. For example, if two dirs file1, file2; both of them have some *.txt files. I want archive only the non-text ones from file1 and file2.
tl;dr: How to tell linux to give me all the files that don't match *.txt
$ zip -r zipfile -x'*.txt' folder1 folder2 ...
Move to you desired directory and run:
ls | grep -P '\.(?!txt$)' | zip -# zipname
This will create a zipname.zip file containing everything but .txt files. In short, what it does is:
List all files in the directory, one per line (this can be achieved by using the -1 option, however it is not needed here as it's the default when output is not the terminal, it is a pipe in this case).
Extract from that all lines that do not end in .txt. Note it's grep using a Perl regular expression (option -P) so the negative lookahead can be used.
Zip the list from stdin (-#) into zipname file.
Update
The first method I posted fails with files with two ., like I described in the comments. For some reason though, I forgot about the -v option for grep which prints only what doesn't match the regex. Plus, go ahead and include a case insensitive option.
ls | grep -vi '\.txt$' | zip -# zipname
Simple, use bash's Extended Glob option like so:
#!/bin/bash
shopt -s extglob
zip -some -options !(*.txt)
Edit
This isn't as good as the -x builtin option to zip but my solution is generic across any command that may not have this nice feature.

Case-insensitive search and replace with sed

I'm trying to use SED to extract text from a log file. I can do a search-and-replace without too much trouble:
sed 's/foo/bar/' mylog.txt
However, I want to make the search case-insensitive. From what I've googled, it looks like appending i to the end of the command should work:
sed 's/foo/bar/i' mylog.txt
However, this gives me an error message:
sed: 1: "s/foo/bar/i": bad flag in substitute command: 'i'
What's going wrong here, and how do I fix it?
Update: Starting with macOS Big Sur (11.0), sed now does support the I flag for case-insensitive matching, so the command in the question should now work (BSD sed doesn't reporting its version, but you can go by the date at the bottom of the man page, which should be March 27, 2017 or more recent); a simple example:
# BSD sed on macOS Big Sur and above (and GNU sed, the default on Linux)
$ sed 's/ö/#/I' <<<'FÖO'
F#O # `I` matched the uppercase Ö correctly against its lowercase counterpart
Note: I (uppercase) is the documented form of the flag, but i works as well.
Similarly, starting with macOS Big Sur (11.0) awk now is locale-aware (awk --version should report 20200816 or more recent):
# BSD awk on macOS Big Sur and above (and GNU awk, the default on Linux)
$ awk 'tolower($0)' <<<'FÖO'
föo # non-ASCII character Ö was properly lowercased
The following applies to macOS up to Catalina (10.15):
To be clear: On macOS, sed - which is the BSD implementation - does NOT support case-insensitive matching - hard to believe, but true. The formerly accepted answer, which itself shows a GNU sed command, gained that status because of the perl-based solution mentioned in the comments.
To make that Perl solution work with foreign characters as well, via UTF-8, use something like:
perl -C -Mutf8 -pe 's/öœ/oo/i' <<< "FÖŒ" # -> "Foo"
-C turns on UTF-8 support for streams and files, assuming the current locale is UTF-8-based.
-Mutf8 tells Perl to interpret the source code as UTF-8 (in this case, the string passed to -pe) - this is the shorter equivalent of the more verbose -e 'use utf8;'.Thanks, Mark Reed
(Note that using awk is not an option either, as awk on macOS (i.e., BWK awk and BSD awk) appears to be completely unaware of locales altogether - its tolower() and toupper() functions ignore foreign characters (and sub() / gsub() don't have case-insensitivity flags to begin with).)
A note on the relationship of sed and awk to the POSIX standard:
BSD sed and awk limit their functionality mostly to what the POSIX sed and
POSIX awk specs mandate, whereas their GNU counterparts implement many more extensions.
Editor's note: This solution doesn't work on macOS (out of the box), because it only applies to GNU sed, whereas macOS comes with BSD sed.
Capitalize the 'I'.
sed 's/foo/bar/I' file
Another work-around for sed on Mac OS X is to install gsedfrom MacPorts or HomeBrew and then create the alias sed='gsed'.
If you are doing pattern matching first, e.g.,
/pattern/s/xx/yy/g
then you want to put the I after the pattern:
/pattern/Is/xx/yy/g
Example:
echo Fred | sed '/fred/Is//willma/g'
returns willma; without the I, it returns the string untouched (Fred).
The sed FAQ addresses the closely related case-insensitive search. It points out that a) many versions of sed support a flag for it and b) it's awkward to do in sed, you should rather use awk or Perl.
But to do it in POSIX sed, they suggest three options (adapted for substitution here):
Convert to uppercase and store original line in hold space; this won't work for substitutions, though, as the original content will be restored before printing, so it's only good for insert or adding lines based on a case-insensitive match.
Maybe the possibilities are limited to FOO, Foo and foo. These can be covered by
s/FOO/bar/;s/[Ff]oo/bar/
To search for all possible matches, one can use bracket expressions for each character:
s/[Ff][Oo][Oo]/bar/
The Mac version of sed seems a bit limited. One way to work around this is to use a linux container (via Docker) which has a useable version of sed:
cat your_file.txt | docker run -i busybox /bin/sed -r 's/[0-9]{4}/****/Ig'
Use following to replace all occurrences:
sed 's/foo/bar/gI' mylog.txt
I had a similar need, and came up with this:
this command to simply find all the files:
grep -i -l -r foo ./*
this one to exclude this_shell.sh (in case you put the command in a script called this_shell.sh), tee the output to the console to see what happened, and then use sed on each file name found to replace the text foo with bar:
grep -i -l -r --exclude "this_shell.sh" foo ./* | tee /dev/fd/2 | while read -r x; do sed -b -i 's/foo/bar/gi' "$x"; done
I chose this method, as I didn't like having all the timestamps changed for files not modified. feeding the grep result allows only the files with target text to be looked at (thus likely may improve performance / speed as well)
be sure to backup your files & test before using. May not work in some environments for files with embedded spaces. (?)
Following should be fine:
sed -i 's/foo/bar/gi' mylog.txt

Using sed to mass rename files

Objective
Change these filenames:
F00001-0708-RG-biasliuyda
F00001-0708-CS-akgdlaul
F00001-0708-VF-hioulgigl
to these filenames:
F0001-0708-RG-biasliuyda
F0001-0708-CS-akgdlaul
F0001-0708-VF-hioulgigl
Shell Code
To test:
ls F00001-0708-*|sed 's/\(.\).\(.*\)/mv & \1\2/'
To perform:
ls F00001-0708-*|sed 's/\(.\).\(.*\)/mv & \1\2/' | sh
My Question
I don't understand the sed code. I understand what the substitution
command
$ sed 's/something/mv'
means. And I understand regular expressions somewhat. But I don't
understand what's happening here:
\(.\).\(.*\)
or here:
& \1\2/
The former, to me, just looks like it means: "a single character,
followed by a single character, followed by any length sequence of a
single character"--but surely there's more to it than that. As far as
the latter part:
& \1\2/
I have no idea.
First, I should say that the easiest way to do this is to use the
prename or rename commands.
On Ubuntu, OSX (Homebrew package rename, MacPorts package p5-file-rename), or other systems with perl rename (prename):
rename s/0000/000/ F0000*
or on systems with rename from util-linux-ng, such as RHEL:
rename 0000 000 F0000*
That's a lot more understandable than the equivalent sed command.
But as for understanding the sed command, the sed manpage is helpful. If
you run man sed and search for & (using the / command to search),
you'll find it's a special character in s/foo/bar/ replacements.
s/regexp/replacement/
Attempt to match regexp against the pattern space. If success‐
ful, replace that portion matched with replacement. The
replacement may contain the special character & to refer to that
portion of the pattern space which matched, and the special
escapes \1 through \9 to refer to the corresponding matching
sub-expressions in the regexp.
Therefore, \(.\) matches the first character, which can be referenced by \1.
Then . matches the next character, which is always 0.
Then \(.*\) matches the rest of the filename, which can be referenced by \2.
The replacement string puts it all together using & (the original
filename) and \1\2 which is every part of the filename except the 2nd
character, which was a 0.
This is a pretty cryptic way to do this, IMHO. If for
some reason the rename command was not available and you wanted to use
sed to do the rename (or perhaps you were doing something too complex
for rename?), being more explicit in your regex would make it much
more readable. Perhaps something like:
ls F00001-0708-*|sed 's/F0000\(.*\)/mv & F000\1/' | sh
Being able to see what's actually changing in the
s/search/replacement/ makes it much more readable. Also it won't keep
sucking characters out of your filename if you accidentally run it
twice or something.
you've had your sed explanation, now you can use just the shell, no need external commands
for file in F0000*
do
echo mv "$file" "${file/#F0000/F000}"
# ${file/#F0000/F000} means replace the pattern that starts at beginning of string
done
I wrote a small post with examples on batch renaming using sed couple of years ago:
http://www.guyrutenberg.com/2009/01/12/batch-renaming-using-sed/
For example:
for i in *; do
mv "$i" "`echo $i | sed "s/regex/replace_text/"`";
done
If the regex contains groups (e.g. \(subregex\) then you can use them in the replacement text as \1\,\2 etc.
The easiest way would be:
for i in F00001*; do mv "$i" "${i/F00001/F0001}"; done
or, portably,
for i in F00001*; do mv "$i" "F0001${i#F00001}"; done
This replaces the F00001 prefix in the filenames with F0001.
credits to mahesh here: http://www.debian-administration.org/articles/150
The sed command
s/\(.\).\(.*\)/mv & \1\2/
means to replace:
\(.\).\(.*\)
with:
mv & \1\2
just like a regular sed command. However, the parentheses, & and \n markers change it a little.
The search string matches (and remembers as pattern 1) the single character at the start, followed by a single character, follwed by the rest of the string (remembered as pattern 2).
In the replacement string, you can refer to these matched patterns to use them as part of the replacement. You can also refer to the whole matched portion as &.
So what that sed command is doing is creating a mv command based on the original file (for the source) and character 1 and 3 onwards, effectively removing character 2 (for the destination). It will give you a series of lines along the following format:
mv F00001-0708-RG-biasliuyda F0001-0708-RG-biasliuyda
mv abcdef acdef
and so on.
Using perl rename (a must have in the toolbox):
rename -n 's/0000/000/' F0000*
Remove -n switch when the output looks good to rename for real.
There are other tools with the same name which may or may not be able to do this, so be careful.
The rename command that is part of the util-linux package, won't.
If you run the following command (GNU)
$ rename
and you see perlexpr, then this seems to be the right tool.
If not, to make it the default (usually already the case) on Debian and derivative like Ubuntu :
$ sudo apt install rename
$ sudo update-alternatives --set rename /usr/bin/file-rename
For archlinux:
pacman -S perl-rename
For RedHat-family distros:
yum install prename
The 'prename' package is in the EPEL repository.
For Gentoo:
emerge dev-perl/rename
For *BSD:
pkg install gprename
or p5-File-Rename
For Mac users:
brew install rename
If you don't have this command with another distro, search your package manager to install it or do it manually:
cpan -i File::Rename
Old standalone version can be found here
man rename
This tool was originally written by Larry Wall, the Perl's dad.
The backslash-paren stuff means, "while matching the pattern, hold on to the stuff that matches in here." Later, on the replacement text side, you can get those remembered fragments back with "\1" (first parenthesized block), "\2" (second block), and so on.
If all you're really doing is removing the second character, regardless of what it is, you can do this:
s/.//2
but your command is building a mv command and piping it to the shell for execution.
This is no more readable than your version:
find -type f | sed -n 'h;s/.//4;x;s/^/mv /;G;s/\n/ /g;p' | sh
The fourth character is removed because find is prepending each filename with "./".
Here's what I would do:
for file in *.[Jj][Pp][Gg] ;do
echo mv -vi \"$file\" `jhead $file|
grep Date|
cut -b 16-|
sed -e 's/:/-/g' -e 's/ /_/g' -e 's/$/.jpg/g'` ;
done
Then if that looks ok, add | sh to the end. So:
for file in *.[Jj][Pp][Gg] ;do
echo mv -vi \"$file\" `jhead $file|
grep Date|
cut -b 16-|
sed -e 's/:/-/g' -e 's/ /_/g' -e 's/$/.jpg/g'` ;
done | sh
for i in *; do mv $i $(echo $i|sed 's/AAA/BBB/'); done
The parentheses capture particular strings for use by the backslashed numbers.
ls F00001-0708-*|sed 's|^F0000\(.*\)|mv & F000\1|' | bash
Some examples that work for me:
$ tree -L 1 -F .
.
├── A.Show.2020.1400MB.txt
└── Some Show S01E01 the Loreming.txt
0 directories, 2 files
## remove "1400MB" (I: ignore case) ...
$ for f in *; do mv 2>/dev/null -v "$f" "`echo $f | sed -r 's/.[0-9]{1,}mb//I'`"; done;
renamed 'A.Show.2020.1400MB.txt' -> 'A.Show.2020.txt'
## change "S01E01 the" to "S01E01 The"
## \U& : change (here: regex-selected) text to uppercase;
## note also: no need here for `\1` in that regex expression
$ for f in *; do mv 2>/dev/null "$f" "`echo $f | sed -r "s/([0-9] [a-z])/\U&/"`"; done
$ tree -L 1 -F .
.
├── A.Show.2020.txt
└── Some Show S01E01 The Loreming.txt
0 directories, 2 files
$
2>/dev/null suppresses extraneous output (warnings ...)
reference [this thread]: https://stackoverflow.com/a/2372808/1904943
change case: https://www.networkworld.com/article/3529409/converting-between-uppercase-and-lowercase-on-the-linux-command-line.html

Resources