Why grep not function as expected with large file? [duplicate]

Why grep not function as expected with large file? [duplicate] - bash

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?

grep -a
It can't get simpler than that.

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.

You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo

You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators

grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring

you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.

Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Related

bash tr extra operand on redirect to another file

I have a small script that downloads a value from a web page.
Before anyone looses their mind because I am not using an HTML parser, besides the headers, the whole web page only has 3 lines of text between one pair of pre tags. I am just after the number values - that is it.
</head><body><pre>
sym
---
12300
</pre></body></html>
This is the script :
#!/bin/bash
wget -O foocounthtml.txt "http://foopage"
tr -d "\n" foocounthtml.txt > foocountnonewlines.txt
Anyhow the tr command is throwing an error.
tr: extra operand ‘foocounthtml.txt’
Only one string may be given when deleting without squeezing repeats.
Try 'tr --help' for more information.
Yes, I could use sed for inplace modification with the -i tag. However I am perplexed by this tr error. Redirecting tr output works fine from command line, but not in a script.

The 'tr' command operates on SETs of text rather than files. From the man page:
NAME
tr - translate or delete characters
SYNOPSIS
tr [OPTION]... SET1 [SET2]
DESCRIPTION
Translate, squeeze, and/or delete characters from standard input, writing to standard output.
...
SETs are specified as strings of characters. Most represent themselves. Interpreted sequences are:
So tr is expecting the actual content you want to operate on rather than the target file. You can simply pipe the files contents to tr for the resuts you want
cat foocounthtml.txt | tr -d "\n" > foocountnonewlines.txt
or as #CHarlesDUffy points out, it would be faster to read directly from the file:
tr -d "\n" < foocounthtml.txt > foocountnonewlines.txt

Trying to remove non-printable characters (junk values) from a UNIX file

I am trying to remove non-printable character (for e.g. ^#) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.
I tried using
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME
but still the ^# characters are not removed.
Also I tried using
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE
but it also did not help.
Can anybody suggest some alternative way to remove non-printable characters?
Used tr -cd but it is removing accented characters. But they are required in the file.

Perhaps you could go with the complement of [:print:], which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file

Remove all control characters first:
tr -dc '\007-\011\012-\015\040-\376' < file > newfile
Then try your string:
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile
I believe that what you see ^# is in fact a zero value \0.
The tr filter from above will remove those as well.

strings -1 file... > outputfile
seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.
"man strings" will provide the documentation.

Was searching for this for a while & found a rather simple solution:
The package ansifilter does exactly this. All you need to do is just pipe the output through it.
On Mac:
brew install ansifilter
Then:
cat file.txt | ansifilter

Get first N chars and sort them

I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou

sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.

Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.

The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.

How to grep a text file which contains some binary data?

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?

grep -a
It can't get simpler than that.

One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob

You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.

You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo

You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.

Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators

grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log

As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring

you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.

Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.

You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Colorized grep -- viewing the entire file with highlighted matches

I find grep's --color=always flag to be tremendously useful. However, grep only prints lines with matches (unless you ask for context lines). Given that each line it prints has a match, the highlighting doesn't add as much capability as it could.
I'd really like to cat a file and see the entire file with the pattern matches highlighted.
Is there some way I can tell grep to print every line being read regardless of whether there's a match? I know I could write a script to run grep on every line of a file, but I was curious whether this was possible with standard grep.

Here are some ways to do it:
grep --color 'pattern\|$' file
grep --color -E 'pattern|$' file
egrep --color 'pattern|$' file
The | symbol is the OR operator. Either escape it using \ or tell grep that the search text has to be interpreted as regular expressions by adding -E or using the egrep command instead of grep.
The search text "pattern|$" is actually a trick, it will match lines that have pattern OR lines that have an end. Because all lines have an end, all lines are matched, but the end of a line isn't actually any characters, so it won't be colored.
To also pass the colored parts through pipes, e.g. towards less, provide the always parameter to --color:
grep --color=always 'pattern\|$' file | less -r
grep --color=always -E 'pattern|$' file | less -r
egrep --color=always 'pattern|$' file | less -r

Here's something along the same lines. Chances are, you'll be using less anyway, so try this:
less -p pattern file
It will highlight the pattern and jump to the first occurrence of it in the file.
You can jump to the next occurence with n and to the previous occurence with p. Quit with q.

I'd like to recommend ack -- better than grep, a power search tool for programmers.
$ ack --color --passthru --pager="${PAGER:-less -R}" pattern files
$ ack --color --passthru pattern files | less -R
$ export ACK_PAGER_COLOR="${PAGER:-less -R}"
$ ack --passthru pattern files
I love it because it defaults to recursive searching of directories (and does so much smarter than grep -r), supports full Perl regular expressions (rather than the POSIXish regex(3)), and has a much nicer context display when searching many files.

You can use my highlight script from https://github.com/kepkin/dev-shell-essentials
It's better than grep because you can highlight each match with its own color.
$ command_here | highlight green "input" | highlight red "output"

You can also create an alias. Add this function in your .bashrc (or .bash_profile on osx)
function grepe {
grep --color -E "$1|$" $2
}
You can now use the alias like this: "ifconfig | grepe inet" or "grepe css index.html".
(PS: don't forget to source ~/.bashrc to reload bashrc on current session)

Use colout program: http://nojhan.github.io/colout/
It is designed to add color highlights to a text stream. Given a regex and a color (e.g. "red"), it reproduces a text stream with matches highlighted. e.g:
# cat logfile but highlight instances of 'ERROR' in red
colout ERROR red <logfile
You can chain multiple invocations to add multiple different color highlights:
tail -f /var/log/nginx/access.log | \
colout ' 5\d\d ' red | \
colout ' 4\d\d ' yellow | \
colout ' 3\d\d ' cyan | \
colout ' 2\d\d ' green
Or you can achieve the same thing by using a regex with N groups (parenthesised parts of the regex), followed by a comma separated list of N colors.
vagrant status | \
colout \
'\''(^.+ running)|(^.+suspended)|(^.+not running)'\'' \
green,yellow,red

The -z option for grep is also pretty slick!
cat file1 | grep -z "pattern"

As grep -E '|pattern' has already been suggested, just wanted to clarify that it's possible to highlight a whole line too.
For example, tail -f somelog | grep --color -E '| \[2\].*' (specifically, the part -E '|):

I use rcg from "Linux Server Hacks", O'Reilly. It's perfect for what you want and can highlight multiple expressions each with different colours.
#!/usr/bin/perl -w
#
# regexp coloured glasses - from Linux Server Hacks from O'Reilly
#
# eg .rcg "fatal" "BOLD . YELLOW . ON_WHITE" /var/adm/messages
#
use strict;
use Term::ANSIColor qw(:constants);
my %target = ( );
while (my $arg = shift) {
my $clr = shift;
if (($arg =~ /^-/) | !$clr) {
print "Usage: rcg [regex] [color] [regex] [color] ...\n";
exit(2);
}
#
# Ugly, lazy, pathetic hack here. [Unquote]
#
$target{$arg} = eval($clr);
}
my $rst = RESET;
while(<>) {
foreach my $x (keys(%target)) {
s/($x)/$target{$x}$1$rst/g;
}
print
}

I added this to my .bash_aliases:
highlight() {
grep --color -E "$1|\$"
}

The sed way
As there is already a lot of different solution, but none show sed as solution,
and because sed is lighter and quicker than grep, I prefer to use sed for this kind of job:
sed 's/pattern/\o33[47;31;1m&\o033[0m/' file
This seems less intuitive.
\o33 is the sed syntax to generate the character octal 033 -> Escape.
(Some shells and editors also allow entering <Ctrl>-<V> followed by <Esc>, to type the character directly.)
Esc [ 47 ; 31 ; 1 m is an ANSI escape code: Background grey, foreground red and bold face.
& will re-print the pattern.
Esc [ 0 m returns the colors to default.
You could also highlight the entire line, but mark the pattern as red:
sed -E <file -e \
's/^(.*)(pattern)(.*)/\o33[30;47m\1\o33[31;1m\2\o33[0;30;47m\3\o33[0m/'
Dynamic tail -f, following logfiles
One of advantage of using sed: You could send a alarm beep on console, using bell ascii character 0x7. I often use sed like:
sudo tail -f /var/log/kern.log |
sed -ue 's/[lL]ink .*\([uU]p\|[dD]own\).*/\o33[47;31;1m&\o33[0m\o7/'
-u stand for unbuffered. This ensure that line will be treated immediately.
So I will hear some beep instantly, when I connect or disconnect my ethernet cable.
Of course, instead of link up pattern, you could watch for USB in same file, or even search for from=.*alice#bobserver.org in /var/log/mail.log (If
you're Charlie, anxiously awaiting an email from Alice;)...

To highlight patterns while viewing the whole file, h can do this.
Plus it uses different colors for different patterns.
cat FILE | h 'PAT1' 'PAT2' ...
You can also pipe the output of h to less -R for better reading.
To grep and use 1 color for each pattern, cxpgrep could be a good fit.

Use ripgrep, aka rg: https://github.com/BurntSushi/ripgrep
rg --passthru...
Color is the default:
rg -t tf -e 'key.*tfstate' -e dynamodb_table
--passthru
Print both matching and non-matching lines.
Another way to achieve a similar effect is by modifying your pattern to
match the empty string.
For example, if you are searching using rg foo then using
rg "^|foo" instead will emit every line in every file searched, but only
occurrences of foo will be highlighted.
This flag enables the same behavior without needing to modify the pattern.
Sacrilege, granted, but grep has gotten complacent.
brew/apt/rpm/whatever install ripgrep
You'll never go back.

another dirty way:
grep -A80 -B80 --color FIND_THIS IN_FILE
I did an
alias grepa='grep -A80 -B80 --color'
in bashrc.

Here is a shell script that uses Awk's gsub function to replace the text you're searching for with the proper escape sequence to display it in bright red:
#! /bin/bash
awk -vstr=$1 'BEGIN{repltext=sprintf("%c[1;31;40m&%c[0m", 0x1B,0x1B);}{gsub(str,repltext); print}' $2
Use it like so:
$ ./cgrep pattern [file]
Unfortunately, it doesn't have all the functionality of grep.
For more information , you can refer to an article "So You Like Color" in Linux Journal

One other answer mentioned grep's -Cn switch which includes n lines of Context. I sometimes do this with n=99 as a quick-and-dirty way of getting [at least] a screenfull of context when the egrep pattern seems too fiddly, or when I'm on a machine on which I've not installed rcg and/or ccze.
I recently discovered ccze which is a more powerful colorizer. My only complaint is that it is screen-oriented (like less, which I never use for that reason) unless you specify the -A switch for "raw ANSI" output.
+1 for the rcg mention above. It is still my favorite since it is so simple to customize in an alias. Something like this is usually in my ~/.bashrc:
alias tailc='tail -f /my/app/log/file | rcg send "BOLD GREEN" receive "CYAN" error "RED"'

Alternatively you can use The Silver Searcher and do
ag <search> --passthrough

I use following command for similar purpose:
grep -C 100 searchtext file
This will say grep to print 100 * 2 lines of context, before & after of the highlighted search text.

It might seem like a dirty hack.
grep "^\|highlight1\|highlight2\|highlight3" filename
Which means - match the beginning of the line(^) or highlight1 or highlight2 or highlight3. As a result, you will get highlighted all highlight* pattern matches, even in the same line.

Ok, this is one way,
wc -l filename
will give you the line count -- say NN, then you can do
grep -C NN --color=always filename

If you want highlight several patterns with different colors see this bash script.
Basic usage:
echo warn error debug info 10 nil | colog
You can change patterns and colors while running pressing one key and then enter key.

Here's my approach, inspired by #kepkin's solution:
# Adds ANSI colors to matched terms, similar to grep --color but without
# filtering unmatched lines. Example:
# noisy_command | highlight ERROR INFO
#
# Each argument is passed into sed as a matching pattern and matches are
# colored. Multiple arguments will use separate colors.
#
# Inspired by https://stackoverflow.com/a/25357856
highlight() {
# color cycles from 0-5, (shifted 31-36), i.e. r,g,y,b,m,c
local color=0 patterns=()
for term in "$#"; do
patterns+=("$(printf 's|%s|\e[%sm\\0\e[0m|g' "${term//|/\\|}" "$(( color+31 ))")")
color=$(( (color+1) % 6 ))
done
sed -f <(printf '%s\n' "${patterns[#]}")
}
This accepts multiple arguments (but doesn't let you customize the colors). Example:
$ noisy_command | highlight ERROR WARN

Is there some way I can tell grep to print every line being read
regardless of whether there's a match?
Option -C999 will do the trick in the absence of an option to display all context lines. Most other grep variants support this too. However: 1) no output is produced when no match is found and 2) this option has a negative impact on grep's efficiency: when the -C value is large this many lines may have to be temporarily stored in memory for grep to determine which lines of context to display when a match occurs. Note that grep implementations do not load input files but rather reads a few lines or use a sliding window over the input. The "before part" of the context has to be kept in a window (memory) to output the "before" context lines later when a match is found.
A pattern such as ^|PATTERN or PATTERN|$ or any empty-matching sub-pattern for that matter such as [^ -~]?|PATTERN is a nice trick. However, 1) these patterns don't show non-matching lines highlighted as context and 2) this can't be used in combination with some other grep options, such as -F and -w for example.
So none of these approaches are satisfying to me. I'm using ugrep, and enhanced grep with option -y to efficiently display all non-matching output as color-highlighted context lines. Other grep-like tools such as ag and ripgrep also offer a pass-through option. But ugrep is compatible with GNU/BSD grep and offers a superset of grep options like -y and -Q. For example, here is what option -y shows when combined with -Q (interactive query UI to enter patterns):
ugrep -Q -y FILE ...

Also try:
egrep 'pattern1|pattern2' FILE.txt | less -Sp 'pattern1|pattern2'
This will give you a tabular output with highlighted pattern/s.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio