Trying to remove non-printable characters (junk values) from a UNIX file - bash

I am trying to remove non-printable character (for e.g. ^#) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time.
I tried using
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME
but still the ^# characters are not removed.
Also I tried using
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE
but it also did not help.
Can anybody suggest some alternative way to remove non-printable characters?
Used tr -cd but it is removing accented characters. But they are required in the file.

Perhaps you could go with the complement of [:print:], which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file

Remove all control characters first:
tr -dc '\007-\011\012-\015\040-\376' < file > newfile
Then try your string:
sed -i 's/[^#a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' newfile
I believe that what you see ^# is in fact a zero value \0.
The tr filter from above will remove those as well.

strings -1 file... > outputfile
seems to work. The strings program will take all printable characters, in this case of length 1 (the -1 argument) and print them. It effectively is removing all the non-printable characters.
"man strings" will provide the documentation.

Was searching for this for a while & found a rather simple solution:
The package ansifilter does exactly this. All you need to do is just pipe the output through it.
On Mac:
brew install ansifilter
Then:
cat file.txt | ansifilter

Related

How to extract only the English words and leaving the Devanagari words in bash script?

The text file is like this,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर#_
The desired text file should be like,
#
1
8IU
underscore
$that
%redyellow
$
#_
This is what I have tried so far, using awk
awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,
#
1
$that
%red
$
and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
Is there anyway to solve this in bash script?
Using perl:
$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
#_
-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.
The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.
Using awk you can do:
awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow
See documentation: https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html
using [\x00-\x7F].
This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.
tr is a very good fit for this task:
LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt
It sets the POSIX C locale environment so that only US English character set is valid.
Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

Replace special character with sed

I'm trying to replace a special character with sed, the character are Þ to replace for ;
The lines of the file are, for example;
0370ÞA020Þ4000011600ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011601ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011602ÞRED USADOÞ0,00Þ20190414
Thanks!
Edit
Its worked and solved.
Thanks!!!
Try this - simple substitution work for me
sed 's/Þ/;/g'
That's the job tr was created to do but look at these results:
$ tr 'Þ' ';' < file
0370;;A020;;4000011600;;RED USADO;;0,00;;20190414
0370;;A020;;4000011601;;RED USADO;;0,00;;20190414
0370;;A020;;4000011602;;RED USADO;;0,00;;20190414
$ sed 's/Þ/;/g' < file
0370;A020;4000011600;RED USADO;0,00;20190414
0370;A020;4000011601;RED USADO;0,00;20190414
0370;A020;4000011602;RED USADO;0,00;20190414
tr seems to consider every Þ as being 2 duplicate characters - sed may think the same but while tr is converting a set of chars to a set of chars, sed is converting a regexp to a string and so even if it considers Þ to be 2 characters wide it'll still do what you want. So just an interesting warning about trying to use tr to replace non-ASCII characters - YMMV!
if your data in 'd' file, try gnu sed:
sed -E 'y/Þ/;/' d

Why grep not function as expected with large file? [duplicate]

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?
grep -a
It can't get simpler than that.
One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob
You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.
You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo
You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.
Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators
grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log
As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring
you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.
Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.
You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Removing Unicode Line Separator "U+2028" in Bash

I have a text file with a unicode line separator (hex code 2028).
I want to remove it using bash (I see implementations for Python, but not for this language). What command could I use to transform the text file (output4.txt) to lose the unicode line separator?
See in vim below:
Probably this tr command should also work:
tr '\xE2\x80\xA8' ' ' < inFile > outFIle
Working solution: Thanks to OP for finding this:
sed -i.old $'s/\xE2\x80\xA8/ /g' inFile
I noticed that in your screenshot, you have already opened file in vim, then why not just do the substitution in vim?
in vim you could do
:%s/(seebelow)//g
the (seebelow) part, you could type:
ctrl-vu2028
You can probably use sed:
sed 's/\x20\x28//g' <file_in.txt >file_out.txt
To overwrite the original file:
sed -i 's/\x20\x28//g' file.txt
Edit: (See chepner's comment) You should make sure that you have the correct bytes, depending on the encoding, and then use sed to delete them. You could use e.g. od -t x1 for looking at the hex dump and figuring out the encoding.
This worked for me
sed $'s/\u2028//g' file_in.txt > file_out.txt
Note: other questions use the term <U+2028>

How to grep a text file which contains some binary data?

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?
grep -a
It can't get simpler than that.
One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob
You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.
You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo
You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.
Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators
grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log
As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring
you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.
Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.
You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Resources