How to remove control characters in a delimited file? - bash

I am just wondering what is the best way to remove control characters from a delimited file using sed/awk in bash. Thanks.

You can use the character class [:cntrl:] with GNU sed:
sed 's/[[:cntrl:]]//g' file.txt
From here:
‘[:cntrl:]’
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.

Related

Replace special character with sed

I'm trying to replace a special character with sed, the character are Þ to replace for ;
The lines of the file are, for example;
0370ÞA020Þ4000011600ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011601ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011602ÞRED USADOÞ0,00Þ20190414
Thanks!
Edit
Its worked and solved.
Thanks!!!
Try this - simple substitution work for me
sed 's/Þ/;/g'
That's the job tr was created to do but look at these results:
$ tr 'Þ' ';' < file
0370;;A020;;4000011600;;RED USADO;;0,00;;20190414
0370;;A020;;4000011601;;RED USADO;;0,00;;20190414
0370;;A020;;4000011602;;RED USADO;;0,00;;20190414
$ sed 's/Þ/;/g' < file
0370;A020;4000011600;RED USADO;0,00;20190414
0370;A020;4000011601;RED USADO;0,00;20190414
0370;A020;4000011602;RED USADO;0,00;20190414
tr seems to consider every Þ as being 2 duplicate characters - sed may think the same but while tr is converting a set of chars to a set of chars, sed is converting a regexp to a string and so even if it considers Þ to be 2 characters wide it'll still do what you want. So just an interesting warning about trying to use tr to replace non-ASCII characters - YMMV!
if your data in 'd' file, try gnu sed:
sed -E 'y/Þ/;/' d

Remove "../" from text file using sed

I have a text file containing text such as this
../path-to-image/folder1/image.jpg path-to-another-image/folder2/image.png
I would like to remove the "../" part and obtain
path-to-image/folder1/image.jpg path-to-another-image/folder2/image.png
I have tried using sed with
sed -i 's#../##g' file.txt
But I obtain the following:
path-to-imafoldeimage.jpg path-to-another-imafoldeimage.png
All the slashes and some other characters were removed and thus the path to my images was broken.
I looked up how to make it match exactly the string using
\<\>
sed 's#\<../\>#%%#g' file.txt
But the output is identical to input. Is there a way to remove "../" using sed? I need this from command line since I have about 10 files with similar path structures which I will copy into a bunch of directories. Meaning I can't do this manually.
.s have special meaning in regex syntax, and need to be escaped.
Either [.] (creating a character class of size one) or \. will suffice; I strongly advise the former, as it works properly in a wider array of quoting contexts. Thus:
sed -i 's#[.][.]/##g' file.txt
Dots are special characters in regex. They mean any character (except a newline). So you need to escape them with backslashes in the sed command:
sed -i 's#\.\./##g' file.txt
Do sed -i 's/\.\.\///g' file.txt
's/\.\.\///g' replaces ../ with an empty string, as of the syntax 's/string/replacement/g'
\.\.\/ escapes the dots and the slash, which is necessary because dots and slashes are special characters in regex. After escaping \.\.\/, the string reads ../.
The following two slashes surround the replacement string, which is empty in this case.
Edit:
For easier legibility (and to avoid escaping the slash):
sed -i 's#\.\.\/##g' file.txt. This is much closer to your initial attempt, and as a revised explanation, \.\./ translates to ../, as the slash no longer needs to be escaped. The dots are still special characters and must be escaped with the backslash.

Remove junk characters from a utf-8 file in Unix

I'm getting the junk chars (<9f>, <9d>, <9d> etc), CNTRL chars (^Z,^M etc) and NULL chars(^#) in a file. However I was able to remove CNTRL and NULL chars from the file but couldn't eliminate the junk characters. Could anyone suggest a way to remove these junk chars?
Control characters are being removed using the following command:
sed 's/\x1a//g;s/\xef\xbf\xbd//g'
Null characters are removed using the below command
tr -d '\000'
Also, Please a suggest a single command to remove all the above mentioned 3 types of garbal characters.
Thanks in Advance
Strip "unusual" unicode characters
In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed which offers unicode support and their [[:alpha:]] class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE to specify which characters all fall into the [[:alpha:]] range. For German with Umlauts, that's e.g.
LC_CTYPE=de_DE.UTF-8
Then you can use sed to strip out everything which is not a letter or punctuation:
sed 's/[^[:alpha:];\ -#]//g' < junk.txt
What \ -# does: It matches all characters in the ascii range between space and # (see ascii table. Sed has a [[:punct:]] class, but unfortunately this also matches a lot of junk, so \ -# is needed.
You may need to play around a little with LC_CTYPE, setting it to utf-8 only I could match greek characters, but not japanese.
If you only care about ascii
If you only care about regular ascii characters you can use tr: First you convert the file to a "one byte per character" encoding, since tr does not understand multibyte characters, e.g. using iconv.
Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.
This command should do it:
iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'
this line..
converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
strips all characters away which are outside this whitelist: \11\12\40-\176. The numbers there are octal. Have a look at e.g. this ascii table. \11 is tab, \12 is carriage return. \40-\176 is all characters which are commonly considered as "normal"

In sed, why can't I use a range of hexdecimal escapes in square brackets?

I'm trying to run:
sed 's/[\xE0-\xEF]/_/g;
but am getting a complaint about an "invalid collation character". What's wrong with my range of characters in the square brackets?
Try to set the LC_ALL environnement variable to the C locale (aka the POSIX locale):
LC_ALL=C sed 's/[\xE0-\xEF]/_/g'
The non-ASCII compliant characters generated may interfere with encodings or whatever. Note that it works fine with standard ASCII ranges: sed 's/[\x41-\x42]/_/g'
Here's a way with tr:
tr "\340-\357" "_" < input > output
(those are octal values for the hex codes you provided).

Remove non-English and accented characters from a flat file using Unix shell script

I have a file which contains lot of accented and some wild-card (?, *) characters. How do I replace these characters with space in Unix (using sed or similar utility). I tried it using sed but somehow it is ignoring accented characters.
Thanks
Using GNU sed, you can do the following:
sed 's/[^\o51-\o57\o64-\o89\o96-\o105\o112-\o121\o128-\o137\o144-\o145\o147\o150\o291-\o293]/ /g' inputfile
Note that those are letter "O" rather than digit zero after the backslashes.
This isn't a terribly specific answer, but it should give you a few keywords to search for.
First, the easy bit. It's straightforward to have sed match regexp characters. For example:
% echo 'one tw? f*ur' | sed 's/\*/ /'
one tw? f ur
% echo 'one tw? f*ur' | sed 's/[*?]/ /'
one tw f*ur
%
Handling the non-ASCII characters is messier.
Some seds can handle non-ASCII characters, usually unicode files. Some seds can't. Unfortunately, it may not be obvious from your sed's manpage which it is. Life is hard.
One thing you'll have to find out is what encoding the input file is in. A unicode file will be encoded in one or other of UTF-8 or UTF-16 (or possibly one of a couple of less common ones). This isn't the place for an expansion of unicode and encodings, but those are the keywords to scan the manpages for....
Even if you can't find a sed which can handle unicode, then you might be able to use perl, python, or some other scripting language to do the processing -- these generally have regexp engines which can do unicode. The perl -n option creates an implicit loop which might make the transformation you want a one-liner.
If your input document is in a different (non-unicode) encoding, such as one of the ISO-8859 ones, then I would guess that the best thing to do would be to convert it to UTF-8 using something like iconv, and proceed from there.
If your accented characters are single-byte you can use tr with character sets to accomplish this. If you can identify a range of characters to match, that's probably easiest:
tr '\192-\255' ' ' < infile > outfile
If you're dealing with larger-than-8-bit characters, awk and sed can probably handle it, but you need to make sure your inputs are properly quoted. Try using the decimal or hexadecimal representations instead of the characters themselves.

Resources