How do I remove special symbols such as & from the file? - bash

I've been trying to clean up my huge xml file (> 6gb) with tr util. The goal is to get rid of all invalid characters and also to get rid of such things as , &, > and etc.
Here is my current implementation:
cat input.xml | tr -dc '[:print:]' > output.xml
But it only removes invalid characters. Do you have any suggestions how to achieve it with tr util?

tr probably won't work
tr is only for replacing individual characters or character classes. Your examples , &, and > are strings. We'll need another tool.
Here's an example with perl
$ cat input.xml
<xml><tag> hello&, >world!</tag></xml>
$ cat input.xml | perl -p -e 's/&.*?;//g'
<xml><tag>hello, world!</tag></xml>
Explanation:
perl -p -e 's/&.*?;//g'
perl -------------------- Run a perl program
-p ----------------- Sets up a loop around our program
-e -------------- Use what comes next as a line of our program
's/&.*?;//g' - Our program, which is a perl regular expression.
- Explanation below:
' ------------ Quotes prevent shell expansion/interpolation.
s ----------- Start a string substitution.
/ ---------- Use '/' as the command separator.
& --------- Matches literal ampersand (&),
. -------- followed by any character (.),
* ------- any number of times (*),
?; ----- until the next semicolon (?;).
// --- Replaces the matching text with the characters between the slashes (i.e. nothing at all)
g -- Allows matching the pattern multiple times per line
' - Quotes prevent shell expansion/interpolation
Note that I'm assuming a pattern of [AMPERSAND(&), SOMETHING, SEMICOLON(;)] based on the example strings you provided.
You could extend that program to also remove your invalid characters, but I'd just continue to use tr for that. It's faster at least on my system.
So putting it all together you get
cat input.xml | perl -p -e 's/&.*?;//g' | tr -dc '[:print:]' > output.xml

open the file in Notepad++ and use replace option.

A character escape is a way of representing a character in source code using only ASCII characters. In HTML you can escape the euro sign € in the following ways.
Format Name
€ hexadecimal numeric character reference
€ decimal numeric character reference
€ named character reference
In CSS syntax you would use one of the following.
Format Notes
\20AC must be followed by a space if the next character is one of a-f, A-F, 0-9
\0020AC must be 6 digits long, no space needed (but can be included)
A trailing space is treated as part of the escape, so use 2 spaces if you actually want to follow the escaped character with space. If using escapes in CSS identifiers, see the additional rules below.
Because you should use UTF-8 for the character encoding of the page, you won't normally need to use character escapes. You may, however, find them useful to represent invisible or ambiguous characters or characters that would otherwise interact in undesirable ways with the surrounding source code or text.

Related

UNIX change all the file extension for a list of files

I am a total beginner in this area so sorry if it is a dumb question.
In my shell script I have a variable named FILES, which holds the path to log files, like that:
FILES="./First.log ./Second.log logs/Third.log"
and I want to create a new variable with the same files but different extension, like that:
NEW_FILES="./First.txt ./Second.txt logs/Third.txt"
So I run this command:
NEW_FILES=$(echo "$FILES" | tr ".log" ".txt")
But I get this output:
NEW_FILES="./First.txt ./Secxnd.txt txts/Third.txt"
# ^^^
I understand the . character is a special character, but I don't know how I can escape it. I have already tried to add a \ before the period but to no avail.
tr replaces characters with other characters. When you write tr .log .txt it replaces . with ., l with t, o with x, and g with t.
To perform string replacement you can use sed 's/pattern/replacement/g', where s means substitute and g means globally (i.e., replace multiple times per line).
NEW_FILES=$(echo "$FILES" | sed 's/\.log/.txt/g')
You could also perform this replacement directly in the shell without any external tools.
NEW_FILES=${FILES//\.log/.txt}
The syntax is similar to sed, with a global replacement being indicated by two slashes. With a single slash only the first match would be replaced.
tr is not the tool you need. The goal of tr is to change characters on a 1-by-1 basis. You probably did not see it, but Second must have been changed to Secxnd.
I think sed is better.
NEW_FILES=$(sed 's/\.log/.txt/g' <<< $FILES)
It searches the \.log regular expression and replaces it with the .txt string. Please note the \. in the regex which means that it matches the dot character . and nothing else.

Remove junk characters from a utf-8 file in Unix

I'm getting the junk chars (<9f>, <9d>, <9d> etc), CNTRL chars (^Z,^M etc) and NULL chars(^#) in a file. However I was able to remove CNTRL and NULL chars from the file but couldn't eliminate the junk characters. Could anyone suggest a way to remove these junk chars?
Control characters are being removed using the following command:
sed 's/\x1a//g;s/\xef\xbf\xbd//g'
Null characters are removed using the below command
tr -d '\000'
Also, Please a suggest a single command to remove all the above mentioned 3 types of garbal characters.
Thanks in Advance
Strip "unusual" unicode characters
In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed which offers unicode support and their [[:alpha:]] class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE to specify which characters all fall into the [[:alpha:]] range. For German with Umlauts, that's e.g.
LC_CTYPE=de_DE.UTF-8
Then you can use sed to strip out everything which is not a letter or punctuation:
sed 's/[^[:alpha:];\ -#]//g' < junk.txt
What \ -# does: It matches all characters in the ascii range between space and # (see ascii table. Sed has a [[:punct:]] class, but unfortunately this also matches a lot of junk, so \ -# is needed.
You may need to play around a little with LC_CTYPE, setting it to utf-8 only I could match greek characters, but not japanese.
If you only care about ascii
If you only care about regular ascii characters you can use tr: First you convert the file to a "one byte per character" encoding, since tr does not understand multibyte characters, e.g. using iconv.
Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.
This command should do it:
iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'
this line..
converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
strips all characters away which are outside this whitelist: \11\12\40-\176. The numbers there are octal. Have a look at e.g. this ascii table. \11 is tab, \12 is carriage return. \40-\176 is all characters which are commonly considered as "normal"

Replace All first 4 spaces with a tab

I am doing some documentation work, and I have a tree structure like this:
A
BB
C C
DD
How can I replace just all the occurrences of 2 spaces in the head of the line with '-', like:
A
--BB
--C C
----DD
I have tried sed 's/ /-/g', but this replaces all occurrences of 2 spaces; also sed 's/^ /-/g', this just replaces the first occurrence of 2 spaces. How can I do this?
The regular expression for four spaces at beginning of line is /^ / where I put the slashes just to demarcate the expression (they are not part of the actual regular expression, but they are used as delimiters by sed).
sed 's/^ /\t/' file
In recent sed versions, you can add an -i option to modify file in-place (that is, sed will replace the file with the modified file); on *BSD (including OSX), you need -i '' with an empty option argument.
The \t escape code for tab is also not universally supported; if that is a problem, your shell probably allows you to type a literal tab by prefixing it with ctrl-V.
(Your question title says "tab" but your question asks about dashes. To replace with two dashes, replace \t in the replacement part of the script with --, obviously.)
If you are trying to generalize to "any groups of two spaces at beginning of line should be replaced by a dash", this is not impossible to do in sed, but I would recommend Perl instead:
perl -pe 's%^((?: )+)% "-" x (length($1) / 2)%e' file
This captures the match into $1; the inner parenthesized expression matches two spaces and the + quantifier says to match that as many times as possible. The /e flag allows us to use Perl code in the replacement; this piece of code repeats the character "-" as many times as the captured expression was repeated, which is conveniently equal to half its length.

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

With Sed how can I replace a random word without removing data on the same line

I am trying to use sed replace a string that has a randomly generated alphanumeric. It is prefixed with a fixed word with special characters in it.
{abcd}RandomAlphanumric
I can easily replace the {abcd}, but I don't know how to replace the Random Alphanumeric without removing other words or data on the same line. I am able to accomplish exactly what I need with the following sed command, but this doesn't seem like a safe command to use in all cases. Is there a cleaner way to do this?
sed -e 's/{abcd}.........../new_myvar/g'
This will delete all strings that start with {abcd} followed by any number of any alphanumeric character:
sed -e 's/{abcd}[[:alnum:]]*/new_myvar/g'
[[:alnum:]] matches any alphanumeric character and [[:alnum:]]* matches zero or more of such characters. Because sed is greedy, it will match as many alphanumeric characters as possible.
Example
Consider this test file:
$ cat file
{abcd}RandomAlphanumric
begin {abcd}adfCvr1243C end
Then, our output is:
$ sed -e 's/{abcd}[[:alnum:]]*/new_myvar/g' file
new_myvar
begin new_myvar end

Resources