sed command to remove invalid xml character not working - bash

I am really new to regex and I was following other StackOverflow answers to make sed command to remove invalid XML characters.
sed -ie 's/[^\u0009\r\n\u0020-\uD7FF\uE000-\uFFFD\ud800\udc00-\udbff\udfff]//g' myfile.xml
When I run this, it looks like it deletes a bunch of alphabets,,, For example, if it is company, it deletes o,m,p,a,y,etc. Especially lower cases.
There is something wrong with my regex OR maybe it doesn't think it as regex. Would you please help me? Thank you.

Related

sed to remove section of text from a variable

So I think I've cracked the regex but just can't crack how to get sed to make the changes. I have a variable which is this:
MAKEVAR = EPICS_BASE=$CI_PROJECT_DIR/3.16/base IPAC=$CI_PROJECT_DIR/3.16/support/ipac SNCSEQ=$CI_PROJECT_DIR/3.16/support/seq
(All one line). But I want to delete the particular section defining IPAC so my regex looks like this:
(IPAC.+\s)
I know from using this tool that that should be correct:
https://www.regextester.com/98103
However when I run different iterations of trying out sed like:
sed 's/(IPAC.+\s)/\&/g' <<< "$MAKEVAR"
And then echo out MAKEVAR, the IPAC section still exists.
How can I update a particular section of text in a shell variable to remove a section beginning with IPAC up until the next space?
Thanks in advance
regextester (or any other online tool) is a great way to verify that a regexp works in that online tool. Unfortunately that doesn't mean it'll work in any given command-line tool. In particular your regexp includes \s which is specific to PCREs and some GNU tools, and uses (...) to delineate capture groups but that's only used in EREs and PCREs, not BREs such as sed supports by default where you'd have to use \(...\), and your replacement text is using '&' which is telling sed you want to replace the string that matches the regexp with a literal \& when in fact you just want to remove it.
This is how to do what I think you're trying to do using any sed:
$ sed 's/IPAC[^ ]* //' <<< "$MAKEVAR"
EPICS_BASE=$CI_PROJECT_DIR/3.16/base SNCSEQ=$CI_PROJECT_DIR/3.16/support/seq
Nevermind, found a workaround:
MAKEVAR=$(sed -E 's/(IPAC.+ipac)//' <<<"$MAKEVAR")
Use a shorter
MAKEVAR=$(sed 's/IPAC.*ipac//' <<< "$MAKEVAR")
IPAC.*ipac matches all the way from first IPAC to last ipac. The matched text is removed from the text.

Command garbled - solaris sed doesn't like my regex

I struggle with regex's at the best of times but having honed this one on a regex test site I can see it should work. However when I put it into sed on Solaris it gives me a garbled command error:
cat p.csv | sed -e 's/(([^,]+,){8})([^,]+)(,.*$)/\3/g'
I can't understand what is wrong with this. If I use xxx instead of the capture group I just get the full input, which makes even less sense to me!
My regex is supposed to allow me to extract a column of a csv file - I have reasons for wanting to use sed and regex.

Command grouping in sed

I do not understand the command grouping in sed scripts. We use curly braces to group commands. I found some information in the first answer to the following question: Using multiple sed commands. But I still do not understand this properly. Could someone please explain this to me?
If you use
/Number/ s/N/n/;s/r//
Then rs will be removed on all lines, not only those containing Number. But, if you use
/Number/{s/N/n/;s/r//}
then rs will be removed only from lines containing Number.

How to insert a line just after the first regex match with sed?

I have a few CmakeLists.txt files and I would like to insert another include right after a known include. So, here's what I've got:
include_directories(src include)
And, here's what I would like to end up with
include_directories(src include)
include_directories("${CMAKE_INSTALL_PREFIX}/include")
Any ideas on the best way to do this? I'm assuming sed would make the most sense, but I'm open to alternatives.
[edit] Found a duplicate question.
You can use this simple sed command with inline editing:
sed -i.bak '/include_directories(src include)/a\
include_directories("${CMAKE_INSTALL_PREFIX}/include")
' CmakeLists.txt
This uses a command which appends a new string after searched string.
-i.bak is for incline editing of the input file.
If you're not satisfied with the answer to the duplicate question, you could try this:
sed '/include_directories(src include)/s/$/\
include_directories(\"${CMAKE_INSTALL_PREFIX}\/include\")/' filename
This might or not work, depending on your shell. If it doesn't, the lesson is to attempt simple things first, then build up.
sed is an excellent tool for simple substitutions on a single line but for anything else you're better off with awk:
awk '{print} /include_directories\(src include\)/{print "include_directories(\"${CMAKE_INSTALL_PREFIX}/include\")"}' file
Another awk variation:
awk '/include_directories\(src include\)/{$0=$0 "\ninclude_directories(\"${CMAKE_INSTALL_PREFIX}/include\")"}8' file
If pattern found, add a new line to current line, then print all.

search a pattern in each line and append it at the end of that line

I have a file with the following entries:
folder1/a_b.csv folder1/generated/
folder2/folder3/a_b1.csv folder12/generated/
folder4/b_c.csv folder123/generated/
folder5/d.csv folder1/new_folder/generated/
folder6/12.csv folder/anotherfolder/morefolder/evenmorefolder/generated/
I want to copy the csv file name from each line, paste them at the end of that line and append it with ".org". Hence, the changed file would look like
folder1/a_b.csv folder1/generated/a_b.csv.org
folder2/folder3/a_b1.csv folder12/generated/a_b1.csv.org
folder4/b_c.csv folder123/generated/b_c.csv.org
folder5/d.csv folder1/new_folder/generated/d.csv.org
folder6/12.csv folder/anotherfolder/morefolder/evenmorefolder/generated/12.csv.org
Basically, I am looking for a command in vim or sed using which I can search a pattern in each line and append it at the end of that line. Is it possible?
Thanks in advance.
Vim
Here's how to do this in Vim:
:%s/\([^/]*\.csv\)\( .*\)/&\1.org/
This global (:%) substitution matches the filename (characters that don't contain /, ending in .csv), and captures \(...\) it. It then matches the rest of the line, and captures that, too.
As a replacement, first keep the original match & (or \0), then append the first capture (\1) with the additional suffix.
sed
Though the regular expression syntax is somewhat different than in Vim, the identical expression can be used with sed:
sed -e 's/\([^/]*\.csv\)\( .*\)/&\1.org/' input
Alternatives
It looks like you want to do file renaming in batches. On Linux, the mmv command-line tool is well suited for that; you'll probably find many similar tools on the web, too.
This might work for you (GNU sed):
sed -r 's|/([^ ]*) .*|&\1.org|' file

Resources