I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"
You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "
Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.
With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.
I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.
For instance,
File_1:
a
c
d
and File_2:
>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH
I have been using variations of this loop:
while read $id;
do
sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1
hoping to obtain something like the following output:
>a
MEEL
>c
MEHL
>d
MLWL
But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?
Try:
$ awk -F'>' 'FNR==NR{a[$1]; next} NF==2{f=$2 in a} f' file1 file2
>a
MEEL
>c
MEHL
>d
MLWL
How it works
-F'>'
This sets the field separator to >.
FNR==NR{a[$1]; next}
While reading in the first file, this creates a key in array a for every line in file file.
NF==2{f=$2 in a}
For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.
f
If f is true, print the line.
A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.
pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2
Explanation:
The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).
The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:
begin:
read next line (from File_2) or quit on end-of-file
label_a:
if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
print the line
read next line (from File_2) or quit on end-of-file
if line begins with `>` goto label_a else goto label_b
else goto begin
Let me try to explain why your approach does not work well:
You need to say while read id instead of while read $id.
The sed command />$id/,/>/{//!p;} will exclude the lines which start
with >.
Then you might want to say something like:
while read id; do
sed -n "/^>$id/{N;p}" File_2
done < File_1
Output:
>a
MEEL
>c
MEHL
>d
MLWL
But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.
If ed is available, and since the shell is involve.
#!/usr/bin/env bash
mapfile -t to_match < file1.txt
ed -s file2.txt <<-EOF
g/\(^>[${to_match[*]}]\)/;/^>/-1p
q
EOF
It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.
Requires bash4+ because of mapfile.
How it works
mapfile -t to_match < file1.txt
Saves the entry/value from file1 in an array named to_match
ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file
<<-EOF A here document, shell syntax.
g/\(^>[${to_match[*]}]\)/;/^>/-1p
g means search the whole file aka global.
( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.
^> If line starts with a > the ^ is an anchor which means the start.
[ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"
; Include the next address/pattern
/^>/ Match a leading >
-1 go back one line after the pattern match.
p print whatever was matched by the pattern.
q quit ed
I am creating a simple script that converts a custom markup to TeX macros:
? What are four kinds of animals?
- elephants
- tigers
- bears
- fish
e
This becomes:
\QUESTION{What are four kinds of animals?}{
\ANSWER{elephants}
\ANSWER{tigers}
\ANSWER{bears}
\ANSWER{fish}
}
I have used a simple syntax to replace the items at the front:
sed 's#^? #\\QUESTION{#' file > temp1
sed 's#^\- #\\ANSWER{#' temp1 > temp2
sed 's#^e #\}{#' temp2 > temp3
How do I get it to also add the }{ to the end when "?" is found at the beginning, and add } to the end when "-" is found at the beginning of the line?
Match the whole line instead of its beginning, and use a replacement pattern referencing the content of the line :
sed -e 's#^? \(.*\)#\\QUESTION{\1}{' -e 's#^- \(.*\)#\\ANSWER{\1}#' -e 's#^e#}#'
In this command \(...\) are capturing groups and \1 refers to their content.
I also took the liberty of regrouping your multiple substitutions in a single sed command.
Like this:
sed -E 's/^(\? )(.*)/\\QUESTION{\2}{/;t;s/- (.*)/\ANSWER{\1}/;t;s/e/}/' file
Explanation:
s/^(\? )(.*)/\\QUESTION{\2}{/ Handle lines starting with ?
t means not further actions if the above s command replaced something
s/- (.*)/\ANSWER{\1}/ Handle lines starting with -
t means not further actions if the above s command replaced something
s/^e/}/ Handle lines starting with e.
You can "speed it up" a bit by reordering the commands by the complexity of the search pattern, like this:
sed -E 's/e/}/;t;s/- (.*)/\ANSWER{\1}/;t;s/^(\? )(.*)/\\QUESTION{\2}{/;' file
But yeah, probably micro-optimization.
You can try this sed too :
sed '/^- /s//\\ANSWER{/;/^e/s///;s/$/}/;/^? /{s//\\QUESTION{/;s/$/{/}' infile
sed '
/^- /s//\\ANSWER{/ # line with -
/^e/s/// # line with e
s/$/}/ # add } at the end of each line
/^? / { # line with ?
s//\\QUESTION{/
s/$/{/
}
' infile
I have a file that contains data like this
word0:secondword0
word1:secondword1
word2:secondword2
word3:secondword3
word4:secon:word4
I'd like to use sed to split that content to give me only the second word after the first colon.
The end result would look like
secondword0
secondword1
secondword2
secondword3
secon:word4
Notice how the last word has a second colon that is part of the word.
How would I write such a script that splits on only the fist colon but retains the rest?
Following sed could help you in same.
sed 's/\([^:]*\):\(.*\)/\2/' Input_file
Output will be as follows.
secondword0
secondword1
secondword2
secondword3
secon:word4
This can be done with gnu grep
grep -Po ':\K.*' <<END
word0:secondword0
word1:secondword1
word2:secondword2
word3:secondword3
word4:secon:word4
END
: matches the first occurence of : and \K keep : out of match .* matches the rest of the line, -o outputs only match
I'm reading from stdin line by line strings like:
<xml version="1.0" encoding="UTF-8">\n<Datanode ....
I need to get rid of that \n , it is not a newline, just a nasty sequence.
I need to read it form pipe, process it and pipe further.
Usually I got help from tr or cut but against this sequence I cannot find the way, they either do not remove it, or remove some other "n"s from XML string as well.
So you want to remove the string made of '\' followed by 'n' ok?
Something like this should work:
... | sed 's/\\n//' | ...
or this if you want to remove multiple sequences:
... | sed 's/\\n//g' | ...
And, if you want to anchor the sequence to be removed:
... | sed 's/>\\n</></' | ...
UPDATE
In case you don't want to remove the sequence '\''n' but replace it with a real new line (and I did notice your tag osx), you might want to use the following:
... | sed -e 's/\\n/\'$'\n/' | ...
I'm assuming here that your document isn't valid XML on account of containing a text node outside the root, which would explain why you can't use conventional XML-centric tools.
To truly use only bash, and do this in a manner that's safe against corrupting your file (performs the replacement only for the exact header text only on the very first line):
correct_xml_header() {
local bad_header correct_header content
bad_header='<xml version="1.0" encoding="UTF-8">\n'
correct_header='<?xml version="1.0" encoding="UTF-8"?>'
IFS= read -r -d '' content
if [[ $content = "$bad_header"* ]]; then
content=${correct_header}${content#"$bad_header"}
fi
printf '%s' "$content"
}
You can then pipe through this function:
generate_bad_xml | correct_xml_header | consume_good_xml
If you want to add a literal newline, add $'\n' to the end of the definition of correct_header, as in:
correct_header='<?xml version="1.0" encoding="UTF-8"?>'$'\n'
Note that I'm also changing <xml ...> to <?xml ...?>, which is a change similarly necessary to make this tool's output parse correctly with XML-compliant tools.