Replacing quotation marks with "``" and "''" - bash

I have a document containing many " marks, but I want to convert it for use in TeX.
TeX uses 2 ` marks for the beginning quote mark, and 2 ' mark for the closing quote mark.
I only want to make changes to these when " appears on a single line in an even number (e.g. there are 2, 4, or 6 "'s on the line). For e.g.
"This line has 2 quotation marks."
--> ``This line has 2 quotation marks.''
"This line," said the spider, "Has 4 quotation marks."
--> ``This line,'' said the spider, ``Has 4 quotation marks.''
"This line," said the spider, must have a problem, because there are 3 quotation marks."
--> (unchanged)
My sentences never break across lines, so there is no need to check on multiple lines.
There are few quotes with single quotes, so I can manually change those.
How can I convert these?

This is my one-liner which is works for me:
awk -F\" '{if((NF-1)%2==0){res=$0;for(i=1;i<NF;i++){to="``";if(i%2==0){to="'\'\''"}res=gensub("\"", to, 1, res)};print res}else{print}}' input.txt >output.txt
And there is long version of this one-liner with comments:
{
FS="\"" # set field separator to double quote
if ((NF-1) % 2 == 0) { # if count of double quotes in line are even number
res = $0 # save original line to res variable
for (i = 1; i < NF; i++) { # for each double quote
to = "``" # replace current occurency of double quote by ``
if (i % 2 == 0) { # if its closes quote replace by ''
to = "''"
}
# replace " by to in res and save result to res
res = gensub("\"", to, 1, res)
}
print res # print resulted line
} else {
print # print original line when nothing to change
}
}
You may run this script by:
awk -f replace-quotes.awk input.txt >output.txt

Here's my one-liner using repeated sed's:
cat file.txt | sed -e 's/"\([^"]*\)"/`\1`/g' | sed '/"/s/`/\"/g' | sed -e 's/`\([^`]*\)`/``\1'\'''\''/g'
(note: it won't work correctly if there are already back-ticks (`) in the file but otherwise should do the trick)
EDIT:
Removed back-tick bug by simplifying, now works for all cases:
cat file.txt | sed -e 's/"\([^"]*\)"/``\1'\'\''/g' | sed '/"/s/``/"/g' | sed '/"/s/'\'\''/"/g'
With comments:
cat file.txt # read file
| sed -e 's/"\([^"]*\)"/``\1'\'\''/g' # initial replace
| sed '/"/s/``/"/g' # revert `` to " on lines with extra "
| sed '/"/s/'\'\''/"/g' # revert '' to " on lines with extra "

Using awk
awk '{n=gsub("\"","&")}!(n%2){while(n--){n%2?Q=q:Q="`";sub("\"",Q Q)}}1' q=\' in
Explanation
awk '{
n=gsub("\"","&") # set n to the number of quotes in the current line
}
!(n%2){ # if there are even number of quotes
while(n--){ # as long as we have double-quotes
n%2?Q=q:Q="`" # alternate Q between a backtick and single quote
sub("\"",Q Q) # replace the next double quote with two of whatever Q is
}
}1 # print out all other lines untouched'
q=\' in # set the q variable to a single quote and pass the file 'in' as input
Using sed
sed '/^\([^"]*"[^"]*"[^"]*\)*$/s/"\([^"]*\)"/``\1'\'\''/g' in

This might work for you:
sed 'h;s/"\([^"]*\)"/``\1''\'\''/g;/"/g' file
Explanation:
Make a copy of the original line h
Replace pairs of "'s s/"\([^"]*\)"/``\1''\'\''/g
Check for odd " and if found revert to original line /"/g

Related

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

How to get a number with variable number of digits from a string in a file using bash script?

I have the following file:
APP_VERSION.ts
export const APP_VERSION = 1;
This is the only content of that file, and the APP_VERSION variable will be incremented as needed.
So, the APP_VERSION could be a single digit number or multiple digit number, like 15 or 999, etc.
I need to use that value in one of my bash scripts.
use-app-version.sh
APP_VERSION=`cat src/constants/APP_VERSION.ts`
echo $APP_VERSION
I know I can read it with cat. But how can I parse that string so I can get exactly the APP_VERSION value, whether it's 1 or 999, for example.
sed -En 's/(^.*APP_VERSION.*)([[:digit:]]+.*)(\;.*$)/\2/p' src/constants/APP_VERSION
Using sed, split the line into three sections defined by opening and closing brackets. Substitute the line for second section on ( the version value) and print.
You may use this awk:
app_ver=$(awk -F '[[:blank:];=]+' '$(NF-2) == "APP_VERSION" {print $(NF-1)}' src/constants/APP_VERSION.ts)
echo "$app_ver"
1
You can concat some commands to remove everything else:
APP_VERSION=`cat src/constants/APP_VERSION.ts | awk -F '=' '{print $2}' | tr -d ' ' | tr -d ';'`
1 - Cat get all file content
2 - AWK gets all content after '='
3 - Remove space
4 - Remove ;
A simple
APP_VERSION=$(grep --text -Eo '[0-9]+' src/constants/APP_VERSION.ts)
should be enough
With bash only:
APP_VERSION=$(cat src/constants/APP_VERSION.ts)
APP_VERSION=${APP_VERSION%;}
APP_VERSION=${APP_VERSION/*= }
Line 2 removes the trailing ';', line 3 removes everything before "= ".
Alternatively, you could set APP_VERSION as an array, take 5th element, and remove trailing ';'.
Or, another solution, using IFS:
IFS='=;' read a APP_VERSION < src/constants/APP_VERSION.ts
In this version, the space will remain before version number.
Assuming that the task can be rephrased to "extract the digits from a file", there are a few options:
Delete all characters that aren't digits with tr:
version=$(tr -cd '[:digit:]' < infile)
Use grep to match all digits and retain nothing but the match:
version=$(grep -Eo '[[:digit:]]+' infile)
Read file into string and delete all non-digits with just Bash:
contents=$(< infile)
version=${contents//[![:digit:]]}

Make sed to ignore multi-line single or double quoted block

Suppose I have a shell script with the following content:
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
Now what I want is to cut out the excess space from the beginning of each line:
I'm not any expert at using sed, so what I've tried so far was sed -i 's|^ ||' file, but this matches from within the multi-line quoted block as well which I don't want it to.
sed -i 's|^ ||' file ends up with:
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
But I expected it to be like:
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
So how could I make sed to ignore such pattern, I'm okay with any awk based solution as well.
Thank you.
Assumptions:
the next-to-last line consists of 4 spaces + "; these spaces should not be removed since they are inside the quoted text block
the last line consists solely of 4 spaces and will be trimmed to an empty line
don't have to worry about any edge cases (see KamilCuk's comment)
One awk idea based on keeping track of the number of double quotes (") we encounter:
awk '
/^ / { if ( qtcnt % 2 == 0 ) # if current line starts with 4 spaces and we
# have seen an even number of double quotes
# prior to this line (ie, we are outside
# of a double quoted string) then ...
$0=substr($0,5) # remove the 4 spaces from the current line
}
{ print $0 } # print the current line
{ n=split($0,arr,"\"") # split the current line on double quotes and
# get a count of the number of fields
if ( n >=1 ) # if number of fields >= 1 (ie, line contains
# at least one double quote) then ...
qtcnt += n - 1 # increment our quote counter
}
' indent.dat
NOTES:
this will erroneously count double quotes in the following situations ...
escaped double quotes (\")
single-quoted double quotes (awk -F'"' ...)
double quotes that show up in comments (# this is a double quote ("))
If the print line is changed to print "."$0"." (use periods as visual delimiters) the following is generated:
.echo "This is a single-line text".
..
.echo ".
.Examples: 1.
. 2.
. 3.
. 4.
. ".
..
As coded (sans the periods) the following is generated:
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
NOTE: the last line is empty/blank
With GNU awk for gensub() and RT:
$ cat tst.awk
BEGIN { RS="\""; ORS="" }
NR%2 { $0 = gensub(/(^|\n)[[:blank:]]+/,"\\1","g") }
{ print gensub(/\n[[:blank:]]+$/,"\n",1) RT }
$ awk -f tst.awk file
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
or with any POSIX awk:
$ cat tst.awk
BEGIN { RS=ORS="\"" }
NR > 1 { print prev }
NR%2 {
sub(/^[[:blank:]]+/,"")
gsub(/\n[[:blank:]]+/,"\n")
}
!(NR%2) {
sub(/\n[[:blank:]]+$/,"\n")
}
{ prev = $0 }
END { printf "%s", prev }
$ awk -f tst.awk file
echo "This is a single-line text"
echo "
Examples: 1
2
3
4
"
Caveat: any solution will be fragile unless you write a parser for shell language that can understand when " is within strings, within scripts, escaped, etc.

How i should use sed for delete specific strings and allow duplicate with more characters?

i had generate a list of file, and this had 17417 lines like :
./usr
./usr/share
./usr/share/mime-info
./usr/share/mime-info/libreoffice7.0.mime
./usr/share/mime-info/libreoffice7.0.keys
./usr/share/appdata
./usr/share/appdata/libreoffice7.0-writer.appdata.xml
./usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
./usr/share/appdata/libreoffice7.0-draw.appdata.xml
./usr/share/appdata/libreoffice7.0-impress.appdata.xml
./usr/share/appdata/libreoffice7.0-base.appdata.xml
./usr/share/appdata/libreoffice7.0-calc.appdata.xml
./usr/share/applications
./usr/share/applications/libreoffice7.0-xsltfilter.desktop
./usr/share/applications/libreoffice7.0-writer.desktop
./usr/share/applications/libreoffice7.0-base.desktop
./usr/share/applications/libreoffice7.0-math.desktop
./usr/share/applications/libreoffice7.0-startcenter.desktop
./usr/share/applications/libreoffice7.0-calc.desktop
./usr/share/applications/libreoffice7.0-draw.desktop
./usr/share/applications/libreoffice7.0-impress.desktop
./usr/share/icons
./usr/share/icons/gnome
./usr/share/icons/gnome/16x16
./usr/share/icons/gnome/16x16/mimetypes
./usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
The thing is i want to delete the lines like :
./usr
./usr/share
./usr/share/mime-info
./usr/share/appdata
./usr/share/applications
./usr/share/icons
./usr/share/icons/gnome
./usr/share/icons/gnome/16x16
./usr/share/icons/gnome/16x16/mimetypes
and the "." at the start, for the result must be like :
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
This is possible using sed ? or is more practical using another tool
With your list in the filename list, you could do:
sed -n 's/^[.]//;/\/.*[._].*$/p' list
Where:
sed -n suppresses printing of pattern-space; then
s/^[.]// is the substitution form that simply removes the first character '.' from each line; then
/\/.*[._].*$/p matches line that contain a '.' or '_' (optional) after the last '/' with p causing that line to be printed.
Example Use/Output
$ sed -n 's/^[.]//;/\/.*[._].*$/p' list
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
Note, without GNU sed that allows chaining of expressions with ';' you would need:
sed -n -e 's/^[.]//' -e '/\/.*[._].*$/p' list
Assuming you want to delete the line(s) which is included other
pathname(s), would you please try:
sort -r list.txt | awk ' # sort the list in the reverse order
{
sub("^\\.", "") # remove leading dot
s = prev; sub("/[^/]+$", "", s) # remove the rightmost slash and following characters
if (s != $0) print # if s != $0, it means $0 is not a substring of the previous line
prev = $0 # keep $0 for the next line
}'
Result:
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml

Replace the last character in string

How can I just replace the last character (it's a }) from a string? I need everything before the last character but replace the last character with some new string.
I tried many things with awk and sed but didn't succeed.
For example:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
}'
should become:
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cf2 Its red now
}'
This replaces the last occurrence of:
}
with
\\cf2 Its red now
}
sed would do this:
# replace '}' in the end
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/}$/\\cf2 Its red now}/'
# replace any last character
echo '\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural \f0 }' | sed 's/\(.\)$/\\cf2 Its red now\1/'
Replacing the trailing } could be done like this (with $ as the PS1 prompt and > as the PS2 prompt):
$ str="...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
> \\f0
> }"
$ echo "$str"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
}
$ echo "${str%\}}\cf2 It's red now
}"
...\tx4535\tx5102\tx5669\tx6236\tx6803\pardirnatural
\f0
\cf2 It's red now
}
$
The first 3 lines assign your string to my variable str. The next 4 lines show what's in the string. The 2 lines:
echo "${str%\}}\cf2 It's red now
}"
contain a (grammar-corrected) substitution of the material you asked for, and the last lines echo the substituted value.
Basically, ${str%tail} removes the string tail from the end of $str; I remember % ends in 't' for tail (and the analogous ${str#head} has hash starting with 'h' for head).
See shell parameter expansion in the Bash manual for the remaining details.
If you don't know the last character, you can use a ? metacharacter to match the end instead:
echo "${str%?}and the extra"
First make a string with newlines
str=$(printf "%s\n%s\n%s" '\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural' '\\f0' "}'")
Now you look for the last } in your string and replace it including a newline.
The $ makes sure it will only replace it at the last line, & stands for the matches string.
echo "${str}" |sed '$ s/}[^}]$/\\\\cf2 Its red now\n&/'
The above solution only works when the } is at the last line. It becomes more difficult when you also want to support str2:
str2=$(printf "Extra } here.\n%s\nsome other text" "${str}")
You can not match the } on the last line. Removing the address $ for the last line will result in replacing all } characters (I added a } at the beginning of str2). You only want to replace the last one.
Replacing once is forced with ..../1. Replacing the last and not the first is done by reversing the order of lines with tac. Since you will tac again after the replacement, you need to use a different order in your sedreplacement string.
echo "${str2}" | tac |sed 's/}[^}]$/&\n\\\\cf2 Its red now/1' |tac
In awk:
$ awk ' BEGIN { RS=OFS=FS="" } $NF="\\\\cf2 Its red now\n}"' file
RS="" sets RS to an empty record (change it to suit your needs)
OFS=FS="" separates characters each to its own field
$NF="\\\\cf2 Its red now\n}" replaces the char in the last field ($NF=}) with the quoted text
awk '{sub(/\\f0/,"\\f0\n\\\\\cfs Its red now")}1' file
...\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural
\\f0
\\cfs Its red now
}'

Resources