Remove single spaces from lines, but leave multiple spaces alone - shell

I have the following input:
a a f aa
aa aa a h o
f j
The above input has single spaces as well as multiple spaces. I need to remove only the single ones, not the others, i.e. the output should be
aaf aa
aa aaaho
f j

Perl to the rescue!
perl -pe 's/(?<! ) (?! )//g' < input
(?<! ) is a negative look-behind assertion. It matches if the preceding string is not matched, i.e. in this case, it's not a space.
(?! ) is a negative look-ahead assertion. Similar to the above, but looks to the right.

Lex to the rescue!
$ cat x.l
%%
" "" "+ ECHO;
" " /* do nothing = don't echo */
%%
int yywrap(void)
{
return 1;
}
int main(void)
{
yylex();
return 0;
}
$ lex x.l
$ cc lex.yy.c -o spacedel
$ ./spacedel < in
aaf aa
aa aaaho
f j
This lex specification says to create a C program that echoes all sequences of two or more spaces; ignore single spaces; and the default rule for unmatched characters is to echo them, which is exactly what we want.
(You could append . ECHO; if you want to make this explicit).

This might work for you (GNU sed):
sed 's/\b\s\b//g' file
This removes a single white space surrounded by word boundaries.

Related

Delete the lines from file between pattern match

How to delete all the lines between two pattern in file using sed.
Here pattern are //test and //endtest, file content:
blah blah blah
c
f
f
[
]
//test
all text to be deleted
line1
line2
xyz
amv
{
//endtest
l
dsf
dsfs
Expected result:
blah blah blah
c
f
f
[
]
//test
//endtest
l
dsf
dsfs
This is common feature of sed
sed '/^\/\/test$/,/^\/\/endtest/d'
As / is used to bound regex, they have to be escaped, in regex.
If you want to keep marks (as requested):
sed '/^\/\/test$/,/^\/\/endtest/{//!d}'
Explanation:
Have a look at info sed, search for sed address -> Regexp Addresses and Range Addresses.
Enclosed by { ... }, symbol // mean any bound.
The empty regular expression '//' repeats the last regular
expression match (the same holds if the empty regular expression is
passed to the 's' command).
! mean not, then d for delete line
Alternative: You could write:
sed '/^\/\/\(end\)\?test$/,//{//!d}'
or
sed -E '/^\/\/(end)?test$/,//{//!d}'
Will work same, but care, this could reverse effect if some extra pattern //endtest may exist before first open pattern (//test).
... All this was done, using GNU sed 4.4!
Under MacOS, BSD sed
Under MacOS, I've successfully dropped wanted lines with this syntax:
sed '/^\/\/test$/,/^\/\/endtest/{/^\/\/\(end\)\{0,1\}test$/!d;}'
or
sed -E '/^\/\/test$/,/^\/\/endtest/{/^\/\/(end)?test$/!d;}'
With awk:
$ awk '/\/\/endtest/{p=0} !p; /\/\/test/{p = 1}' file
blah blah blah
c
f
f
[
]
//test
//endtest
l
dsf
dsfs
if your data in 'd' file, try gnu sed:
sed -E '/\/\/test/,/\/\/endtest/{/\/\/.*test/!d}' d

How to do an insertion of text before a multi-line regex using sed or awk?

Given the following input (not literally what follows, but shown with some meta notation):
... any content can be above the match ...
# ... optional comment above the match ...
# ... optional comment above the match can have spaces before it ...
"<key>": ... any content can follow ...
... any content can be below the match ...
where the match is ^\s*"<key>": where the <key> is a placeholder for an actual string. Note that comments are matched by ^\s*#.*.
I want to insert a string of text before the matched <key> and before any comments that are immediately above the matched <key>. There may be a variable number of comments, or none at all.
I've come up with a solution using sed; however, it is very ugly because it uses a tr hack. I'm hoping for a simpler solution using either sed or awk.
First, here's a test case:
test.txt:
{
# 1a
# 2a
"key1": true,
# 1b
# 2b
"key2": false,
}
Now my present solution involves sed and translating all newlines to a delimiter character ($'\x01') to make it easier to do multi-line operations. My example involves a regex that matches multiple comment lines followed by a key-value pair.
# The string to insert before the match
s='# 1x
# 2x
"keyx": null,
'
# Define the key before which to do the insertion:
Key='key2'
# Normalize that string: s -> ns
ns="$(printf '%s' "$s" | tr '\n' $'\x01')"
# Normalize test.txt
tr '\n' $'\x01' < test.txt |
# Perform the multi-line insertion
sed "s/\(^\|\x01\)\(\(\s*#[^\x01]*\x01\)*\)\(\s*\"$Key\":\)/\1$ns\2\4/" |
# Return to standard form with newlines
tr $'\x01' '\n'
The above code when executed with the test.txt input produces the correct and expected output:
{
# 1a
# 2a
"key1": true,
# 1x
# 2x
"keyx": null,
# 1b
# 2b
"key2": false,
}
How might I improve on what I've done above using sed or awk to make for more maintainable code? Specifically:
Is there another way to do this using sed without the tr hack above?
Is there a simpler way to do this using awk?
Following your update that the input could include either no or varying amounts of comments, this is the edit (due to some problems editing it, I'm having to edit out v1, so if you want it back leave a comment.)
sed doesn't do loops or if/elses really, just labels and branches, so trying to pick a range of lines is a bit more complicated it seems. Or at least for my knowledge level.
export key='key2'
s='# 1x\n# 2x\n"keyx": null,\n'
key_pattern='[[:space:]]*"'"$key"'":'
sed -n '
/'"$key_pattern"'/ {
:b; i\
'"$s"'
p; d
}
/^[[:space:]]*#/ {
h; :a; n; H
/^[[:space:]]*#/ ba
/'"$key_pattern"'/ { x; bb; }
x; p; d;
}
p
'
This script breaks into three types of patterns; where the key_pattern matches but is on its own (no comments before):
/'"$key_pattern"'/ { # here :b creates label b,
:b; i\ # and inserts
'"$s"' # the contents of this line
p; d # print then delete from buffer and start next line
}
When a group of comments is followed by the key_pattern:
/^[[:space:]]*#/ { # if comment found
h; # copy pattern space into hold space
:a; # create label a
n; H # get next line, append to hold space.
/^[[:space:]]*#/ ba # if new line is comment, goto `a`
/'"$key_pattern"'/ { x; bb; } # else if our pattern retrieve hold
# and goto `b`
x; p; d; # retrieve hold space, print and delete
}
And finally, When the line doesn't match anything else:
p; # print line and start next.
The following code comes with these assumptions:
Blank line between keys and data
Curly braces not elsewhere
awk '/key2/{$0 = "# 1x\n# 2x\n\"keyx\": null,\n\n"$0}ORS = RT' RS='[{}\n]\n' input_file
The main focus here is on setting up the RS value so it delimits each record

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etc×¥ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Deleting characters from a column if they appear fewer than 20 times

I have a CSV file with two columns:
cat # c a t
dog # d o g
bat # b a t
To simplify communication, I've used English letters for this example, but I'm dealing with CJK in UTF-8.
I would like to delete any character appearing in the second column, which appears on fewer than 20 lines within the first column (characters could be anything from numbers, letters, to Chinese characters, and punctuation, but not spaces).
For e.g., if "o" appears on 15 lines in the first column, all appearances of "o" are deleted from the second column. If "a" appears on 35 lines in the first column, no change is made.
The first column must not be changed.
I don't need to count multiple appearances of a letter on a single line. For e.g. "robot" has 2 o's, but this detail is not important, only that "robot" has an "o", so that is counted as one line.
How can I delete the characters that appear less than 20 times?
Here is a script using awk. Change the var num to be your frequency cutoff point. I've set it to 1 to show how it works against a small sample file. Note how f is still deleted even though it shows up three times on a single line. Also, passing the same input file twice is not a typo.
awk -v num=1 '
BEGIN { OFS=FS="#" }
FNR==NR{
split($1,a,"")
for (x in a)
if(a[x] != " " && !c[a[x]]++)
l[a[x]]++
delete c
next
}
!flag++{
for (x in l)
if (l[x] <= num)
cclass = cclass x
}
{
gsub("["cclass"]", " " , $2)
}1' ./infile.csv ./infile.csv
Sample Input
$ cat ./infile
fff # f f f
cat # c a t
dog # d o g
bat # b a t
Output
$ ./delchar.sh
fff #
cat # a t
dog #
bat # a t
Perl solution:
#!/usr/bin/perl
use warnings;
use strict;
open my $IN, '<:utf8', $ARGV[0] or die $!;
my %chars;
while (<$IN>) {
chomp;
my #cols = split /#/;
my %linechars;
undef #linechars{ split //, $cols[0] };
$chars{$_}++ for keys %linechars;
}
seek $IN, 0, 0;
my #remove = grep $chars{$_} < 20, keys %chars;
my $remove_reg = '[' . join(q{}, #remove) . ']';
warn $remove_reg;
while (<$IN>) {
my #cols = split /#/;
$cols[1] =~ s/$remove_reg//g;
print join '#', #cols;
}
I am not sure how whitespace should be handled, so you might need to adjust the script.
the answer is:
cut -d " " -f #column $file | sed -e 's/\.//g' -e 's/\,//g' | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
where $file is your text file and $column is the column you need to look for its frequency. It gives you out the list of their frequency
then you can go on looping on those results which have the first digit greater than your treshold and grepping on the whole lines.

Replacing quotation marks with "``" and "''"

I have a document containing many " marks, but I want to convert it for use in TeX.
TeX uses 2 ` marks for the beginning quote mark, and 2 ' mark for the closing quote mark.
I only want to make changes to these when " appears on a single line in an even number (e.g. there are 2, 4, or 6 "'s on the line). For e.g.
"This line has 2 quotation marks."
--> ``This line has 2 quotation marks.''
"This line," said the spider, "Has 4 quotation marks."
--> ``This line,'' said the spider, ``Has 4 quotation marks.''
"This line," said the spider, must have a problem, because there are 3 quotation marks."
--> (unchanged)
My sentences never break across lines, so there is no need to check on multiple lines.
There are few quotes with single quotes, so I can manually change those.
How can I convert these?
This is my one-liner which is works for me:
awk -F\" '{if((NF-1)%2==0){res=$0;for(i=1;i<NF;i++){to="``";if(i%2==0){to="'\'\''"}res=gensub("\"", to, 1, res)};print res}else{print}}' input.txt >output.txt
And there is long version of this one-liner with comments:
{
FS="\"" # set field separator to double quote
if ((NF-1) % 2 == 0) { # if count of double quotes in line are even number
res = $0 # save original line to res variable
for (i = 1; i < NF; i++) { # for each double quote
to = "``" # replace current occurency of double quote by ``
if (i % 2 == 0) { # if its closes quote replace by ''
to = "''"
}
# replace " by to in res and save result to res
res = gensub("\"", to, 1, res)
}
print res # print resulted line
} else {
print # print original line when nothing to change
}
}
You may run this script by:
awk -f replace-quotes.awk input.txt >output.txt
Here's my one-liner using repeated sed's:
cat file.txt | sed -e 's/"\([^"]*\)"/`\1`/g' | sed '/"/s/`/\"/g' | sed -e 's/`\([^`]*\)`/``\1'\'''\''/g'
(note: it won't work correctly if there are already back-ticks (`) in the file but otherwise should do the trick)
EDIT:
Removed back-tick bug by simplifying, now works for all cases:
cat file.txt | sed -e 's/"\([^"]*\)"/``\1'\'\''/g' | sed '/"/s/``/"/g' | sed '/"/s/'\'\''/"/g'
With comments:
cat file.txt # read file
| sed -e 's/"\([^"]*\)"/``\1'\'\''/g' # initial replace
| sed '/"/s/``/"/g' # revert `` to " on lines with extra "
| sed '/"/s/'\'\''/"/g' # revert '' to " on lines with extra "
Using awk
awk '{n=gsub("\"","&")}!(n%2){while(n--){n%2?Q=q:Q="`";sub("\"",Q Q)}}1' q=\' in
Explanation
awk '{
n=gsub("\"","&") # set n to the number of quotes in the current line
}
!(n%2){ # if there are even number of quotes
while(n--){ # as long as we have double-quotes
n%2?Q=q:Q="`" # alternate Q between a backtick and single quote
sub("\"",Q Q) # replace the next double quote with two of whatever Q is
}
}1 # print out all other lines untouched'
q=\' in # set the q variable to a single quote and pass the file 'in' as input
Using sed
sed '/^\([^"]*"[^"]*"[^"]*\)*$/s/"\([^"]*\)"/``\1'\'\''/g' in
This might work for you:
sed 'h;s/"\([^"]*\)"/``\1''\'\''/g;/"/g' file
Explanation:
Make a copy of the original line h
Replace pairs of "'s s/"\([^"]*\)"/``\1''\'\''/g
Check for odd " and if found revert to original line /"/g

Resources