What cases should I use dollar sign and single quotes in transliterate (eg. tr -d '\n'), or any other function? - bash

Say Im trying to delete newlines or carrige returns. I notice that when I use transliterate to delete the newline characters tr -d '\n', I get the same results as if I were to tr -d $"\n" or tr -d $'\n'. What's the difference?
I'm not sure how the same applies in sed or grep because they are more complicated. So, I'm trying to figure out tr first as that seems to be a simpler bash program.

tr does its own escaping:
When you write tr -d '\n', the tr program itself recognises \+n and substitutes a newline.
When you write tr -d $'\n', Bash converts \n to a newline character, and tr sees it literally.
If you're experimenting to understand what the shell does, it's probably worth writing a short C program to print out each argument letter by letter - something like:
#include <stdio.h>
int main(int argc, char **argv)
{
int i;
/* Ignore argv[0] - the program name is not interesting */
for (i = 1; i < argc; ++i) {
char *p = argv[i];
printf("argv[%d] =", i);
while (*p)
printf(" %3d", (int)*p++);
printf("\n");
}
return 0;
}
This one prints in decimal, but it's easy to change it to use hex or octal. Running it with $'\n' \n "\n" as arguments gives:
argv[1] = 10
argv[2] = 110
argv[3] = 92 110
showing that in the first case, Bash passes a single newline character, in the second case, just the 'n', and in the final case, both '\' and 'n'.

Related

Exactly how do backslashes work within backticks?

From the Bash FAQ:
Backslashes (\) inside backticks are handled in a non-obvious manner:
$ echo "`echo \\a`" "$(echo \\a)"
a \a
$ echo "`echo \\\\a`" "$(echo \\\\a)"
\a \\a
But the FAQ does not break down the parsing rules that lead to this difference. The only relevant quote from man bash I found was:
When the old-style backquote form of substitution is used, backslash retains its literal meaning except when followed by $, `, or .
The "$(echo \\a)" and "$(echo \\\\a)" cases are easy enough: Backslash, the escape character, is escaping itself into a literal backlash. Thus every instance of \\ becomes \ in the output. But I'm struggling to understand the analogous logic for the backtick cases. What is the underlying rule and how does the observed output follow from it?
Finally, a related question... If you don't quote the backticks, you get a "no match" error:
$ echo `echo \\\\a`
-bash: no match: \a
What's happening in this case?
update
Re: my main question, I have a theory for a set of rules that explains all the behavior, but still don't see how it follows from any of the documented rules in bash. Here are my proposed rules....
Inside backticks, a backslash in front of a character simply returns that character. Ie, a single backslash has no effect. And this is true for all characters, except backlash itself and backticks. In the case of backslash itself, \\ becomes an escaping backslash. It will escape its next character.
Let's see how this plays out in an example:
a=xx
echo "`echo $a`" # prints the value of $a
echo "`echo \$a`" # single backslash has no effect: equivalent to above
echo "`echo \\$a`" # escaping backslash make $ literal
prints:
xx
xx
$a
Try it online!
Let's analyze the original examples from this perspective:
echo "`echo \\a`"
Here the \\ produces an escaping backslash, but when we "escape" a we just get back a, so it prints a.
echo "`echo \\\\a`"
Here the first pair \\ produces an escaping backslash which is applied to \, producing a literal backslash. That is, the first 3 \\\ become a single literal \ in the output. The remaining \a just produces a. Final result is \a.
The logic is quite simple as such. So we look at bash source code (4.4) itself
subst.c:9273
case '`': /* Backquoted command substitution. */
{
t_index = sindex++;
temp = string_extract(string, &sindex, "`", SX_REQMATCH);
/* The test of sindex against t_index is to allow bare instances of
` to pass through, for backwards compatibility. */
if (temp == &extract_string_error || temp == &extract_string_fatal)
{
if (sindex - 1 == t_index)
{
sindex = t_index;
goto add_character;
}
last_command_exit_value = EXECUTION_FAILURE;
report_error(_("bad substitution: no closing \"`\" in %s"), string + t_index);
free(string);
free(istring);
return ((temp == &extract_string_error) ? &expand_word_error
: &expand_word_fatal);
}
if (expanded_something)
*expanded_something = 1;
if (word->flags & W_NOCOMSUB)
/* sindex + 1 because string[sindex] == '`' */
temp1 = substring(string, t_index, sindex + 1);
else
{
de_backslash(temp);
tword = command_substitute(temp, quoted);
temp1 = tword ? tword->word : (char *)NULL;
if (tword)
dispose_word_desc(tword);
}
FREE(temp);
temp = temp1;
goto dollar_add_string;
}
As you can see calls a function de_backslash(temp); on the string which updates the string in c. The code the same function is below
subst.c:1607
/* Remove backslashes which are quoting backquotes from STRING. Modifies
STRING, and returns a pointer to it. */
char *
de_backslash(string) char *string;
{
register size_t slen;
register int i, j, prev_i;
DECLARE_MBSTATE;
slen = strlen(string);
i = j = 0;
/* Loop copying string[i] to string[j], i >= j. */
while (i < slen)
{
if (string[i] == '\\' && (string[i + 1] == '`' || string[i + 1] == '\\' ||
string[i + 1] == '$'))
i++;
prev_i = i;
ADVANCE_CHAR(string, slen, i);
if (j < prev_i)
do
string[j++] = string[prev_i++];
while (prev_i < i);
else
j = i;
}
string[j] = '\0';
return (string);
}
The above just does simple thing if there is \ character and the next character is \ or backtick or $, then skip this \ character and copy the next character
So if convert it to python for simplicity
text = r"\\\\$a"
slen = len(text)
i = 0
j = 0
data = ""
while i < slen:
if (text[i] == '\\' and (text[i + 1] == '`' or text[i + 1] == '\\' or
text[i + 1] == '$')):
i += 1
data += text[i]
i += 1
print(data)
The output of the same is \\$a. And now lets test the same in bash
$ a=xxx
$ echo "$(echo \\$a)"
\xxx
$ echo "`echo \\\\$a`"
\xxx
Did some more research to find the reference and rule of what is happening. From the GNU Bash Reference Manual it states
When the old-style backquote form of substitution is used, backslash
retains its literal meaning except when followed by ‘$’, ‘`’, or ‘\’.
The first backquote not preceded by a backslash terminates the command
substitution. When using the $(command) form, all characters between
the parentheses make up the command; none are treated specially.
In other words \, \$, and ` inside of `` are processed by the CLI parser before the command substitution. Everything else is passed to the command substitution for processing.
Let's step through each example from the question. After the # I put how the command substitution was processed by the CLI parser before `` or $() is executed.
Your first example explained.
$ echo "`echo \\a`" # echo \a
a
$ echo "$(echo \\a)" # echo \\a
\a
Your second example explained:
$ echo "`echo \\\\a`" # echo \\a
\a
$ echo "$(echo \\\\a)" # echo \\\\a
\\a
Your third example:
a=xx
$ echo "`echo $a`" # echo xx
xx
$ echo "`echo \$a`" # echo $a
xx
echo "`echo \\$a`" # echo \$a
$a
Your third example using $()
$ echo "$(echo $a)" # echo $a
xx
$ echo "$(echo \$a)" # echo \$a
$a
$ echo "$(echo \\$a)" # echo \\$a
\xx

awk substitution ascii table rules bash

I want to perform a hierarchical set of (non-recursive) substitutions in a text file.
I want to define the rules in an ascii file "table.txt" which contains lines of blank space tabulated pairs of strings:
aaa 3
aa 2
a 1
I have tried to solve it with an awk script "substitute.awk":
BEGIN { while (getline < file) { subs[$1]=$2; } }
{ line=$0; for(i in subs)
{ gsub(i,subs[i],line); }
print line;
}
When I call the script giving it the string "aaa":
echo aaa | awk -v file="table.txt" -f substitute.awk
I get
21
instead of the desired "3". Permuting the lines in "table.txt" doesn't help. Who can explain what the problem is here, and how to circumvent it? (This is a simplified version of my actual task. Where I have a large file containing ascii encoded phonetic symbols which I want to convert into Latex code. The ascii encoding of the symbols contains {$,&,-,%,[a-z],[0-9],...)).
Any comments and suggestions!
PS:
Of course in this application for a substitution table.txt:
aa ab
a 1
a original string: "aa" should be converted into "ab" and not "1b". That means a string which was yielded by applying a rule must be left untouched.
How to account for that?
The order of the loop for (i in subs) is undefined by default.
In newer versions of awk you can use PROCINFO["sorted_in"] to control the sort order. See section 12.2.1 Controlling Array Traversal and (the linked) section 8.1.6 Using Predefined Array Scanning Orders for details about that.
Alternatively, if you can't or don't want to do that you could store the replacements in numerically indexed entries in subs and walk the array in order manually.
To do that you will need to store both the pattern and the replacement in the value of the array and that will require some care to combine. You can consider using SUBSEP or any other character that cannot be in the pattern or replacement and then split the value to get the pattern and replacement in the loop.
Also note the caveats/etcץ with getline listed on http://awk.info/?tip/getline and consider not using that manually but instead using NR==1{...} and just listing table.txt as the first file argument to awk.
Edit: Actually, for the manual loop version you could also just keep two arrays one mapping input file line number to the patterns to match and another mapping patterns to replacements. Then looping over the line number array will get you the pattern and the pattern can be used in the second array to get the replacement (for gsub).
Instead of storing the replacements in an associative array, put them in two arrays indexed by integer (one array for the strings to replace, one for the replacements) and iterate over the arrays in order:
BEGIN {i=0; while (getline < file) { subs[i]=$1; repl[i++]=$2}
n = i}
{ for(i=0;i<n;i++) { gsub(subs[i],repl[i]); }
print tolower($0);
}
It seems like perl's zero-width word boundary is what you want. It's a pretty straightforward conversion from the awk:
#!/usr/bin/env perl
use strict;
use warnings;
my %subs;
BEGIN{
open my $f, '<', 'table.txt' or die "table.txt:$!";
while(<$f>) {
my ($k,$v) = split;
$subs{$k}=$v;
}
}
while(<>) {
while(my($k, $v) = each %subs) {
s/\b$k\b/$v/g;
}
print;
}
Here's an answer pulled from another StackExchange site, from a fairly similar question: Replace multiple strings in a single pass.
It's slightly different in that it does the replacements in inverse order by length of target string (i.e. longest target first), but that is the only sensible order for targets which are literal strings, as appears to be the case in this question as well.
If you have tcc installed, you can use the following shell function, which process the file of substitutions into a lex-generated scanner which it then compiles and runs using tcc's compile-and-run option.
# Call this as: substitute replacements.txt < text_to_be_substituted.txt
# Requires GNU sed because I was too lazy to write a BRE
substitute () {
tcc -run <(
{
printf %s\\n "%option 8bit noyywrap nounput" "%%"
sed -r 's/((\\\\)*)(\\?)$/\1\3\3/;
s/((\\\\)*)\\?"/\1\\"/g;
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$1"
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"
} | lex -t)
}
With gcc or clang, you can use something similar to compile a substitution program from the replacement list, and then execute that program on the given text. Posix-standard c99 does not allow input from stdin, but gcc and clang are happy to do so provided you tell them explicitly that it is a C program (-x c). In order to avoid excess compilations, we use make (which needs to be gmake, Gnu make).
The following requires that the list of replacements be in a file with a .txt extension; the cached compiled executable will have the same name with a .exe extension. If the makefile were in the current directory with the name Makefile, you could invoke it as make repl (where repl is the name of the replacement file without a text extension), but since that's unlikely to be the case, we'll use a shell function to actually invoke make.
Note that in the following file, the whitespace at the beginning of each line starts with a tab character:
substitute.mak
.SECONDARY:
%: %.exe
#$(<D)/$(<F)
%.exe: %.txt
#{ printf %s\\n "%option 8bit noyywrap nounput" "%%"; \
sed -r \
's/((\\\\)*)(\\?)$$/\1\3\3/; #\
s/((\\\\)*)\\?"/\1\\"/g; #\
s/^((\\.|[^[:space:]])+)[[:space:]]*(.*)/"\1" {fputs("\3",yyout);}/' \
"$<"; \
printf %s\\n "%%" "int main(int argc, char** argv) { return yylex(); }"; \
} | lex -t | c99 -D_POSIX_C_SOURCE=200809L -O2 -x c -o "$#" -
Shell function to invoke the above:
substitute() {
gmake -f/path/to/substitute.mak "${1%.txt}"
}
You can invoke the above command with:
substitute file
where file is the name of the replacements file. (The filename must end with .txt but you don't have to type the file extension.)
The format of the input file is a series of lines consisting of a target string and a replacement string. The two strings are separated by whitespace. You can use any valid C escape sequence in the strings; you can also \-escape a space character to include it in the target. If you want to include a literal \, you'll need to double it.
If you don't want C escape sequences and would prefer to have backslashes not be metacharacters, you can replace the sed program with a much simpler one:
sed -r 's/([\\"])/\\\1/g' "$<"; \
(The ; \ is necessary because of the way make works.)
a) Don't use getline unless you have a very specific need and fully understand all the caveats, see http://awk.info/?tip/getline
b) Don't use regexps when you want strings (yes, this means you cannot use sed).
c) The while loop needs to constantly move beyond the part of the line you've already changed or you could end up in an infinite loop.
You need something like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
while ( sstart = index(tail,old) ) {
$0 = $0 substr(tail,1,sstart-1) new
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
print
}
$ echo aaa | awk -f substitute.awk table.txt -
3
$ echo aaaa | awk -f substitute.awk table.txt -
31
and adding some RE metacharacters to table.txt to show they are treated just like every other character and showing how to run it when the target text is stored in a file instead of being piped:
$ cat table.txt
aaa 3
aa 2
a 1
. 7
\ 4
* 9
$ cat foo
a.a\aa*a
$ awk -f substitute.awk table.txt foo
1714291
Your new requirement requires a solution like this:
$ cat substitute.awk
NR==FNR {
if (NF==2) {
strings[++numStrings] = $1
old2new[$1] = $2
}
next
}
{
delete news
for (stringNr=1; stringNr<=numStrings; stringNr++) {
old = strings[stringNr]
new = old2new[old]
slength = length(old)
tail = $0
$0 = ""
charPos = 0
while ( sstart = index(tail,old) ) {
charPos += sstart
news[charPos] = new
$0 = $0 substr(tail,1,sstart-1) RS
tail = substr(tail,sstart+slength)
}
$0 = $0 tail
}
numChars = split($0, olds, "")
$0 = ""
for (charPos=1; charPos <= numChars; charPos++) {
$0 = $0 (charPos in news ? news[charPos] : olds[charPos])
}
print
}
.
$ cat table.txt
1 a
2 b
$ echo "121212" | awk -f substitute.awk table.txt -
ababab

Replacing quotation marks with "``" and "''"

I have a document containing many " marks, but I want to convert it for use in TeX.
TeX uses 2 ` marks for the beginning quote mark, and 2 ' mark for the closing quote mark.
I only want to make changes to these when " appears on a single line in an even number (e.g. there are 2, 4, or 6 "'s on the line). For e.g.
"This line has 2 quotation marks."
--> ``This line has 2 quotation marks.''
"This line," said the spider, "Has 4 quotation marks."
--> ``This line,'' said the spider, ``Has 4 quotation marks.''
"This line," said the spider, must have a problem, because there are 3 quotation marks."
--> (unchanged)
My sentences never break across lines, so there is no need to check on multiple lines.
There are few quotes with single quotes, so I can manually change those.
How can I convert these?
This is my one-liner which is works for me:
awk -F\" '{if((NF-1)%2==0){res=$0;for(i=1;i<NF;i++){to="``";if(i%2==0){to="'\'\''"}res=gensub("\"", to, 1, res)};print res}else{print}}' input.txt >output.txt
And there is long version of this one-liner with comments:
{
FS="\"" # set field separator to double quote
if ((NF-1) % 2 == 0) { # if count of double quotes in line are even number
res = $0 # save original line to res variable
for (i = 1; i < NF; i++) { # for each double quote
to = "``" # replace current occurency of double quote by ``
if (i % 2 == 0) { # if its closes quote replace by ''
to = "''"
}
# replace " by to in res and save result to res
res = gensub("\"", to, 1, res)
}
print res # print resulted line
} else {
print # print original line when nothing to change
}
}
You may run this script by:
awk -f replace-quotes.awk input.txt >output.txt
Here's my one-liner using repeated sed's:
cat file.txt | sed -e 's/"\([^"]*\)"/`\1`/g' | sed '/"/s/`/\"/g' | sed -e 's/`\([^`]*\)`/``\1'\'''\''/g'
(note: it won't work correctly if there are already back-ticks (`) in the file but otherwise should do the trick)
EDIT:
Removed back-tick bug by simplifying, now works for all cases:
cat file.txt | sed -e 's/"\([^"]*\)"/``\1'\'\''/g' | sed '/"/s/``/"/g' | sed '/"/s/'\'\''/"/g'
With comments:
cat file.txt # read file
| sed -e 's/"\([^"]*\)"/``\1'\'\''/g' # initial replace
| sed '/"/s/``/"/g' # revert `` to " on lines with extra "
| sed '/"/s/'\'\''/"/g' # revert '' to " on lines with extra "
Using awk
awk '{n=gsub("\"","&")}!(n%2){while(n--){n%2?Q=q:Q="`";sub("\"",Q Q)}}1' q=\' in
Explanation
awk '{
n=gsub("\"","&") # set n to the number of quotes in the current line
}
!(n%2){ # if there are even number of quotes
while(n--){ # as long as we have double-quotes
n%2?Q=q:Q="`" # alternate Q between a backtick and single quote
sub("\"",Q Q) # replace the next double quote with two of whatever Q is
}
}1 # print out all other lines untouched'
q=\' in # set the q variable to a single quote and pass the file 'in' as input
Using sed
sed '/^\([^"]*"[^"]*"[^"]*\)*$/s/"\([^"]*\)"/``\1'\'\''/g' in
This might work for you:
sed 'h;s/"\([^"]*\)"/``\1''\'\''/g;/"/g' file
Explanation:
Make a copy of the original line h
Replace pairs of "'s s/"\([^"]*\)"/``\1''\'\''/g
Check for odd " and if found revert to original line /"/g

Removing control / special characters from log file

I have a log file captured by tclsh which captures all the backspace characters (ctrl-H, shows up as "^H") and color-setting sequences (eg. ^[[32m .... ^[[0m ). What is an efficient way to remove them?
^[...m
This one is easy since, I can just do "sed -i /^[.*m//g" to remove them
^H
Right now I have "sed -i s/.^H//", which "applies" a backspace, but I have to keep looping this until there are no more backspaces.
while [ logfile == `grep -l ^H logfile` ]; do sed -i s/.^H// logfile ; done;
"sed -i s/.^H//g" doesn't work because it would match consecutive backspaces. This process takes 11 mins for my log file with ~6k lines, which is too long.
Any better ways to remove the backspace?
You could always write a simple pipeline command to implement the backspace stripping, something like this:
#include <stdio.h>
#include <stdlib.h>
#define BUFFERSIZE 10240
int main(int argc, char* argv[])
{
int c ;
int buf[BUFFERSIZE] ;
int pos = 0 ;
while((c = getchar()) != EOF)
{
switch (c)
{
case '\b':
{
if (pos > 0)
pos-- ;
break ;
}
case '\n':
{
int i ;
for (i = 0; i < pos; ++i)
putchar(buf[i]) ;
putchar('\n') ;
pos = 0 ;
break ;
}
default:
{
buf[pos++] = c ;
break ;
}
}
}
return 0 ;
}
I've only given the code a minimal test and you may need to adjust the buffer sze depending on how big your lines our. It might be an idea to assert that pos is < BUFERSSIZE after pos++ just to be safe!
Alternatively you could maybe implement something similar with the Tcl code that captures the log file in the first place; but without knowing how that works it's a bit hard to say.
Your could try:
sed -i s/[^^H]^H//g
This might or might not work in one go, but should at least be faster than one at a time as you seem to be doing now.
Did you know that “sed” doesn't just do substitutions? The commands of a sed script have to be on separate lines though (or at least they do on the version of sed I've got on this machine).
sed -i bak 's/^[[^^]]*m//g
: again
s/[^^H]^H//g
t again' logfile
The : sets up a label (again in this case) and t branches to a label if any substitutions have been performed (since the start/last branch). Wrapping those round a suitable s gets the substitution applied until it can't any more.
Just to put it out here, I ended up doing this. It's not a pretty solution and not as flexible as Jackson's answer but does what I need in my particular case. I basically use the inner loop to generate the match string for sed.
# "Applies" up to 10 consecutive backspaces
for i in {10..1}; do
match=""
for j in `seq 1 $i`; do
match=".${match}^H"
done;
# Can't put quotes around s//g or else backspaces are evaluated
sed -i s/${match}//g ${file-to-process}
done;

sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line

I have files like
aaa
bbb
ccc
I need them to sed into aaa\r\nbbb\r\nccc
It should work either for unix and windows replacing them with \r or \r\n accordingly
The problem is that sed adds \n at the end of line but keeps lines separated. How can I fix it?
These two commands together should do what you want:
sed ':a;N;$!ba;s/\r/\\r/g'
sed ':a;N;$!ba;s/\n/\\n/g'
Pass your input file through both to get the output you want. Theres probably a way to combine them into a single expression.
Stolen and Modified from this question:
How can I replace a newline (\n) using sed?
It's possible to merge lines in sed, but personally, I consider needing to change line breaks a sign that it's time to give up on sed and use a more powerful language instead. What you want is one line of perl:
perl -e 'undef $/; while (<>) { s/\n/\\n/g; s/\r/\\r/g; print $_, "\n" }'
or 12 lines of python:
#! /usr/bin/python
import fileinput
from sys import stdout
first = True
for line in fileinput.input(mode="rb"):
if fileinput.isfirstline() and not first:
stdout.write("\n")
if line.endswith("\r\n"): stdout.write(line[:-2] + "\\r\\n")
elif line.endswith("\n"): stdout.write(line[:-1] + "\\n")
elif line.endswith("\r"): stdout.write(line[:-1] + "\\r")
first = False
if not first: stdout.write("\n")
or 10 lines of C to do the job, but then a whole bunch more because you have to process argv yourself:
#include <stdio.h>
void process_one(FILE *fp)
{
int c;
while ((c = getc(fp)) != EOF)
if (c == '\n') fputs("\\n", stdout);
else if (c == '\r') fputs("\\r", stdout);
else putchar(c);
fclose(fp);
putchar('\n');
}
int main(int argc, char **argv)
{
FILE *cur;
int i, consumed_stdin = 0, rv = 0;
if (argc == 1) /* no arguments */
{
process_one(stdin);
return 0;
}
for (i = 1; i < argc; i++)
{
if (argc[i][0] == '-' && argc[i][1] == 0)
{
if (consumed_stdin)
{
fputs("cannot read stdin twice\n", stderr);
rv = 1;
continue;
}
cur = stdin;
consumed_stdin = 1;
}
else
{
cur = fopen(ac[i], "rb");
if (!cur)
{
perror(ac[i]);
rv = 1;
continue;
}
}
process_one(cur);
}
return rv;
}
awk '{printf("%s\\r\\n",$0)} END {print ""}' file
tr -s '\r' '\n' <file | unix2dos
EDIT (it's been pointed out that the above misses the point entirely! •///•)
tr -s '\r' '\n' <file | perl -pe 's/\s+$/\\r\\n/'
The tr gets rid of empty lines and dos line endings. The pipe means two processes—good on modern hardware.

Resources