How Does Bash Tokenize Scripts? - bash

Coming from a C++: it always seems like magic to me that some whitespace has an effect on the validity or semantics of the script. Here's an example:
echo a 2 > &1
bash: syntax error near unexpected token `&'
echo a 2 >&1
a 2
echo a 2>&1
a
echo a 2>& 1
a
Looking at this didn't help much. My main problem is that it does not feel consistent; and I am in a state of confusion.
I'm trying to find out how bash tokenizes its scripts. A general description thereof to clear up any confusion would be appreciated.
Edit:
I am NOT looking for redirections specifically. They just came up as example. Other examples:
A="something"
A = "something"
if [$x = $y];
if [ $x = $y ];
Why isn't there a space necessary between ] and ;? Why does assignment require an immediate equal sign? ...

2>&1 is a single operator token, so any whitespace that breaks it up will change the meaning of the command. It just happens to be a parameterized token, which means the shell will further tokenize it to determine what exactly the operator does. The general form is n>&m, where n is the file descriptor you are redirecting, and m is the descriptor you are copying to. In this case, you are saying that the standard error (2) of the command should be copied to whatever standard output (1) is currently open on.

The examples you gave have the behavior they do for good reason.
Redirection sources default to FD 1. Thus, >&1 is legitimate syntax on its own -- it redirects FD 1 to FD 1 -- meaning allowing whitespace before the > would result in an ambiguous syntax: The parser couldn't tell if the preceding token was its own word or a redirection source.
Nothing other than a FD number is valid under >&, unless you're in a very new bash which allows a variable to be dereferenced to retrieve a FD number. In any event, anything immediately following >& is known to be a file descriptor, so allowing optional whitespace creates no ambiguity there.
a = 1 is parsed as a legitimate command, not a syntax error: It runs the command a with the first argument = and the second argument 1. Disallowing whitespace within assignments eliminates this ambiguity. Similarly, a= foo has a separate and distinct meaning: It exports an environment variable a with an empty value while running the command foo. Relaxing the whitespace rules would disallow both of these legitimate commands.
[ is a command, not special syntax known to the parser; thus, [foo tries to find a command (named, say, /usr/bin/[foo), requiring whitespace.
; takes precedence in the parser as a statement separator, rather than being treated as part of a word, unless quoted or escaped. The same is true of & (another separator), or a newline.
The thing is, there's no single general rule which will explain all this; you need to read and learn the language syntax. Fortunately, there's not very much syntax: Almost all commands are "simple commands", which follow very simple and clear rules. You're asking about, and we're explaining, some of the exceptions to that; there are other exceptions, such as [[ ]] in bash, but they're small enough in total that they can be learned.
Other suggested resources:
http://aosabook.org/en/bash.html (The Architecture of Open Source Applications; chapter on bash)
http://mywiki.wooledge.org/BashParser (Wooledge wiki high-level description of the parser -- though this focuses more on expansion rules than tokenization)
http://mywiki.wooledge.org/BashGuide (an introductory guide to bash syntax in general, written with more of a focus on accuracy and best practices than some competing materials).

Related

Why does bash use/need so many input redirect symbols?

I am curious as to the nature and purpose of using multiple "<" characters to satisfy certain bash redirections. When is each of the <, <<, <<<, syntax correct/preferred? And under what conditions? Shouldn't a single "<" be sufficient for a properly written command, function, or subroutine? In unix 'everything' is a file. So why mask this with process-substitution? Isn't that already just a mask for the natural (grouping) capability of any shell? Or in some cases just a matter of proper order of execution?
Efficiency and performance always have trade-offs, as do ease of read/write ability or ease of usability. I'm an old dog trying to learn new tricks. 10 lines of code I understand, to perform the same task as one line of code that I do not understand, is worth the trade-off to me. In my years of scripting, I have had very few situations require writing to non-volatile storage, unless it was intended to be left there ""permanently.
I have not seen such reference for output. A single ">" will create/overwrite a file. A double ">>" will create/append a file. Is there a ">>>" for output too? This is a redundant question. I am only interested in the input redirect.
In simple words, they all have different meanings.
< Redirection of input
<< Here Document
<<< Here String (variant of here document)
Examples
< Redirection of input
grep foo < a-file.txt
This redirects the contents of a-file.txt to grep's standard input. grep searches for occurrences of string 'foo' in file a-file.txt.
<< Here Document
grep foo <<EOF
foo
foobar
baz
bar
EOF
Notice the EOF right after << and in the last line. From man bash:
This type of redirection instructs the shell to read input from the current source until a line containing only delimiter (with no trailing blanks) is seen.
So effectively, grep gets the string enclosed by the two EOFs as input.
<<< Here String (variant of here document)
grep foo <<<"foobar"
You could see this as a "single line" here document (<<). grep gets the string "foobar" as input.
Shouldn't a single "<" be sufficient for a properly written command, function, or subroutine?
So, which variant is the correct one to use depends on your use case and is indepent from the command you're using, as your shell (most likely bash) will take care of them.
I recommend section 3.6 Redirection of bash's manual for further reading. The sections concerning <, << and <<< are 3.6.1, 3.6.6, 3.6.7: https://www.gnu.org/software/bash/manual/bash.html#Redirections

Does "untyped" mean the same as "dynamically typing"? [duplicate]

This question already has answers here:
Does "untyped" also mean "dynamically typed" in the academic CS world?
(9 answers)
Closed 6 years ago.
According to Advanced Bash-Scripting Guide,
bash variables are untyped:
Unlike many other programming languages, Bash does not segregate its variables by "type." Essentially, Bash variables are character
strings, but, depending on context, Bash permits arithmetic
operations and comparisons on variables. The determining factor is
whether the value of a variable contains only digits.
The link also gives examples.
Does "untyped" mean the same as the concept of "dynamically typing"
in programming languages? If not, what are the relations and
differences between the two?
To lighten the burden of keeping track of variable types in a script, Bash does permit declaring variables.
For example, declare a variable to be integer type, by declare -i
myvariable.
Is this called "typed" variables? Does "typed" mean
the same as the concept of "statically typing"?
Most of this has been well answered here...
Does "untyped" also mean "dynamically typed" in the academic CS world?
by at least two people that are very familiar with the matter. To most of us that have not studied type systems etc to that level 'untyped' means dynamic typing but it's a misnomer in academic circles, see post above. untyped actually means there are no types ie think assembly, Bash is typed, it figures out it's types at runtime. Lets take the following sentence from the Advanced Bash Scripting Guide, emphasis mine...
http://tldp.org/LDP/abs/html/untyped.html
Unlike many other programming languages, Bash does not segregate its
variables by "type." Essentially, Bash variables are character
strings, but, depending on context, Bash permits arithmetic operations
and comparisons on variables. The determining factor is whether the
value of a variable contains only digits.
Bash figures out that something is a number at runtime ie it's dynamically typed.
In assembler on a 64bit machine I can store any 8 bytes in a register and decrement it, it doesn't check to see if the things were chars etc, there is no context about the thing it's about to decrement it just decrements the 64 bits, it doesn't check or work out anything about the type of the thing it's decrementing.
Perl is not an untyped language but the following code might make it seem like it treats everything as integers ie
#!/usr/bin/perl
use strict;
use warnings;
my $foo = "1";
my $bar = $foo + 1;
print("$bar\n");
$foo was assigned a string but was incremented? Does this means Perl is untyped because based on context it does what you want it to do? I don't think so.
This differs from Python, Python will actually give you the following error if you try the same thing...
Traceback (most recent call last):
File "py.py", line 2, in <module>
bar = foo + 1
If Python is dynamically typed and Perl is dynamically typed why do we see different behavior. Is it because their type systems differ or their type conversion semantics differ. In assembly do we have type conversion instructions that change a string to an integer or vice versa?
Bash has different type conversion rules
#!/bin/bash
set -e
MYVAR=WTF
let "MYVAR+=1"
echo "MYVAR == $MYVAR";
This will assign 1 to MYVAR instead of incrementing it ie if you increment a string bash sets the string to integer zero then does the increment. It's performing type conversion which means it's typed.
For anyone still believing that Bash is untyped try this....
#!/bin/bash
declare -i var1=1
var1=2367.1
You should get something like this...
foo.sh: line 3: 2367.1: syntax error: invalid arithmetic operator (error token is ".1")
But the following shows no such error
#!/bin/bash
var1=2367.1
The output of the following
#!/bin/bash
var1=2367.1
echo "$var1"
let "var1+=1"
echo "$var1"
is the same warning without declaring a type...
2367.1
foo.sh: line 4: let: 2367.1: syntax error: invalid arithmetic operator (error token is ".1")
2367.1
A much better example is this
#!/bin/bash
arg1=1234
arg2=abc
if [ $arg1 -eq $arg2 ]; then
echo "wtf";
fi
Why do I get this...
foo.sh: line 5: [: abc: integer expression expected
Bash is asking me for an integer expression.
Bash is a dynamically typed or more correctly it's a dynamically checked language. I've already added a long answer, this is the short one.
#!/bin/bash
arg1=1234
arg2=abc
if [ $arg1 -eq $arg2 ]; then
echo "wtf";
fi
gives this error message....
foo.sh: line 5: [: abc: integer expression expected
The fact I have an error that tells me I have in some way made a mistake with regards type means something is checking types.

How does : <<'END' work in bash to create a multi-line comment block?

I found a great answer for how to comment in bash script (by #sunny256):
#!/bin/bash
echo before comment
: <<'END'
bla bla
blurfl
END
echo after comment
The ' and ' around the END delimiter are important, otherwise things inside the block like for example $(command) will be parsed and executed.
This may be ugly, but it works and I'm keen to know what it means. Can anybody explain it simply? I did already find an explanation for : that it is no-op or true. But it does not make sense to me to call no-op or true anyway....
I'm afraid this explanation is less "simple" and more "thorough", but here we go.
The goal of a comment is to be text that is not interpreted or executed as code.
Originally, the UNIX shell did not have a comment syntax per se. It did, however, have the null command : (once an actual binary program on disk, /bin/:), which ignores its arguments and does nothing but indicate successful execution to the calling shell. Effectively, it's a synonym for true that looks like punctuation instead of a word, so you could put a line like this in your script:
: This is a comment
It's not quite a traditional comment; it's still an actual command that the shell executes. But since the command doesn't do anything, surely it's close enough: mission accomplished! Right?
The problem is that the line is still treated as a command beyond simply being run as one. Most importantly, lexical analysis - parameter substitution, word splitting, and such - still takes place on those destined-to-be-ignored arguments. Such processing means you run the risk of a syntax error in a "comment" crashing your whole script:
: Now let's see what happens next
echo "Hello, world!"
#=> hello.sh: line 1: unexpected EOF while looking for matching `''
That problem led to the introduction of a genuine comment syntax: the now-familiar # (which was first introduced in the C shell created at BSD). Everything from # to the end of the line is completely ignored by the shell, so you can put anything you like there without worrying about syntactic validity:
# Now let's see what happens next
echo "Hello, world!"
#=> Hello, world!
And that's How The Shell Got Its Comment Syntax.
However, you were looking for a multi-line (block) comment, of the sort introduced by /* (and terminated by */) in C or Java. Unfortunately, the shell simply does not have such a syntax. The normal way to comment out a block of consecutive lines - and the one I recommend - is simply to put a # in front of each one. But that is admittedly not a particularly "multi-line" approach.
Since the shell supports multi-line string-literals, you could just use : with such a string as an argument:
: 'So
this is all
a "comment"
'
But that has all the same problems as single-line :. You could also use backslashes at the end of each line to build a long command line with multiple arguments instead of one long string, but that's even more annoying than putting a # at the front, and more fragile since trailing whitespace breaks the line-continuation.
The solution you found uses what is called a here-document. The syntax some-command <<whatever causes the following lines of text - from the line immediately after the command, up to but not including the next line containing only the text whatever - to be read and fed as standard input to some-command. Here's an alternate shell implementation of "Hello, world" which takes advantage of this feature:
cat <<EOF
Hello, world
EOF
If you replace cat with our old friend :, you'll find that it ignores not only its arguments but also its input: you can feed whatever you want to it, and it will still do nothing (and still indicate that it did that nothing successfully).
However, the contents of a here-document do undergo string processing. So just as with the single-line : comment, the here-document version runs the risk of syntax errors inside what is not meant to be executable code:
#!/bin/sh -e
: <<EOF
(This is a backtick: `)
EOF
echo 'In modern shells, $(...) is preferred over backticks.'
#=> ./demo.sh: line 2: bad substitution: no closing "`" in `
The solution, as seen in the code you found, is to quote the end-of-document "sentinel" (the EOF or END or whatever) on the line introducing the here document (e.g. <<'EOF'). Doing this causes the entire body of the here-document to be treated as literal text - no parameter expansion or other processing occurs. Instead, the text is fed to the command unchanged, just as if it were being read from a file. So, other than a line consisting of nothing but the sentinel, the here-document can contain any characters at all:
#!/bin/sh -e
: <<'EOF'
(This is a backtick: `)
EOF
echo 'In modern shells, $(...) is preferred over backticks.'
#=> In modern shells, $(...) is preferred over backticks.
(It is worth noting that the way you quote the sentinel doesn't matter - you can use <<'EOF', <<E"OF", or even <<EO\F; all have the same result. This is different from the way here-documents work in some other languages, such as Perl and Ruby, where the content is treated differently depending on the way the sentinel is quoted.)
Notwithstanding any of the above, I strongly recommend that you instead just put a # at the front of each line you want to comment out. Any decent code editor will make that operation easy - even plain old vi - and the benefit is that nobody reading your code will have to spend energy figuring out what's going on with something that is, after all, intended to be documentation for their benefit.
It is called a Here Document. It is a code block that lets you send a list of commands to another command or program
The string following the << is the marker determining the end of the block. If you send commands to no-op, nothing happens, which is why you can use it as a comment block.
That's heredoc syntax. It's a way of defining multi-line string literals.
As the answer at your link explains, the single quotes around the END disables interpolation, similar to the way single-quoted strings disable interpolation in regular bash strings.

multiple replacements on a single variable

For the following variable:
var="/path/to/my/document-001_extra.txt"
i need only the parts between the / [slash] and the _ [underscore].
Also, the - [dash] needs to be stripped.
In other words: document 001
This is what I have so far:
var="${var##*/}"
var="${var%_*}"
var="${var/-/ }"
which works fine, but I'm looking for a more compact substitution pattern that would spare me the triple var=...
Use of sed, awk, cut, etc. would perhaps make more sense for this, but I'm looking for a pure bash solution.
Needs to work under GNU bash, version 3.2.51(1)-release
After editing your question to talk about patterns instead of regular expressions, I'll now show you how to actually use regular expressions in bash :)
[[ $var =~ ^.*/(.*)-(.*)_ ]] && var="${BASH_REMATCH[#]:1:2}"
Parameter expansions like you were using previously unfortunately cannot be nested in bash (unless you use ill-advised eval hacks, and even then it will be less clear than the line above).
The =~ operator performs a match between the string on the left and the regular expression on the right. Parentheses in the regular expression define match groups. If a match is successful, the exit status of [[ ... ]] is zero, and so the code following the && is executed. (Reminder: don't confuse the "0=success, non-zero=failure" convention of process exit statuses with the common Boolean convention of "0=false, 1=true".)
BASH_REMATCH is an array parameter that bash sets following a successful regular-expression match. The first element of the array contains the full text matched by the regular expression; each of the following elements contains the contents of the corresponding capture group.
The ${foo[#]:x:y} parameter expansion produces y elements of the array, starting with index x. In this case, it's just a short way of writing ${BASH_REMATCH[1]} ${BASH_REMATCH[2]}. (Also, while var=${BASH_REMATCH[*]:1:2} would have worked as well, I tend to use # anyway to reinforce the fact that you almost always want to use # instead of * in other contexts.)
Both of the following should work correctly. Though the second is sensitive to misplaced characters (if you have a / or - after the last _ it will fail).
var=$(IFS=_ read s _ <<<"$var"; IFS=-; echo ${s##*/})
var=$(IFS=/-_; a=($var); echo "${a[#]:${#a[#]} - 3:2}")

Bash command groups: Why do curly braces require a semicolon?

I know the difference in purpose between parentheses () and curly braces {} when grouping commands in bash.
But why does the curly brace construct require a semicolon after the last command, whereas for the parentheses construct, the semicolon is optional?
$ while false; do ( echo "Hello"; echo "Goodbye"; ); done
$ while false; do ( echo "Hello"; echo "Goodbye" ); done
$ while false; do { echo "Hello"; echo "Goodbye"; }; done
$ while false; do { echo "Hello"; echo "Goodbye" }; done
bash: syntax error near unexpected token `done'
$
I'm looking for some insight as to why this is the case. I'm not looking for answers such as "because the documentation says so" or "because it was designed that way". I'd like to know why it was designed this is way. Or maybe if it is just a historical artifact?
This may be observed in at least the following versions of bash:
GNU bash, version 3.00.15(1)-release (x86_64-redhat-linux-gnu)
GNU bash, version 3.2.48(1)-release (x86_64-apple-darwin12)
GNU bash, version 4.2.25(1)-release (x86_64-pc-linux-gnu)
Because { and } are only recognized as special syntax if they are the first word in a command.
There are two important points here, both of which are found in the definitions section of the bash manual. First, is the list of metacharacters:
metacharacter
A character that, when unquoted, separates words. A metacharacter is a blank or one of the following characters: ‘|’, ‘&’, ‘;’, ‘(’, ‘)’, ‘<’, or ‘>’.
That list includes parentheses but not braces (neither curly nor square). Note that it is not a complete list of characters with special meaning to the shell, but it is a complete list of characters which separate tokens. So { and } do not separate tokens, and will only be considered tokens themselves if they are adjacent to a metacharacter, such as a space or a semi-colon.
Although braces are not metacharacters, they are treated specially by the shell in parameter expansion (eg. ${foo}) and brace expansion (eg. foo.{c,h}). Other than that, they are just normal characters. There is no problem with naming a file {ab}, for example, or }{, since those words do not conform to the syntax of either parameter expansion (which requires a $ before the {) or brace expansion (which requires at least one comma between { and }). For that matter, you could use { or } as a filename without ever having to quote the symbols. Similarly, you can call a file if, done or time without having to think about quoting the name.
These latter tokens are "reserved words":
reserved word
A word that has a special meaning to the shell. Most reserved words introduce shell flow control constructs, such as for and while.
The bash manual doesn't contain a complete list of reserved words, which is unfortunate, but they certainly include the Posix-designated:
! { }
case do done elif else
esac fi for if in
then until while
as well as the extensions implemented by bash (and some other shells):
[[ ]]
function select time
These words are not the same as built-ins (such as [), because they are actually part of the shell syntax. The built-ins could be implemented as functions or shell scripts, but reserved words cannot because they change the way that the shell parses the command line.
There is one very important feature of reserved words, which is not actually highlighted in the bash manual but is made very explicit in Posix (from which the above lists of reserved words were taken, except for time):
This recognition [as a reserved word] shall only occur when none of the characters is quoted and when the word is used as:
The first word of a command …
(The full list of places where reserved words is recognized is slightly longer, but the above is a pretty good summary.) In other words, reserved words are only reserved when they are the first word of a command. And, since { and } are reserved words, they are only special syntax if they are the first word in a command.
Example:
ls } # } is not a reserved word. It is an argument to `ls`
ls;} # } is a reserved word; `ls` has no arguments
There is lots more I could write about shell parsing, and bash parsing in particular, but it would rapidly get tedious. (For example, the rule about when # starts a comment and when it is just an ordinary character.) The approximate summary is: "don't try this at home"; really, the only thing which can parse shell commands is a shell. And don't try to make sense of it: it's just a random collection of arbitrary choices and historical anomalies, many but not all based on the need to not break ancient shell scripts with new features.

Resources