I've been working on a small assembler which uses flex, however, the flex rule list is reasonably long. Ideally, I'd like to solve this by splitting the rules into several files which can be included into the primary lex file. My searching has turned up nothing of relevance which leads me to believe this functionality may not exist. If it doesn't exist I'd be curious if anybody had some alternate suggestions. My only current alternate is to write a quick tool which preprocesses the lex file and builds a new one. This isn't the prettiest solution, but I suppose it does work.
So this question boils down to two questions:
Is there a way to include additional rules with flex?
If not, what alternates would you suggest (if they are different from the one I already proposed)?
I am not strictly against moving to a different lexer if one has a compelling reason to do so. However, in such cases, the lexer needs to be able to generate C or C++, either can be merged into this project with ease. I do consider this option to be a last resort though.
flex certainly doesn't include any functionality similar to the C preprocessor #include directive.
Also, using the C preprocessor to preprocess scanner definitions would be a bit awkward because scanner descriptions commonly include preprocessor commands to be transcribed into the output file, and the C preprocessor doesn't have any mechanism to conditionally retain directives.
However, there is nothing stopping your from simply concatenating several files to produce a scanner definition:
flex -o scanner.c scanner.options scanner.definitions keyword_rules.l other_rules.l
So I've been playing with various solutions for awhile now and finally have one I am particularly happy with. I ended up using bash to quickly implement a "flex_include" script.
Before I explain the syntax additions here is the script I came up with:
#!/bin/bash
while read -r line
do
if [[ $line =~ "<<INCLUDE>>".* ]]; then
file=$(echo $line | cut -d'"' -f2)
while read -r line2
do
echo "$line2"
done < $file
else
echo "$line"
fi
done < $1
This allows for including files with the syntax <<INCLUDE>> "my_file.l" within your lex file. I choose a naming convention similar to that of <<EOF>> and thus it fits into the flex syntax reasonably well. Usage of the script is reasonably simple, but there is a caveat -- pipes don't directly work. I don't know why but two extra lines would be generated in lex.yy.c at the top. I found process substitution worked just fine though: flex <(./flex_include.sh mips.l). This script is very forgiving in syntax, so just keep in mind that the script will accept more than it should. It will search for a single line with <<INCLUDE>> and then match a quoted string on that line, everything else is ignored and deleted on that line.
Related
For context, i'm trying to create an overly simplified version of bash, not like a bash full script interpreter, just a series of commands and operators (|,||, &&, <, >, <<, >>,$, $?) small interpreter, The mental model which i used in a nutshell is:
Lexer + Expander: in the first stage i used a simple state machine to lex and store data (commands, arguments, redirection files etc.) and lex input into tokens, i expand env variables and i handle lexical errors too.(as simple as checking finite states of valid characters).
Parser: in the second i stage i intend to create an AST out of the tokens + data, and handle parsing errors.
Executor: Finally i'll execute the AST.
No i'm at the parser stage, and i'm trying to think about how might i handle parsing errors, now the thought i had is out of the possible range of valid statements, it seems very difficult to check the validity of such an input cause the range is too big or at least that's what i think, and i'm sure there's some generalized solution for the problem, why i'm sure? because bash have done it.
For example this statement:
$ < $FILE || && > outfile
From the lexer point of view it's all bright and shiny, but it's surely not a valid input from the parser's perspective. Now one possible solution to this is to check whether there's a command token in the input if not then invalid. but what about this one:
$ || ls > $FILE && cat < $FILE
Again all valid lexeme, but unparsable statement, maybe that too could be checked against "if the line start with an OR or AND token error.".
Now the specific question is how bash exactly parse these combination of commands and operators, either there's some sort of more generalized solution or i'm left with an if&else error checking against inputs that i think is invalid. which honestly seems stupid and cumbersome.
Most of the complexity of shell parsing is in the tokenisation, although you certainly don't need to worry about all of the complications which have crept in over the years. The grammar itself is pretty simple; it's designed to be parsed by a parser generated with a tool like Bison (or some other yacc derivative), and that's precisely how Bash works.
The various syntactic rules recognised by Bash are scattered throughout the Bash manual, but the grammar is based on the standard shell grammar specified in the Posix standard, which is probably an easier starting point. In that document, the grammar is included as what is basically a Yacc input file (without any of the semantic actions necessary for an actual implementation); you can find it at the end of section 2.10. Make sure to read the initial part of that section, though, because it contains important information about how tokens are classified. Also, take note of section 2.3, token recognition.
Between these two sections you'll find a precise description of shell quoting rules and the various expansions which are done prior to parsing (or, better said, intermingled with parsing because command substitution makes the whole process recursive.) You might not want to absorb all of that on a first reading, although it will also help you be more effective in your use of the shell.
Bash implements a lot more features, but probably most or all of them go beyond your needs.
#choroba has the right idea - to understand exactly how Bash parses scripts you need to look at the source of Bash. There are basically fractal rules of thumb for how Bash works in increasingly complex cases, and any description short enough to fit in a SO response is probably not detailed enough to give you the full picture.
According to the Bash Reference Manual, the Bash scripting language is constituted of 4 distinct subclasses of syntactic elements:
built-in commands (alias, cd)
reserved words (if, function)
parameters and variables ($, IFS)
functions (abort, end-of-file - activated with keybindings such as Ctrl-d)
Apart from reading the manual, I became inherently curious if there was a programmatic way to list out or generate all such keywords, at least from one of the above categories. I think this could be useful in some contexts. Sometimes I wish I could see all the options available to me for what I can write in any given moment, and having that information as data, instead of a formatted manual, is convenient, focused, and can be edited, in case you want to strike out commands you know well, or that are too obscure for now.
My understanding is that Bash takes the input into stdin and passes it to the running shell process. When code is distributed in a production-ready form, it is compiled, so it runs faster. Unlike using a Python REPL, you don’t have access to the Bash source code from within Bash, so it is not a very direct route to write a program that searches through source files to find various defined commands. I mean that if you wanted to list all functions, Python has the dir() function which programmatically looks for function names in the namespace. But I don’t think Bash can do that. I think it doesn’t have a special syntax in its source files which makes it easy to find and identify all the keywords. Instead, they will be found if you simply enter them - like cd will “find” the program cd because $PATH returns the path to that command - but there’s no special way to discover them.
Or am I wrong? Technically, you could run a “brute force” search by generating every combination of symbols of every length and record when you did not get “error: unknown command” as a response.
Is there any other clever programmatic way to do this?
I mean I want to see a list of every symbol or string that the bash
compiler
Bash is not a compiler. It and every other shell I know are interpreters of various languages.
recognises and knows what to do with, including commands like
“ls” or just a symbol like “*”. I also want to see the inputs and
outputs for each symbol, i.e., some commands are executed in the shell
prompt by themselves, but what data type do they return?
All commands executed by the shell have an exit status, which is a number between 0 and 255. This is as close to a "return type" as you get. Many of them also produce idiosyncratic output to one or two streams (a standard output stream and a standard error stream) under some conditions, and many have other effects on the shell environment or operating environment.
And some
require a certain data type to standard input.
I can't think of a built-in utility whose expected input is well characterized as having a particular data type. That's not really a stream-oriented concept.
I want to do this just as a rigorous way to study the language.
If you want to rigorously study the language, then you should study its manual, where everything you describe has already been compiled. You might also want to study the POSIX shell command language manual for a slightly different perspective, which is more thorough in some areas, though what it documents differs in a few details from Bash's default behavior.
If you want to compile your own summary of Bash syntax and behavior, then those are the best source materials for such an effort.
You can get a list of all reserved words and syntactic elements of bash using this trick:
help -s '*' | cut -d: -f1
Or more accurately:
help -s \* | awk -F ': ' 'NR>2&&!/variables/{print $1}'
I find makefiles very useful, and the header of each recipe
<target> : [dependencies]
is helpful. Within a recipe, the prefixes # and - are useful, as well as the automatically-defined variables like $# and $?. However, besides that, I find the way of coding the actual recipe to be strange and unhelpful. There are so many questions on StackOverflow along the lines of "how to do this in a makefile" for something that's simple (or at least more familiar) to do in bash.
Is there a reason why the recipe contents are not just interpreted as a regular shell script? Reading the manual pages, there seems to be many tools with equivalent functionality to a shell script but with different syntax. I end up specifying .ONESHELL and escaping $ with $$, or sometimes just call a script from the recipe when I can't figure out how to make it work in a makefile. My question is whether this is just unfortunate design, or are there are important features of makefiles that force them to be designed this way?
I don't really know how to answer your question. Probably that means it's not really appropriate for StackOverflow.
The requirement for using $$ instead of $ is obvious. The reasoning for using a separate shell for each logical line of a makefile instead of passing the entire recipe to a single shell, is less clear. It could have worked either way, and this is the way it was chosen.
There is one advantage to the way it works now, although maybe most people don't care about it: you only have to indent the first recipe line with TAB, if you use backslash newline to continue each line. If you don't use backslash newline, then every line has to be indented with TAB else you don't know where the recipe ends.
If your question is, could Stuart Feldman have made very different syntax decisions that would have made it easier to write long/complex recipes in makefiles, then sure. Choosing a more obscure character than $ as a variable introducer would reduce the amount of escaping (although, shell scripting uses pretty much every special character somewhere so "reduce" is the best you can do). Choosing an explicit "start/stop" character sequence for recipes would make it simpler to write long recipes, possibly at the expense of some readability.
But that's not how it was done.
In the Z shell there's a handy command that returns a list of all available functions. The command is, conveniently, called functions. I cannot find a similar alternative in Bash. I threw together a quick & dirty (and wholly unacceptable) function to approximately do the same thing, but it has at least one glaring problem: since it relies on parsing files you must either list all the files to look in (which may become stale) or give an expression (which is guaranteed to give files you don't want to look in, such as .bash_history).
Here's the function, since I know someone will ask for it if I don't post it, but I'm pretty sure it's a dead end, or at least the wrong approach.
functions() {
grep "^function " "$HOME/."{bashrc,bash_profile,aliases,functions,projects,variables} | sort | sed -e 's/{//' | uniq
}
I could improve on this wrong-headed approach by parsing .bash_profile and getting a list of all sourced files and then parsing them for functions, but by the time you add the following complications into the mix, it's really not worth it:
You can source files with . or source.
I also happen to use a function to source files, which checks for the file's existence first.
You could easily source after && or ;: it's not necessarily the first or only thing on a line.
You have to account for the fact that functions don't necessarily have the keyword function before them.
You can omit the () after the function name.
There are probably other complicating factors I haven't thought of.
Fundamentally this is wrong because it is parsing files rather than reporting what is loaded in memory.
Is there any reasonable way to do this—get a list of all functions loaded in memory—in Bash? It seems like an enormous omission, if not.
(And for those looking for duplicate questions, this one is very different, as it's asking for a way to list only those functions that come from a specific file.)
Use typeset -f in bash. In zsh, functions is just a synonym for the same command.
Looking for a command line code formatter that can be used for bash code. It must be configurable and preferably usable from command line.
I have a big project in bash, which I need to use Q in mind for. So far I am happy with a program written in python by Paul Lutus (a remake of his previous version in Ruby).
See http://arachnoid.com/python/beautify_bash_program.html (also cloned here https://github.com/ewiger/beautify_bash).
but I would like to learn any serious alternative to this tool if it exists. Requirements: it should provide robust enough performance and behavior of treating/parsing rather complicated code.
PS I believe full parsing of bash code is generally complicated because there exists no official language grammar (but please correct me if I am wrong about it).
You could give shfmt a try. It implements its own shell parser including Bash support, so it's more robust than plaintext-based tools.
And both the parser and printer are available as Go packages, so it should be easy to write a 20-line Go program to manipulate or play with shell code.
Please note that I'm the author, so the advice may be a bit biased :)
you can script vim to do: "gg=G" that means "indent all the file"
I discovered that the type builtin will print functions in a formatted manner.
#/usr/bin/env bash
source <(cat <(echo 'wrapper() {') - <(echo '}'));
type wrapper | tail -n +4 | head -n -1 | sed 's/^ //g'
https://github.com/bas080/flush
On the contrary the shell does have a rigorous grammar.
It is described both in English in the ISO standard and documentation for Bash and other shells, and in formal terms in the shell.y file in the Bash source tree.
What makes it "hard" is that where one normally thinks of, say, a quoted string as single lexical token, In the shell every meta character is a separate lexical token, so the meaning of a character can change depending on its grammatical context.
So the parsing tokens do not align with the "shell words" that a user thinks of, and a simple quoted string is at least 3 tokens.
The implementations typically take some shortcuts involving using multiple lexical analysers chosen by whether the grammar is inside quotes, inside numeric context, or outside both.