How Bash parse multi-flag commands? - bash

I'm trying to create an overly simplified version of bash, I've tried split the program into "lexer + expander, parser, executor".
In the lexer i store my data (commands, flags, files) and create tokens out of them , my procedure is simply to loop through given input char by char and use a state machine to handle states, states are either a special character, an alphanumeric character or space.
Now when i'm at an alphanumeric state i'm at a command, the way i know where the next flag is when i encounter again alphanumeric state or if input[i] == '-', now the problem is with multi-flag commands.
For example:
$ ls -la | grep "*.c"
I successfully get the command ls, grep and the flag -la, *.c.
However with multi-flag commands like.
$ sed -i "*.bak" "s/a/b/g" file1 file2
It seems to me very difficult, and i can't figure out yet, how can i know where the flags to a specific command ends, so my question is how bash parse these multi-flags commands ? any suggestions regarding my problem, would be appreciated !

The shell does not attempt to parse command arguments; that's the responsibility of the utility. The range of possible command argument syntaxes, both in use and potentially useful, is far too great to attempt that.
On Unix-like systems, the shell identifies individual arguments from the command line, mostly by splitting at whitespace but also taking into account the use of quotes and a variety of other transformations, such as "glob expansion". It then makes a vector of these arguments ("argv") and passes the vector to execve, which hands them to the newly created process.
On Windows systems, the shell doesn't even do that. It just hands over the command-line as a string, and leaves it to the command-line tool to do everything. (In order to provide a modicum of compatibility, there's an intermediate layer which is called by the application initialization code, which eventually calls main(). This does some basic argument-splitting, although its quoting algorithm is quite a bit simplified from that used by a Unix shell.)
No command-line shell that I know of attempts to identify command-line flags. And neither should you.
For a bit of extracurricular reading, here's the description of shell parsing from the Posix standard: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html. Trying to implement all that goes far beyond the requirements given to you for this assignment, and I'm certainly not recommending that you do that. But it might still be interesting, and understanding it will help you immensely if you start using a shell.
Alternatively, you could try reading the Bash manual, which might be easier to understand. Note that Bash implements a lot of extensions to the Posix standard.

Related

How to more accurate detect one file is a ruby script file in linux

I know the extname is rb is worked.
I know linux file command is worked to in some case.
But all not accurate enough to decide a file is a ruby scripts
EDIT:
What I want to do is: more accuretely amount the ruby lines I wrote
with a bash shell scripts like followings:
find -name '*.rb' |xargs -n100 cat |grep -v '\s*#' |wc -l
but, in fact, I wrote some executable ruby scripts, and others, e.g.
.rake, Gemfile Capfile jbuider etc ...
Thanks
Use ruby -c. From the man page:
-c Causes Ruby to check the syntax of the script and exit
This will tell you if the file is a valid Ruby script without executing it. If it is, it will print "Syntax OK" to STDOUT and exit with status code 0; otherwise it will print a syntax error to STDERR and exit with a nonzero code. (You can of course suppress the messages using I/O redirection, e.g. &>/dev/null.)
Of course, false positives are possible (the fact that a file is valid Ruby doesn't necessarily mean it was intended to be a Ruby script), but unlikely except with very short files.
What you want is impossible. For example, the following is a valid, and semantically identical program in at least Ruby, PHP, Scala, and Perl:
print("Hello");
It is also valid in Python, although semantically slightly different: it prints a newline (i.e. it prints the string "Hello\n") while the others don't (the others print "Hello" without a newline).
It is also at least syntactically valid in ECMAScript, and may be semantically equivalent assuming a suitable print function exists in the standard library.
It is probably valid in a lot more languages than that, some that I can think of are AmbientTalk, Atomy, CoffeeScript, Converge, Dart, Dylan, E, Elixir, Falcon, Fancy, Groovy, Hack, Io, Ioke, Julia, Lua, Monte, Neko, Pico, Pike, and Seph. It is also a valid fragment, although not a complete program, in at least Perl6, C, C++, Objective-C, Objective-C++, D, Java, C♯, Spec♯, Sing♯, M♯, Cω, X♯, Kotlin, Ceylon, Rust, and Rust.
There is no way of knowing whether this is a Ruby program except asking the person who wrote it.

Prog Challenge - Find paths to files called from configuration files or scripts

I have no idea how to do that, so I come here for help :) Here is what I'd need. I need to parse some configuration files or bash/sh scripts on a Red Hat Linux system, and look for the paths to the files/commands/scripts meant to be executed by them. The configuration files can have different syntax or be using different languages.
Here are the files I have to look at:
Config scripts:
/etc/inittab
/var/spool/cron/root
/var/spool/cron/tabs/root
/etc/crontab
/etc/xinetd.conf
Files located under /etc/cron.d/* recursively
Bash / Sh scripts:
Files located under /etc/init.d/* or /etc/rc.d/* recursively. These folders contain only shell scripts so maybe all the other files listed above need separate treatment.
Now here's the challenges that I can think of:
The paths within the files may be absolute or relatives ;
The paths within the files may be at the beginning of lines or preceded by a character such as space, colon or semicolon ;
File paths expressed as arguments to commands/scripts must be ignored ;
Paths to directories must be ignored ;
Shell functions or built-in commands must be ignored ;
Some examples (extracted from /etc/init.d/avahi-daemon):
if [ -s /etc/localtime ]; then
cp -fp /etc/localtime /etc/avahi/etc >/dev/null 2>&1
-> Only /bin/cp and /bin/[ must be returned in the snippet above (its the only commands actually executed)
AVAHI_BIN=/usr/sbin/avahi-daemon
$AVAHI_BIN -r
-> /usr/sbin/avahi-daemon must be returned, but only because the variable is called after.
Note that I do not have access to the actual filesystem, I just have a copy of the files to parse.
After writing this up, I realize how complicated it is and unlikely to have a 100% working solution... But if you like programming challenges :)
The good part is I can use any scripting language: bash/sh/grep/sed/awk, php, python, perl, ruby or a combination of these..
I tried to start writing up in PHP but I am struggling to get coherent results.
Thanks!
The language you use to implement this doesn't matter. What matters is that the problem is undecidable, because it is equivalent to the halting problem.
Just as we know that it is impossible to determine if a program will halt, it is impossible to know if a program will call another program. For example, you may think your script will invoke X then Z, but if X never returns, Z will never be invoked. Also, you may not notice that your script invokes Y, because the string Y may be determined dynamically and never actually appear in the program text.
There are other problems which may stymie you along the way, too, such as:
python -c 'import subprocess; subprocess.call("ls")'
Now you need not only a complete parser for Bash, but also for Python. Not to mention solve the halting problem in Python.
In other words, what you want is not possible. To make it feasible you would have to significantly reduce the scope of the problem, e.g. "Find everything starting with /usr/bin or /bin that isn't in a comment". And it's unclear how useful that would be.

Interpretation of additional arguments to Ruby's Kernel::system method

Why does the first excerpt succeed and the second fail?
system 'emacs', '--batch', '--quick', '--eval="(require \'package)"'
system 'emacs --batch --quick --eval="(require \'package)"'
(If it matters, I'm executing the code on Mac OS X Mountain Lion with Ruby version 1.8.7 and Emacs version 22.1.1.)
First of all, those two system calls are different in ways that you may not expect. A quick example will probably explain the difference better than a bunch of words and hand waving. Start with a simple shell script:
#!/bin/sh
echo $1
I'll call that pancakes.sh because I like pancakes more than foo. Then we can step into irb and see what's going on:
>> system('./pancakes.sh --where-is="house?"')
--where-is=house?
>> system('./pancakes.sh', '--where-is="house?"')
--where-is="house?"
Do you see the significant difference? The single argument form of system hands the whole string to /bin/sh for processing and /bin/sh will deal with the double quotes in its own way so the program being called will never see them. The multi-argument form of system doesn't invoke /bin/sh to process the command line so the arguments are passed as-is with double quotes intact.
Back to your system calls. The first one will send this exact argument to emacs (note that Ruby will take care of converting \' to just '):
--eval="(require 'package)"
and emacs will try to evaluate "(require 'package)"; that looks more like a string than an elisp snippet to me and evaluating a string literal doesn't do much of anything. Your second will send this to emacs:
--eval=(require 'package)
and emacs will complain that it
Cannot open load file: package
Note that my elisp knowledge is buried under about 20 years of rust and forgetfulness so some of the emacs details may be a bit off.

Converting a history command into a shell script

This is sort of one of those things that I figured a lot of people would use a lot, but I can't seem to find any people who have written about this sort of thing.
I find that a lot of times I do a lot of iteration on a command-line one-liner and when I end up using it a lot, or anticipate wanting to use it in the future, or when it becomes cumbersome to work with in one line, it generally is a good idea to turn the one-liner into a shell script and stick it somewhere reasonable and easily accessible like ~/bin.
It's obviously too cumbersome to use any sort of roundabout method involving a text editor to get this done, and it's possible to simply do it on the shell, for instance in zsh typing
echo "#!/usr/bin/env sh" > ~/bin/command_from_history_number_523.sh && echo !523 >> ~/bin/command_from_history_number_523.sh
followed by pressing Tab to inject the !523rd command and somehow shoehorning it into an acceptable string to be saved.
This is particularly cumbersome and has at minimum three problems:
Does not work in bash as it does not complete the !523
Requires some manual inspection and string escapement
Requires too much typing such as the script name must be entered twice
So it looks like I need to do some meta shell scripting here.
I think a good solution would function under both bash and zsh, and it should probably work by taking two arguments, an integer for the history command number and a name for the shell script to poop out in a hardcoded directory which contains that one command. Furthermore, under bash, it appears that multi-line commands are treated as separate commands, but I'm willing to assume that we only care about one-liners here and I only use zsh anyway at this point.
The stumbling block here is that i think I'll still be running shell scripts through bash even when using zsh, so it won't likely then be able to parse zsh's history files. I may need to make this into two separate programs then.
Update: I agree with #Floris 's comment that direct use of the commands like !! would be helpful though I am not sure how to make this work. Suppose I have the usage be
mkscript command_number_24 !24
this is inadequate because mkscript will be receiving the expanded out contents of the 24th command. if the 24th command contains any file globs or somesuch they will have been expanded already. This is bad, and I basically want the contents of the history file, i.e. the raw command string. I guess this can be worked around by manually implementing those shortcuts in here. Or just screw it and just take an integer argument.
function mkscript() {
echo '#!/bin/bash' > ~/bin/$2
history -p '!'$1 >> ~/bin/$2
}
Only tested in Bash.
Update from OP: In zsh I can accomplish this with fc -l $2 $2

What is the standard usage argument style?

I'm making some command-line tools for some research I'm doing. I'd like these tools to follow commonly used conventions regarding command line programs in Unix.
Should I use flags or just list parameters?
program one two three
program -a one -b two -c three
Where in the list of commands does the input file normally go, or is it better to < it into the program?
What about the output filename?
Should I specify the file extension for the output format, or have my program automatically put the correct extension on?
When the user enters an invalid command, is there a prototypical "correct usage" message?
Is "--help" or "-h" required?
Also, is there some sort of header file I can include that would help with managing these?
If you're looking for a "standard", then you could do worse than look at GNU's Standards for Command Line Interfaces. Other standards are available.
As far as coding for this goes, take a look at boost::program_options. Not only will this save you rolling a lot of your own code, but it does a good job of formatting the options for presenting to the user (the prototypical "correct usage" message, you asked for).
In answer to your specific questions:
Where in the list of commands does the input file normally go, or is it better to < it into the program?
I would expect these to come at the end of a command line. Like in GNU grep. If you are only processing one file and would like to make stdin available as an input source, that would not surprise most users.
If your command processes lots of files, then it would be unusual to have to specify a switch before the filenames. Think cat.
What about the output filename?
A -o or --output option is fairly common. If your file takes exactly one input and one output, then program inputfile outputfile would not surprise many users. If no output file is specified, perhaps you'll output to stdout; that would not be unusual behaviour and would allow your users to pipe the output through other commands (such as grep, less, etc...), They could also redirect stdout to a file using >.
Should I specify the file extension for the output format, or have my program automatically put the correct extension on?
This is probably a matter for debate. If I specified an output filename, I would expect to find that file created (or replaced, after a prompt) without the program changing the name.
When the user enters an invalid command, is there a prototypical "correct usage" message?
Using GNU grep as an example again:
grep: unrecognized option '--incorrect'
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
This wouldn't surprise too many users and points them in the right direction if they've made a typo without swamping them with information.
Is "--help" or "-h" required?
That depends on your customer! I find it frustrating when this option isn't available.
Usually speaking, flags are there for providing options and parameter are for passing information. If you have input,output file as command line argument, use flags like -i -o, so sequence will not matter. -h is required if you want to (and need to) give documentation.

Resources