finding variable and function memory addresses in shell - shell

In shell programming, is there any way to find the variable and function addresses?
Something like in c or c++ when we output the &x, we can get x memory address.

It depends on what you want. Most shells only support symbolic (or "soft") references, variables that contain the names of other variables. Korn shell 93 has name references, created using typeset -n, but the address is not exposed. Take a look at that in man ksh, it might help.
But even if you could get the memory address of variables, what would you do with it? Shell variables are not structured in a primitive manner as in C, so pointer arithemtic would be no use. You could not iterate through an array or character string just by incrementing an address. The best you could do is investigate the memory of that particular version of that particular shell.
Since the shells are mostly written in C, you could always compile the shell source with debug (-g) and connect using a debugger. No idea what that would buy you though, so my question has to be: why?
Edit; from a comment I see you want the stack size - I'm pretty sure that shell variables will be allocated either on the heap or in the environment block.

Related

How can I generate a list of every valid syntactic operator in Bash including input and output?

According to the Bash Reference Manual, the Bash scripting language is constituted of 4 distinct subclasses of syntactic elements:
built-in commands (alias, cd)
reserved words (if, function)
parameters and variables ($, IFS)
functions (abort, end-of-file - activated with keybindings such as Ctrl-d)
Apart from reading the manual, I became inherently curious if there was a programmatic way to list out or generate all such keywords, at least from one of the above categories. I think this could be useful in some contexts. Sometimes I wish I could see all the options available to me for what I can write in any given moment, and having that information as data, instead of a formatted manual, is convenient, focused, and can be edited, in case you want to strike out commands you know well, or that are too obscure for now.
My understanding is that Bash takes the input into stdin and passes it to the running shell process. When code is distributed in a production-ready form, it is compiled, so it runs faster. Unlike using a Python REPL, you don’t have access to the Bash source code from within Bash, so it is not a very direct route to write a program that searches through source files to find various defined commands. I mean that if you wanted to list all functions, Python has the dir() function which programmatically looks for function names in the namespace. But I don’t think Bash can do that. I think it doesn’t have a special syntax in its source files which makes it easy to find and identify all the keywords. Instead, they will be found if you simply enter them - like cd will “find” the program cd because $PATH returns the path to that command - but there’s no special way to discover them.
Or am I wrong? Technically, you could run a “brute force” search by generating every combination of symbols of every length and record when you did not get “error: unknown command” as a response.
Is there any other clever programmatic way to do this?
I mean I want to see a list of every symbol or string that the bash
compiler
Bash is not a compiler. It and every other shell I know are interpreters of various languages.
recognises and knows what to do with, including commands like
“ls” or just a symbol like “*”. I also want to see the inputs and
outputs for each symbol, i.e., some commands are executed in the shell
prompt by themselves, but what data type do they return?
All commands executed by the shell have an exit status, which is a number between 0 and 255. This is as close to a "return type" as you get. Many of them also produce idiosyncratic output to one or two streams (a standard output stream and a standard error stream) under some conditions, and many have other effects on the shell environment or operating environment.
And some
require a certain data type to standard input.
I can't think of a built-in utility whose expected input is well characterized as having a particular data type. That's not really a stream-oriented concept.
I want to do this just as a rigorous way to study the language.
If you want to rigorously study the language, then you should study its manual, where everything you describe has already been compiled. You might also want to study the POSIX shell command language manual for a slightly different perspective, which is more thorough in some areas, though what it documents differs in a few details from Bash's default behavior.
If you want to compile your own summary of Bash syntax and behavior, then those are the best source materials for such an effort.
You can get a list of all reserved words and syntactic elements of bash using this trick:
help -s '*' | cut -d: -f1
Or more accurately:
help -s \* | awk -F ': ' 'NR>2&&!/variables/{print $1}'

How to find variable names in Bash source code

I'm writing an experimental Bash module system that would allow local function namespaces, and my first idea was to write a Bash function parser that would read the function code line by line and prepend each function/variable name with <module-name>. (i.e. function func in module module would become module.func - which could again be imported in another module like module_2.module.func and so on; variables inside functions would be name-mangled - variable var within function func in module module would become __module_func_var).
However, in order to do that, I need a way to detect which names are variables and replace all their occurences in the function with the transported import-name. Trivial cases like variable=[...] are easily parsable, but there are countless of other cases where it's not that trivial - what about while read variable; do [...] done and variable2="asdf${variable//_/+}"?
It seems to me that in order to do this I need to dive into the parsing mechanisms of Bash or read a book on programming languages - but where do I start in order to achieve what I have explained above?
I need a way to detect which names are variables
I'm sorry to say this, but in general it's impossible.
Supporting only the static cases where variables can occur is possible but very tricky. Consider only variable assignments: Besides x= there are declare x=, printf -v x, read x, mapfile x, readarray x and probably many more. Even mature tools like shellcheck still have problems parsing all these cases correctly (for instance, see this issue).
However, even if you mastered parsing all the static cases correctly there still could by dynamic variables, for instance:
x=$(someCommand)
declare "$x=something"
In this example you cannot know the name of the new variable without executing someCommand. Other things which are equally (or even) worse are bash's indirection operator ${!x}, implicit indirection in arithmetic contexts (e.g. x=y; echo $((x))), and eval.
tl;dr: The only way to get all the variables in a script is to interpret/execute the script.
But here comes another problem: Executing the script is also not an option if there is non-determinism (declare "$(tr -cd a-z /dev/urandom | head -c1)=..."). Note that user-input is also non-deterministic (read x; declare "var$x=..."). You would have to write a static analyzer. But this is also not an option because of the halting problem. From the halting problem we can deduce that it is (in general) impossible to tell whether a given bash script has a finite amount of variables.
To implement your module system you could use another approach. For instance, if someone wants to implement a module for your framework then they have to specify the functions/variables in this module in an easy parsable format.

What is the rationale behind variable assignment without space in bash script

I am trying to write an automate process for AWS that requires some JSON processing and other things in bash script. I am following a few blogs for bash script and I found this:
a=b
with the following note:
There is no space on either side of the equals ( = ) sign. We
also leave off the $ sign from the beginning of the variable name when
setting it
This is ugly and very difficult to read and comparing to other scripting languages, it is easy for user to make a mistake when writing a bash script by leaving space in between. I think everyone like to write clean and readable code, this restriction for sure is bad for code readability.
Can you explain why? explanation with examples are highly appreciated.
It's because otherwise the syntax would be ambiguous. Consider this command line:
cat = foo
Is that an assignment to the variable cat, or running the command cat with the arguments "=" and "foo"? Note that "=" and "foo" are both perfectly legal filenames, and therefore reasonable things to run cat on. Shell syntax settles this in favor of the command interpretation, so to avoid this interpretation you need to leave out the spaces. cat =foo has the same problem.
On the other hand, consider:
var= cat
Is that the command cat run with the variable var set to the empty string (i.e. a shorthand for var='' cat), or an assignment to the shell variable var? Again, the shell syntax favors the command interpretation so you need to avoid the temptation to add spaces.
There are many places in shell syntax where spaces are important delimiters. Another commonly-messed-up place is in tests, where if you leave out any of the spaces in:
if [ "$foo" = "$bar" ]
...it will lead to a different meaning, which might cause an error, or might just silently do the wrong thing.
What I'm getting at is that shell syntax does not allow you to arbitrarily add or remove spaces to improve readability. Don't even try, you'll just break things.
What you need to understand is that the shell language and syntax is old. Really old. The first version of the UNIX shell with variables was the Bourne shell which was designed and implemented in 1977. Back then, there were few precedents. (AFAIK, just the Thompson shell, which didn't support variables according to the manual entry.)
The rationale for the design decisions in the 1970's are ... lost in the mists of time. The design decisions were made by Steve Bourne and colleagues working at Bell Labs on v6 UNIX. They probably had no idea that their decisions would still be relevant 40+ years later.
The Bourne shell was designed to be general purpose and simple to use ... compared with the alternative of writing programs in C. And small. It was an outstanding success in those terms.
However, any language that is successful has the "problem" that it gets widely adopted. And that makes it more difficult to fix any issues (real or perceived) that may arise. Any proposal to change a language needs to be balanced against the impact of that change on existing users / uses of the language. You don't want to break existing programs or scripts.
Irrespective of arguments about whether spaces around = should be allowed in a shell variable assignment, changing this would break millions of shell scripts. It is just not going to happen.
Of course, Linux (and UNIX before it) allow you to design and implement your own shell. You could (in theory) replace the default shell. It is just a lot of work.
And there is nothing stopping you from writing your scripts in another scripting language (e.g. Python, Ruby, Perl, etc) or designing and implementing your own scripting language.
In summary:
We cannot know for sure why they designed the shell with this syntax for variable assignment, but it is moot anyway.
Reference:
Evolution of shells in Linux: a history of shells.
It prevents ambiguity in a lot of cases. Otherwise, if you have a statement foo = bar, it could then either mean run the foo program with = and bar as arguments, or set the foo variable to bar. When you require that there are no spaces, now you've limited ambiguity to the case where a program name contains an equals sign, which is basically unheard of.
I agree with #StephenC, and here's some more context with sources:
Unix v6 from 1975 did not have an environment, there was just a exec syscall that took a program and a string array of arguments. The system sh, written by Thompson, did not support variables, only single digit numbered arguments like $1 (probably why $12 to this day is interpreted as ${1}2)
Unix v7 from 1979, emboldened by advances in hardware, added a ton of features including a second string array to the exec call. The man page described it like this, which is still how it works to this day:
An array of strings called the environment is made available by exec(2) when a process begins. By convention these strings have the form name=value
The system sh, now written by Bourne, worked much like v6 shell, but now allowed you to specify these environment strings in the same format in front of commands (because which other format would you use?). The simplistic parser essentially split words by spaces, and flagged a word as destined for a variable if it contained a = and all preceding characters had been alphanumeric.
Thanks to Unix v7's incredible popularity, forks and clones copied a lot of things including this behavior, and that's what we're still seeing today.

What are the minimum required environment variables?

I am writing a shell.
With the execvpe system call, I can run a program and control its environment. What are the minimum values I need to pass through here?
Alternatively, I understand that child processes should have a copy of their parent's environment, possibly with some values added. While testing my shell, I am running it from within bash from within my terminal from within a window manager, etc etc. What are the bare basics that I can assume are in my environment? If I were to run my shell straight from a TTY (the "lowest level", as far as I understand), what can I expect?
That’s a very broad question.  To a certain extent,
programs should be able to run with no environment at all.
“X” display (i.e., GUI) programs need to know
where they are supposed to display. 
This information is usually provided
through the DISPLAY environment variable,
but can also be passed on the command line. 
There are probably other environment variables that are essential
(or nearly so) to “X” programs;
it’s been a while since I’ve looked under that hood.
Any program that needs to use special characteristics of your terminal
needs the TERM environment variable. 
“Special characteristics” means being able to set colors
(as ls and grep can do, subject to options),
move around the screen (like vi / vim),
or even know the size of the screen (like less). 
Note that size of the screen may also be available
through ROWS and COLUMNS.
Any program that needs to know the date and time
as perceived / understood by the user needs to know the time zone (TZ) —
although, if you’re willing to work with absolute (GMT / UTC),
you don’t need this.
etc.
The minimum that you need is a working PATH variable. Any extras beyond that depend on what programs you want to execute.
POSIX has a list of commonly-used environment variables, very few programs use more than a few of those.
Generally if you're using execvp*, you're not giving full pathnames for the executables. It makes your programs much simpler, you do not have to provide a full pathname for each executable, as is needed by the plain execv. POSIX describes these functions as
int execv(const char *path, char *const argv[]);
int execvp(const char *file, char *const argv[]);
and (referring to the parameters of the various exec* functions):
The argument path points to a pathname that identifies the new process image file.
The argument file is used to construct a pathname that identifies the new process image file. If the file argument contains a slash character, the file argument shall be used as the pathname for this file. Otherwise, the path prefix for this file is obtained by a search of the directories passed as the environment variable PATH (see XBD Environment Variables). If this environment variable is not present, the results of the search are implementation-defined.
and (remember that "file" is referring to execvp rather than execv, so the environ variable applies to the search using PATH for the "file" parameter):
For those forms not containing an envp pointer (execl(), execv(), execlp(), and execvp()), the environment for the new process image shall be taken from the external variable environ in the calling process.
So... you could technically remove the entire PATH variable, but the result would be implementation-defined.
The minimum neccessary environment is empty. You don't need anything.
e.g.
$ env -i env
$
We can see that env -i has created a blank environment.
We can take this further:
$ env -i /bin/bash
sweh#server:/home/sweh$ env
LS_COLORS=
PWD=/home/sweh
SHLVL=1
_=/usr/bin/env
We can see that bash has set a few variables, but nothing was inherited.
Now such an environment may break some things; e.g. a missing TERM variable means that vi or less may not work properly
$ less foo
WARNING: terminal is not fully functional
foo (press RETURN)
So, really, you need to determine what programs you expect to run inside the environment and what their needs are.

Print addresses of all local variables in C

I want to print the addresses of all the local and global variables which are being used in a function, at different points of execution of a program and store them in a file.
I am trying to use gdb for this same.
The "info local" command prints the values of all local variables. I need something to print the addresses in a similar way. Is there any built in command for it?
Edit 1
I am working on a gcc plugin which generates a points-to graph at compile time.
I want to verify if the graph generated is correct, i.e. if the pointers do actually point to the variables, which the plugin tells they should be pointing to.
We want to validate this points-to information on large programs with over thousands of lines of code. We will be validating this information using a program and not manually. There are several local and global variables in each function, therefore adding printf statements after every line of code is not possible.
There is no built-in command to do this. There is an open feature request in gdb bugzilla to have a way to show the meaning of all the known slots in the current stack frame, but nobody has ever implemented this.
This can be done with a bit of gdb scripting. The simplest way is to use Python to iterate over the Blocks of the selected Frame. Then in each such Block, you can iterate over all the variables, and invoke info addr on the variable.
Note that printing the address with print &var will not always work. A variable does not always have an address -- but, if the variable exists, it will have a location, which is what info addr will show.
One simple way these ideas can differ is if the compiler decides to put the variable into a register. There are more complicated cases as well, though, for example the compiler can put the variable into different spots at different points in the function; or can split a local struct into its constituent parts and move them around.
By default info addr tries to print something vaguely human-readable. You can also ask it to just dump the DWARF location expressions if you need that level of detail.
programmatically ( in C/C++ ) you use the & operator to get the address of a variable (assuming it's not a pointer):
int a; //variable declaration
print("%d", a); //print the value of the variable (as an integer)
print("0x%x", &a); //print the address of the variable (as hex)
The same goes for (gdb), just use &
plus the question has already been answered here (and not only)

Resources