yacc, only applying rule once - shell

I'm trying to write a shell using yacc and lex and I'm running into some problems with my I/O redirectors. Currently, I can use the < and > operators fine and in any order, but my problem is I can redirect twice with no error, such as "ls > log > log2"
My rule code is below, can anyone give me some tips on how to fix this? Thanks!
io_mod:
iomodifier_opt io_mod
|
;
iomodifier_opt:
GREAT WORD {
printf(" Yacc: insert output \"%s\"\n", $2);
Command::_currentCommand._outFile = $2;
}
|
LESS WORD {
printf(" Yacc: insert input \"%s\"\n", $2);
Command::_currentCommand._inputFile = $2;
}
| /* can be empty */
;
EDIT: After talking to my TA, I learned that I did not actually need to have only 1 modifier for my command and that I actually can have multiple copies of the same I/O redirection.

There are two approaches:
(1) Modify the grammar so that you can only have one of each kind of modifier:
io_mod_opt: out_mod in_mod | in_mod out_mod | in_mod | out_mod | ;
(2) Modify the clause handler to count the modifiers and report an error if there's more than one:
GREAT_WORD {
if (already_have_output_file()) {
error("too many output files: \"%s\"\n", $2)
} else {
/* record output file */
}
}
Option (2) seems likely to lead to better error messages and a simpler grammar.

There's also a third approach - don't fret. Bash (under Cygwin) does not generate an error for:
ls > x > y
It creates x and then y and ends up writing to y.

I realize this might be just an exercise to learn lexx and yacc, but otherwise the first question is to ask why you want to use lexx and yacc? Any usual shell command language has a pretty simple grammar; what are you gaining from using an LALR generator?
Well, other than complexity, difficulty generating good error messages, and code bulk, I mean.

Related

How does the interpretive program(ex:perl,shell) work

I propose a hypothesis: 1. the operating system creates a process space to start the interpreter; 2. the interpreter creates a new process space to start the program that needs to be interpreted, translating the first statement into machine language; 3. the execution of the first statement ends and interrupts; 4. the interpreter translates the next statement and dynamically modifies and creates new instructions. Well, I can't make it up. I can't understand the concept of explaining and executing.
Here's a example interpreter:
while (<>) {
my ($cmd, #args) = split;
if ($cmd eq '...') { ... }
elsif ($cmd eq '...') { ... }
elsif ($cmd eq '...') { ... }
else { ... }
}
This points out that the interpreted program isn't run in a separate process from the interpreter.
This also points out there isn't necessarily any translation to machine language.
Due note that Perl is a compiled language rather than interpreted one.
$ perl -MO=Concise,-exec -e'print("Hello, world!\n");'
1 <0> enter
2 <;> nextstate(main 1 -e:1) v:{
3 <0> pushmark s
4 <$> const[PV "Hello, world!\n"] s
5 <#> print vK
6 <#> leave[1 ref] vKP/REFC
-e syntax OK
That said, the compiled form is not native instructions. There are different ways this can be handled, but Perl effectively interprets these. The following is that interpreter:
int
Perl_runops_standard(pTHX)
{
OP *op = PL_op;
PERL_DTRACE_PROBE_OP(op);
while ((PL_op = op = op->op_ppaddr(aTHX))) {
PERL_DTRACE_PROBE_OP(op);
}
PERL_ASYNC_CHECK();
TAINT_NOT;
return 0;
}
(Copied from here.)
The ops are really data structures arranged in a linked list (with other pointers for jumps) rather than a stream of bytes encoding instructions. The above loop traverses the list, executing the function associated with each op. These function returns the address of the next op to execute, thus forming the program.
Some languages probably take a similar approach. Other languages definitely take a different approach.

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

if else statement in yacc double execution

how do we implement if else in yacc ?
i tried this
|IF log THEN AffectationI ELSE AffectationI {if ($2) $$=$4; else $$=$6;}
but the $4 and $6 both execute at the same time
knowing that
AffectationI is equal to Var=3
yacc generates a parser, not a general program evaluator, so if you want to execute the program you're parsing, you need to implement something that can do that. The simplest is to have your parser generate a tree an not evaluate anything, and then have a separate evaulator that "executes" the code in the tree by traversing it. That way you can readily skip parts of the tree that should not be executed -- or repeatedly traverse parts of the tree that can be evaluated mulitple times, like loops.
Alternatively, you can have a global "condition" flag that controls whether things should be executed or not and manipulate that in your actions. With this approach an if statement becomes something like:
statement:
IF expression
{ $$ = condition_flag; // save the previous condition
if (condition_flag) condition_flag = $2; }
THEN statement_list
{ if ($3) condition_flag = !condition_flag; }
ELSE statement_list
{ condition_flag = $3; } // restore previous condition

Is there any difference between the terms line of code and statement?

Is there any difference between the terms 'line of code' and 'statement' in programming languages?
Yes, check following:
int a =0;
while(a<100)
{
cout<<
"Is it ok"<<
a <<
"is the current value of 'a'"<<endl;
a++;
}
Above code snippet is having:
Line of code: 9
Simple Statements: 3
Compound Statements: 1
Line of code is basically how many end points you are using. A Statement is the group of code that you produce to create an expected output.
For example in a conditional statement using if else, you can right that in multiple line of core or using only one line via Ternary method.
Sample:
if($var == 0) {
echo "this is zero";
} else {
echo "not a zero";
}
that basically created 5 line of code. In ternary you can go like this
$var == 0 ? echo "this is zero" : echo "not a zero";
As you see the result is the same only it is created in 1 line of code.
Hope this helps you moving forward
Lines of code presumably refers to lines of content in a source file, i.e. the number of lines in a file. On the other hand, a given statement of code can, and often does, exceed a single line. For example, consider the following Java statement from the Javadoc for streams:
int sum = widgets.stream()
.filter(b -> b.getColor() == RED)
.mapToInt(b -> b.getWeight())
.sum();
One statement spans four physical lines in the editor. However, we could inline the whole thing into a single line.

Does awk support dynamic user-defined variables?

awk supports this:
awk '{print $(NF-1);}'
but not for user-defined variables:
awk '{a=123; b="a"; print $($b);}'
by the way, shell supports this:
a=123;
b="a";
eval echo \${$b};
How can I achieve my purpose in awk?
OK, since some of us like to eat spaghetti through their nose, here is some actual code that I wrote in the past :-)
First of all, getting a self modifying code in a language that does not support it will be extremely non-trivial.
The idea to allow dynamic variables, function names, in a language that does not support one is very simple. At some state in the program, you want a dynamic anything to self modify your code, and resume execution
from where you left off. a eval(), that is.
This is all very trivial, if the language supports eval() and such equlavant. However, awk does not have such function. Therefore, you, the programmer has to provide a interface to such thing.
To allow all this to happen, you have three main problems
How to get our self so we can modify it
How to load the modified code, and resume from where we left off
Finding a way for the interpreter to accept our modified code
How to get our self so we can modify it
Here is a example code, suitable for direct execution.
This one is the infastrucure that I inject for enviroments running gawk, as it requires PROCINFO
echo ""| awk '
function push(d){stack[stack[0]+=1]=d;}
function pop(){if(stack[0])return stack[stack[0]--];return "";}
function dbg_printarray(ary , x , s,e, this , i ){
x=(x=="")?"A":x;for(i=((s)?s:1);i<=((e)?e:ary[0]);i++){print x"["i"]=["ary[i]"]"}}
function dbg_argv(A ,this,p){
A[0]=0;p="/proc/"PROCINFO["pid"]"/cmdline";push(RS);RS=sprintf("%c",0);
while((getline v <p)>0)A[A[0]+=1]=v;RS=pop();close(p);}
{
print "foo";
dbg_argv(A);
dbg_printarray(A);
print "bar";
}'
Result:
foo
A[1]=[awk]
A[2]=[
function push(d){stack[stack[0]+=1]=d;}
function pop(){if(stack[0])return stack[stack[0]--];return "";}
function dbg_printarray(ary , x , s,e, this , i ){
x=(x=="")?"A":x;for(i=((s)?s:1);i<=((e)?e:ary[0]);i++){print x"["i"]=["ary[i]"]"}}
function dbg_argv(A ,this,p){
A[0]=0;p="/proc/"PROCINFO["pid"]"/cmdline";push(RS);RS=sprintf("%c",0);
while((getline v <p)>0)A[A[0]+=1]=v;RS=pop();close(p);}
{
print "foo";
dbg_argv(A);
dbg_printarray(A);
print "bar";
}]
bar
As you can see, as long as the OS does not play with our args, and /proc/ is available, it is possible
to read our self. This may appear useless at first, but we need it for push/pop of our stack,
so that our execution state can be enbedded within the code, so we can save/resume and survive OS shutdown/reboots
I have left out the OS detection function and the bootloader (written in awk), because, if I publish that,
kids can build platform independent polynormal code, and it is easy to cause havoc with it.
how to load the modified code, and resume from where we left off
Now, normaly you have push() and pop() for registers, so you can save your state and play with
your self, and resume from where you left off. a Call and reading your stack is a typical way to get the
memory address.
Unfortunetly, in awk, under normal situations we can not use pointers (with out a lot of dirty work),
or registers (unless you can inject other stuff along the way).
However you need a way to suspend and resume from your code.
The idea is simple. Instead of letting awk in control of your loops and while, if else conditions,
recrusion depth, and functions you are in, the code should.
Keep a stack, list of variable names, list of function names, and manage it your self.
Just make sure that your code always calls self_modify( bool ) constantly, so that even upon sudden failure,
As soon as the script is re-run, we can enter self_modify( bool ) and resume our state.
When you want to self modify your code, you must provide a custom made
write_stack() and read_stack() code, that writes out the state of stack as string, and reads string from
the values out from the code embedded string itself, and resume the execution state.
Here is a small piece of code that demonstrates the whole flow
echo ""| awk '
function push(d){stack[stack[0]+=1]=d;}
function pop(){if(stack[0])return stack[stack[0]--];return "";}
function dbg_printarray(ary , x , s,e, this , i ){
x=(x=="")?"A":x;for(i=((s)?s:1);i<=((e)?e:ary[0]);i++){print x"["i"]=["ary[i]"]"}}
function _(s){return s}
function dbg_argv(A ,this,p){
A[0]=0;p="/proc/"PROCINFO["pid"]"/cmdline";push(RS);RS=sprintf("%c",0);
while((getline v <p)>0)A[A[0]+=1]=v;RS=pop();close(p);}
{
_(BEGIN_MODIFY"|");print "#foo";_("|"END_MODIFY)
dbg_argv(A);
sub( \
"BEGIN_MODIFY\x22\x5c\x7c[^\x5c\x7c]*\x5c\x7c\x22""END_MODIFY", \
"BEGIN_MODIFY\x22\x7c\x22);print \"#"PROCINFO["pid"]"\";_(\x22\x7c\x22""END_MODIFY" \
,A[2])
print "echo \x22\x22\x7c awk \x27"A[2]"";
print "function bar_"PROCINFO["pid"]"_(s){print \x22""doe\x22}";
print "\x27"
}'
Result:
Exactly same as our original code, except
_(BEGIN_MODIFY"|");print "65964";_("|"ND_MODIFY)
and
function bar_56228_(s){print "doe"}
at the end of code
Now, this may seem useless, as we are only replaceing code print "foo"; with our pid.
But it becomes usefull, when there are multiple _() with separate MAGIC strings to identify BLOCKS,
and a custome made multi line string replacement routine instead of sub()
You msut provide BLOCKS for stack, function list, execution point, as a bare minimum.
And notice that the last line contains bar
This it self is just a sting, but when this code repeatedly gets executed, notice that
function bar_56228_(s){print "doe"}
function bar_88128_(s){print "doe"}
...
and it keeps growing. While the example is intentionally made so that it does nothing useful,
if we provide a routine to call bar_pid_(s) instead of that print "foo" code,
Sudenly it means we have eval() on our hands :-)
Now, isn't eval() usefull :-)
Don't forget to provide a custome made remove_block() function so that the code maintains
a reasonable size, instead of growing every time you execute.
Finding a way for the interpreter to accept our modified code
Normally calling a binary is trivial. However, when doing so from with in awk, it becomes difficult.
You may say system() is the way.
There are two problems to that.
system() may not work on some envoroments
it blocks while you are executing code, trus you can not perform recrusive calls and keep the user happy at the same time.
If you must use system(), ensure that it does not block.
A normal call to system("sleep 20 && echo from-sh & ") will not work.
The solution is simple,
echo ""|awk '{print "foo";E="echo ep ; sleep 20 && echo foo & disown ; "; E | getline v;close(E);print "bar";}'
Now you have a async system() call that does not block :-)
Not at the moment. However, if you provide a wrapper, it is (somewhat hacky and dirty) possible.
The idea is to use # operator, introduced in the recent versions of gawk.
This # operator is normally used to call a function by name.
So if you had
function foo(s){print "Called foo "s}
function bar(s){print "Called bar "s}
{
var = "";
if(today_i_feel_like_calling_foo){
var = "foo";
}else{
var = "bar";
}
#var( "arg" ); # This calls function foo(), or function bar() with "arg"
}
Now, this is usefull on it's own.
Assuming we know var names beforehand, we can write a wrapper to indirectly modify and obtain vars
function get(varname, this, call){call="get_"varname;return #call();}
function set(varname, arg, this, call){call="set_"varname; #call(arg);}
So now, for each var name you want to prrvide access by name, you declare these two functions
function get_my_var(){return my_var;}
function set_my_var(arg){my_var = arg;}
And prahaps, somewhere in your BEGIN{} block,
BEGIN{ my_var = ""; }
To declare it for global access.
Then you can use
get("my_var");
set("my_var", "whatever");
This may appear useless at first, however there are perfectly good use cases, such as
keeping a linked list of vars, by holding the var's name in another var's array, and such.
It works for arrays too, and to be honest, I use this for nesting and linking Arrays within
Arrays, so I can walk through multiple Arrays like using pointers.
You can also write configure scripts that refer to var names inside awk this way,
in effect having a interpreter-inside-a-interpreter type of things, too...
Not the best way to do things, however, it gets the job done, and I do not have to worry about
null pointer exceptions, or GC and such :-)
The $ notation is not a mark for variables, as in shell, PHP, Perl etc. It is rather an operator, which receives an integer value n and returns the n-th column from the input. So, what you did in the first example is not the setting/getting of a variable dynamically but rather a call to an operator/function.
As stated by commenters, you can archive the behavior you are looking for with arrays:
awk '{a=123; b="a"; v[b] = a; print v[b];}'
I had a similar problem to solve, to load the settings from a '.ini' file and I've used arrays to set the variables dynamically.
It works with Awk or Gawk, Linux or Windows (GnuWin32)
gawk -v Settings_File="my_settings_file.ini" -f awk_script.awk <processing_file>
[my_settings_file.ini]
#comment
first_var=foo
second_var=bar
[awk_script.awk]
BEGIN{
FS="=";
while((getline < Settings_File)>0) {
if($0 !~ /^[#;]|^(\s*)$/) {
var_array[$1] = $2;
}
}
print var_array["first_var"];
print var_array["second_var"];
if (var_array["second_var"] == "bar") {
print "works!";
}
}
{
#more processing
}
END {
#finish processing
}

Resources