Split text from bash variable - bash

I have a variable which has groups of numbers. It looks like this:
foo 3
foo 5
foo 2
bar 8
bar 8
baz 2
qux 3
qux 5
...
I would like to split this data so I can work on one 'group' at a time. I feel this would be achievable with a loop somehow. The end goal is to take the mean of each group, such that I could have:
foo 3.33
bar 8.50
baz 5.00
qux 4.00
...
This mean taking has been implemented already, but I've brought it up so the context is known.
It's important to note that each group (eg. foo, bar, baz) is of arbitrary length.
How would I go about splitting up these groups?

I would use awk (tested using the GNU version gawk here, but I think it's portable) for both the collecting and the averaging. As a coreutil, it should be in just about anything bash is installed on.
# print_avg.awk
{
sums[$1] += $2
counts[$1] += 1
}
END {
for (key in sums)
print key , sums[key] / counts[key]
}
data.txt:
foo 3
foo 5
bar 8
bar 8
baz 2
qux 3
qux 5
Run it like:
$ awk -f print_avg.awk data.txt
foo 4
baz 2
qux 4
bar 8

Related

Remove initial directives from preprocessor output

I have the following in test.c:
#if 1
foo boo bar
#endif
Then I run gcc like this:
gcc -E test.c -o test.pp
This is the test.pp output:
# 1 "test.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "test.c"
foo boo bar
Is there a way to remove these # [something] directives and let only (in this case) foo boo bar only by using gcc flags? I mean, I would like that the preprocessor output would be only foo boo bar in this case.

How would you structure Alpha Nodes in a Rete Network that has a rule with two conditions found in other rules?

Let's say I have three rules:
When Object's foo property is 1, output "foo"
When Object's bar property is 1, output "bar"
When Object's foo property is 1 and bar property is 1, output "both foo and bar"
What would the structure of alpha nodes look like for this scenario? I've seen examples where, given rules 1 and 2, it might look like:
foo == 1 - "foo"
root<
bar == 1 - "bar"
And, given 3:
root - foo == 1 - bar == 1 - "both foo and bar"
And, given 3 and 1:
"foo"
root - foo == 1 <
bar == 1 - "both foo and bar"
Given 3, 2 and 1, would it look something like:
foo == 1 - "foo"
root <
"bar"
bar == 1 <
foo == 1 - "both foo and bar"
or
foo == 1 - "foo"
/
root-- bar == 1 - "bar"
\
foo == 1 - bar == 1 - "both foo and bar"
Or some other way?
If you are sharing nodes and preserving the order in which properties are tested it would look like this:
bar == 1 - "bar"
root <
"foo"
foo == 1 <
bar == 1 - "both foo and bar"

Paste files conditionally with bash if and awk loop

I have a list of files that I want to paste to a master file (bar) if some awk condition is fulfilled.
for foo in *;
do
if awk '*condition* {exit 1}' $foo
then
:
else
paste $foo > bar
fi
done
However, it looks like only the last pasted file is in bar. Shouldn't paste add new columns to bar every time, without overwriting all the data completely?
File1 File2 Expected_Output Actual_Output
1 4 1 NaN 1 4 1 NaN 1 NaN
2 5 2 7 2 5 2 7 2 7
3 6 3 8 3 6 3 8 3 8
Your paste command overwrites file bar at each iteration in the loop, which explains that at the end you only have the last file.
declare -a FILES=()
for foo in *;
do
if awk '*condition* {exit 1}' $foo
then
:
else
FILES+=("$foo")
fi
done
paste "${FILES[#]}" > bar
This code accumulates all filenames that match your condition in an array named FILES, and calls paste only once, expanding all filenames into individual, quoted arguments (this is what "${FILES[#]}" does) and redirecting output to the bar file.
Additionally, you can replace the whole if/then/else block with :
awk '*condition* {exit 1}' "$foo" || FILES+=("$foo")
The || expresses a condition, and because of Bash performing a lazy evaluation of logical operators, the statement to the right is only executed if awk returns a non-zero return code.
Please note I quoted "$foo" (when passing it to awk) for the cases the name of your files would contain special characters.

Efficient non-greedy method of returning multiple lines between patterns

I have a file like this:
bar 1
foo 1
how now
manchu 50
foo 2
brown cow
manchu 55
foo 3
the quick brown
manchu 1
bar 2
foo 1
fox jumped
manchu 8
foo 2
over the
manchu 20
foo 3
lazy dog
manchu 100
foo 4
manchu 5
foo 5
manchu 7
bar 3
bar 4
I want to search 'manchu 55' and receive:
FOONUMBER=2
(The foo # above 'manchu 55')
BARNUMBER=1
(The bar # above that foo)
PHRASETEXT="brown cow"
(The text on the line above 'manchu 55')
So I can ultimately output:
brown cow, bar 1, foo 2.
Thus far I've accomplished this with some really ugly grep code like:
FOONUMBER=`grep -e "manchu 55" -e ^" foo" -e ^"bar" | grep -B 1 "manchu 55" | grep "foo" | awk '{print $2}'`
BARNUMBER=`grep -e ^" foo $FOONUMBER" -e ^"bar" | grep -B 1 "foo $FOONUMBER" | grep "bar" | awk '{print $2}'`
PHRASETEXT=`grep -B 1 "manchu 55" | grep -v "manchu 55"`
There are 3 problems with this code:
It makes me cringe because I know it's bad
It's slow; I have to go through hundreds of thousands of entries and it's taking too long
sometimes, as in bar 2, foo 4 and 5 in my example, there is no text above the 'manchu'. In this case, it incorrectly returns a foo, which is not what I want.
I suspected I could do this with sed, doing something like:
FOONUMBER=`sed -n '/foo/,/manchu 55/p' | grep foo | awk '{print $2}'
Unfortunately sed is too greedy. I've been reading on AWK and state machines, which seems like it might be a better way to do this, but I still don't understand it well enough to set it up.
As you may have been able to determine by now, programming is not what I do for a living, but ultimately I have had this thrust upon me. I'm hoping to rewrite what I already have to be more efficient and hopefully not too complicated as some other poor sod without a programming degree will probably end up having to support any changes to it at some future date.
with awk:
awk -v nManchu=55 -v OFS=", " '
$1 == "bar" {bar = $0} # store the most recently seen "bar" line
$1 == "foo" {foo = $0} # store the most recently seen "foo" line
$1 == "manchu" && $2 == nManchu {print prev, bar, foo}
{prev = $0} # remember the previous line
' file
outputs
brown cow, bar 1, foo 2
Running with "nManchu=100" outputs
lazy dog, bar 2, foo 3
This has the advantage of only taking a single pass through the file, instead of parsing the file 3 times to get "bar", "foo" and the prev line.
I would suggest
sed -n '/foo/ { s/.*foo\s*//; h }; /manchu 55/ { x; p }' filename
This is very simple:
/foo/ { # if you see a line with "foo" in it,
s/.*foo\s*// # isolate the number
h # and put it in the hold buffer
}
/manchu 55/ { # if you see a line with "manchu 55" in it,
x # exchange hold buffer and pattern space
p # and print the pattern space.
}
This will then print the last number seen after a foo before the manchu 55 line. The bar number can be extracted essentially the same way, and for the phrase text you could use
sed -n '/manchu 55/ { x; p }; h'
to get the line held before manchu 55 is seen. Or possibly
sed -n '/manchu 55/ { x; p }; s/^\s*//; h'
to remove leading white spaces in such a line.
If you are certain that only one manchu 55 line exists in the file or you only want the first match, you can replace x; p with x; p; q. The q will then quit directly after the result is printed.

Bash - omit lines starting with a mis-spelled word (using hunspell)

I have a file words.txt in which each line is a word, followed by a TAB, followed by an integer (which represents the word's frequency). I want to generate a new file containing only those lines where the word is spelled correctly.
Using cat words.txt | hunspell -1 -G > ok_words.txt I can get a list of correct words, but how can I also include the remainder of each line (ie the TAB and the number)?
Input:
adwy 27
bird 10
cat 12
dog 42
erfgq 9
fish 2
Desired Output:
bird 10
cat 12
dog 42
fish 2
The easiest way would be to use the join command:
$ join words.txt ok_words.txt
bird 10
cat 12
dog 42
fish 2
or to preserve tabs:
$ join -t $'\t' words.txt ok_words.txt
bird 10
cat 12
dog 42
fish 2

Resources