Computing the size of array in text file in bash - bash

I have a text file that sometimes-not always- will have an array with a unique name like this
unique_array=(1,2,3,4,5,6)
I would like to find the size of the array-6 in the above example- when it exists and skip it or return -1 if it doesnt exist.
grepping the file will tell me if the array exists but not how to find its size.
The array can fill multiple lines like
unique_array=(1,2,3,
4,5,6,
7,8,9,10)
Some of the elements in the array can be negative as in
unique_array=(1,2,-3,
4,5,6,
7,8,-9,10)

awk -v RS=\) -F, '/unique_array=\(/ {print /[0-9]/?NF:0}' file.txt
-v RS=\) - delimit records by ) instead of newlines
-F, - delimit fields by , instead of whitespace
/unique_array=(/ - look for a record containing the unique identifier
/[0-9]?NF:0 - if record contains digit, number of fields (ie. commas+1), otherwise 0
There is a bad bug in the code above: commas preceding the array may be erroneously counted. A fix is to truncate the prefix:
awk -v RS=\) -F, 'sub(/.*unique_array=\(/,"") {print /[0-9]/?NF:0}' file.txt

Your specifications are woefully incomplete, but guessing a bit as to what you are actually looking for, try this at least as a starting point.
awk '/^unique_array=\(/ { in_array = 1; n = split(",", arr, $0); next }
in_array && /\)/ { sub(/\)./, ""); quit = 1 }
in_array { n += split(",", arr, $0);
if (quit) { print n; in_array = quit = n = 0 } }' file
We keep a state variable in_array which tells us whether we are currently in a region which contains the array. This gets set to 1 when we see the beginning of the array, and back to 0 when we see the closing parenthesis. At this point, we remove the closing parenthesis and everything after it, and set a second variable quit to trigger the finishing logic in the next condition. The last condition performs two tasks; it adds the items from this line to the count in n, and then checks if quit is true; if it is, we are at the end of the array, and print the number of elements.
This will simply print nothing if the array was not found. You could embellish the script to set a different exit code or print -1 if you like, but these details seem like unnecessary complications for a simple script.

I think what you probably want is this, using GNU awk for multi-char RS and RT and word boundaries:
$ awk -v RS='\\<unique_array=[(][^)]*)' 'RT{exit} END{print (RT ? gsub(/,/,"",RT)+1 : -1)}' file

With your shown samples please try following awk.
awk -v RS= '
{
while(match($0,/\<unique_array=[(][^)]*\)/)){
line=substr($0,RSTART,RLENGTH)
gsub(/[[:space:]]*\n[[:space:]]*|(^|\n)unique_array=\(|(\)$|\)\n)/,"",line)
print gsub(/,/,"&",line)+1
$0=substr($0,RSTART+RLENGTH)
}
}
' Input_file

Using sed and declare -a. The test file is like this:
$ cat f
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
Testing:
$ declare -a "$(sed -n '/unique_array=(/,/)/s/,/ /gp' f | \
sed 's/.*\(unique_array\)/\1/;s/).*/)/;
s/`.*`//g')"
$ echo ${unique_array[#]}
1 2 3 4 5 6 7 8 9 10
And then you can do whatever you want with ${unique_array[#]}

With GNU grep or similar that support -z and -o options:
grep -zo 'unique_array=([^)]*)' file.txt | tr -dc =, | wc -c
-z - (effectively) treat file as a single line
-o - only output the match
tr -dc =, - strip everything except = and ,
wc -c - count the result
Note: both one- and zero-element arrays will be treated as being size 1. Will return 0 rather than -1 if not found.

here's an awk solution that works with gawk, mawk 1/2, and nawk :
TEST INPUT
saa
dfsaf
sdgdsag unique_array=(1,2,3,
4,5,6,
7,8,9,10) sdfgadfg
sdgs
sdgs
sfsaf(sdg)
CODE
{m,n,g}awk '
BEGIN { __ = "-1:_ERR_NOT_FOUND_"
RS = "^$" (_ = OFS = "")
FS = "(^|[ \t-\r]?)unique[_]array[=][(]"
___ = "[)].*$|[^0-9,.+-]"
} $!NF = NR < NF ? $(gsub(___,_)*_) : __'
OUTPUT
1,2,3,4,5,6,7,8,9,10

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

how to find continuous blank lines and convert them to one

I have a file -- a, and exist some continues blank line(more than one), see below:
cat a
1
2
3
4
5
So first I want to know if exist continues blank lines, I tried
cat a | grep '\n\n\n'
nothing output. So I have to use below manner
vi a
:set list
/\n\n\n
So I want to know if exist other shell command could easily implement this?
then if exist two and more blank lines I want to convert them to one? see below
1
2
3
4
5
at first I tried below shell
sed 's/\n\n\(\n\)*/\n\n/g' a
it does not work, then I tried this shell
cat a | tr '\n' '$' | sed 's/$$\(\$\)*/$$/g' | tr '$' '\n'
this time it works. And also I want to know if exist other manner could implement this?
Well, if your cat implementation supports
-s, --squeeze-blank
suppress repeated empty output lines
then it is as simple as
$ cat -s a
1
2
3
4
5
Also, both -s and -n for numbering lines is likely to be available with less command as well.
remark: lines containing only blanks will not be suppressed.
If your cat does not support -s then you could use:
awk 'NF||p; {p=NF}'
or if you want to guarantee a blank line after every record, including at the end of the output even if none was present in the input, then:
awk -v RS= -v ORS='\n\n' '1'
If your input contains lines of all white space and you want them to be treated just like lines of non white space (like cat -s does, see the comments below) then:
awk '/./||p; {p=/./}'
and to guarantee a blank line at the end of the output:
awk '/./||p; {p=/./} END{if (p) print ""}'
This awk command should work to produce an output with 2 line breaks at each line:
awk -v RS= '{printf "%s%s", $0, ORS (RT ~ /\n{2,}/ ? ORS : "")}' file
1
2
3
4
5
This awk is using:
-v RS=: sets empty input record separator so that each empty line becomes record separator
printf "%s%s", $0, ORS: prints each line with single line break
(RT ~ /\n{2,}/ ? ORS : ""): prints additional line break if input record separator has more than 2 line breaks
You may use perl as well in slurp mode:
perl -0777 -pe 's/\R{2,}/\n\n/g' file
1
2
3
4
5
Command breakup:
-0777 Slurp mode to read entire file
's/\R{2,}/\n\n/g' Match 2 or more line breaks and replace by 2 line breaks
You can --squeeze-repeats with tr and then use sed to insert just a new line:
<a tr -s '\n' | sed 'G'
remark: This is a copy from my answer here
A very quick way is using awk
awk 'BEGIN{RS="";ORS="\n\n"}1'
How does this work:
awk knowns the concept records (which is by default lines) and you can define a record by its record separator RS. If you set the value of RS to an empty string, it will match any multitude of empty lines as a record separator. The value ORS is the output record separator. It states which separator should be printed between two consecutive records. This is set to two <newline> characters. Finally, the statement 1 is a shorthand for {print $0} which prints the current record followed by the output record-separator ORS.
note: This will, just as cat -s keep lines with only blanks as actual lines and will not suppress them.
Another awk solution:
awk 'NF' ORS="\n\n" a
1
2
3
4
5
It checks if the line is not empty by testing if NF (number of fields) is not zero. It it matches, print the line as default action. ORS (output record separator) is set to 2 newline characters, so there is an empty line between non-empty lines.
1) awk solution
$ echo "a\n\n\nb\n\n\nc\n\n\n" | awk 'BEGIN{b=0} /^$/{b=1;next} {printf "%s%s\n", b==1?"\n":"",$0} {b=0} END{printf "%s",b==1?"\n":""}'
a
b
c
$
2) sed solution
sed '
/^$/{ ${ p; d; }; H; d; }
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
:f { x; p; d; }
'
SED Explanation:
/^$/{ ${ p; d; }; H; d; }
--If input is blank, if it is the last line, just print, else append to the holdspace and delete the pattern space and start new cycle
/^$/!{ x; s/^\(\n\{1,\}\)$/\1/; ts; Tf; }
--If input is not blank, exchange content of the p space and h space and check if h space contains \n. if yes, jump to s, if not jump to f
:s { x; s/\(.*\)/\n\1/; x; s/.*//; x; p; d; }
--If blank lines are present in h space, then append \n to p space, then clear hold space , then print p space and delete p space
:f { x; p; d; }
--If blank lines are absent in h space, then print p space and delete p space

How to grep the last occurrence of a line pattern

I have a file with contents
x
a
x
b
x
c
I want to grep the last occurrence,
x
c
when I try
sed -n "/x/,/b/p" file
it lists all the lines, beginning x to c.
I'm not sure if I got your question right, so here are some shots in the dark:
Print last occurence of x (regex):
grep x file | tail -1
Alternatively:
tac file | grep -m1 x
Print file from first matching line to end:
awk '/x/{flag = 1}; flag' file
Print file from last matching line to end (prints all lines in case of no match):
tac file | awk '!flag; /x/{flag = 1};' | tac
grep -A 1 x file | tail -n 2
-A 1 tells grep to print one line after a match line
with tail you get the last two lines.
or in a reversed way:
tac fail | grep -B 1 x -m1 | tac
Note: You should make sure your pattern is "strong" enough so it gets you the right lines. i.e. by enclosing it with ^ at the start and $ at the end.
This might work for you (GNU sed):
sed 'H;/x/h;$!d;x' file
Saves the last x and what follows in the hold space and prints it out at end-of-file.
not sure how to do it using sed, but you can try awk
awk '{a=a"\n"$0; if ($0 == "x"){ a=$0}} END{print a}' file
POSIX vi (or ex or ed), in case it is useful to someone
Done in Command mode, of course
:set wrapscan
Go to the first line and just search Backwards!
1G?pattern
Slower way, without :set wrapscan
G$?pattern
Explanation:
G go to the last line
Move to the end of that line $
? search Backwards for pattern
The first backwards match will be the same as the last forward match
Either way, you may now delete all lines above current (match)
:1,.-1d
or
kd1G
You could also delete to the beginning of the matched line prior to the line deletions with d0 in case there were multiple matches on the same line.
POSIX awk, as suggested at
get last line from grep search on multiple files
awk '(FNR==1)&&s{print s; s=""}/PATTERN/{s=$0}END{if(s) print s}'
if you wanna do awk in truly hideous one-liner fashion but getting awk to resemble closer to functional programming paradigm syntax without having to keep track when the last occurrence is
mawk/mawk2/gawk 'BEGIN { FS = "=7713[0-9]+="; RS = "^$";
} END { print ar0[split($(0 * sub(/\n.+$/,"",$NF)), ar0, ORS)] }'
Here i'm employing multiple awk short-hands :
sub(/[\n.+$/, "", $NF) # trimming all extra rows after pattern
g/sub() returns # of substitutions made, so multiplying that by 0 forces the split() to be splitting $0, the full file, instead.
split() returns # of items in the array (which is another way of saying the position of last element), so even though I've already trimmed out the trailing \n, i still can directly print ar0[split()], knowing that ORS will fill in the missing trailing \n.
That's why this code looks like i'm trying to extract array items before the array itself is defined, but due to flow of logic needed, the array will become defined by the time it reaches print.
Now if you want something simpler, these 2 also work
mawk/gawk 'BEGIN { FS="=7713[0-9]+="; RS = "^$"
} END { $NF = substr($NF, 1, index($NF, ORS));
FS = ORS; $0 = $0; print $(NF-1) }'
or
mawk/gawk '/=7713[0-9]+=/ { lst = $0 } END { print lst }'
I didn't use the same x|c requirements as OP just to showcase these work regardless of whether you need fixed-strings or regex based matches.
The above solutions only work for one single file, to print the last occurrence for many files (say with suffix .txt), use the following bash script
#!/bin/bash
for fn in `ls *.txt`
do
result=`grep 'pattern' $fn | tail -n 1`
echo $result
done
where 'pattern' is what you would like to grep.

How to efficiently sum two columns in a file with 270,000+ rows in bash

I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename
You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources