bash to identify and verify file headers

bash to identify and verify file headers - bash

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input

You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.

Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.

This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?

If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively

Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'

When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

How to make that variable $f defined how much "Freq" will be printed from column number 3?

I need a help with my bash script. I've problem with code:
for v in $(seq 1 $f)); do echo $(grep "Freq" freq.log) | awk '{print$3}')
because this comands printed $f times column number 3 instead should be printed $f values of "Freq" from column number 3.
It's look like
enter image description here
Should be like
enter image description here
I don't know how make that variable $f defined how much "Freq" will be printed from column number 3. In this file I've plenty expressions of "Freq" but I need just $f.
For sure I pasted all content of script:
#!/bin/bash
e=$(grep "atomic number" freq.log | tail -1 | awk '{print$2}')
echo "Liczba atomow znajdujacyh sie w podanej czasteczce wynosi: $e"
f=$(bc <<< "($e*3-6)/3")
echo "Liczba wartosci Freq, ktore wczyta skrypt to $f"
for v in $(seq 1 $f); do
echo "$(grep "Freq" freq.log | awk '{print$3}')"
done
Sample input data file; geometry optimization calculations in GAUSSIAN
A A A
Frequencies -- 182.1477 202.8948 227.7144
Red. masses -- 6.6528 8.2622 6.3837
Frc consts -- 0.1300 0.2004 0.1950
IR Inten -- 0.8602 0.4870 1.2090
NAtoms= 35 NActive= 35 NUniq= 35 SFac= 1.00D+00 NAtFMM= 60 NAOKFM=F Big=F

Here is your bash script converted to a single awk script:
awk script script.awk
/atomic number/{ # for each line matching regEx pattern "atomic number"
e = $2; # store current 2nd field in variable e
}
/Freq/{ # for each line matching regEx pattern "Freq"
freqArr[fr++]=$3; # add 3rd field to array freqArr, increment array counter fr
}
END { # on complete scanning input file
print "Liczba atomow znajdujacyh sie w podanej czasteczce wynosi: " e;
f = ( ((e * 3) - 6) / 3 ); # claculate vairable f
print "Liczba wartosci Freq, ktore wczyta skrypt to " f;
for (currFreq in freqArr) { # scan all element freqArr
if (currFreq == f) # if currFreq equals f
freqCount++; # increment freqCount coutner
}
print freqCount;
}
run command
awk -f script.awk freq.log

delete entries at certain indices in space delimited text file

I have a .txt file with numeric indices of certain 'outlier' data points, each on their own line, called by $outlier_file:
1
7
30
43
48
49
56
57
65
Using the following code, I can successfully remove certain files (volumes of neuroimaging data in this case) by using while + read.
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
done < $outlier_file
However, I also need to remove the numbers located at these 'outlier' indices from another text file stored in $bvec_file, which has 69 columns & 3 rows. Within each row, the numbers are space delimited. So e.g., for this example, I need to remove all 3 rows of column 1, 7, 30, etc. and then save this version with the outliers removed into a new *.txt file.
0 0.9988864166 -0.0415925034 -0.06652866169 -0.6187155495 0.2291534462 0.8892356214 0.7797364286 0.1957395685 0.9236669465 -0.5400265342 -0.3845263463 -0.4903989539 0.4863306385 -0.6496130843 0.5571164636 0.8110081715 0.9032142094 -0.3234596075 -0.1551409525 -0.806059879 0.4811597826 -0.7820757748 -0.9528881463 0.1916556621 -0.007136403284 -0.2459431735 -0.7915263574 -0.1938049261 -0.1578786349 0.8688043633 -0.5546072294 -0.4019951732 0.2806154851 0.3478762022 0.9548067252 -0.9696777541 -0.4816255837 -0.7962240023 0.6818610905 0.7097978218 0.6739686799 0.1317547111 -0.7648252249 -0.1456021218 -0.5948047487 0.0934205064 0.5268769564 -0.8618324858 -0.3721029232 -0.1827616535 0.691353613 0.4159071597 0.4605505287 0.1312199424 0.426674893 -0.4068291509 0.7167859082 0.2330824665 0.01909161256 -0.06375254731 -0.5981122948 -0.2672253674 0.6875472994 0.2302943724 0 0 0 0
0 0.04258194557 0.9988207007 0.6287131425 0.7469024143 0.5528476637 0.3024964957 0.1446931241 0.9305823612 0.1675139932 0.8208211337 0.8238722992 0.5983722761 0.4238174961 0.639429196 0.1072148887 0.5551578885 0.003337599176 0.511740508 0.9516619405 0.3851404227 0.8526321065 0.1390947346 0.2030449535 0.7759459569 0.165587903 0.9523372297 0.5801228933 0.3277276562 0.7413928896 0.442482978 0.2320585706 0.1079269171 0.1868672655 0.1606136006 0.2968573235 0.1682337977 0.8745679247 0.5989061899 0.4172933119 0.01746934331 0.5641480832 0.7455469091 0.3471016571 0.8035001467 0.5870623128 0.361107261 0.8192579877 0.4160218909 0.5651330299 0.4070513153 0.7221181184 0.714223583 0.6971767133 0.4937978446 0.4232911691 0.8011701162 0.2870385494 0.9016941521 0.09688949547 0.9086826131 0.2631932421 0.152678096 0.6295753848 0.9712458578 0 0 0 0
0 -0.02031513434 -0.02504539005 -0.7747862425 0.2435730944 0.8011542666 0.343155766 -0.6091592581 -0.3093581909 -0.3446424728 -0.1860752773 -0.4163819443 -0.6336083058 0.7641081337 -0.4112580017 -0.8234841915 0.1845683194 0.4291770641 -0.7959243273 -0.2650864686 0.449371034 -0.203724703 0.6074620459 0.2253373638 -0.6009791836 -0.9861692137 0.1804598471 0.1922068008 -0.9246806119 0.6522353256 -0.2222336438 0.7990992685 -0.9092588527 -0.9414539684 0.9236803664 0.0148272357 -0.1772637652 0.05628269894 -0.08566629406 -0.6007759525 0.7041888058 0.4769729119 0.6532997034 -0.5427364139 -0.5772239915 0.5491494803 0.9278330427 0.2263117816 -0.290121617 0.7363179158 0.8949343019 -0.02399176716 0.5629439653 -0.5493977074 -0.8596191107 -0.7992328333 0.4388809483 0.6354737076 0.3641705918 0.9951120218 0.412591228 -0.75696169 0.9514620339 -0.3618197699 0.06038199928 0 0 0 0
As far as I've gotten in one approach is using awk to index the right columns.. (just printing them right now) but I can only get this to work if I call $1 (i.e., the numeric index of the first outlier column)...
awk -F ' ' '{print $1}' $bvec_file
If I try to refer to the value in $outlier, it doesn't work. Instead, this prints the entire contents of $bvec_file
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
# Remove outlier #'s from bvec file
awk -F ' ' '{print $1}' $bvec_file
done < $outlier_file
I am completely stuck on how to get this done. Any advice would be greatly appreciated.

To delete the outliers from bvec_file after the loop and only delete the ones where the associated file was successfully removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
echo "$outlier"
done < "$outlier_file" |
awk '
NR==FNR { os[$0]; next }
{
for (o in os) {
$o=""
}
$0=$0; $1=$1
}
1' - "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
Or to delete the outliers one at a time as the files are removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
# Remove outlier #'s from bvec file
awk -v o="$outlier" '{$o=""; $0=$0; $1=$1} 1' "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
done < <(sort -rnu "$outlier_file")
Always quote your shell variables, see https://mywiki.wooledge.org/Quotes, and the && at the end of each line is to ensure the next command only runs if the previous commands succeeded.
The magical incantation in the awk script does the following - lets say your input is a b c and the outlier field is field number 2, b:
$ echo 'a b c'
a b c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; print NF ":", $0}'
3: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; print NF ":", $0}'
2: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; $1=$1; print NF ":", $0}'
2: a c
The o="" sets the field value to null, the $0=$0 forces awk to resplit $0 into fields so it effectively deletes field 2 (as opposed to the previous step which set it to null but it still existed as such), and the $1=$1 recombines $0 from it's fields replacing every FS (any contiguous chain of white space chars including the 2 blanks now between a and c) with OFS (a single blank char).

bash routine to return the page number of a given line number from text file

Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'):
alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f
Note that each page has a random number of lines.
Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character.
After a long time researching the solution I finally came across this piece of code:
function get_page_from_line
{
local nline="$1"
local input_file="$2"
local npag=0
local ln=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( ++npag ))
ln=$(echo -n "$page" | wc -l)
total=$(( total + ln ))
if [ $total -ge $nline ]; then
echo "${npag}"
return
fi
done < "$input_file"
echo "0"
return
}
But, unfortunately, this solution proved to be very slow in some cases.
Any better solution ?
Thanks!

The idea to use read -d $'\f' and then to count the lines is good.
This version migth appear not ellegant: if nline is greater than or equal to the number of lines in the file, then the file is read twice.
Give it a try, because it is super fast:
function get_page_from_line ()
{
local nline="${1}"
local input_file="${2}"
if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
printf "0\n"
else
printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
fi
}
Performance of awk is better than the above bash version. awk was created for such text processing.
Give this tested version a try:
function get_page_from_line ()
{
awk -v nline="${1}" '
BEGIN {
npag=1;
}
{
if (index($0,"\f")>0) {
npag++;
}
if (NR==nline) {
print npag;
linefound=1;
exit;
}
}
END {
if (!linefound) {
print 0;
}
}' "${2}"
}
When \f is encountered, the page number is increased.
NR is the current line number.
----
For history, there is another bash version.
This version is using only built-it commands to count the lines in current page.
The speedtest.sh that you had provided in the comments showed it is a little bit ahead (20 sec approx.) which makes it equivalent to your version:
function get_page_from_line ()
{
local nline="$1"
local input_file="$2"
local npag=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( npag + 1 ))
IFS=$'\n'
for line in ${page}
do
total=$(( total + 1 ))
if [[ total -eq nline ]] ; then
printf "%d\n" ${npag}
unset IFS
return
fi
done
unset IFS
done < "$input_file"
printf "0\n"
return
}

awk to the rescue!
awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file
3
updated anchoring as commented below.
$ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i"
'$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done
01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3

A script of similar length can be written in bash itself to locate and respond to the embedded <form-feed>'s contained in a file. (it will work for POSIX shell as well, with substitute for string index and expr for math) For example,
#!/bin/bash
declare -i ln=1 ## line count
declare -i pg=1 ## page count
fname="${1:-/dev/stdin}" ## read from file or stdin
printf "\nln:pg text\n" ## print header
while read -r l; do ## read each line
if [ ${l:0:1} = $'\f' ]; then ## if form-feed found
((pg++))
printf "<ff>\n%2s:%2s '%s'\n" "$ln" "$pg" "${l:1}"
else
printf "%2s:%2s '%s'\n" "$ln" "$pg" "$l"
fi
((ln++))
done < "$fname"
Example Input File
The simple input file with embedded <form-feed>'s was create with:
$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt
Which when output gives:
$ cat dat/affex.txt
a
b
c
d
e
f
g
h
i
j
k
l
Example Use/Output
$ bash affex.sh <dat/affex.txt
ln:pg text
1: 1 'a'
2: 1 'b'
3: 1 'c'
<ff>
4: 2 'd'
5: 2 'e'
6: 2 'f'
7: 2 'g'
8: 2 'h'
<ff>
9: 3 'i'
10: 3 'j'
11: 3 'k'
12: 3 'l'

With Awk, you can define RS (the record separator, default newline) to form feed (\f) and IFS (the input field separator, default any sequence of horizontal whitespace) to newline (\n) and obtain the number of lines as the number of "fields" in a "record" which is a "page".
The placement of form feeds in your data will produce some empty lines within a page so the counts are off where that happens.
awk -F '\n' -v RS='\f' '{ print NF }' file
You could reduce the number by one if $NF == "", and perhaps pass in the number of the desired page as a variable:
awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file
To obtain the page number for a particular line, just feed head -n number to the script, or loop over the numbers until you have accrued the sum of lines.
line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
old=$line
((line += count))
echo "Lines $old through line are on page $page"
((page++)
done

This gnu awk script prints the "page" for the linenumber given as command line argument:
BEGIN { ffcount=1;
search = ARGV[2]
delete ARGV[2]
if (!search ) {
print "Please provide linenumber as argument"
exit(1);
}
}
$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }
/[\f]/ { ffcount++ }
Use it like awk -f formfeeds.awk formfeeds.txt 05 where formfeeds.awk is the script, formfeeds.txt is the file and '05' is a linenumber.
The BEGIN rule deals mostly with the command line argument. The other rules are simple rules:
$1 ~ search applies when the first field matches the commandline argument stored in search
/[\f]/ applies when there is a formfeed

Bash script, command - output to array, then print to file

I need advice on how to achieve this output:
myoutputfile.txt
Tom Hagen 1892
State: Canada
Hank Moody 1555
State: Cuba
J.Lo 156
State: France
output of mycommand:
/usr/bin/mycommand
Tom Hagen
1892
Canada
Hank Moody
1555
Cuba
J.Lo
156
France
Im trying to achieve with this shell script:
IFS=$'\r\n' GLOBIGNORE='*' :; names=( $(/usr/bin/mycommand) )
for name in ${names[#]}
do
#echo $name
echo ${name[0]}
#echo ${name:0}
done
Thanks

Assuming you can always rely on the command to output groups of 3 lines, one option might be
/usr/bin/mycommand |
while read name;
read year;
read state; do
echo "$name $year"
echo "State: $state"
done
An array isn't really necessary here.
One improvement could be to exit the loop if you don't get all three required lines:
while read name && read year && read state; do
# Guaranteed that name, year, and state are all set
...
done

An easy one-liner (not tuned for performance):
/usr/bin/mycommand | xargs -d '\n' -L3 printf "%s %s\nState: %s\n"
It reads 3 lines at a time from the pipe and then passes them to a new instance of printf which is used to format the output.
If you have whitespace at the beginning (it looks like that in your example output), you may need to use something like this:
/usr/bin/mycommand | sed -e 's/^\s*//g' | xargs -d '\n' -L3 printf "%s %s\nState: %s\n"

#!/bin/bash
COUNTER=0
/usr/bin/mycommand | while read LINE
do
if [ $COUNTER = 0 ]; then
NAME="$LINE"
COUNTER=$(($COUNTER + 1))
elif [ $COUNTER = 1 ]; then
YEAR="$LINE"
COUNTER=$(($COUNTER + 1))
elif [ $COUNTER = 2 ]; then
STATE="$LINE"
COUNTER=0
echo "$NAME $YEAR"
echo "State: $STATE"
fi
done

chepner's pure bash solution is simple and elegant, but slow with large input files (loops in bash are slow).
Michael Jaros' solution is even simpler, if you have GNU xargs (verify with xargs --version), but also does not perform well with large input files (external utility printf is called once for every 3 input lines).
If performance matters, try the following awk solution:
/usr/bin/mycommand | awk '
{ ORS = (NR % 3 == 1 ? " " : "\n")
gsub("^[[:blank:]]+|[[:blank:]]*\r?$", "") }
{ print (NR % 3 == 0 ? "State: " : "") $0 }
' > myoutputfile.txt
NR % 3 returns the 0-based index of each input line within its respective group of consecutive 3 lines; returns 1 for the 1st line, 2 for the 2nd, and 0(!) for the 3rd.
{ ORS = (NR % 3 == 1 ? " " : "\n") determines ORS, the output-record separator, based on that index: a space for line 1, and a newline for lines 2 and 3; the space ensures that line 2 is appended to line 1 with a space when using print.
gsub("^[[:blank:]]+|[[:blank:]]*\r?$", "") strips leading and trailing whitespace from the line - including, if present, a trailing \r, which your input seems to have.
{ print (NR % 3 == 0 ? "State: " : "") $0 } prints the trimmed input line, prefixed by "State: " only for every 3rd input line, and implicitly followed by ORS (due to use of print).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash to identify and verify file headers - bash

Related

Processing of the data from a big number of input files

How to make that variable $f defined how much "Freq" will be printed from column number 3?

delete entries at certain indices in space delimited text file

bash routine to return the page number of a given line number from text file

Bash script, command - output to array, then print to file

Categories

Resources