delete entries at certain indices in space delimited text file

delete entries at certain indices in space delimited text file - bash

I have a .txt file with numeric indices of certain 'outlier' data points, each on their own line, called by $outlier_file:
1
7
30
43
48
49
56
57
65
Using the following code, I can successfully remove certain files (volumes of neuroimaging data in this case) by using while + read.
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
done < $outlier_file
However, I also need to remove the numbers located at these 'outlier' indices from another text file stored in $bvec_file, which has 69 columns & 3 rows. Within each row, the numbers are space delimited. So e.g., for this example, I need to remove all 3 rows of column 1, 7, 30, etc. and then save this version with the outliers removed into a new *.txt file.
0 0.9988864166 -0.0415925034 -0.06652866169 -0.6187155495 0.2291534462 0.8892356214 0.7797364286 0.1957395685 0.9236669465 -0.5400265342 -0.3845263463 -0.4903989539 0.4863306385 -0.6496130843 0.5571164636 0.8110081715 0.9032142094 -0.3234596075 -0.1551409525 -0.806059879 0.4811597826 -0.7820757748 -0.9528881463 0.1916556621 -0.007136403284 -0.2459431735 -0.7915263574 -0.1938049261 -0.1578786349 0.8688043633 -0.5546072294 -0.4019951732 0.2806154851 0.3478762022 0.9548067252 -0.9696777541 -0.4816255837 -0.7962240023 0.6818610905 0.7097978218 0.6739686799 0.1317547111 -0.7648252249 -0.1456021218 -0.5948047487 0.0934205064 0.5268769564 -0.8618324858 -0.3721029232 -0.1827616535 0.691353613 0.4159071597 0.4605505287 0.1312199424 0.426674893 -0.4068291509 0.7167859082 0.2330824665 0.01909161256 -0.06375254731 -0.5981122948 -0.2672253674 0.6875472994 0.2302943724 0 0 0 0
0 0.04258194557 0.9988207007 0.6287131425 0.7469024143 0.5528476637 0.3024964957 0.1446931241 0.9305823612 0.1675139932 0.8208211337 0.8238722992 0.5983722761 0.4238174961 0.639429196 0.1072148887 0.5551578885 0.003337599176 0.511740508 0.9516619405 0.3851404227 0.8526321065 0.1390947346 0.2030449535 0.7759459569 0.165587903 0.9523372297 0.5801228933 0.3277276562 0.7413928896 0.442482978 0.2320585706 0.1079269171 0.1868672655 0.1606136006 0.2968573235 0.1682337977 0.8745679247 0.5989061899 0.4172933119 0.01746934331 0.5641480832 0.7455469091 0.3471016571 0.8035001467 0.5870623128 0.361107261 0.8192579877 0.4160218909 0.5651330299 0.4070513153 0.7221181184 0.714223583 0.6971767133 0.4937978446 0.4232911691 0.8011701162 0.2870385494 0.9016941521 0.09688949547 0.9086826131 0.2631932421 0.152678096 0.6295753848 0.9712458578 0 0 0 0
0 -0.02031513434 -0.02504539005 -0.7747862425 0.2435730944 0.8011542666 0.343155766 -0.6091592581 -0.3093581909 -0.3446424728 -0.1860752773 -0.4163819443 -0.6336083058 0.7641081337 -0.4112580017 -0.8234841915 0.1845683194 0.4291770641 -0.7959243273 -0.2650864686 0.449371034 -0.203724703 0.6074620459 0.2253373638 -0.6009791836 -0.9861692137 0.1804598471 0.1922068008 -0.9246806119 0.6522353256 -0.2222336438 0.7990992685 -0.9092588527 -0.9414539684 0.9236803664 0.0148272357 -0.1772637652 0.05628269894 -0.08566629406 -0.6007759525 0.7041888058 0.4769729119 0.6532997034 -0.5427364139 -0.5772239915 0.5491494803 0.9278330427 0.2263117816 -0.290121617 0.7363179158 0.8949343019 -0.02399176716 0.5629439653 -0.5493977074 -0.8596191107 -0.7992328333 0.4388809483 0.6354737076 0.3641705918 0.9951120218 0.412591228 -0.75696169 0.9514620339 -0.3618197699 0.06038199928 0 0 0 0
As far as I've gotten in one approach is using awk to index the right columns.. (just printing them right now) but I can only get this to work if I call $1 (i.e., the numeric index of the first outlier column)...
awk -F ' ' '{print $1}' $bvec_file
If I try to refer to the value in $outlier, it doesn't work. Instead, this prints the entire contents of $bvec_file
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
# Remove outlier #'s from bvec file
awk -F ' ' '{print $1}' $bvec_file
done < $outlier_file
I am completely stuck on how to get this done. Any advice would be greatly appreciated.

To delete the outliers from bvec_file after the loop and only delete the ones where the associated file was successfully removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
echo "$outlier"
done < "$outlier_file" |
awk '
NR==FNR { os[$0]; next }
{
for (o in os) {
$o=""
}
$0=$0; $1=$1
}
1' - "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
Or to delete the outliers one at a time as the files are removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
# Remove outlier #'s from bvec file
awk -v o="$outlier" '{$o=""; $0=$0; $1=$1} 1' "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
done < <(sort -rnu "$outlier_file")
Always quote your shell variables, see https://mywiki.wooledge.org/Quotes, and the && at the end of each line is to ensure the next command only runs if the previous commands succeeded.
The magical incantation in the awk script does the following - lets say your input is a b c and the outlier field is field number 2, b:
$ echo 'a b c'
a b c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; print NF ":", $0}'
3: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; print NF ":", $0}'
2: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; $1=$1; print NF ":", $0}'
2: a c
The o="" sets the field value to null, the $0=$0 forces awk to resplit $0 into fields so it effectively deletes field 2 (as opposed to the previous step which set it to null but it still existed as such), and the $1=$1 recombines $0 from it's fields replacing every FS (any contiguous chain of white space chars including the 2 blanks now between a and c) with OFS (a single blank char).

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?

If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively

Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'

When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Process code lines with in specific pattern boundary from text file separately?

I have used this script to extract all the occurences of function data between name __libc_memalign and : from "file1" to "file 2" . Now the file 2 contains multiple(6000 group) occurrences of code with in this pattern. How I can iterate through each group in the file "file2" and process each group?
`awk '/__libc_memalign/ {p=1;print;next} /:/ && p {p=0;print} p' file1.out >file2`
sample input
0 0xc40840 : __libc_memalign
0 0x40bac0 0x7ffe493d0d50 W
0 0x40bac2 0x7ffe493d0d48 W
0 0x40bac4 0x7ffe493d0d40 W
..
0 0xc40840 : __libc_memalign
0 0x40bac0 0x7ffe493d0d50 R
0 0x40bac2 0x7ffe493d0d48 R
0 0x40bac4 0x7ffe493d0d40 R
....
0 0xc40840 : __libc_memalign
0 0x40bab0 0x7ffe493b0d50 W
0 0x40bab2 0x7ffe493dbd48 R
0 0x40bac4 0x7ffe493d0d40 W

It's not really clear what you mean by "group" or "process" but hopefully at least this could nudge you in the right direction.
Assuming there are no empty lines in your groups, add a separator between them; then loop over sequences between empty lines. Your Awk script already seems to put an empty line when it finishes a group, so you can simply
awk '/__libc_memalign/ {p=1; print; next}
/:/ && p {p=0; print} p' file1.out |
while true; do
while read -r line; do
case $line in '') break;; esac
echo "$line"
done |
# Pipe the collected group into "process
process
done
This is fairly clumsy, and can probably be refactored significantly. If you don't particularly need the intermediate results, maybe simply
awk '/__libc_memalign/ {
p=1; cmd = "process" print | cmd; next}
/:/ && p { p=0; close(cmd) }
p { print | cmd }' file1.out

bash to identify and verify file headers

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input

You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.

Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.

This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header

Sorting on multiple columns w/ an output file per key

I'm uncertain as to how I can use the until loop inside a while loop.
I have an input file of 500,000 lines that look like this:
9 1 1 0.6132E+02
9 2 1 0.6314E+02
10 3 1 0.5874E+02
10 4 1 0.5266E+02
10 5 1 0.5571E+02
1 6 1 0.5004E+02
1 7 1 0.5450E+02
2 8 1 0.5696E+02
11 9 1 0.6369E+02
.....
And what I'm hoping to achieve is to sort the numbers in the first column in numerical order such that I can pull all the similar lines (eg. lines that start with the same number) into new text files "cluster${i}.txt". From there I want to sort the fourth column of ("cluster${i}.txt") files in numerical order. After sorting I would like to write the first row of each sorted "cluster${i}.txt" file into a single output file. A sample output of "cluster1.txt" would like this:
1 6 1 0.5004E+02
1 7 1 0.5450E+02
1 11 1 0.6777E+02
....
as well as an output.txt file that would look like this:
1 6 1 0.5004E+02
2 487 1 0.3495E+02
3 34 1 0.0344E+02
....
Here is what I've written:
#!/bin/bash
input='input.txt'
i=1
sort -nk 1 $input > 'temp.txt'
while read line; do
awk -v var="$i" '$1 == var' temp.txt > "cluster${i}.txt"
until [[$i -lt 20]]; do
i=$((i+1))
done
done
for f in *.txt; do
sort -nk 4 > temp2.txt
head -1 temp2.txt
rm temp2.txt
done > output.txt

This only takes one line, if your sort -n knows how to handle exponential notation:
sort -nk 1,4 <in.txt | awk '{ of="cluster" $1 ".txt"; print $0 >>of }'
...or, to also write the first line for each index to output.txt:
sort -nk 1,4 <in.txt | awk '
{
if($1 != last) {
print $0 >"output.txt"
last=$1
}
of="cluster" $1 ".txt";
print $0 >of
}'
Consider using an awk implementation -- such as GNU awk -- which will cache file descriptors, rather than reopening each output file for every append; this will greatly improve performance.
By the way, let's look at what was wrong with the original script:
It was slow. Really, really slow.
Starting a new instance of awk 20 times for every line of input (because the whole point of while read is to iterate over individual lines, so putting an awk inside a while read is going to run awk at least once per line) is going to have a very appreciable impact on performance. Not that it was actually doing this, because...
The while read line outer loop was reading from stdin, not temp.txt or input.txt.
Thus, the script was hanging if stdin didn't have anything written on it, or wasn't executing the contents of the loop at all if stdin pointed to a source with no content like /dev/null.
The inner loop wasn't actually processing the line read by the outer loop. line was being read, but all of temp.txt was being operated on.
The awk wasn't actually inside the inner loop, but rather was inside the outer loop, just before the inner loop. Consequently, it wasn't being run 20 times with different values for i, but run only once per line read, with whichever value for i was left over from previously executed code.
Whitespace is important to how commands are parsed. [[foo]] is wrong; it needs to be [[ foo ]].
To "fix" the inner loop, to do what I imagine you meant to write, might look like this:
# this is slow and awful, but at least it'll work.
while IFS= read -r line; do
i=0
until [[ $i -ge 20 ]]; do
awk -v var="$i" '$1 == var' <<<"$line" >>"cluster${i}.txt"
i=$((i+1))
done
done <temp.txt
...or, somewhat better (but still not as good as the solution suggested at the top):
# this is a somewhat less awful.
for (( i=0; i<=20; i++ )); do
awk -v var="$i" '$1 == var' <temp.txt >"cluster${i}.txt"
head -n 1 "cluster${i}.txt"
done >output.txt
Note how the redirection to output.txt is done just once, for the whole loop -- this means we're only opening the file once.

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.

Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.

I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.

Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt

I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

delete entries at certain indices in space delimited text file - bash

Related

Processing of the data from a big number of input files

Process code lines with in specific pattern boundary from text file separately?

bash to identify and verify file headers

Sorting on multiple columns w/ an output file per key

Find nth row using AWK and assign them to a variable

Categories

Resources