extracting values from text file using awk - bash

I have 100 text files which look like this:
File title
4
Realization number
variable 2 name
variable 3 name
variable 4 name
1 3452 4538 325.5
The first number on the 7th line (1) is the realization number, which SHOULD relate to the file name. i.e. The first file is called file1.txt and has realization number 1 (as shown above). The second file is called file2.txt and should have realization number 2 on the 7th line. file3.txt should have realization number 3 on the 7th line, and so on...
Unfortunately every file has realization=1, where they should be incremented according to the file name.
I want to extract variables 2, 3 and 4 from the 7th line (3452, 4538 and 325.5) in each of the files and append them to a summary file called summary.txt.
I know how to extract the information from 1 file:
awk 'NR==7,NR==7{print $2, $3, $4}' file1.txt
Which, correctly gives me:
3452 4538 325.5
My first problem is that this command doesn't seem to give the same results when run from a bash script on multiple files.
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR=7,NR==7{print $2, $3, $4}' File$((i)).txt
done
I get multiple lines being printed to the screen when I use the above script.
Secondly, I would like to output those values to the summary file along with the CORRECT preceeding realization number. i.e. I want a file that looks like this:
1 3452 4538 325.5
2 4582 6853 158.2
...
100 4865 3589 15.15
Thanks for any help!

You can simplify some things and get the result you're after:
#!/bin/bash
for ((i=1;i<=100;i++))
do
echo $i $(awk 'NR==7{print $2, $3, $4}' File$i.txt)
done
You really don't want to assign to NR=7 (as you did) and you don't need to repeat the NR==7,NR==7 either. You also really don't need the $((i)) notation when $i is sufficient.
If all the files are exactly 7 lines long, you can do it all in one awk command (instead of 100 of them):
awk 'NR%7==0 { print ++i, $2, $3, $4}' Files*.txt

Notice that you have only one = in your bash script. Does all the files have exactly 7 lines? If you are only interested in the 7th line then:
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR==7{print $2, $3, $4}' File$((i)).txt
done
Since your realization number starts from 1, you can simply add that using nl command.
For example, if your bash script is called s.sh then:
./s.sh | nl > summary.txt
will get you the result with the expected lines in summary.txt

Here's one way using awk:
awk 'FNR==7 { print ++i, $2, $3, $4 > "summary.txt" }' $(ls -v file*)
The -v flag simply sorts the glob by version numbers. If your version of ls doesn't support this flag, try: ls file* | sort -V instead.

Related

Separating rows in a text file based on the column

Given a text file below, I want to separate the rows whose the value in second column is zero and put those rows in a separate file. Since the values in the second column are starting from 0 to 83, I would like to have this approach for every value. I have written the code below but it is not working as it should be and every output file generated is empty. Can anyone tell me what am I doing wrong?
for i in {0..83}; do awk ' $2=="$i" {print}' combined-all.txt > combined-all-$i.txt; done
here is part of the text file
Subj02 19 000274 000318
Subj01 83 000319 000362
Subj03 18 000363 000414
Subj04 83 000415 000447
Subj05 17 000448 000490
Subj06 0 000491 000540
...
Or you can use awk var assignment
for i in {0..83}; do awk -v i=$i '$2==i' combined-all.txt > combined-all-$i.txt; done
awk loops through files, try to use awk without a loop.
awk '{print >> "combined-all-" $2 ".txt"}' combined-all.txt
EDIT: Inputfile is combined-all.txt, not combined-all-$i.txt
... not using awk a lot these days... but this works:
for i in {0..83}; do awk -F" " '{ if ($2=='"$i"') {print}}' combined-all.txt > combined-all-$i.txt; done
Note the '"$1"'

Extract lines from file2 that exist in file1 using a loop

I am very new at shell scripting and I am having some trouble with the following task:
I want to extract lines from file2 that are found also in file1 and extract those lines to a new file3. I am only allowed to use loops for this (I know it works with the basic grep command, but I need to find a way with a loop)
File1
John 5 red books
Ashley 4 yellow music
Susan 8 green films
File2
John
Susan
Desired output for file3 would be:
John 5 red books
Susan 8 green films
The desired output has to be found using bash script and a loop. I have tried the following loop, but I am missing some lines in the results by using this:
while read line
do
grep "${line}" $file1
done < $file2 >> file3.txt
If anyone has any thoughts on how to improve my script or any new ideas (again using loops) it would be greatly appreciated. Thank you!
Looping here is a good educational exercise but it isn't ideal for this in the real world.
Technically, this AWK solution works and uses a loop, but I'm guessing it's not what your instructor is looking for:
awk 'NR == FNR { find[$1]=1; next } find[$1]' File2 File1 >File3
I've swapped the order of the files so the file with the data (File1) is loaded after the file listing what we want (File2).
This starts with a condition that ensures we're on the first file AWK reads (NR is the "number of records" (lines) seen so far across all inputs and FNR is the current file's number of records, so since this clause requires them to be the same value, it can only fire on the first input file). It sets a hash (a data structure with key/value pairs, a.k.a. an associative array or dictionary) whose key is the value of the first column ($1) on the line so we can extract it later, then next skips the later stanza for that input line.
When the code loops through the next file (File1), the first clause does not fire and instead the first column of input is looked up in the find hash. If it is present, its value is 1 and that evaluates to true, so we print the value. (A clause with no action implies { print })
See Toby Speight's answer for a native bash answer with only builtins. It uses loops and hashes. You'll likely find that solution is slower on larger data sets.
Since you're using Bash, you could create an associative array from File2, and use that to check membership. Something like (untested):
read -a names <File2
local -A n
for i in "${names[#]}"
do n["$i"]="$i"
done
while read -r name rest
do [ "${n[$name]}" ] && echo "$name $rest"
done <File1 >file3
Awk solution:
awk 'NR==FNR{ arr[$0]="";next } { for (i in arr) { if (i == $1 ) { print $0 } } }' file2 file1
First we create an array of with the data in file2. We then use this to check the first space delimited piece of data and print if there is a match,
With awk :
$ awk 'NR==FNR{ a[$1];next } $1 in a' file2 file1`
With grep:
$ grep -F -f file2 file1

match awk column value to a column in another file

I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C
Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files
I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.
You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -

Print text between two lines (from list of line numbers in file) in Unix [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a sample file which has thousands of lines.
I want to print text between two line numbers in that file. I don't want to input line numbers manually, rather I have a file which contains list of line numbers between which text has to be printed.
Example : linenumbers.txt
345|789
999|1056
1522|1366
3523|3562
I need a shell script which will read line numbers from this file and print the text between each range of lines into a separate (new) file.
That is, it should print lines between 345 and 789 into a new file, say File1.txt, and print text between lines 999 and 1056 into a new file, say File2.txt, and so on.
considering your target file has only thousands of lines. here is a quick and dirty solution.
awk -F'|' '{system("sed -n \""$1","$2"p\" targetFile > file"NR)}' linenumbers.txt
the targetFile is your file containing thousands of lines.
the oneliner does not require your linenumbers.txt to be sorted.
the oneliner allows line range to be overlapped in your linenumbers.txt
after running the command above, you will have n filex files. n is the row counts of linenumbers.txt x is from 1-n you can change the filename pattern as you want.
Here's one way using GNU awk. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as keys to a multidimensional array with
# a value of field two
a[NR][$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array's first dimension
for (i in a) {
# for every element in the second dimension
for (j in a[i]) {
# ensure that the first field is treated numerically
j+=0
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=j && FNR<=a[i][j]) {
# print the line to a file with the suffix of the first file's
# line number (the first dimension)
print > "File" i
}
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR][$1]=$2; next } { for (i in a) for (j in a[i]) { j+=0; if (FNR>=j && FNR<=a[i][j]) print > "File" i } }' numbers.txt file.txt
If you have an 'old' awk, here's the version with compatibility. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as a key to a pseudo-multidimensional
# array with a value of field two
a[NR,$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array
for (i in a) {
# split the element in to another array
# b[1] is the row number and b[2] is the first field
split(i,b,SUBSEP)
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=b[2] && FNR<=a[i]) {
# print the line to a file with the suffix of the first file's
# line number (the first pseudo-dimension)
print > "File" b[1]
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR,$1]=$2; next } { for (i in a) { split(i,b,SUBSEP); if (FNR>=b[2] && FNR<=a[i]) print > "File" b[1] } }' numbers.txt file.txt
I would use sed to process the sample data file because it is simple and swift. This requires a mechanism for converting the line numbers file into the appropriate sed script. There are many ways to do this.
One way uses sed to convert the set of line numbers into a sed script. If everything was going to standard output, this would be trivial. With the output needing to go to different files, we need a line number for each line in the line numbers file. One way to give line numbers is the nl command. Another possibility would be to use pr -n -l1. The same sed command line works with both:
nl linenumbers.txt |
sed 's/ *\([0-9]*\)[^0-9]*\([0-9]*\)|\([0-9]*\)/\2,\3w file\1.txt/'
For the given data file, that generates:
345,789w > file1.txt
999,1056w > file2.txt
1522,1366w > file3.txt
3523,3562w > file4.txt
Another option would be to have awk generate the sed script:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt
If your version of sed will allow you to read its script from standard input with -f - (GNU sed does; BSD sed does not), then you can convert the line numbers file into a sed script on the fly, and use that to parse the sample data:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f - sample.data
If your system supports /dev/stdin, you can use one of:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/stdin sample.data
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/fd/0 sample.data
Failing that, use an explicit script file:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > sed.script
sed -n -f sed.script sample.data
rm -f sed.script
Strictly, you should deal with ensuring the temporary file name is unique (mktemp) and removed even if the script is interrupted (trap):
tmp=$(mktemp sed.script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > $tmp
sed -n -f $tmp sample.data
rm -f $tmp
trap 0
The final trap 0 allows your script to exit successfully; omit it, and you script will always exit with status 1.
I've ignored Perl and Python; either could be used for this in a single command. The file management is just fiddly enough that using sed seems simpler. You could also use just awk, either with a first awk script writing an awk script to do the heavy duty work (trivial extension of the above), or having a single awk process read both files and produce the required output (harder, but far from impossible).
If nothing else, this shows that there are many possible ways of doing the job. If this is a one-off exercise, it really doesn't matter very much which you choose. If you will be doing this repeatedly, then choose the mechanism that you like. If you're worried about performance, measure. It is likely that converting the line numbers into a command script is a negligible cost; processing the sample data with the command script is where the time is taken. I would expect sed to excel at that point; I've not measured to confirm that it does.
You could do the following
# myscript.sh
linenumbers="linenumber.txt"
somefile="afile"
while IFS=\| read start end ; do
echo "sed -n '$start,${end}p;${end}q;' $somefile > $somefile-$start-$end"
done < $linenumbers
run it like so sh myscript.sh
sed -n '345,789p;789q;' afile > afile-345-789
sed -n '999,1056p;1056q;' afile > afile-999-1056
sed -n '1522,1366p;1366q;' afile > afile-1522-1366
sed -n '3523,3562p;3562q;' afile > afile-3523-3562
then when you're happy do sh myscript.sh | sh
EDIT Added William's excellent points on style and correctness.
EDIT Explanation
The basic idea is to get a script to generate a series of shell commands that can be checked for correctness first before being executed by "| sh".
sed -n '345,789p;789q; means use sed and don't echo each line (-n) ; there are two commands saying from line 345 to 789 p(rint) the lines and the second command is at line 789 q(uit) - by quitting on the last line you save having sed read all the input file.
The while loop reads from the $linenumbers file using read, read if given more than one variable name populates each with a field from the input, a field is usually separated by space and if there are too few variable names then read will put the remaining data into the last variable name.
You can put the following in at your shell prompt to understand that behaviour.
ls -l | while read first rest ; do
echo $first XXXX $rest
done
Try adding another variable second to the above to see what happens then, it should be obvious.
The problem is your data is delimited by |s and that's where using William's suggestion of IFS=\| works as now when reading from the input the IFS has changed and the input is now separated by |s and we get the desired result.
Others can feel free to edit,correct and expand.
To extract the first field from 345|789 you can e.g use awk
awk -F'|' '{print $1}'
Combine that with the answers received from your other question and you will have a solution.
This might work for you (GNU sed):
sed -r 's/(.*)\|(.*)/\1,\2w file-\1-\2.txt/' | sed -nf - file

Need an awk script or any other way to do this on unix

i have small file with around 50 lines and 2 fields like below
file1
-----
12345 8373
65236 7376
82738 2872
..
..
..
i have some around 100 files which are comma"," separated as below:
file2
-----
1,3,4,4,12345,,,23,3,,,2,8373,1,1
each file has many lines similar to the above line.
i want to extract from all these 100 files whose
5th field is eqaul to 1st field in the first file and
13th field is equal to 2nd field in the first file
I want to search all the 100 files using that single file?
i came up with the below in case of a single comma separated file.i am not even sure whether this is correct!
but i have multiple comma separated files.
awk -F"\t|," 'FNR==NR{a[$1$2]++;next}($5$13 in a)' file1 file2
can anyone help me pls?
EDIT:
the above command is working fine in case of a single file.
Here is another using an array, avoiding multiple work files:
#!/bin/awk -f
FILENAME == "file1" {
keys[$1] = ""
keys[$2] = ""
next
}
{
split($0, fields, "," )
if (fields[5] in keys && fields[13] in keys) print "*:",$0
}
I am using split because the field seperator in the two files are different. You could swap it around if necessary. You should call the script thus:
runit.awk file1 file2
An alternative is to open the first file explicitly (using "open") and reading it (readline) in a BEGIN block.
Here is a simple approach. Extract each line from the small file, split it into fields and then use awk to print lines from the other files which match those fields:
while read line
do
f1=$(echo $line | awk '{print $1}')
f2=$(echo $line | awk '{print $2}')
awk -v f1="$f1" -v f2="$f2" -F, '$5==f1 && $13==f2' file*
done < small_file

Resources