Comparing 2 files content with specific values in shellScript - shell

I am very new to the shell script and need help
I have 2 files i.e
Student.txt
001, Peter, class3
002, Mohit, class4
and so on...
Marks.txt
001, History, 45
001, Maths, 55
002, computer, 76
002, Maths, 96
and so on...
I want to read the first word (i.e. Roll No. )from Student.txt i.e. 001,002 in my example and then search the content (roll no.) in another file Marks.txt 1st word AND 2nd word should be "History", (condition: $1 == roll no && $2 == History)
I going through awk cmd and tried but not able to make a complete solution
awk -F "," '{ print $1 }' student.txt
awk -F "," '{ print $1, $2 }' marks.txt

This command first collects the roll numbers from the Marks.txt that have History in the second column and then prints lines from Student.txt that have roll number from this list:
awk 'BEGIN {FS=", "} FNR==NR{ if ($2=="History"){a[$1];} next} $1 in a' Marks.txt Student.txt
Output:
001, Peter, class3
EDIT: see this for more info on processing multiple files with awk: Using AWK to Process Input from Multiple Files

Related

How to subtract 2 numbers in 2 different files?

I have 2 files that are reports of size of the Databases (1 file is from yesterday, 1 from today).
I want to see how the size of each database changed, so I want to calculate the difference.
File looks like this:
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3"
"ASIA","3740156","3170088","570068","354000","server5"
"AFRICA","4871331","4101711","769620","318412","server4"
Other file is the same, only the numbers are different.
I want to see how the database size changed (so ONLY column "Use MB").
I guess I cannot use "diff" or "awk" options since numbers may change dramatically each day. The only good 'algoritm' I can think of is to subtract numbers between 5th and 6th double quote ("), how do I do that?
You can do this (using awk):
paste file1 file2 -d ',' |awk -F ',' '{gsub(/"/, "", $3); gsub(/"/, "", $9); print $3 - $9}'
paste puts the two files next to another, separated by a comma (-d ','). So you will have :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname","DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3","EUROPE","9133508","8336089","797419","896120","server3"
...
gsub(/"/, "", $3) removes the quotes around column 3
And finally we print column 3 minus column 9
Maybe I missed something, but I don't get why you could not use awk as it can totally do
The only good 'algoritm' I can think of is to subtract numbers between
5th and 6th double quote ("), how do I do that?
Let's say that file1 is :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3"
"ASIA","3740156","3170088","570068","354000","server5"
"AFRICA","4871331","4101711","769620","318412","server4"
And file2 is :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8335089","797419","896120","server3"
"ASIA","3740156","3170058","570068","354000","server5"
"AFRICA","4871331","4001711","769620","318412","server4"
Command
awk -F'[",]' 'NR>2&&NR==FNR{db[$2]=$8;next}FNR>2{print $2, db[$2]-$8}' file1 file2
gives you result :
EUROPE 1000
ASIA 30
AFRICA 100000
You can also use this answer to deal more properly with quotechars on awk.
If your awk version cannot support multiple field delimiters, you can try this :
awk -F, 'NR>2&&NR==FNR{db[$1]=$3;next}FNR>2{print $1, db[$1]-$3}' <(sed 's,",,g' file1) <(sed 's,",,g' file2)

How can I make a script that calls awk in a loop over k/v pairs faster?

I have numerous amounts of text files that I would like to loop through. While looping I would like to find lines that match a list of strings and extract each to a separate folder. I have a variable "ij" that need to be split into "i" and "j" to match two columns. For example 2733 needs to be split into 27 and 33. The script searches each text file and extracts every line that has an i and j of 2733.
The problem here is that I have nearly 100 different strings, so it takes about 35 hours to get through all these strings.
Is there any way to extract all of the variables to separate files in just one loop? I am trying to loop through a text file, extract all the lines that are in my list of strings and output them to their own folder, then move onto the next text file.
I am currently using the "awk" command to accomplish this.
list="2741 2740 2739 2738 2737 2641 2640 2639 2638 2541 2540 2539 2538 2441 2440 2439 2438 2341 2340 2339 2241 2240 2141"
for string in $list
do
for i in ${string:0:2}
do
for j in ${string:2:2}
do
awk -v i=$i -v j=$j '$2==j && $3==i {print $0}' $datadir/*.txt >"${fileout}${i}_${j}_Output.txt"
done
done
done
So I did this:
# for each 4 digits in the list
# add "a[" and "];" before and after the four numbers
# so awk array is "a[2741]; a[2740]; a[2739]; ...."
awkarray=$(awkarray=$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')
awk -vfileout="$fileout" '
BEGIN {'"$awkarray"'}
$2 $3 in a {
print $0 > fileout $2 "_" $3 "_Output.txt"
}
' "$datadir"/*.txt
So first I transform the list to load it as an array in awk. The array has only indexes, so I can check if an index exists in an array, the array elements have no values. Then I simply check if the concatenation of $2 and $3 exists in the array, if it exists, the output is redirected to proper filename.
Remember to quote your variables. $datadir/*.txt may not work, when datadir contains spaces, do "$datadir"/*.txt. The newlines in awk script
above can be removed, so if you prefer a oneliner:
awk -vfileout="$fileout" 'BEGIN {'"$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')"'} $2 $3 in a { print $0 > fileout $2 "_" $3 "_Output.txt" }' "$datadir"/*.txt

awk change one column in file, where column changes position in different files

I have text files that result from various processing steps, so depending on the order of the steps the order columns and length each line changes from one file to the next.
so file1 would be:
moo 100.35 blah 9 85 0.0038
moo 93.8 bluu 10 85 0.0042
and file2 would be:
125.2 129.3 moo 0.23
123.5 125.3 moo 0.23
and I would like to change it to:
1_horatio 100.35 blah 9 85 0.0038
2_horatio 93.8 bluu 10 85 0.0042
and
125.2 129.3 1_clarence 0.23
123.5 125.3 2_clarence 0.23
where the the number on the new name for moo is incremented for each row. The name is an input variable.
here's what I've been trying so far:
newnam=$1
awk -v nnam=$newnam 'BEGIN{ count=1 } {imgn=count"_"nam; print imgn,$2,$3,$4 count++ }' $2 > $3
which I then need to change to:
newnam=$1
awk -v nam=$newnam 'BEGIN{ count=1 } {imgn=count"_"nam; print $1,$2,imgn,$4 count++ }' $2 > $3
I'd like to be able to put the column number as a variable, and not have to worry about how many columns there are. There can be up to 50 column, with up to a million rows.
Is there a way to do this in awk? Or bash with awk?
I believe what you can do is something like this,
awk '{$col=count"_"name; count++}1' name="clarence" col=3 <file>
Here we make use of the following awk features:
redefining a field $n will redefine $0
The command 1 means {print $0}
The operator $expr returns the field number given by expr
Update: to have the counter start at 1, one can rewrite this as:
awk '{count++; $col=count"_"name}1' name="clarence" col=3 <file>
which can be shortened as:
awk '{$col=++count"_"name}1' name="clarence" col=3 <file>
due to the usage of the pre-increment operator ++var. But again, now count resembles nothing but the number of records, thus
awk '{$col=NR"_"name}1' name="clarence" col=3 <file>

find duplicate data between files in unix

I have 2 files with contents:-
file1:
918802944821 919968005200 kushinagar
919711354546 919211999924 delhi
915555555555 916666666666 kanpur
919711354546 915686524578 hehe
918802944821 4752168549 hfhkjh
file2:-
919211999924 919711354546 ghaziabad
919999999999 918888888888 lucknow
912222222222 911111111111 chandauli
918802944821 916325478965 hfhjdhjd
Now notice that number1 and number2 are interchanged in file1 and file2. I want to print only this duplicate line on the screen. to be more specific i want only the numbers or line to be printed on the screen which are duplicate like 8888888888 and 7777777777 are duplicate in the two files. I want only these two numbers on the screen or the whole line on the screen..
Using awk you can do:
awk 'FNR==NR{a[$1,$2]++;next} a[$2,$1]' f1 f2
7777777777 8888888888 pqr
EDIT: Based on your edited question you can do:
awk 'FNR==NR{a[$1]++;b[$2]++;next} a[$1] || b[$1] {print $1} a[$2] || b[$2]{print $2}' f1 f2
919211999924
919711354546
918802944821
kent$ awk 'NR==FNR{a[$2 FS $1]=1;next}a[$1 FS $2]{print $1,$2}' f1 f2
7777777777 8888888888

extracting values from text file using awk

I have 100 text files which look like this:
File title
4
Realization number
variable 2 name
variable 3 name
variable 4 name
1 3452 4538 325.5
The first number on the 7th line (1) is the realization number, which SHOULD relate to the file name. i.e. The first file is called file1.txt and has realization number 1 (as shown above). The second file is called file2.txt and should have realization number 2 on the 7th line. file3.txt should have realization number 3 on the 7th line, and so on...
Unfortunately every file has realization=1, where they should be incremented according to the file name.
I want to extract variables 2, 3 and 4 from the 7th line (3452, 4538 and 325.5) in each of the files and append them to a summary file called summary.txt.
I know how to extract the information from 1 file:
awk 'NR==7,NR==7{print $2, $3, $4}' file1.txt
Which, correctly gives me:
3452 4538 325.5
My first problem is that this command doesn't seem to give the same results when run from a bash script on multiple files.
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR=7,NR==7{print $2, $3, $4}' File$((i)).txt
done
I get multiple lines being printed to the screen when I use the above script.
Secondly, I would like to output those values to the summary file along with the CORRECT preceeding realization number. i.e. I want a file that looks like this:
1 3452 4538 325.5
2 4582 6853 158.2
...
100 4865 3589 15.15
Thanks for any help!
You can simplify some things and get the result you're after:
#!/bin/bash
for ((i=1;i<=100;i++))
do
echo $i $(awk 'NR==7{print $2, $3, $4}' File$i.txt)
done
You really don't want to assign to NR=7 (as you did) and you don't need to repeat the NR==7,NR==7 either. You also really don't need the $((i)) notation when $i is sufficient.
If all the files are exactly 7 lines long, you can do it all in one awk command (instead of 100 of them):
awk 'NR%7==0 { print ++i, $2, $3, $4}' Files*.txt
Notice that you have only one = in your bash script. Does all the files have exactly 7 lines? If you are only interested in the 7th line then:
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR==7{print $2, $3, $4}' File$((i)).txt
done
Since your realization number starts from 1, you can simply add that using nl command.
For example, if your bash script is called s.sh then:
./s.sh | nl > summary.txt
will get you the result with the expected lines in summary.txt
Here's one way using awk:
awk 'FNR==7 { print ++i, $2, $3, $4 > "summary.txt" }' $(ls -v file*)
The -v flag simply sorts the glob by version numbers. If your version of ls doesn't support this flag, try: ls file* | sort -V instead.

Resources