Grab similar data and do the math operation on the grabbed data - shell

I need to search the value of each student attendance from below table and need to do summation of each student attendance.
I have a file dump with below data inside file_dump.txt
[days Leaves PERCENTAGE student_attendance]
194 1.3 31.44% student1.entry2
189 1.3 30.63% student2._student2
138 0.9 22.37% student3.entry2
5 0.0 0.81% student3._student3
5 0.0 0.81% student1._student1
I need to search student1 data from above table using linux command (grep or other commands) and then do the summation of student1.entry2 and student1._student1 together that is ( 194 + 5 = 199).
How can I do this using linux command line ?

Awk is eminently suitable for small programming tasks like this.
awk -v student="student1" '$4 ~ "^" student "\." { sum += $1 }
END { print sum }' file
The -v option lets you pass in a value for an Awk variable from the command line; we do that to provide a value for the variable student. The first line checks whether the fourth field $4 begins with that variable immediately followed by a dot, and if so, adds the first field $1 to the variable sum. (Conveniently, uninitialized variables spring to life with a default value of zero / empty string.) This gets repeated for each input line in the file. Then the END block gets executed after the input file has been exhausted, and we print the accumulated sum.
If you want to save this in an executable script, you might want to allow the caller to specify the student to search for:
#!/bin/sh
# Fail if $1 is unset
: ${1?Syntax: $0 studentname}
awk -v student="$1" '$4 ~ "^" student "\." { sum += $1 }
END { print sum }' file
Notice how the $ variables inside the single quotes are Awk field names (the single quotes protect the script's contents from the shell's variable interpolation etc facilities), whereas the one with double quotes around it gets replaced by the shell with the value of the first command-line argument.

Related

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Using awk command to compare values on separate lines?

I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:
molecule gene start end strand
ERR2661861.3269 JN051170.1 11330 10778 minus
ERR2661861.3269 JN051170.1 11904 11348 minus
ERR2661861.3269 JN051170.1 12418 11916 minus
ERR2661861.3269 JN051170.1 13000 12469 minus
ERR2661861.3269 JN051170.1 13382 13932 plus
ERR2661861.3269 JN051170.1 13977 14480 plus
ERR2661861.3269 JN051170.1 14491 15054 plus
ERR2661861.3269 JN051170.1 15068 15624 plus
ERR2661861.3269 JN051170.1 15635 16181 plus
Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:
ERR2661861.3269 13000 13382
The script that I have thus far is:
# test input file
file=Eg2.1.txt.out
#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file
# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt
# for each line of the input file, do this
for i in $sorted_file
do
# check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
if [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
# if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
then
mol=awk '{if(NR==i) print $1}'
start=awk '{if(NR==i) print $3}'
end=awk '{if(NR==i+1) print $4}'
new_line=$mol$'\t'$start$'\t'$end
echo new_line >> Test_file.txt
fi
done
The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?
Many thanks in advance!
Explanation why your code does not work
For a better solution to your problem see karakfa's answer.
String comparison in bash needs spaces around [ and ]
Bash interprets your command ...
[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']
... as the command [awk with the arguments {if(NR..., ==, awk, and {if(NR...]. On your average system there is no command named [awk, therefore this should fail with an error message. Add a space after [ and before ].
awk wasn't executed
[ awk = awk ] just compares the literal string awk. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ].
awk is missing the input file
awk '{...}' tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file
awk '... NR==i ...' is not referencing the i from bash's for i in
awk does not know about your bash variable. When you write i in your awk script, that i will always have the default value 0. To pass a variable from bash to awk use awk -v i="$i" .... Also, it seems like you assumed for i in would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.
for i in $sorted_file is not iterating the file sorted_file
You called your file sorted_file. But when you write $sorted_file you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file), but that would iterate over the file content, not the line numbers. Also, the unquoted $() can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file)).
this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".
$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile
ERR2661861.3269 13000 13382
Note that awk has an implicit loop, no need force it to iterate externally.
The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.
molecule = $1
strand = $5
if (molecule==last_molecule)
if (last_strand=="minus")
if (strand=="plus")
print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3
You essentially described a proto-program in your bullet points:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
You have everything needed to write a program in Perl, awk, ruby, etc.
Here is Perl version:
perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", #F[0..2])}
$l0=$F[0]; $l4=$F[4];' sorted_file
The -lanE part enables auto split (like awk) and auto loop and compiles the text as a program;
The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") tests your three bullet points (but Perl is 0 based index arrays so 'first' is 0 and fifth is 4)
The $l0=$F[0]; $l4=$F[4]; saves the current values of field 1 and 5 to compare next loop through. (Both awk and perl allow comparisons to non existent variables; hence why $l0 and $l4 can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby they need to be initialized first...)
Here is an awk version, same program essentially:
awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
{l1=$1;l5=$5}' sorted_file
Ruby version:
ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file
All three print:
ERR2661861.3269 JN051170.1 13382
My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

How to use unix grep and the output together?

I am new to unix commands. I have a file named server.txt, which has 100 fields, the first line of the file is header.
I want to take a look at the fields at 99 and 100 only.
Field 99 is just some numbers, field 100 is a String.
The delimiter of each field which is a space.
My goal is to extract every tokens in the string(field100) by grep and regex,
and then output with the field99 with every token extracted from the String,
and skip the first 1000 lines of my records
----server.txt--
... ... ,field99,field100
... ... 5,"hi are"
... ... 3,"how is"
-----output.txt
header1,header2
5,hi
5,are
3,how
3,is
so i just have some idea, but i dont know how to combine all the scripts
Here is some of my thought:
sed 1000d server.txt cut -f99,100 -d' ' >output.txt
grep | /[A-Za-z]+/|
Sounds more like a job for awk.
awk -F, 'NR <= 1000 { next; }
{ gsub(/^\"|\"$/, "", $100); split($100, a, / /);
for (v=1; v<=length(a); ++v) print $99, a[v]; }' server.txt >output.txt
The general form of an awk program is a sequence of condition { action } expressions. The first line has the condition NR <= 1000 where NR is the current line number. If the condition is true, the next action skips to the next input line. Otherwise, we fall through to the next expression, which does not have a condition; so, it's uncoditional, for all input lines which reach here. It first cleans out the double quotes around the 100th field value, and then splits it on spaces into the array a. The for loop then loops over this array, printing the 99th field value and the vth element of the array, starting with v=1 and up through the end of the array.
The input file format is sort of cumbersome. The gsub and split stuff could be avoided with a slightly more sane input format. If you are new to awk, you should probably go look for a tutorial.
If you only want to learn one scripting language, I would suggest Perl or Python over awk, but it depends on your plans and orientation.

Resources