Columns addition - ruby

I try to do sums in columns (values between the ",") depending on the date and time.
exemple:
RG Data,2015/02/27,18:02:07,"0","52",50.0,5.3,44.7,5.6,100.0,0.23,0.03,0.20,6.3,4.5
RG Data,2015/02/27,18:02:07,"1","52",36.9,22.3,14.6,39.9,100.0,0.59,0.16,0.43,7.5,29.9
RG Data,2015/02/27,18:03:06,"0","52",21.2,0.7,20.5,50.0,100.0,0.08,0.00,0.08,0.0,4.2
RG Data,2015/02/27,18:03:06,"1","52",245.6,233.4,12.2,73.7,100.0,2.08,1.83,0.25,8.0,21.4
... more lines after...
Output:
RG Data,2015/02/27,18:02:07,86.9,27.6,59.3,....
RG Data,2015/02/27,18:03:06,266.8,234.1,....
where: 86.9 is from: "50.0"(1st line) + 36.9 (2nd line). etc.. for each column.
Code with awk:
for TIME in $(awk -F ',|/' '{print $4","$5}' FILE | sort -u) ;do echo -n "$TIME; awk -F ',' "/$TIME/ {SUM += \$6} END { print SUM}" FILE ; done
Many thanks for help

This awk one-liner produces something close to the desired output:
$ awk -F, '{k=$1FS$2FS$3;seen[k];for(i=6;i<=NF;++i)sum[k,i]+=$i}END{for(i in seen){printf "%s,",i;for(j=6;j<=NF;++j)printf "%s%s",sum[i ,j],(j<NF?FS:RS)}}' file
RG Data,2015/02/27,18:03:06,266.8,234.1,32.7,123.7,200,2.16,1.83,0.33,8,25.6
RG Data,2015/02/27,18:02:07,86.9,27.6,59.3,45.5,200,0.82,0.19,0.63,13.8,34.4
The variable k is the key, which is made up the first, second and third columns of each line, joined on the field separator FS (a comma in this case). The array seen keeps track of every key k that is encountered.
The loop goes through each field from the sixth one to the last, adding to an element of the sum array, whose key is composed of the first two fields (as in seen) and the current field number.
Once the file has been processed, loop through the seen array and print out all of the corresponding elements of the sum array.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

How to loop through a character array to match array items from one column and extract full rows from a tab separated text file

I have a tab separated text file with the first column subject ID (characters) another 23 column (all values except the second column which is also characters).
There are 700 rows (subjects), but I want to extract a subset of 200 rows based on matching with the subject ID column.
I have tried using grep and sed and awk with various combinations but I have not been successful. Some things that have failed include:
for sub in ${subjects[#]}; do
grep $sub | \
sed < baseline_subs.txt > baseline_moresubs.txt;
done
and
awk '{if ($1 == $sub) { print } }' baseline_moresubs.txt
Please help
Suggestion 1 awk script: scanning baseline_moresubs.txt ${#subjects[#]} times
screen same input file many times, each time screening one subject.
for sub in ${subjects[#]}; do
# screen baseline_moresubs.txt for 1st field,
awk "\$1 == \"$sub\" {print}" baseline_moresubs.txt
done
Suggestion 2 awk script: scanning baseline_moresubs.txt once
Map subjects[] array into awk associative array (dictionary) subjDict. Than scan each line for any of the subjects.
awk -v inpArr="${subjects[#]}" '
BEGIN { # pre processing input
split(inpArr,arr); # map inpArr into arr indexed 1,2,3...
for (i in arr) subjDict[arr[i]] = 1; # map arr into dictonary
}
$1 in subjDict{print} # if 1st field in dictionary print line
' baseline_moresubs.txt

Shell command to sum up numbers across similar lines of text in a file

I have a file with thousands of lines, each containing a number followed by a line of text. I'd like to add up the numbers for the lines whose text is similar. I'd like unique lines to be output as well.
For example:
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
The output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Any suggestions how this could be achieved in unix shell?
I looked at Shell command to sum integers, one per line? but this is about summing up a column of numbers across all lines in a file, not across similar text lines only.
There is no need for multiple processes and pipes. awk alone is more than capable of handling the entire job (and will be orders of magnitude faster on large files). With awk simply append each of the fields 2-NF as a string and use that as an index to sum the numbers in field 1 in an array. Then in the END section, simply output the contents of the array, e.g. presuming your data is stored in file, you could do:
awk '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
str=""
}
END {
for (i in a) print a[i], i
}' file
Above, the first for loop simply appends all fields from 2-NF in str, a[str] += $1 sums the values in field 1 into array a using str as an index. That ensures the values for similar lines are summed. In the END section, you simply loop over each element of the array outputting the element value (the sum) and then the index (original str for fields 2-NF).
Example Use/Output
Just take what is above, select it, and then middle-mouse paste it into a command line in the directory where your file is located (change the name of file to your data file name)
$ awk '{
> for (i=2; i<=NF; i++)
> str = str " " $i
> a[str] += $1
> str=""
> }
> END {
> for (i in a) print a[i], i
> }' file
30 take a test
37 cup of coffee
75 sign on the dotted
If you want the lines sorted in a different order, just add | sort [options] after the filename to pipe the output to sort. For example for output in the order you show, you would use | sort -k 2 and the output would be:
37 cup of coffee
75 sign on the dotted
30 take a test
Preserving Original Order Of Strings
Pursuant to your comment regarding how to preserve the original order of the lines of text seen in your input file, you can keep a second array where the strings are stored in the order they are seen using a sequential index to keep them in order. For example the o array (order array) is used below to store the unique string (fields 2-NF) and the variable n is used as a counter. A loop over the array is used to check whether the string is already contained, and if so, next is used to avoid storing the string and jump to the next record of input. In END the loop then uses a for (i = 0; i < n; i++) form to output the information from both arrays in the order the string were seen in the original file, e.g.
awk -v n=0 '{
for (i=2; i<=NF; i++)
str = str " " $i
a[str] += $1
for (i = 0; i < n; i++)
if (o[i] == str) {
str=""
next;
}
o[n++] = str;
str=""
}
END {
for (i = 0; i < n; i++) print a[o[i]], o[i]
}' file
Output
37 cup of coffee
75 sign on the dotted
30 take a test
Here is a simple awk script that do the task:
script.awk
{ # for each input line
inpText = substr($0, length($1)+2); # read the input text after 1st field
inpArr[inpText] = inpArr[inpText] + 0 + $1; # accumulate the 1st field in array
}
END { # post processing
for (i in inpArr) { # for each element in inpArr
print inpArr[i], i; # print the sum and the key
}
}
input.txt
25 cup of coffee
75 sign on the dotted
28 take a test
2 take a test
12 cup of coffee
running:
awk -f script.awk input.txt
output:
75 sign on the dotted
37 cup of coffee
30 take a test
Using datamash is relatively succinct. First use sed to change the first space to a tab, (for this job datamash must have one and only one tab separator), then use -s -g2 to sort groups by the 2nd field, (i.e. "cup" etc.), then use sum 1 to add up the first column numbers by group, and it's done. No, not quite -- the number column migrated to the 2nd field for some reason, so reverse migrates it back to the 1st field:
sed 's/ /\t/' file | datamash -s -g2 sum 1 | datamash reverse
Output:
37 cup of coffee
75 sign on the dotted
30 take a test
You can do the following (assume the name of the file is file.txt):
for key in $(sort -k2 -u file.txt | cut -d ' ' -f2)
do
cat file.txt|grep $key | awk '{s+=$1} END {print $2 "\t" s}'
done
Explanation:
1. get all unique keys (cup of coffee, sign on the dotted, take a test):
sort -k2 -u file.txt | cut -d ' ' -f2
2. grep all lines with unique key from the file:
cat file.txt | grep $key
3. Sum the lines using awk where $1=number column and $2 = key
awk '{s+=$1} END {print $2 "\t" s}'
Put everything in for loop and iterate over the unique keys
Note: If a key can be a sub-string of another key, for example "coffee" and "cup of coffee" you will need to change step 2 to grep with regex
you mean something like this?
#!/bin/bash
# define a dictionary
declare -A dict
# loop over all lines
while read -r line; do
# read first word as value and the rest as text
IFS=' ' read value text <<< "$line"
# use 'text' as key, get value for 'text', default 0
[ ${dict[$text]+exists} ] && dictvalue="${dict[$text]}" || dictvalue=0
# sum value
value=$(( $dictvalue + value ))
# save new value in dictionary
dict[$text]="$value"
done < data.txt
# loop over dictionary, print sum and text
for key in "${!dict[#]}"; do
printf "%s %s\n" "${dict[$key]}" "$key"
done
output
37 cup of coffee
75 sign on the dotted
30 take a test
Another version based on the same logic as mentioned here #David.
Changes: It omits loops to speed up the process.
awk '
{
text=substr($0, index($0,$2))
if(!(text in text_sums)){ texts[i++]=text }
text_sums[text]+=$1
}
END {
for (i in texts) print text_sums[texts[i]],texts[i]
}' input.txt
Explanation:
substr returns the string starting with field 2. i.e. text part
array texts stores text on integer index, if its not present in text_sums array.
text_sums keep adding field 1 for a corresponding text.
Reason behind a separate array to store text as value backed by consecutive integer as index, is to assures the order of value (text) while accessing in same consecutive order.
See Array Intro
Footnotes says:
The ordering will vary among awk implementations, which typically use hash tables to store array elements and values.

Indexing variable created by awk in bash

I'm having some trouble indexing a variable (consisting of 1 line with 4 values) derived from a text file with awk.
In particular, I have a text-file containing all input information for a loop. Every row contains 4 specific input values, and every iteration makes use of a different row of the input file.
Input file looks like this:
/home/hannelore/TVB-pipe_local/subjects/CON02T1/ 10012 100000 1001 --> used for iteration 1
/home/hannelore/TVB-pipe_local/subjects/CON02T1/ 10013 7200 1001 --> used for iteration 2
...
From this input text file, I identified the different columns (path, seed, count, target), and then I wanted to index these variables in each iteration of the loop. However, index 0 returns the entire variable and higher indices return without output. Using awk, cut, or IFS on this obtained variable, I wasn't able to split the variable. Can anyone help me with this?
Some code that I used:
seed=$(awk '{print $2}' $input_file)
--> extract column information from input file, this works
seedsplit=$(awk '{print $2}' $seed)
seedsplit=$(cut -f2 -d ' ' $seed)"
Thank you in advance!
Kind regards,
Hannelore
If I understand you correctly you want to extract the values from the input file row by row.
while read a b c d; do echo "var1:" ${a}; done < file
will print
var1: /home/hannelore/TVB-pipe_local/subjects/CON02T1/
var1: /home/hannelore/TVB-pipe_local/subjects/CON02T1/
similarly you can access the other variables in b,c, and d.
If you want an array, then use array assignment notation:
seed=( $(awk '{print $2}' $input_file) )
Now you will have the words from each line of output from awk in a separate array element.
col1=( $(awk '{print $1}' $input_file) )
col3=( $(awk '{print $3}' $input_file) )
Now you have three arrays which can be indexed in parallel.
for i in $(seq 1 "${#col1[#]}"
do
echo "${col1[$i]} in col1; ${seed[$i]} in col2; ${col3[$i]} in col3"
done

Check number in another file if in range using shell script

I have two files (fileA and fileB). FileA contains the list of numbers and fileB contains the number range.
fileA
446646452
000000001
63495980020
fileB (range_from and range_to)
22400208, 22400208
446646450, 446646450
63495980000, 63495989999
OUTPUT MUST BE
63495980020
In sql script its just like
select *
from fileB
where 446646452 between Range_from and Range_To
How can I do it using shell script?
Per clarification from the OP, each value in fileA should be checked against all ranges in fileB to see if it falls into at least one range.
>= and <= logic for range checking is assumed (i.e., values that coincide with the range endpoints are included).
awk -F', +' '
# 1st pass (fileB): read the lower and upper range bounds
FNR==NR { lbs[++count] = $1+0; ubs[count] = $2+0; next }
# 2nd pass (fileA): check each line against all ranges.
{
for(i=1;i<=count;++i) {
if ($1+0 >= lbs[i] && $1+0 <= ubs[i]) { print; next }
}
}
' fileB fileA
awk is used to read both files, using separate passes:
FNR==NR is true for all lines from fileB; parallel arrays for the lower bounds (lbs) and upper bounds (ubs) of the ranges are built up; thanks to next, no further processing is applied to fileB lines.
The subsequent {...} block is then only applied to lines from fileA.
Each value from fileA is checked against all ranges, and as soon as a match is found, the input line is printed and processing proceeds to the next line.
To ensure that all tokens involved are treated as numbers, +0 is added to them.
Printing Numbers That Match Any of the Ranges
$ awk 'FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next} {for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}}' fileB fileA
63495980020
How it works
FNR==NR{low[NR]=$1+0; hi[NR]=$2+0;next}
When reading in the first file, fileB, save low end of the range in the array low and the high end in the array hi.
for (i in low)if ($1>low[i] && $1<hi[i]){print $1;next}
When reading in the second file, fileA, check the number against each range. If it satisfies any of the ranges, then print it and skip to the next line.
Printing Numbers That Match Their Respective Range
$ paste fileA fileB | awk '$1>$2+0 && $1<$3+0{print $1}'
63495980020
Note that only 63495980020 is printed. 446646452 is not between 22400208 and 22400208, so it is omitted.
How it works
The utility paste combines the files like this:
$ paste fileA fileB
446646452 22400208, 22400208
000000001 446646450, 446646450
63495980020 63495980000, 63495989999
The first column is the number we are interested in while the second column is the low value of the range and the third the high value. We want to print the first value, $1, if it is between the second and third. To test if it is bigger than the second, we might try:
$1>$2
However, to assure that awk is treating the fields as numbers, not strings, we perform addition on one of the numbers like this:
$1>$2+0
Similarly, to test if the first number is smaller than the third:
$1<$3+0
Putting those two tests together with a print command yields:
$1>$2+0 && $1<$3+0 {print $1}
This test does strictly between. Depending on your requirements, you may prefer:
$1>=$2+0 && $1<=$3+0 {print $1}
Old fashion script
sed 's/,[[:space:]]*/ /' fileB \
| while read LowVal HighVal
do
while read ThisLine
do
[ ${ThisLine} -ge ${LowVal} ] && [ ${ThisLine} -le ${HighVal} ] && echo "${ThisLine}"
done < fileA
done

Resources