select columns from the files and save the output - bash

i am new to the programming.I have many files in a directory as shown below: and each file consists of two column data.
TS.TST_X.1990-11-22
TS.TST_Y.1990-11-22
TS.TST_Z.1990-11-22
TS.TST_X.1990-12-30
TS.TST_Y.1990-12-30
TS.TST_Z.1990-12-30
At first i want to choose only second columns of all files having same name( only difference in X,Y,Z strings)(TS.TST_X.1990-11-22,TS.TST_Y.1990-11-22,TS.TST_Z.1990-11-22) and want to save the output In a file like TSTST19901112
Similarly for (TS.TST_X.1990-12-30,TS.TST_Y.1990-12-30,TS.TST_Z.1990-12-30 )files also and want to save the output like TSTST19901230
For example:
if files contains like as below
TS.TST_X.1990-11-22 TS.TST_Y.1990-11-22 TS.TST_Z.1990-11-22
1 2 1 3.4 1 2.1
2 5 2 2.4 2 4.2
3 2 3 1.2 3 1.0
4 4 4 2.4 4 3.5
5 8 5 6.3 5 1.8
Then output file TSTST19901122 would be like
2 3.4 2.1
5 2.4 4.2
2 1.2 1.0
4 2.4 3.5
8 6.3 1.8
i tried the code
#!/bin/sh
for file in /home/min/data/*
do
awk '{print $2}' $file
done
But my written code only reads the column of all files doesn't give expected output.So here i need experts help.

Hope below example help you to start with, next time when you post in SO make sure you post input properly so that it will be easy for readers to help you:
Here is online : DEMO
[akshay#db1 tmp]$ cat test.sh
#!/usr/bin/env bash
# use sort and uniq where field sep being dot,
# we get unique date
while IFS= read -r f; do
# creates veriable like TS.TST_*.1990-11-22
i=$(sed 's/_[^.]/_*/' <<<"$f");
# modify outfile if you want any extension suffix etc
outfile=$(sed 's/[^[:alnum:]]//g' <<<"$i")".txt";
# filename expansion with unquoted variable
# finally use awk to print whatever you want
paste $i | awk 'NR>1{for(i=2; i<=NF; i+=2)printf "%s%s", $(i), (i<NF ? OFS : ORS)}' >"$outfile"
done < <(printf '%s\n' TS.TST* | sort -t'.' -u -nk3)
[akshay#db1 tmp]$ bash test.sh
[akshay#db1 tmp]$ cat TSTST19901122.txt
2 3.4 2.1
5 2.4 4.2
2 1.2 1.0
4 2.4 3.5
8 6.3 1.8
Input:
[akshay#db1 tmp]$ ls TS.TST* -1
TS.TST_X.1990-11-22
TS.TST_Y.1990-11-22
TS.TST_Z.1990-11-22
[akshay#db1 tmp]$ for i in TS.TST*; do cat "$i"; done
TS.TST_X.1990-11-22
1 2
2 5
3 2
4 4
5 8
TS.TST_Y.1990-11-22
1 3.4
2 2.4
3 1.2
4 2.4
5 6.3
TS.TST_Z.1990-11-22
1 2.1
2 4.2
3 1.0
4 3.5
5 1.8

EDIT: Since OP has mentioned in comments that actual file names are little different so adding solution as per that here(since as per OP only 3 type of files with different year and month are there)..
for file in TS.TST_BHE*
do
year=${file/*\./}
year=${year//-/}
yfile=${file/BHE/BHN}
zfile=${file/BHE/BHZ}
outfile="TSTST.$year"
##echo $file $yfile $zfile
paste "$file" "$yfile" "$zfile" | awk '{print $2,$4,$6}' > "$outfile"
done
Explanation: Adding detailed explanation for above.
for file in TS.TST_BHE*
##Going through TS.TST_BHE named files in for loop here, where variable file will have its name in it.
do
year=${file/*\./}
##Creating year where removing everything till . here.
year=${year//-/}
##Substituting all - with null in year variable.
yfile=${file/BHE/BHN}
##Substituting BHE with BHN in file variable and saving it to yfile here.
zfile=${file/BHE/BHZ}
##Substituting BHE with BHZ in file variable and saving it to zfile here.
outfile="TSTST.$year"
##Creating outfile which has TSTST. with year variable value here.
##echo $file $yfile $zfile
paste "$file" "$yfile" "$zfile" | awk '{print $2,$4,$6}' > "$outfile"
##using paste to contenate values of 3 of the files(BHE BHN and BHZ) and printing only 2nd, 4th and 6th fields out of it.
done
Could you please try following, based on comment of OP that we could simply concatenate Input_files without checking 1st column's value.
for file in TS.TST_X*
do
year=${file/*\./}
year=${year//-/}
yfile=${file/X/Y}
zfile=${file/X/Z}
outfile="TSTST.$year"
###echo $file $yfile $zfile ##Just to print variable values(optional)
paste "$file" "$yfile" "$zfile" | awk '{print $2,$4,$6}' > "$outfile"
done
For showing samples output will be as follows, above will generate file name d TS.TST_X.19901122 for shown samples.
cat TSTST.19901122
2 3.4 2.1
5 2.4 4.2
2 1.2 1.0
4 2.4 3.5
8 6.3 1.8

The following recreation of input files:
cat <<EOF >TS.TST_X.2000-11-22
1 2
2 5
3 2
4 4
5 8
EOF
cat <<EOF >TS.TST_Y.2000-11-22
1 3.4
2 2.4
3 1.2
4 2.4
5 6.3
EOF
cat <<EOF >TS.TST_Z.2000-11-22
1 2.1
2 4.2
3 1.0
4 3.5
5 1.8
EOF
cat <<EOF >TS.TST_X.1990-11-22
1 2
2 5
3 2
4 4
5 8
EOF
cat <<EOF >TS.TST_Y.1990-11-22
1 3.4
2 2.4
3 1.2
4 2.4
5 6.3
EOF
cat <<EOF >TS.TST_Z.1990-11-22
1 2.1
2 4.2
3 1.0
4 3.5
5 1.8
EOF
When run with the following script on repl:
# get the filenames
find . -maxdepth 1 -name "TS.TST*" -printf "%f\n" |
# meh, sort them, so it looks nice
sort |
# group files according to suffix after the dot
awk -F. '
{ a[$3]=a[$3]" "$0 }
END{ for (i in a) print i, a[i] }
' |
# here we have: YYYY-MM-DD filename1 filename2 filename3
# let's transform it into TSTSTYYYYMMDD filename{1,2,3}
sed -E 's/^([0-9]{4})-([0-9]{2})-([0-9]{2})/TSTST\1\2\3/' |
while IFS=' ' read -r new f1 f2 f3; do
# get second column from all files
# if your awk doesn't sort files, they would have to be sorted here
paste "$f1" "$f2" "$f3" | awk '{print $2,$4,$6}' > "$new"
done
# just output
for i in TSTST*; do echo "$i"; cat "$i"; done
Generates the following output:
TSTST19901122
2 3.4 2.1
5 2.4 4.2
2 1.2 1.0
4 2.4 3.5
8 6.3 1.8
TSTST20001122
2 3.4 2.1
5 2.4 4.2
2 1.2 1.0
4 2.4 3.5
8 6.3 1.8
I would advise to do research on basic shell commands. Read documentation about find. Read an introduction into awk and sed scripting. Read a good introduction into bash, get to know how to iterate, sort, merge and filter list of files in bash. And also read how to read a stream line by line.

Related

Cannot print in awk command in bash script

I am trying to read values from a file and print specific items into a variable which I will use later.
cat /dir1/file1 | while read blmbline2
do
BLMBFILE2=`print $blmbline2 | awk '{$1=""; print $0}'`
echo $BLMBFILE2
done
When I run that same code at the command line, it runs as expected, but, when I run it in a bash script called testme.sh, I get this error:
./testme.sh: line 3: print: command not found
If I run print by itself at the command prompt, I don't get an error (just a blank line).
If I run "bash" and then print at the command prompt, I get command not found.
I can't figure out what I'm doing wrong. Can someone suggest?
updated: I see some other posts that say to use echo or printf? Is there a difference I need to be concerned with in using one of those in bash?
Since awk can read files, you may be able to do away with the cat | while read and just use awk. Using a sample file containing:
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
Declare your bash array variable and populate with the output from awk:
arr=() ; arr=($(awk '{$1=""; print $0}' /dir1/file1))
Use the following to display array size and contents:
printf "array length: %d\narray contents: %s\n" "${#arr[#]}" "${arr[*]}"
Output:
array length: 30
array contents: 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Change print to echo in your shell script. With printf you can format the data and with echo it will print the entire line of the file. Also, create an array so you can store multiple items:
BLMBFILE2=()
while IFS= read -r -d $'\0'
do
BLMBFILE2+=(`echo $REPLY | awk '{$1=""; print $0}'`)
echo $BLMBFILE2
done < <(cat /dir1/file1)
echo "Items found:"
for value in "${BLMBFILE2[#]}"
do
echo $value
done

How to add multi column data using Bash

I have a bash file say input.dat which looks like following.
1 2 4 6
2 3 6 9
3 4 8 12
I want the data in 2nd, 3rd and 4th column to be added and printed in output.dat file like following
1 12
2 18
3 24
How can this be achieved in bash ?
Using awk you can do this:
awk '{print $1, $2+$3+$4}' input.dat
and if you prefer bash it can be done like this (at least if the numbers are integers): bash sum.sh < input.dat and sum.sh is
sum.sh
while read -r v1 v2 v3 v4;
do
echo $v1 $(( v2 + v3 + v4 ))
done

Randomly sample lines retaining commented header lines

I'm attempting to randomly sample lines from a (large) file, while always retaining a set of "header lines". Header lines are always at the top of the file and unlike any other lines, begin with a #.
The actual file format I'm dealing with is a VCF, but I've kept the question general
Requirements:
Output all header lines (identified by a # at line start)
The command / script should (have the option to) read from STDIN
The command / script should output to STDOUT
For example, consider the following sample file (file.in):
#blah de blah
1
2
3
4
5
6
7
8
9
10
An example output (file.out) would be:
#blah de blah
10
2
5
3
4
I have a working solution (in this case selecting 5 non-header lines at random) using bash. It is capable of reading from STDIN (I can cat the contents of file.in into the rest of the command) however it writes to a named file rather than STDOUT:
cat file.in | tee >(awk '$1 =~ /^#/' > file.out) | awk '$1 !~ /^#/' | shuf -n 5 >> file.out
By using process substitution (thanks Tom Fenech), both commands are seen as files.
Then using cat we can concatenate these "files" together and output to STDOUT.
cat <(awk '/^#/' file) <(awk '!/^#/' file | shuf -n 10)
Input
#blah de blah
1
2
3
4
5
6
7
8
9
10
Output
#blah de blah
1
9
8
4
7
2
3
10
6
5

Count and sum up specific decimal number (bash,awk,perl)

I have a tab delimited file and I want to sum up certain decimal number to the output (1.5) each time its find number instead of character to the first column and print out the results for all the rows from the first to the last.
I have example file which look like this:
It has 8 rows
1st-column 2nd-Column
a ship
1 name
b school
c book
2 blah
e blah
3 ...
9 ...
Now I want my script to read line by line and if it finds number sum up 1.5 and give me output just for first column like this:
0
1.5
1.5
1.5
3
3
4.5
6
my script is:
#!/bin/bash
for c in {1..8}
do
awk 'NR==$c { if (/^[0-9]/) sum+=1.5} END {print sum }' file
done
but I don't get any output!
Thanks for your help in advance.
The last item in your expected output appears to be incorrect. If it is, then you can do:
$ awk '$1~/^[[:digit:]]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
use warnings;
use strict;
my $sum = 0;
while (<DATA>) {
my $data = (split)[0]; # 1st column
$sum += 1.5 if $data =~ /^\d+$/;
print "$sum\n";
}
__DATA__
a ship
1 name
b school
c book
2 blah
e blah
3 ...
6 ...
Why not just use awk:
awk '{if (/^[0-9]+[[:blank:]]/) sum+=1.5} {print sum+0 }' file
Edited to simplify based on jaypal's answer, bound the number and work with tabs and spaces.
How about
perl -lane 'next unless $F[0]=~/^\d+$/; $c+=1.5; END{print $c}' file
Or
awk '$1~/^[0-9]+$/{c+=1.5}END{print c}' file
These only produce the final sum as your script would have done. If you want to show the numbers as they grow use:
perl -lane 'BEGIN{$c=0}$c+=1.5 if $F[0]=~/^\d+$/; print "$c"' file
Or
awk 'BEGIN{c=0}{if($1~/^[0-9]+$/){c+=1.5}{print c}}' file
I'm not sure if you're multiplying the first field by 1.5 or adding 1.5 to a sum every time there's any number in $1 and ignoring the contents of the line otherwise. Here's both in awk, using your sample data as the contents of "file."
$ awk '$1~/^[0-9]+$/{val=$1*1.5}{print val+0}' file
0
1.5
1.5
1.5
3
3
4.5
9
$ awk '$1~/^[0-9]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
Or, here you go in ksh (or bash if you have a newer bash that can do floating point math), assuming the data is on STDIN
#!/usr/bin/ksh
sum=0
while read a b
do
[[ "$a" == +([0-9]) ]] && (( sum += 1.5 ))
print $sum
done

setting awk variables through inlining

I've got this:
./awktest -v fields=`cat testfile`
which ought to set fields variable to '1 2 3 4 5' which is all that testfile contains
It returns:
gawk: ./awktest:9: fatal: cannot open file `2' for reading (No such file or directory)
When I do this it works fine.
./awktest -v fields='1 2 3 4 5'
printing fields at the time of error yields:
1
printing fields in the second instance yields:
1 2 3 4 5
When I try it with 12345 instead of 1 2 3 4 5 it works fine for both, so it's a problem with the white space. What is this problem? And how do I fix it.
This is most likely not an awk question. Most likely, it is your shell that is the culprit.
For example, if awktest is:
#!/bin/bash
i=1
for arg in "$#"; do
printf "%d\t%s\n" $i "$arg"
((i++))
done
Then you get:
$ ./awktest -v fields=`cat testfile`
1 -v
2 fields=1
3 2
4 3
5 4
6 5
You see that the file contents are not being handled as a single word.
Simple solution: use double quotes on the command line:
$ ./awktest -v fields="$(< testfile)"
1 -v
2 fields=1 2 3 4 5
The $(< file) construct is a bash shortcut for `cat file` that does not need to spawn an external process.
Or, read the first line of the file in the awk BEGIN block
awk '
BEGIN {getline fields < "testfile"}
rest of awk program ...
'
./awktest -v fields="`cat testfile`"
#note that:
#./awktest -v fields='`cat testfile`'
#does not work

Resources