get subset of table based on unique column values - bash

H- I am looking for a bash/awk/sed solution to get subsets of a table based on unique column values. For example if I have:
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
I want to be able to split this table into 3:
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
I know how to do this in R using the split command, but I am specifically looking for a bash/awk/sed solution.
Thanks

I don’t know if this awk is of any use but it will create 3 separate file based on the unique column values:
awk '{print >> $1; close($1)}' file

alternative awk which keeps the original order of records within each block
$ awk '{a[$1]=a[$1]?a[$1] ORS $0:$0}
END{for(k in a) print a[k] ORS ORS}' file
generates
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
there are 2 trailing empty lines at the end but not displayed in the formatted output.

Using sort and awk:
sort -k1,1 file | awk 'NR>1 && p != $1{print ORS} {p=$1} 1'
EDIT: If you want to keep original order of records from input file then use:
awk -v ORS='\n\n' '!($1 in a){a[$1]=$0; ind[++i]=$1; next}
{a[$1]=a[$1] RS $0}
END{for(k=1; k<=i; k++) print a[ind[k]]}' file

create input list file.txt
(
cat << EOF
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
EOF
) > file.txt
transfomation
cat file.txt | cut -d" " -f1 | sort -u | while read c
do
cat file.txt | grep "^$c" | sort
echo
done

Related

how to combine more than two file into one new file with specific name using bash

I have many file
list file name:
p004c01.txt
p004c05.txt
p006c01.txt
p006c02.txt
p007c01.txt
p007c03.txt
p007c04.txt
...
$cat p004c01.txt
#header
122.5 -0.256 547
123.6 NaN 325
$cat p004c05.txt
#header
122.1 2.054 247
122.2 -1.112 105
$cat p006c01.txt
#header
99 -0.200 333
121.4 -1.206 243
$cat p006c02.txt
#header
122.5 2.200 987
99 -1.335 556
I want the file be like this
file1
$cat p004.txt
122 -0.256 547
122 2.054 247
122 -1.112 105
file2
$cat p006.txt
122.5 2.200 987
121.4 -1.206 243
99 -1.335 556
99 -0.200 333
And the other file too
File that contain the same value (?) in
p????cxx.txt
is in the same new file
I tried one by one file like this
cat p004* | sed '/#/d'| sort -k 1n | sed '/NaN/d' |awk '{print substr($1,2,3),$2,$3,$4,$5}' > p004.txt
Anyone can help me with the simple script for all the data?
Thank you :)
Perhaps this will work for you:
for f in {001..999}; do tail -n +2 p"$f"c* > p"$f".txt; done 2>/dev/null

A UNIX Command to Find the Name of the Student who has the Second Highest Score

I am new to Unix Programming. Could you please help me to solve the question.
For example, If the input file has the below content
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
The output will be
ABC
I tried something like this
sort -k3,3 -rn -t" " | head -n2 | awk '{print $2}'
Using awk
awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}'
Demo:
$cat file.txt
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
$awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}' file.txt
ABC
$
Explanation:
NR>1 --> Skip first record
{arr[$3]=$2} --> Create associtive array with marks as index and name as value
END <-- read till end of file
n=asorti(arr,arr_sorted) <-- Sort array arr on index value(i.e marks) and save in arr_sorted. n= number of element in array
print arr[arr_sorted[n-1]]} <-- n-1 will point to second last value in arr_sorted (i.e marks) and print corresponding value from arr
Your attempt is 90% correct just a single change
Try this...it will work.
sort -k3,3 -rn -t" " | head -n1 | awk '{print $2}'
Instead of using head -n2 replace it with head -n1

Select first two columns from tab-delimited text file and and substitute with '_' character

I have a sample input file as follows
RF00001 1c2x C 3 118 77.20 1.6e-20 1 119 f29242
RF00001 1ffk 9 1 121 77.40 1.4e-20 1 119 8e2511
RF00001 1jj2 9 1 121 77.40 1.4e-20 1 119 f29242
RF00001 1k73 B 1 121 77.40 1.4e-20 1 119 8484c0
RF00001 1k8a B 1 121 77.40 1.4e-20 1 119 93c090
RF00001 1k9m B 1 121 77.40 1.4e-20 1 119 ebeb30
RF00001 1kc8 B 1 121 77.40 1.4e-20 1 119 bdc000
I need to extract the second and third columns from the text file and substitute the tab with '_'
Desired output file :
1c2x_C
1ffk_9
1jj2_9
1k73_B
1k8a_B
1k9m_B
1kc8_B
I am able to print the two columns by :
awk -F" " '{ print $2,$3 }' input.txt
but unable to substitute the tab with '_' with the following command
awk -F" " '{ print $2,'_',$3 }' input.txt
Could you please try following.
awk '{print $2"_"$3}' Input_file
2nd solution:
awk 'BEGIN{OFS="_"} {print $2,$3}' Input_file
3rd solution: Adding a sed solution.
sed -E 's/[^ ]* +([^ ]*) +([^ ]*).*/\1_\2/' Input_file

Addition in awk failing

I am using following code snippet where I export the shell variables in awk as follows:
half_buffer1=$((start_buffer/2))
half_buffer2=$((end_buffer/2))
echo $line | awk -v left="$half_buffer1" -v right="$half_buffer2" 'BEGIN {print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
However for the variable 'right' in awk at times the $3 variable is being subtracted from instead of adding the 'right' variable to $3.
Observe that the following provides the "wrong" answers:
$ echo 1 2 3 4 5 | awk -v left=10 -v right=20 'BEGIN {print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
-10 20
To get the right answers, remove BEGIN:
$ echo 1 2 3 4 5 | awk -v left=10 -v right=20 '{print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
1 -8 23 4 5
The problem is that the BEGIN block is executed before any input is read. Consequently, the variables $1, $2, etc., do not yet have useful values.
If BEGIN is removed, the code is executed on each line read. This gives you the answers that you want.
Examples
Using real input lines from the comments:
$ echo ID1 14389398 14389507 109 + ABC 608 831 | awk -v left=10 -v right=20 '{print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
ID1 14389388 14389527 109 + ABC 608 831
$ echo ID1 14390340 14390409 69 + ABC 831 32 – | awk -v left=10 -v right=20 '{print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
ID1 14390330 14390429 69 + ABC 831 32
Also, this shell script:
start_buffer=10
end_buffer=100
half_buffer1=$((start_buffer/2))
half_buffer2=$((end_buffer/2))
echo ID1 14390340 14390409 69 + ABC 831 32 – | awk -v left="$half_buffer1" -v right="$half_buffer2" '{print $1"\t"$2-left"\t"$3+right"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}'
produces this output:
ID1 14390335 14390459 69 + ABC 831 32

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources