I have a huge file (my_file.txt) with ~ 8,000,000 lines that looks like this:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs374183434 0 NA -2.22383195384362
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
I want to find the duplicates based on the first three columns and then remove the line with a lower value in the 7th columns, the first part I can accomplish with:
awk -F"\t" '!seen[$2, $3]++' my_file.txt
But I don't know how to do the part about removing the duplicate with a lower value, the desired output would be this one:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
Speed is an issue so I could use awk, sed or another bash command
Thanks
$ awk '(i=$1 FS $2 FS $3) && !(i in seventh) || seventh[i] < $7 {seventh[i]=$7; all[i]=$0} END {for(i in a) print all[i]}' my_file.txt
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13110 13110 rs540538026 0 NA -1.33177622457982
Thanks to #fedorqui for the advanced indexing. :D
Explained:
(i=$1 FS $2 FS $3) && !(i in seventh) || $7 > seventh[i] { # set index to first 3 fields
# AND if index not yet stored in array
# OR the seventh field is greater than the previous value of the seventh field by the same index:
seventh[i]=$7 # new biggest value
all[i]=$0 # store that record
}
END {
for(i in all) # for all stored records of the biggest seventh value
print all[i] # print them
}
Related
I have many TSV files in a directory that have only three columns, I want to merge all of them based on the first column value (both columns have headers that I need to maintain); if this value is present then it must add the value of the corresponding second and third column, and if value missing in any file add NA and so on (see example). Files might have different number of lines and not ordered by first column, although this can be easily done with sort.
I have tried join but that works nicely for only two files. Can join be expanded for all files in a directory? Here are the example of just three files:
S01.tsv
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
S02.tsv
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
S03.tsv
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
Outfile should be:
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3
Ok I've tried with awk by taking help here, but not successful
BEGIN { OFS="\t" } # tab separated columns
FNR==1 { f++ } # counter of files
{
a[0][$1]=$1 # reset the key for every record
for(i=2;i<=NF;i++) # for each non-key element
a[f][$1]=a[f][$1] $i ( i==NF?"":OFS ) # combine them to array element
}
END { # in the end
for(i in a[0]) # go thru every key
for(j=0;j<=f;j++) # and all related array elements
printf "%s%s", a[j][i], (j==f?ORS:OFS)
} # output them, nonexistent will output empty
I would harness GNU AWK for this task following way, let S01.tsv content be
Accesion Val S01
AJ863320 1 0.2
AM930424 1 0.3
AY664038 2 0.5
and S02.tsv content be
Accesion Val S02
AJ863320 2 0.8
AM930424 1 0.25
EU236327 1 0.14
EU434346 2 0.2
and S03.tsv content be
Accesion Val S03
AJ863320 5 0.2
EU236327 1 0.5
EU434346 2 0.3
then
awk 'BEGIN{OFS="\t"}NR==1{title=$1 OFS $2}{arr[$1 OFS $2][FILENAME]=$3}END{print title,arr[title]["S01.tsv"],arr[title]["S02.tsv"],arr[title]["S03.tsv"];delete arr[title];for(i in arr){print i,"S01.tsv" in arr[i]?arr[i]["S01.tsv"]:"NA","S02.tsv" in arr[i]?arr[i]["S02.tsv"]:"NA","S03.tsv" in arr[i]?arr[i]["S03.tsv"]:"NA"}}' S01.tsv S02.tsv S03.tsv
gives output
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
EU236327 1 NA 0.14 0.5
AM930424 1 0.3 0.25 NA
EU434346 2 NA 0.2 0.3
AY664038 2 0.5 NA NA
Explanation: I am storing data in 2D array arr, using values from 1st and 2nd column concatenated using output field separator (first dimension) and filename (2nd dimensions). Values in array are values from 3rd column. After data is collected I started by printing title (header) row, which I then delete from array, then I iterate over first dimension of array, and for each element I print key followed by values from each file or NA if there was not value. Observe that I use in check rather than looking for truthiness of value itself as this would alter 0 values into NAs. Disclaimer: this solution assumes you are accepting any order of output rows beyond headers, if this does not hold do not use this solution.
(tested in GNU Awk 5.0.1)
Using GNU awk for arrays of arrays and sorted_in:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
FNR == 1 {
if ( NR == 1 ) {
numCols = split($0,hdrs)
}
else {
hdrs[++numCols] = $3
}
next
}
{
accsValsCols2ss[$1][$2][numCols] = $3
}
END {
for ( colNr=1; colNr<=numCols; colNr++ ) {
printf "%s%s", hdrs[colNr], (colNr<numCols ? OFS : ORS)
}
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( acc in accsValsCols2ss ) {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( val in accsValsCols2ss[acc] ) {
printf "%s%s%s", acc, OFS, val
for ( colNr=3; colNr<=numCols; colNr++ ) {
s = ( colNr in accsValsCols2ss[acc][val] ? accsValsCols2ss[acc][val][colNr] : "NA" )
printf "%s%s", OFS, s
}
print ""
}
}
}
$ awk -f tst.awk S01.tsv S02.tsv S03.tsv
Accesion Val S01 S02 S03
AJ863320 1 0.2 NA NA
AJ863320 2 NA 0.8 NA
AJ863320 5 NA NA 0.2
AM930424 1 0.3 0.25 NA
AY664038 2 0.5 NA NA
EU236327 1 NA 0.14 0.5
EU434346 2 NA 0.2 0.3
I am looking for some options in unix (may be awk or sed ) through which I can replace the last column in my .fam file with the last column (v8) of a .txt file. Something similar to the merge function in R.
My .fam file looks like this
20481 20481 0 0 2 -9
20483 20483 0 0 1 1
20488 20488 0 0 2 1
20492 20492 0 0 1 1
and my .txt file looks like this.
V1 V2 V3 V4 V6 V7_Pheno V8
2253792 20481 NA DNA 1 Yes 2
2253802 20483 NA DNA 4 Yes 2
2253816 20488 NA DNA 0 No 1
2253820 20492 NA DNA 4 Yes 2
My outcome.fam file should looks like this
20481 20481 0 0 2 2
20483 20483 0 0 1 2
20488 20488 0 0 2 1
20492 20492 0 0 1 2
paste merges the lines
awk allow you to select column, so
paste foo.fam bar.txt | awk '{ print $1 " " $2 " " $3 " " $4 " " $13 }'
should do what you want
If you want to suppress the header line of .txt file, you can call tail to skip the first line:
tail -n +2 bar.txt
You can hence integrate it in you command line (assuming you use bash)
paste foo.fam <(tail -n +2 bar.txt) | awk '{ print $1 " " $2 " " $3 " " $4 " " $13 }'
awk can do it alone.
$: awk 'BEGIN{ getline < "f.txt" }
{ gsub("[^ ]+$",""); l=$0; getline < "f.txt"; print l$7; }' f.fam
20481 20481 0 0 2 2
20483 20483 0 0 1 2
20488 20488 0 0 2 1
20492 20492 0 0 1 2
The BEGIN reads the header record on the .txt.
Then for each line of the .fam, strip off the last field and save to l.
getline used this way parses to fields also, so print l$7; prints the shortened record from .fam and adds the last field from .txt.
I could do this easily in R with grepl and row indexing, but wanted to try this in shell. I have a text file that looks like what I have below. I would like to find rows where It matches TWGX and wherever it match, I would like to concatenate column 1 and column 2 separated by _ and make it column values for both column 1 and column 2.
text:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP 10064-8036056040 0 0 0 -9
TWGX-MAP 11570-8036056502 0 0 0 -9
TWGX-MAP 11680-8036055912 0 0 0 -9
This is the result I want:
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
The regex /TWGX/ selects the lines containing that string and applies the action that follows. The 1 is an awk shorthand that will print both the modified and unmodified lines.
$ awk 'BEGIN{FS=OFS="\t"} /TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}1' file
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 2 1
NIALOAD NIALOAD 0 0 1 1
TWGX-MAP_10064-8036056040 TWGX-MAP_10064-8036056040 0 0 0 -9
TWGX-MAP_11570-8036056502 TWGX-MAP_11570-8036056502 0 0 0 -9
TWGX-MAP_11680-8036055912 TWGX-MAP_11680-8036055912 0 0 0 -9
BEGIN { FS = OFS = "\t" }
# Just once, before processing the file, set FS (file separator) and OFS (output file separator) to be the tab character
/TWGX/ {tmp = $1 "_" $2; $1 = $2 = tmp}
# For every line that contains a match for TWGX create a mashup of the first two columns, and assign it to each of columns 1 and 2. (Note that in awk string concatenation is done by simply putting expressions next to one another)
1
# This is an awk idiom that consists of the pattern 1, which is always true. By not explicitly specifying an action to go with that pattern, the default action of printing the whole line will be executed.
I was trying to achieve merging of multiple files with main file keys.
My main file is like this
cat files.txt
Which has keys, want to compare....
1
2
3
4
5
6
7
8
9
10
11
Other inputs files like this
cat f1.txt
1 : 20
3 : 40
5 : 40
7 : 203
cat f2.txt
3 : 45
4 : 56
9 : 23
Want output like this ..
f1 f2 ....
1 20 NA
2 NA NA
3 40 45
4 56 NA
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 23 NA
10 NA NA
11 NA NA
tried this but not able to print the non-match keys
awk -F':' 'NF>1{a[$1] = a[$1]$2}END{for(i in a){print i""a[i]}}' files.txt *.txt
1 20
3 40 45
4 56
5 40
7 203
9 23
Please can someone guide me what is missing here ?
Complex GNU awk solution (will cover any number of files, considering system resources):
awk 'BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"; h=" ";
for(i=2;i<=ARGC;i++) h=(i==2)? h ARGV[i]: h OFS ARGV[i]; print h
}
NR==FNR{ a[$1]; next }{ b[ARGIND][$1]=$3 }
END{
for(i in a) {
printf("%d",i);
for(j in b) printf("%s%s",OFS,(i in b[j])? b[j][i] : "NA"); print ""
}
}' files.txt *.txt
An exemplary output:
f1 f2
1 20 NA
2 NA NA
3 40 45
4 NA 56
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 NA 23
10 NA NA
11 NA NA
PROCINFO["sorted_in"]="#ind_num_asc" - sorting mode (numerically in ascending order)
for(i=2;i<=ARGC;i++) h=(i==1)? h ARGV[i]: h OFS ARGV[i] - iterating through script arguments, collecting filenames.
ARGC and ARGV make the command-line arguments available to your program
$ cat awk-file
NR==FNR{
l=NR
next
}
NR==FNR+l{
split(FILENAME,f1,".")
a[$1]=$3
next
}
NR==FNR+l+length(a){
split(FILENAME,f2,".")
bwk -v OFS='\t' -f awk-file files.txt f1.txt f2.txt[$1]=$3
next
}
END{
print "",f1[1],f2[1]
for(i=1;i<=l;i++){
print i,(a[i]!="")?a[i]:"NR",(b[i]!="")?b[i]:"NR"
}
}
$ awk -v OFS='\t' -f awk-file files.txt f1.txt f2.txt
f1 f2
1 20 NR
2 NR NR
3 40 45
4 NR 56
5 40 NR
6 NR NR
7 203 NR
8 NR NR
9 NR 23
10 NR NR
11 NR NR
I modify the answer for your further question.
If you have 3rd, 4th files (assume to nth files), add n new blocks as followed,
NR==FNR+l+length(a)+...+length(n){
split(FILENAME,fn,".")
n[$1]=$3
}
And in your End block,
END{
print "",f1[1],f2[1],...,fn[1]
for(i=1;i<=l;i++){
print i,(a[i]!="")?a[i]:"NR",(b[i]!="")?b[i]:"NR",...,(n[i]!="")?n[i]:"NR"
}
}
$ cat tst.awk
ARGIND < (ARGC-1) { map[ARGIND,$1] = $NF; next }
FNR==1 {
printf "%-2s", ""
for (fileNr=1; fileNr<ARGIND; fileNr++) {
fileName = ARGV[fileNr]
sub(/\.txt$/,"",fileName)
printf "%s%s", OFS, fileName
}
print ""
}
{
printf "%-2s", $1
for (fileNr=1; fileNr<ARGIND; fileNr++) {
printf "%s%s", OFS, ((fileNr,$1) in map ? map[fileNr,$1] : "NA")
}
print ""
}
$ awk -f tst.awk f1.txt f2.txt files.txt
f1 f2
1 20 NA
2 NA NA
3 40 45
4 NA 56
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 NA 23
10 NA NA
11 NA NA
The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++} at the start of the script.
Using awk and sort -n for sorting the output:
$ awk -F" *: *" '
NR==FNR {
a[$1]; next }
FNR==1 {
for(i in a)
a[i]=a[i] " NA"
h=h OFS FILENAME
}
{
match(a[$1]," NA")
a[$1]=substr(a[$1],1,RSTART-1) OFS $2 substr(a[$1],RSTART+RLENGTH)
}
END {
print h
for(i in a)
print i a[i]
}' files f1 f2 |sort -n
f1 f2
1 20 NA
2 NA NA
3 40 45
4 56 NA
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 23 NA
10 NA NA
11 NA NA
Pitfalls: 1. sort will fail with the header in certain situations. 2. Since NA is replaced with the value $2, your data can't have NA starting strings. That could probably be circumvented with replacing / NA( |$)/ but would probably cause a lot more checking in the code, so choose your NA carefully. :D
Edit:
Running it for, for example, four files:
$ awk '...' files f1 f2 f1 f2 | sort -n
1 20 20 NA NA
2 NA NA NA NA
3 40 45 40 45
4 56 56 NA NA
5 40 40 NA NA
6 NA NA NA NA
7 203 203 NA NA
8 NA NA NA NA
9 23 23 NA NA
10 NA NA NA NA
11 NA NA NA NA
Please use the below script to process.
FILESPATH has the list of your input files (f1.txt, f2.txt...).
INPUT has the input file (files.txt).
script.sh
FILESPATH=/home/ubuntu/work/test/
INPUT=/home/ubuntu/work/files.txt
i=0
while read line
do
FILES[ $i ]="$line"
(( i++ ))
done < <(ls $FILESPATH/*.txt)
for file in "${FILES[#]}"
do
echo -n " ${file##*/}"
done
echo ""
while IFS= read -r var
do
echo -n "$var "
for file in "${FILES[#]}"
do
VALUE=`grep "$var " $file | cut -d ' ' -f3`
if [ ! -z $VALUE ]; then
echo -n "$VALUE "
else
echo -n "NA "
fi
done
echo ""
done < "$INPUT"
========
you can use printf instead of echo to get better formatting of output.
This can be done via simple loop and echo statements.
#!/bin/bash
NA=" NA"
i=0
#print header module start
header[i]=" "
for file in `ls f[0-9].txt`;
do
first_part=`echo $file|cut -d. -f1`
i=$i+1
header[i]=$first_part
done
echo ${header[#]}
#print header module end
#print elements start
for element in `cat files.txt`;
do
var=$element
for file in `ls f[0-9].txt`;
do
var1=`grep -w ${element} $file`
if [[ ! -z $var1 ]] ; then
field2=`echo $var1|cut -d":" -f2`
var="$var$field2"
else
var="$var$NA"
fi
done
echo $var
done
#print elements end
I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182