How to replace a column in .fam with .txt in unix - bash

I am looking for some options in unix (may be awk or sed ) through which I can replace the last column in my .fam file with the last column (v8) of a .txt file. Something similar to the merge function in R.
My .fam file looks like this
20481 20481 0 0 2 -9
20483 20483 0 0 1 1
20488 20488 0 0 2 1
20492 20492 0 0 1 1
and my .txt file looks like this.
V1 V2 V3 V4 V6 V7_Pheno V8
2253792 20481 NA DNA 1 Yes 2
2253802 20483 NA DNA 4 Yes 2
2253816 20488 NA DNA 0 No 1
2253820 20492 NA DNA 4 Yes 2
My outcome.fam file should looks like this
20481 20481 0 0 2 2
20483 20483 0 0 1 2
20488 20488 0 0 2 1
20492 20492 0 0 1 2

paste merges the lines
awk allow you to select column, so
paste foo.fam bar.txt | awk '{ print $1 " " $2 " " $3 " " $4 " " $13 }'
should do what you want
If you want to suppress the header line of .txt file, you can call tail to skip the first line:
tail -n +2 bar.txt
You can hence integrate it in you command line (assuming you use bash)
paste foo.fam <(tail -n +2 bar.txt) | awk '{ print $1 " " $2 " " $3 " " $4 " " $13 }'

awk can do it alone.
$: awk 'BEGIN{ getline < "f.txt" }
{ gsub("[^ ]+$",""); l=$0; getline < "f.txt"; print l$7; }' f.fam
20481 20481 0 0 2 2
20483 20483 0 0 1 2
20488 20488 0 0 2 1
20492 20492 0 0 1 2
The BEGIN reads the header record on the .txt.
Then for each line of the .fam, strip off the last field and save to l.
getline used this way parses to fields also, so print l$7; prints the shortened record from .fam and adds the last field from .txt.

Related

Count and percentage

Using the columns 4 and 2, will create a report like the output file showed below. My code works fine but I believe it can be done more shorted :).
I have a doubt in the part of the split.
CNTLM = split ("20,30,40,60", LMT
It works but will be better to have exactly the values "10,20,30,40" as values in column 4.
4052538693,2910,04-May-2018-22,10
4052538705,2910,04-May-2018-22,10
4052538717,2910,04-May-2018-22,10
4052538729,2911,04-May-2018-22,20
4052538741,2911,04-May-2018-22,20
4052538753,2912,04-May-2018-22,20
4052538765,2912,04-May-2018-22,20
4052538777,2914,04-May-2018-22,10
4052538789,2914,04-May-2018-22,10
4052538801,2914,04-May-2018-22,30
4052539029,2914,04-May-2018-22,20
4052539041,2914,04-May-2018-22,20
4052539509,2915,04-May-2018-22,30
4052539521,2915,04-May-2018-22,30
4052539665,2915,04-May-2018-22,30
4052539677,2915,04-May-2018-22,10
4052539689,2915,04-May-2018-22,10
4052539701,2916,04-May-2018-22,40
4052539713,2916,04-May-2018-22,40
4052539725,2916,04-May-2018-22,40
4052539737,2916,04-May-2018-22,40
4052539749,2916,04-May-2018-22,40
4052539761,2917,04-May-2018-22,10
4052539773,2917,04-May-2018-22,10
here is the code I use to get the output desired.
printf " Code 10 20 30 40 Total\n" > header
dd=`cat header | wc -L`
awk -F"," '
BEGIN {CNTLM = split ("20,30,40,60", LMT)
cmdsort = "sort -nr"
DASHES = sprintf ("%0*d", '$dd', _)
gsub (/0/, "-", DASHES)
}
{for (IX=1; IX<=CNTLM; IX++) if ($4 <= LMT[IX]) break
CNT[$2,IX]++
COLTOT[IX]++
LNC[$2]++
TOT++
}
END {
print DASHES
for (l in LNC)
{printf "%5d", l | cmdsort
for (IX=1; IX<=CNTLM; IX++) {printf "%9d", CNT[l,IX]+0 | cmdsort
}
printf " = %6d" RS, LNC[l] | cmdsort
}
close (cmdsort)
print DASHES
printf "Total"
for (IX=1; IX<=CNTLM; IX++) printf "%9d", COLTOT[IX]+0
printf " = %6d" RS, TOT
print DASHES
printf "PCT "
for (IX=1; IX<=CNTLM; IX++) printf "%9.1f", COLTOT[IX]/TOT*100
printf RS
print DASHES
}
' file
Output file I got
Code 10 20 30 40 Total
----------------------------------------------------
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
----------------------------------------------------
Total 9 6 4 5 = 24
----------------------------------------------------
PCT 37.5 25.0 16.7 20.8
----------------------------------------------------
Appreciate if code can be improved.
without the header and cosmetics...
$ awk -F, '{a[$2,$4]++; k1[$2]; k2[$4]}
END{for(r in k1)
{printf "%5s", r;
for(c in k2) {k1[r]+=a[r,c]; k2[c]+=a[r,c]; printf "%10d", OFS a[r,c]+0}
printf " =%7d\n", k1[r]};
printf "%5s", "Total";
for(c in k2) {sum+=k2[c]; printf "%10d", k2[c]}
printf " =%7d", sum}' file | sort -nr
2917 2 0 0 0 = 2
2916 0 0 0 5 = 5
2915 2 0 3 0 = 5
2914 2 2 1 0 = 5
2912 0 2 0 0 = 2
2911 0 2 0 0 = 2
2910 3 0 0 0 = 3
Total 9 6 4 5 = 24

How to find sum of elements in column inside of a text file (Bash)

I have a log file with lots of unnecessary information. The only important part of that file is a table which describes some statistics. My goal is to have a script which will accept a column name as argument and return the sum of all the elements in the specified column.
Example log file:
.........
Skipped....
........
WARNING: [AA[409]: Some bad thing happened.
--- TOOL_A: READING COMPLETED. CPU TIME = 0 REAL TIME = 2
--------------------------------------------------------------------------------
----- TOOL_A statistics -----
--------------------------------------------------------------------------------
NAME Attr1 Attr2 Attr3 Attr4 Attr5
--------------------------------------------------------------------------------
AAA 885 0 0 0 0
AAAA2 1 0 2 0 0
AAAA4 0 0 2 0 0
AAAA8 0 0 2 0 0
AAAA16 0 0 2 0 0
AAAA1 0 0 2 0 0
AAAA8 0 0 23 0 0
AAAAAAA4 0 0 18 0 0
AAAA2 0 0 14 0 0
AAAAAA2 0 0 21 0 0
AAAAA4 0 0 23 0 0
AAAAA1 0 0 47 0 0
AAAAAA1 2 0 26 0
NOTE: Some notes
......
Skipped ......
The expected usage script.sh Attr1
Expected output:
888
I've tried to find something with sed/awk but failed to figure out a solution.
tldr;
$ cat myscript.sh
#!/bin/sh
logfile=${1}
attribute=${2}
field=$(grep -o "NAME.\+${attribute}" ${logfile} | wc -w)
sed -nre '/NAME/,/NOTE/{/NAME/d;/NOTE/d;s/\s+/\t/gp;}' ${logfile} | \
cut -f${field} | \
paste -sd+ | \
bc
$ ./myscript.sh mylog.log Attr3
182
Explanation:
assign command-line arguments ${1} and ${2} to the logfile and attribute variables, respectively.
with wc -w, count the quantity of words within the line that
contains both NAME and ${attribute} (the field index) and assign it to field
with sed
suppress automatic printing (-n) and enable extended regular expressions (-r)
find lines between the NAME and NOTE lines, inclusive
delete the lines that match NAME and NOTE
translate each contiguous run of whitespace to a single tab and print the result
cut using the field index
paste all numbers as an infix summation
evaluate the infix summation via bc
Quick and dirty (without any other spec)
awk -v CountCol=2 '/^[^[:blank:]]/ && NF == 6 { S += $( CountCol) } END{ print S + 0 }' YourFile
with column name
awk -v ColName='Attr1' '/^[[:blank:]]/ && NF == 6 { for(i=1;i<=NF;i++){if ( $i == ColName) CountCol = i } /^[^[:blank:]]/ && NF == 6 && CountCol{ S += $( CountCol) } END{ print S + 0 }' YourFile
you should add a header/trailer filter to avoid noisy line (a flag suit perfect for this) but lack of info about structure to set this flag, i use sthe simple field count (assuming text field have 0 as value so not changing the sum when taken in count)
$ awk -v col='Attr3' '/NAME/{for (i=1;i<=NF;i++) f[$i]=i} col in f{sum+=$(f[col]); if (!NF) {print sum+0; exit} }' file
182

Format output in bash

I have output in bash that I would like to format. Right now my output looks like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
abac.com 1
abbotsleigh.nsw.edu.au 1
abc.mre.gov.br 1
ableland.freeserve.co.uk 1
academicplanet.com 1
access-k12.org 1
acconnect.com 1
acconnect.com 1
accountingbureau.co.uk 1
acm.org 1
acsalaska.net 1
adam.com.au 1
ada.state.oh.us 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
adelphia.net 1
aecom.yu.edu 1
aecon.com 1
aetna.com 1
agedwards.com 1
ahml.info 1
The problem with this is none of the numbers on the right line up. I would like them to look like this:
1scom.net 1
1stservicemortgage.com 1
263.net 1
263.sina.com 1
2sahm.org 1
Would there be anyway to make them look like this without knowing exactly how long the longest domain is? Any help would be greatly appreciated!
The code that outputted this is:
grep -E -o -r "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" $ARCHIVE | sed 's/.*#//' | uniq -ci | sort | sed 's/^ *//g' | awk ' { t = $1; $1 = $2; $2 = t; print; } ' > temp2
ALIGNMENT:
Just use cat with column command and thats it:
cat /path/to/your/file | column -t
For more details on column command refer http://manpages.ubuntu.com/manpages/natty/man1/column.1.html
EDITED:
View file in terminal:
column -t < /path/to/your/file
(as noted by anishsane)
Export to a file:
column -t < /path/to/your/file > /output/file

Print only '+' or '-' if string matches (with two conditions)

I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!
This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.

sum of column in text file using shell script

I have file like this
1814 1
2076 2
2076 1
3958 1
2076 2
2498 3
2858 2
2858 1
1818 2
1814 1
2423 1
3588 12
2026 2
2076 1
1814 1
3576 1
2005 2
1814 1
2107 1
2810 1
I would like to generate report like this
1814 3
2076 6
3958 1
2858 3
Basically calculate the total for each unique value in column 1
Using awk:
awk '{s[$1] += $2} END{ for (x in s) print x, s[x] }' input
Pure Bash:
declare -a sum
while read key val ; do
((sum[key]+=val))
done < "$infile"
for key in ${!sum[#]}; do
printf "%4d %4d\n" $key ${sum[$key]}
done
The output is sorted:
1814 4
1818 2
2005 2
2026 2
2076 6
2107 1
2423 1
2498 3
2810 1
2858 3
3576 1
3588 12
3958 1
Perl solution:
perl -lane '$s{$F[0]} += $F[1] }{ print "$_ $s{$_}" for keys %s' INPUT
Note that the output is different from the one you gave.
sum totals for each primary key (integers only)
for key in $(cut -d\ -f1 test.txt | sort -u)
do
echo $key $(echo $(grep $key test.txt | cut -d\ -f2 | tr \\n +)0 | bc)
done
simply sum a column of integers
echo $(cut -d\ -f2 test.txt | tr \\n +)0 | bc

Resources