How to extract a string following a pattern using unix(mac OSX) - macos
I have file with these columns and tab separated.
Jun-AP1(bZIP)/K562-cJun-ChIP-Seq(GSE31477)/Homer 12.88% 4926.5 9.08%
Maz(Zf)/HepG2-Maz-ChIP-Seq(GSE31477)/Homer 52.08% 25510.3 47.00%
Bach2(bZIP)/OCILy7-Bach2-ChIP-Seq(GSE44420)/Homer 10.81% 4377 8.06%
Atf3(bZIP)/GBM-ATF3-ChIP-Seq(GSE33912)/Homer 28.73% 13346.9 24.59%
TEAD4(TEA)/Tropoblast-Tead4-ChIP-Seq(GSE37350)/Homer 40.43% 19549.3 36.01%
In first column, I want to extract the string upto first bracket and keep rest of the columns same.
For instance, I need the output as shown below.
Jun-AP1 12.88% 4926.5 9.08%
Maz 52.08% 25510.3 47.00%
Bach2 10.81% 4377 8.06%
Atf3 28.73% 13346.9 24.59%
TEAD4 40.43% 19549.3 36.01%
Thank you.
I would start with
sed 's/([^ ]*//'
where that's an actual tab character in [^ ].
awk '{sub(/\(.*Homer/,"")}{print $1,$2,$3,$4}' file
Jun-AP1 12.88% 4926.5 9.08%
Maz 52.08% 25510.3 47.00%
Bach2 10.81% 4377 8.06%
Atf3 28.73% 13346.9 24.59%
TEAD4 40.43% 19549.3 36.01%
Related
Unix/bash/Shell: How to Find Files from a List and Merge Them into One File
I would like to merge specific files (XXXXXXX_Abstract_TOC.txt, XXXXXXX_Chapter1.txt, XXXXXXX_Chapter2.txt, XXXXXXX_Chapter3.txt, XXXXXXX_Chapter4.txt, XXXXXXX_Conclusion.txt) into one file based on specific numbers that come from a text file(/util_files/list_NRPs.txt). Note: X is [0-9] digit The list_NRPs.txt contains as follows: 0030001 0030002 0030004 ... In /All_Files folder, I have files as follows: 0030001_Abstract_TOC.txt 0030001_Chapter1.txt 0030001_Chapter2.txt 0030001_Chapter3.txt 0030001_Chapter4.txt 0030001_Conclusion.txt 0030002_Abstract_TOC.txt 0030002_Chapter1.txt 0030002_Chapter2.txt 0030002_Chapter3.txt 0030002_Chapter4.txt 0030002_Conclusion.txt 0030004_Abstract_TOC.txt 0030004_Chapter1.txt 0030004_Chapter2.txt 0030004_Chapter3.txt 0030004_Chapter4.txt 0030004_Conclusion.txt ... For each XXXXXXX from list_NRPs.txt I would like to merge XXXXXXX_Abstract_TOC.txt, XXXXXXX_Chapter1.txt, XXXXXXX_Chapter2.txt, XXXXXXX_Chapter3.txt, XXXXXXX_Chapter4.txt, XXXXXXX_Conclusion.txt into XXXXXXX_All.txt. The final process in /All_Files folder would be: 0030001_Abstract_TOC.txt 0030001_Chapter1.txt 0030001_Chapter2.txt 0030001_Chapter3.txt 0030001_Chapter4.txt 0030001_Conclusion.txt 0030001_All.txt 0030002_Abstract_TOC.txt 0030002_Chapter1.txt 0030002_Chapter2.txt 0030002_Chapter3.txt 0030002_Chapter4.txt 0030002_Conclusion.txt 0030002_All.txt 0030004_Abstract_TOC.txt 0030004_Chapter1.txt 0030004_Chapter2.txt 0030004_Chapter3.txt 0030004_Chapter4.txt 0030004_Conclusion.txt 0030004_All.txt ... I would like start with cat ../util_files/list_NRPs.txt | xargs but I do not know how to proceed. How can I do that?
You can use globbing to concatenate multiple files matching each line in list_NRPs.txt file: while read -r ch; do cat "/All_Files/$ch"* > "/All_Files/${ch}_All.txt" done < /util_files/list_NRPs.txt
Append count after each match
Sample input: >Sample GJVT7LS03DEUKL AAACTCCGCAATGCGCGCAAGC >Sample GJVT7LS03CXJ53 AAACTCCGCAATGCGCGCAAGCGTGACGGGG >Sample GJVT7LS03DJOYJ AAACTCC >Sample GJVT7LS03DMERH AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC >Sample GJVT7LS03DN2RB AAACTCCGCAATGCGCGCAAGCGTGACGG What I want out: >Sample_1 GJVT7LS03DEUKL AAACTCCGCAATGCGCGCAAGC >Sample_2 GJVT7LS03CXJ53 AAACTCCGCAATGCGCGCAAGCGTGACGGGG >Sample_3 GJVT7LS03DJOYJ AAACTCC >Sample_4 GJVT7LS03DMERH AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC >Sample_5 GJVT7LS03DN2RB AAACTCCGCAATGCGCGCAAGCGTGACGG In other words, I want to append count (preceded by "_") for each line that matches pattern ("Sample" in this case). Any sed/awk/etc. one-liners for this task?
One way: $ awk '/^>/{$1=$1"_"++i}1' file >Sample_1 GJVT7LS03DEUKL AAACTCCGCAATGCGCGCAAGC >Sample_2 GJVT7LS03CXJ53 AAACTCCGCAATGCGCGCAAGCGTGACGGGG >Sample_3 GJVT7LS03DJOYJ AAACTCC >Sample_4 GJVT7LS03DMERH AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC >Sample_5 GJVT7LS03DN2RB AAACTCCGCAATGCGCGCAAGCGTGACGG
One possible attempt is as follows: $ awk 'BEGIN{a=1}/Sample/ {$1=$1"_"a; a++}1' file >Sample_1 GJVT7LS03DEUKL AAACTCCGCAATGCGCGCAAGC >Sample_2 GJVT7LS03CXJ53 AAACTCCGCAATGCGCGCAAGCGTGACGGGG >Sample_3 GJVT7LS03DJOYJ AAACTCC >Sample_4 GJVT7LS03DMERH AAACTCCGCAATGCGCGCAAGCGTGACGGGGGGAC >Sample_5 GJVT7LS03DN2RB AAACTCCGCAATGCGCGCAAGCGTGACGG For each file containing "Sample" we update the first field with "_"$variable. This variable is a initially set to 1 and that we then increment in one.
alphanumeric sort in VIM
Suppose I have a list in a text file which is as follows - TaskB_115 TaskB_19 TaskB_105 TaskB_13 TaskB_10 TaskB_0_A_1 TaskB_17 TaskB_114 TaskB_110 TaskB_0_A_5 TaskB_16 TaskB_12 TaskB_113 TaskB_15 TaskB_103 TaskB_2 TaskB_18 TaskB_106 TaskB_11 TaskB_14 TaskB_104 TaskB_112 TaskB_107 TaskB_0_A_4 TaskB_102 TaskB_100 TaskB_109 TaskB_101 TaskB_0_A_2 TaskB_0_A_3 TaskB_116 TaskB_1_A_0 TaskB_111 TaskB_108 If I sort in vim with command %sort, it gives me output as - TaskB_0_A_1 TaskB_0_A_2 TaskB_0_A_3 TaskB_0_A_4 TaskB_0_A_5 TaskB_10 TaskB_100 TaskB_101 TaskB_102 TaskB_103 TaskB_104 TaskB_105 TaskB_106 TaskB_107 TaskB_108 TaskB_109 TaskB_11 TaskB_110 TaskB_111 TaskB_112 TaskB_113 TaskB_114 TaskB_115 TaskB_116 TaskB_12 TaskB_13 TaskB_14 TaskB_15 TaskB_16 TaskB_17 TaskB_18 TaskB_19 TaskB_1_A_0 TaskB_2 But I would like to have the output as follows - TaskB_0_A_1 TaskB_0_A_2 TaskB_0_A_3 TaskB_0_A_4 TaskB_0_A_5 TaskB_1_A_0 TaskB_2 TaskB_10 TaskB_11 TaskB_12 TaskB_13 TaskB_14 TaskB_15 TaskB_16 TaskB_17 TaskB_18 TaskB_19 TaskB_100 TaskB_101 TaskB_102 TaskB_103 TaskB_104 TaskB_105 TaskB_106 TaskB_107 TaskB_108 TaskB_109 TaskB_110 TaskB_111 TaskB_112 TaskB_113 TaskB_114 TaskB_115 TaskB_116 Note I just wrote this list to demonstrate the problem. I could generate the list in VIM. But I want to do it for other things as well in VIM.
With [n] sorting is done on the first decimal number in the line (after or inside a {pattern} match). One leading '-' is included in the number. try this command: sor n and you don't need the %, sort sorts all lines if no range was given. EDIT as commented by OP, if you have: TaskB_0_A_1 TaskB_0_A_2 TaskB_0_A_4 TaskB_0_A_3 TaskB_0_A_5 TaskB_1_A_0 you could try: sor n /.*_\ze\d*/ or sor nr /\d*$/ EDIT2 for newly edited question, this line may give you expected output based on your example data: sor nr /\d*$/|sor n
how to replace last comma in a line with a string in unix
I trying to insert a string in every line except for first and last lines in a file, but not able to get it done, can anyone give some clue how to achieve? Thanks in advance. How to replace last comma in a line with a string xxxxx (except for first and last rows) using unix Original File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,01 99,SRI,FF,28 Expected File 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,xxxxx231100,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,xxxxx231300,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,xxxxx231900,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,xxxxx232200,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,xxxxx232400,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,xxxxx232700,01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28
awk can be quite useful for manipulating data files like this one. Here's a one-liner that does more-or-less what you want. It prepends the string "xxxxx" to the twelfth field of each input line that has at least twelve fields. $ awk 'BEGIN{FS=OFS=","}NF>11{$12="xxxxx"$12}{print}' 16006747.txt 00,SRI,BOM,FF,000004,20120808030100,20120907094412,"GTEXPR","SRIVIM","8894-7577","SRIVIM#GTEXPR." 10,SRI,FF,NMNN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYPO 10,SRI,FF,NMNN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,NMNN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,CCL,,101400,xxxxx02CHLSU 10,SRI,FF,BUN,0800,NMJWJB,U,NM,PAR,101540,101700,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,MCE,101840,101900,xxxxx01 10,SRI,FF,BUN,0800,NMJWJB,U,NM,SSS,102140,102200,xxxxx09 10,SRI,FF,BUN,0800,NMJWJB,U,NM,FSS,102600,,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,CCL,,103700,xxxxx01CHLSU 10,SRI,FF,BUN,0802,NMJWJB,U,NM,PAR,103940,104000,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,MCE,104140,104200,xxxxx01 10,SRI,FF,BUN,0802,NMJWJB,U,NM,SSS,104440,104500,xxxxx09 10,SRI,FF,BUN,0802,NMJWJB,U,NM,FSS,105000,,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,GEB,,230900,xxxxx02BLYSU 10,SRI,FF,BUN,3112,NMNSME,U,NM,TCM,231040,231100,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UPW,231240,231300,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,UFG,231700,231900,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,FTG,232140,232200,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BOR,232340,232400,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,BAY,232640,232700,xxxxx01 10,SRI,FF,BUN,3112,NMNSME,U,NM,RWD,233400,,xxxxx01 99,SRI,FF,28
text processing for IPv4 dotted decimal notation conversion to /8 or /16 format
I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this. 202.124.127.26 2135869 202.124.127.25 2111217 202.124.127.17 2058082 202.124.127.16 2014958 202.124.127.20 1949323 202.124.127.24 1933773 202.124.127.27 1932076 202.124.127.22 1886466 202.124.127.18 1882955 202.124.127.21 1803528 202.124.127.23 1786348 119.224.129.200 1776592 119.224.129.211 1639325 202.124.127.19 1479198 119.224.129.201 1145426 202.49.175.110 1133354 119.224.129.210 1119525 68.232.45.132 1085491 119.224.129.209 1015078 131.203.3.8 857951 202.162.73.4 817197 207.123.58.125 785326 202.7.6.18 762603 117.121.253.254 718022 74.125.237.120 710448 68.232.44.219 693002 202.162.73.2 671559 205.128.75.126 611301 119.161.91.17 604393 119.224.129.202 559930 8.27.241.126 528862 74.125.237.152 517516 8.254.9.254 514341 As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below. 4.23.63.126 15731 4.26.254.254 320705 4.27.8.254 25174 8.12.129.50 176141 8.12.223.125 11800 8.19.32.65 15854 8.19.240.53 11013 8.19.240.70 11915 8.19.240.72 31541 8.19.240.73 23304 8.20.213.28 96434 8.20.213.32 108191 8.20.213.34 170058 8.20.213.39 23512 8.20.213.41 10420 8.20.213.61 24809 8.26.195.253 28568 8.27.152.253 104446 8.27.233.125 115856 8.27.235.126 16102 8.27.235.254 25628 8.27.238.254 108485 8.27.240.125 169262 8.27.241.126 528862 8.27.241.252 197302 8.27.248.125 14926 8.254.9.254 514341 12.129.210.71 89663 15.192.45.21 20139 15.192.45.26 35265 15.193.0.148 10313 15.193.113.29 40318 15.201.49.136 14243 15.240.238.52 57163 17.250.248.95 28166 23.33.125.13 19179 23.33.125.37 17953 31.151.163.60 72709 38.99.42.37 192356 38.99.68.180 41251 38.99.68.181 10272 38.104.237.74 74012 38.108.112.103 37034 38.108.112.115 69698 38.108.112.121 92173 38.108.112.122 99230 38.112.63.238 39958 38.119.130.62 42159 46.4.28.22 19769 Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this. For example 8.19.240.53 11013 8.19.240.70 11915 8.19.240.72 31541 8.19.240.73 23304 8.20.213.28 96434 8.20.213.32 108191 8.20.213.34 170058 8.20.213.39 23512 8.20.213.41 10420 8.20.213.61 24809 The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet. For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column. 8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304) 8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809) It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file). Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end. Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper. I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you: awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.