How to extract a text part by regexp in linux shell? Lets say, I have a file where in every line is an IP address, but on a different position. What is the simplest way to extract those IP addresses using common unix command-line tools?
You could use grep to pull them out.
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' file.txt
Most of the examples here will match on 999.999.999.999 which is not technically a valid IP address.
The following will match on only valid IP addresses (including network and broadcast addresses).
grep -E -o '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' file.txt
Omit the -o if you want to see the entire line that matched.
This works fine for me in access logs.
cat access_log | egrep -o '([0-9]{1,3}\.){3}[0-9]{1,3}'
Let's break it part by part.
[0-9]{1,3} means one to three occurrences of the range mentioned in []. In this case it is 0-9. so it matches patterns like 10 or 183.
Followed by a '.'. We will need to escape this as '.' is a meta character and has special meaning for the shell.
So now we are at patterns like '123.' '12.' etc.
This pattern repeats itself three times(with the '.'). So we enclose it in brackets.
([0-9]{1,3}\.){3}
And lastly the pattern repeats itself but this time without the '.'. That is why we kept it separately in the 3rd step. [0-9]{1,3}
If the ips are at the beginning of each line as in my case use:
egrep -o '^([0-9]{1,3}\.){3}[0-9]{1,3}'
where '^' is an anchor that tells to search at the start of a line.
I usually start with grep, to get the regexp right.
# [multiple failed attempts here]
grep '[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*' file # good?
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file # good enough
Then I'd try and convert it to sed to filter out the rest of the line. (After reading this thread, you and I aren't going to do that anymore: we're going to use grep -o instead)
sed -ne 's/.*\([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\).*/\1/p # FAIL
That's when I usually get annoyed with sed for not using the same regexes as anyone else. So I move to perl.
$ perl -nle '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ and print $&'
Perl's good to know in any case. If you've got a teeny bit of CPAN installed, you can even make it more reliable at little cost:
$ perl -MRegexp::Common=net -nE '/$RE{net}{IPV4}/ and say $&' file(s)
You can use sed. But if you know perl, that might be easier, and more useful to know in the long run:
perl -n '/(\d+\.\d+\.\d+\.\d+)/ && print "$1\n"' < file
I wrote a little script to see my log files better, it's nothing special, but might help a lot of the people who are learning perl. It does DNS lookups on the IP addresses after it extracts them.
You can use some shell helper I made:
https://github.com/philpraxis/ipextract
included them here for convenience:
#!/bin/sh
ipextract ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
}
ipextractnet ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/[[:digit:]]+'
}
ipextracttcp ()
{
egrep --only-matching -E '[[:digit:]]+/tcp'
}
ipextractudp ()
{
egrep --only-matching -E '[[:digit:]]+/udp'
}
ipextractsctp ()
{
egrep --only-matching -E '[[:digit:]]+/sctp'
}
ipextractfqdn ()
{
egrep --only-matching -E '[a-zA-Z0-9]+[a-zA-Z0-9\-\.]*\.[a-zA-Z]{2,}'
}
Load it / source it (when stored in ipextract file) from shell:
$ . ipextract
Use them:
$ ipextract < /etc/hosts
127.0.0.1
255.255.255.255
$
For some example of real use:
ipextractfqdn < /var/log/snort/alert | sort -u
dmesg | ipextractudp
For those who want a ready solution for getting IP addresses from apache log and listing occurences of how many times IP address has visited website, use this line:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' error.log | sort | uniq -c | sort -nr > occurences.txt
Nice method to ban hackers. Next you can:
Delete lines with less than 20 visits
Using regexp cut till single space so you will have only IP addresses
Using regexp cut 1-3 last numbers of IP addresses so you will have only network addresses
Add deny from and a space at the beginning of each line
Put the result file as .htaccess
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}"
I'd suggest perl. (\d+.\d+.\d+.\d+) should probably do the trick.
EDIT: Just to make it more like a complete program, you could do something like the following (not tested):
#!/usr/bin/perl -w
use strict;
while (<>) {
if (/(\d+\.\d+\.\d+\.\d+)/) {
print "$1\n";
}
}
This handles one IP per line. If you have more than one IPs per line, you need to use the /g option. man perlretut gives you a more detailed tutorial on regular expressions.
All of the previous answers have one or more problems. The accepted answer allows ip numbers like 999.999.999.999. The currently second most upvoted answer requires prefixing with 0 such as 127.000.000.001 or 008.008.008.008 instead of 127.0.0.1 or 8.8.8.8. Apama has it almost right, but that expression requires that the ipnumber is the only thing on the line, no leading or trailing space allowed, nor can it select ip's from the middle of a line.
I think the correct regex can be found on http://www.regextester.com/22
So if you want to extract all ip-adresses from a file use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt
If you don't want duplicates use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt | sort | uniq
Please comment if there still are problems in this regex. It easy to find many wrong regex for this problem, I hope this one has no real issues.
Everyone here is using really long-handed regular expressions but actually understanding the regex of POSIX will allow you to use a small grep command like this for printing IP addresses.
grep -Eo "(([0-9]{1,3})\.){3}([0-9]{1,3})"
(Side note)
This doesn't ignore invalid IPs but it is very simple.
I have tried all answers but all of them had one or many problems that I list a few of them.
Some detected 123.456.789.111 as valid IP
Some don't detect 127.0.00.1 as valid IP
Some don't detect IP that start with zero like 08.8.8.8
So here I post a regex that works on all above conditions.
Note : I have extracted more than 2 millions IP without any problem with following regex.
(?:(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)\.){3}(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)
I wrote an informative blog article about this topic: How to Extract IPv4 and IPv6 IP Addresses from Plain Text Using Regex.
In the article there's a detailed guide of the most common different patterns for IPs, often required to be extracted and isolated from plain text using regular expressions.
This guide is based on CodVerter's IP Extractor source code tool for handling IP addresses extraction and detection when necessary.
If you wish to validate and capture IPv4 Address this pattern can do the job:
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
or to validate and capture IPv4 Address with Prefix ("slash notation"):
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/[0-9]{1,2})\b
or to capture subnet mask or wildcard mask:
(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)
or to filter out subnet mask addresses you do it with regex negative lookahead:
\b((?!(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)))(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
For IPv6 validation you can go to the article link I have added at the top of this answer.
Here is an example for capturing all the common patterns (taken from CodVerter`s IP Extractor Help Sample):
If you wish you can test the IPv4 regex here.
You could use awk, as well. Something like ...
awk '{i=1; if (NF > 0) do {if ($i ~ /regexp/) print $i; i++;} while (i <= NF);}' file
May require cleaning. just a quick and dirty response to shows basically how to do it with awk.
The awk example above didn't work for me, and I needed to do it with awk specifically, so I came up with this method:
$ awk '{match($0,/[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+/); ip = substr($0,RSTART,RLENGTH); print ip}' your_sample_file.log
You can also just use pipes if you're getting the data from somewhere else. Eg, ipconfig
I also realized the method matches invalid IP addresses.
Here is an extended version that only matches valid IPv4 Addresses:
$ awk 'match($0, /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/) {print substr($0, RSTART, RLENGTH)}' sample_file.log
Hope it helps someone else.
It's a REALLY brute-force type of solution, and I haven't had time to handle things like subnet masks.
Since many awk variants lack backreferences in regex, range notation in regex {n,m}, FPAT ability, an array target for match(), I have to try my best to emulate some of that functionality here.
The regex itself is very basic, and it's very much intentional, since each of candidates that passed through the first layer filter will then be fed into the ip4 validation function to ensure values are in range.
Additionally, I use a second array to handle the duplicate scenario (although it's only de-duped in the ASCII string sense - leading zeros, for now, will show up multiple times for each unique ASCII string representation of it).
I know it's ultra brute-force and unseemly of a solution - there's only so much lemonade I can make out of the lemons I have.
echo "${bbbbbbbb}" \
\
| mawk 'function validIP4(_,__,___) {
__^=__=___=4;—-__
if(--___!=gsub("[.]","_",_)) {
return !___ }
++___
do {
if ((+_<-_)||(__<+_)||(--___<-___)) {
_="[|]"
break
} } while (sub("^[^_]+[_]","",_))
return _!="[|]"
} BEGIN { FS = RS = "^$"
__=(__= (__="[0]*([012]?[0-9])?[0-9][.]")__)__
sub("...$","",__)
} END {
gsub(/[^0-9.]+/,OFS)
gsub(__,"=&~")
gsub(/[~][^0-9.=~]+[=]/,"~=")
gsub(/^[^=~]+[=~]|[=~][^=~]+$/,"")
split($(_<_),___,"[=~]+")
for(_ in ___) {
if ( ! (____[__=___[_]]++)) {
if (validIP4(__)) {
print (__) } } } }' \
\
| gsort -t'.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n \
| gcat -n \
| rs -t -c$'\n' -C= 0 4 \
| column -s= -t \
| lgp3 5
1 00.69.84.243 76 23.108.43.3 151 79.127.56.148 226 172.241.192.165
2 00.71.110.228 77 23.108.43.19 152 80.48.119.28 227 172.245.220.154
3 00.105.215.18 78 23.108.43.55 153 80.76.60.2 228 175.196.182.58
4 00.123.2.171 79 23.108.43.94 154 80.244.229.102 229 176.74.9.62
5 00.123.228.2 80 23.108.43.120 155 81.8.52.78 230 176.214.97.55
6 00.201.223.164 81 23.108.43.208 156 83.166.241.233 231 177.128.44.131
7 01.51.106.70 82 23.108.43.244 157 85.25.4.28 232 177.129.53.114
8 01.144.14.232 83 23.108.75.98 158 85.25.91.156 233 178.88.185.2
9 01.148.85.50 84 23.108.75.164 159 85.25.91.161 234 180.180.171.123
10 01.174.10.170 85 23.225.64.59 160 85.25.117.171 235 180.183.15.198
11 02.64.120.219 86 36.37.177.186 161 85.25.150.32 236 180.250.153.129
12 02.68.128.214 87 36.94.161.219 162 85.25.201.22 237 181.36.230.242
13 02.129.196.242 88 37.48.82.87 163 85.195.104.71 238 181.191.141.43
14 02.134.127.15 89 37.144.180.52 164 85.208.211.163 239 182.253.186.140
15 03.28.246.130 90 41.65.236.56 165 85.209.149.130 240 185.24.233.208
16 03.73.194.2 91 41.65.251.86 166 88.119.195.35 241 185.61.152.137
17 03.80.77.1 92 41.79.65.241 167 91.107.15.221 242 185.74.7.51
18 03.81.77.194 93 41.161.92.138 168 91.188.246.246 243 185.93.205.236
19 03.97.200.52 94 41.164.68.42 169 93.184.8.74 244 185.138.114.113
20 3.120.173.144 95 41.164.68.194 170 94.16.15.100 245 186.3.85.131
21 03.134.97.233 96 41.205.24.155 171 94.75.76.3 246 186.5.117.82
22 03.148.72.192 97 43.255.113.232 172 94.228.204.229 247 186.46.168.42
23 03.150.113.147 98 45.5.68.18 173 95.181.150.121 248 186.96.50.39
24 03.159.46.18 99 45.5.68.25 174 95.181.151.105 249 186.154.211.106
25 03.162.181.132 100 45.43.63.230 175 110.74.200.177 250 186.167.48.138
26 03.177.45.7 101 45.67.212.99 176 112.163.123.242 251 186.202.176.153
27 03.177.45.10 102 45.67.230.13 177 113.161.59.136 252 186.233.186.60
28 03.177.45.11 103 45.71.203.110 178 115.87.196.88 253 186.251.71.193
29 03.217.169.100 104 45.87.249.80 179 116.212.155.229 254 187.217.54.84
30 03.232.215.194 105 45.122.233.76 180 117.54.114.101 255 188.94.225.177
31 04.208.138.14 106 45.131.213.170 181 117.54.114.102 256 188.95.89.81
32 04.244.75.205 107 45.158.158.29 182 117.54.114.103 257 188.133.153.143
33 5.39.189.39 108 45.179.193.70 183 119.82.241.21 258 188.138.89.50
34 05.149.219.201 109 45.183.142.126 184 120.72.20.225 259 188.138.90.226
35 5.149.219.201 110 45.184.103.68 185 121.1.41.162 260 188.166.218.243
36 5.189.229.42 111 45.184.155.7 186 123.31.30.100 261 190.128.225.115
37 07.151.182.247 112 45.189.113.63 187 125.25.33.241 262 190.217.7.73
38 07.154.221.245 113 45.189.117.237 188 125.25.206.28 263 190.217.19.243
39 07.244.242.103 114 45.192.141.247 189 133.242.146.103 264 192.3.219.94
40 08.177.248.47 115 45.229.32.190 190 137.74.93.21 265 192.99.38.64
41 08.177.248.213 116 45.250.65.15 191 137.184.57.245 266 192.140.42.83
42 08.177.248.217 117 46.99.146.232 192 139.5.151.182 267 192.155.107.59
43 8.210.83.33 118 46.243.220.70 193 139.59.233.24 268 192.254.104.201
44 8.213.128.19 119 46.246.80.6 194 139.255.58.212 269 194.5.193.183
45 8.213.128.30 120 47.74.114.83 195 140.238.19.26 270 194.114.128.149
46 8.213.128.41 121 47.88.79.154 196 151.106.13.221 271 194.233.67.98
47 8.213.128.106 122 47.91.44.217 197 151.106.18.126 272 194.233.69.41
48 8.213.128.123 123 47.243.75.115 198 152.26.229.67 273 194.233.73.103
49 8.213.128.131 124 47.254.28.2 199 152.32.143.109 274 194.233.73.104
50 8.213.128.149 125 49.156.47.162 200 153.122.106.94 275 194.233.73.105
51 8.213.128.152 126 50.195.227.153 201 153.122.107.129 276 194.233.73.107
52 8.213.128.158 127 50.235.149.74 202 154.85.35.235 277 194.233.73.109
53 8.213.128.171 128 50.250.56.129 203 154.95.36.182 278 194.233.88.38
54 8.213.128.172 129 51.68.199.120 204 154.236.162.59 279 195.80.49.3
55 8.213.128.202 130 51.77.141.29 205 154.236.168.179 280 195.80.49.4
56 8.213.128.214 131 51.81.32.81 206 154.236.177.101 281 195.80.49.5
57 8.213.129.23 132 51.159.3.223 207 154.236.179.226 282 195.80.49.6
58 8.213.129.36 133 51.178.182.23 208 157.100.26.69 283 195.80.49.7
59 8.213.129.51 134 54.80.246.241 209 159.65.69.186 284 195.80.49.253
60 8.213.129.57 135 61.9.48.169 210 159.65.133.175 285 195.80.49.254
61 8.213.129.243 136 61.9.53.157 211 159.203.13.121 286 195.158.30.232
62 8.214.41.50 137 62.75.219.49 212 160.16.242.164 287 197.149.247.82
63 8.218.213.95 138 62.75.229.77 213 161.22.34.142 288 197.243.20.178
64 09.200.156.102 139 62.78.84.159 214 164.132.137.241 289 198.46.200.70
65 13.237.147.45 140 62.138.8.42 215 167.71.207.46 290 198.229.231.13
66 20.47.108.204 141 62.204.35.69 216 167.86.81.208 291 212.112.113.178
67 20.113.24.12 142 63.161.104.189 217 167.249.180.42 292 212.154.234.46
68 23.19.7.136 143 66.29.154.103 218 168.205.100.36 293 212.174.44.87
69 23.19.10.93 144 66.29.154.105 219 169.57.1.85 294 213.32.75.44
70 23.81.127.253 145 69.163.252.140 220 170.81.35.26 295 213.230.69.193
71 23.105.78.193 146 76.118.227.8 221 170.83.60.19 296 213.230.71.230
72 23.105.78.252 147 77.83.86.65 222 170.155.5.235 297 213.230.90.106
73 23.105.86.52 148 77.83.87.217 223 171.233.151.214 298 221.159.192.122
74 23.108.42.228 149 77.104.97.3 224 172.241.156.1 299 222.158.197.138
75 23.108.42.238 150 77.236.243.125 225 172.241.192.104 300 222.252.23.5
cat ip_address.txt | grep '^[0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}$'
Lets assume the file is comma delimited and the position of ip address in the beginning ,end and somewhere in the middle
First regexp looks for the exact match of ip address in the beginning of the line.
The second regexp after the or looks for ip address in the middle.we are matching it in such a way that the number that follows ,should be exactly 1 to 3 digits .falsy ips like 12345.12.34.1 can be excluded in this.
The third regexp looks for the ip address at the end of the line
I wanted to get only IP addresses that began with "10", from any file in a directory:
grep -o -nr "[10]\{2\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" /var/www
If you are not given a specific file and you need to extract IP address then we need to do it recursively.
grep command -> Searches a text or file for matching a given string and displays the matched string .
grep -roE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
-r We can search the entire directory tree i.e. the current directory and all levels of sub-directories. It denotes recursive searching.
-o Print only the matching string
-E Use extended regular expression
If we would not have used the second grep command after the pipe we would have got the IP address along with the path where it is present
for centos6.3
ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | awk 'BEGIN {FS=":"} {print $2}'
I am trying to parse an input file (my test file is 4 lines) and then query an online biological database. However my loop seems to stop after returning the first result.
#!/bin/bash
if [ "$1" = "" ]; then
echo "No input file to parse given. Give me a BLAST output file"
else
file=$1
#Extracts GI from each result and stores it on temp file.
rm -rf /home/chris/TEMP/tempfile.txt
awk -F '|' '{printf("%s\n",$2);}' "$file" >> /home/chris/TEMP/tempfile.txt
#gets the species from each gi.
input="/home/chris/TEMP/tempfile.txt"
while read -r i
do
echo GI:"$i"
/home/chris/EntrezDirect/edirect/esearch -db protein -query "$i" | /home/chris/EntrezDirect/edirect/efetch -format gpc | /home/chris/EntrezDirect/edirect/xtract -insd source o
rganism | cut -f2
done < "$input"
rm -rf /home/chris/TEMP/tempfile.txt
fi
For example, my only output is
GI:751637161
Pseudomonas stutzeri group
whereas I should have 4 results. Any help appreciated and thanks in advance.
This is the format of the sample input:
TARA042SRF022_1 gi|751637161|ref|WP_041104882.1| 40.4 151 82 2 999 547 1 143 2.8e-21 110.9
TARA042SRF022_2 gi|1057355277|ref|WP_068715547.1| 62.7 263 96 1 915 133 80 342 7.1e-96 358.6
TARA042SRF022_3 gi|950462516|ref|WP_057369049.1| 38.3 47 29 0 184 44 152 198 5.1e+01 36.2
TARA042SRF022_4 gi|918428433|ref|WP_052479609.1| 37.5 48 29 1 525 668 192 238 6.1e+01 37.0
It would appear that read -r i is returning with a non-zero exit status on its second call, indicating that there is no more data to be read from the input file. This usually means that a command inside the while loop is also reading from standard input, and is consuming the remainder of the file before read has a chance.
The only candidate here is esearch, as echo does not read from standard input and the other commands are all reading from the previous command in the pipeline. Redirect standard input for esearch so that it does not consume your input data inadvertently.
while read -r i
do
echo GI:"$i"
/home/chris/EntrezDirect/edirect/esearch -db protein -query "$i" < /dev/null |
/home/chris/EntrezDirect/edirect/efetch -format gpc |
/home/chris/EntrezDirect/edirect/xtract -insd source organism |
cut -f2
done < "$input"
Use cut to extract columns from an ASCII file, use the -d option to denote the delimiter and -f to specify the column. Wrap everything in a loop like so
$ cat data.txt
TARA042SRF022_1 gi|751637161|ref|WP_041104882.1| 40.4 151 82 2 999 547 1 143 2.8e-21 110.9
TARA042SRF022_2 gi|1057355277|ref|WP_068715547.1| 62.7 263 96 1 915 133 80 342 7.1e-96 358.6
TARA042SRF022_3 gi|950462516|ref|WP_057369049.1| 38.3 47 29 0 184 44 152 198 5.1e+01 36.2
TARA042SRF022_4 gi|918428433|ref|WP_052479609.1| 37.5 48 29 1 525 668 192 238 6.1e+01 37.0
$ cat t.sh
#!/bin/bash
for gi in $(cut -d"|" -f 2 data.txt); do
echo $gi
done
$ bash t.sh
751637161
1057355277
950462516
918428433
Edit:
I cannot reproduce the problem but I feel it is linked to newlines and/or the usage of a temp file. My suggestions omits this but does not answer your actual question (but your problem I guess)
H- I am looking for a bash/awk/sed solution to get subsets of a table based on unique column values. For example if I have:
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
I want to be able to split this table into 3:
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
I know how to do this in R using the split command, but I am specifically looking for a bash/awk/sed solution.
Thanks
I don’t know if this awk is of any use but it will create 3 separate file based on the unique column values:
awk '{print >> $1; close($1)}' file
alternative awk which keeps the original order of records within each block
$ awk '{a[$1]=a[$1]?a[$1] ORS $0:$0}
END{for(k in a) print a[k] ORS ORS}' file
generates
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
there are 2 trailing empty lines at the end but not displayed in the formatted output.
Using sort and awk:
sort -k1,1 file | awk 'NR>1 && p != $1{print ORS} {p=$1} 1'
EDIT: If you want to keep original order of records from input file then use:
awk -v ORS='\n\n' '!($1 in a){a[$1]=$0; ind[++i]=$1; next}
{a[$1]=a[$1] RS $0}
END{for(k=1; k<=i; k++) print a[ind[k]]}' file
create input list file.txt
(
cat << EOF
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
EOF
) > file.txt
transfomation
cat file.txt | cut -d" " -f1 | sort -u | while read c
do
cat file.txt | grep "^$c" | sort
echo
done
I have the following test file
Kmax Event File - Text Format
1 4 1000
65 4121 9426 12312
56 4118 8882 12307
1273 4188 8217 12309
1291 4204 8233 12308
1329 4170 8225 12303
1341 4135 8207 12306
63 4108 8904 12300
60 4106 8897 12307
731 4108 8192 12306
...
ÿÿÿÿÿÿÿÿ
In this file I want to delete the first two lines and apply some mathematical calculations. For instance each column i will be $i-(i-1)*number. A script that does this is the following
#!/bin/bash
if test $1 ; then
if [ -f $1.evnt ] ; then
rm -f $1.dat
sed -n '2p' $1.evnt | (read v1 v2 v3
for filename in $1*.evnt ; do
echo -e "Processing file $filename"
sed '$d' < $filename > $1_tmp
sed -i '/Kmax/d' $1_tmp
sed -i '/^'"$v1"' '"$v2"' /d' $1_tmp
cat $1_tmp >> $1.dat
done
v3=`wc -l $1.dat | awk '{print $1}' `
echo -e "$v1 $v2 $v3" > .$1.dat
rm -f $1_tmp)
else
echo -e "\a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
echo -e " Event file $1.evnt doesn't exist !!!!!!"
echo -e "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
fi
else
echo -e "\a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
echo -e "!!!!! Give name for event files !!!!!"
echo -e "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
fi
awk '{print $1, $2-4096, $3-(2*4096), $4-(3*4096)}' $1.dat >$1_Processed.dat
rm -f $1.dat
exit 0
The file won't always have 4 columns. Is there a way to read the number of columns, print this number and apply those calculations?
EDIT The idea is to have an input file (*.evnt), convert it to *.dat or any other ascii file(it doesn't matter really) which will only include the number in columns and then apply the calculation $i=$i-(i-1)*number. In addition it will keep the number of columns in a variable, that will be called in another program. For instance in the above file, number=4096 and a sample output file is the following
65 25 1234 24
56 22 690 19
1273 92 25 21
1291 108 41 20
1329 74 33 15
1341 39 15 18
63 12 712 12
60 10 705 19
731 12 0 18
while in the console I will get the message There are 4 detectors.
Finally a new file_processed.dat will be produced, where file is the initial name of awk's input file.
The way it should be executed is the following
./myscript <filename>
where <filename> is the name without the format. For instance, the files will have the format filename.evnt so it should be executed using
./myscript filename
Let's start with this to see if it's close to what you're trying to do:
$ numdet=$( awk -v num=4096 '
NR>2 && NF>1 {
out = FILENAME "_processed.dat"
for (i=1;i<=NF;i++) {
$i = $i-(i-1)*num
}
nf = NF
print > out
}
END {
printf "There are %d detectors\n", nf | "cat>&2"
print nf
}
' file )
There are 4 detectors
$ cat file_processed.dat
65 25 1234 24
56 22 690 19
1273 92 25 21
1291 108 41 20
1329 74 33 15
1341 39 15 18
63 12 712 12
60 10 705 19
731 12 0 18
$ echo "$numdet"
4
Is that it?
Using awk
awk 'NR<=2{next}{for (i=1;i<=NF;i++) $i=$i-(i-1)*4096}1' file