Bash- Sort a file based on a list in another file [closed] - bash

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I know that similar questions have been asked about sorting a file by a specific column, but none of them seem to answer my question.
My Input file looks like
OHJ07_1_contig_10 0 500 130 500 500 1.0000000
OHJ07_1_contig_10 500 1000 180 500 500 1.0000000
OHJ07_1_contig_10 1000 1500 171 500 500 1.0000000
OHJ07_1_contig_10 1500 2000 79 380 500 0.7600000
OHJ07_1_contig_10 2000 2500 62 500 500 1.0000000
OHJ07_1_contig_10 2500 3000 96 500 500 1.0000000
OHJ07_1_contig_10 3000 3500 76 500 500 1.0000000
OHJ07_1_contig_10 3500 4000 87 500 500 1.0000000
OHJ07_1_contig_10 4000 4500 60 500 500 1.0000000
OHJ07_1_contig_10 4500 5000 64 500 500 1.0000000
OHJ07_1_contig_10 5000 5468 213 468 468 1.0000000
OHJ07_1_contig_100 0 500 459 500 500 1.0000000
OHJ07_1_contig_100 500 1000 156 500 500 1.0000000
OHJ07_1_contig_100 1000 1314 77 305 314 0.9713376
OHJ07_1_contig_1000 0 500 239 500 500 1.0000000
OHJ07_1_contig_1000 500 1000 226 500 500 1.0000000
OHJ07_1_contig_1000 1000 1500 238 500 500 1.0000000
OHJ07_1_contig_1000 1500 2000 263 500 500 1.0000000
The program that generated it, sorted alphanumerically based on the name in the first column, but I would like to sort it based on a list of names in another file, and keep all the other data. The other file has other information, like contig length in column 2 (this file was produced with samtools faidx).
OHJ07_1_contig_25270 888266 96530655 60 61
OHJ07_1_contig_36751 583964 120924448 60 61
OHJ07_1_contig_44057 504884 134192571 60 61
OHJ07_1_contig_21721 415942 87354744 60 61
OHJ07_1_contig_46339 411691 143341916 60 61
OHJ07_1_contig_44022 330441 133783765 60 61
Since each name has a different number of entries in the first file, what's the easiest way to deal with this? Preferably using bash
I haven't tried anything because I have literally no way to tackle this.

I would prepend each line of file that determines the order (from now on named index) with its line number, there is a way using awk , I used the answer written here https://superuser.com/questions/10201/how-can-i-prepend-a-line-number-and-tab-to-each-line-of-a-text-file to do this (assuming your index file is named index and data file is named data.txt):
awk '{printf "%d,%s\n", NR, $0}' < index > index-numbered
in this way you will have in index-numbered a correspondence between the arbitrary words order you decided and numbers.
you can then use a while on file to sort that replaces each first word with index line number, a comma and the rest of line (keeping name) , for example :
57,OHJ07_1_contig_46339 411691 143341916 60 61
in this way you will be able to sort using the first field, the number, which translates your arbitrary order in a numeric order.
The while which create a new data file with same numbers as above:
while read line
do
key=$(echo $line | cut -f1)
n=$(grep $key index-numbered | cut -d, -f1)
echo $n","$line >> indexed-data.txt
done < data.txt
Then you can simply sort your modified data file (indexed-data.txt) using sort and using the inserted line number as sort key :
sort -k1 -n -t, indexed-data.txt >sorted-data.txt
If you want to hide line numbers on final output you can filter each one out modifying preceding instructions with these :
sort -k1 -n -t, indexed-data.txt | cut -d, -f2 > sorted-data.txt
Your final output will be in file sorted-data.txt .
I'm sure this is not the best solution, maybe others can answer better than me.

Related

extract ip address from a string containing ip address and other alphnum [duplicate]

How to extract a text part by regexp in linux shell? Lets say, I have a file where in every line is an IP address, but on a different position. What is the simplest way to extract those IP addresses using common unix command-line tools?
You could use grep to pull them out.
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' file.txt
Most of the examples here will match on 999.999.999.999 which is not technically a valid IP address.
The following will match on only valid IP addresses (including network and broadcast addresses).
grep -E -o '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' file.txt
Omit the -o if you want to see the entire line that matched.
This works fine for me in access logs.
cat access_log | egrep -o '([0-9]{1,3}\.){3}[0-9]{1,3}'
Let's break it part by part.
[0-9]{1,3} means one to three occurrences of the range mentioned in []. In this case it is 0-9. so it matches patterns like 10 or 183.
Followed by a '.'. We will need to escape this as '.' is a meta character and has special meaning for the shell.
So now we are at patterns like '123.' '12.' etc.
This pattern repeats itself three times(with the '.'). So we enclose it in brackets.
([0-9]{1,3}\.){3}
And lastly the pattern repeats itself but this time without the '.'. That is why we kept it separately in the 3rd step. [0-9]{1,3}
If the ips are at the beginning of each line as in my case use:
egrep -o '^([0-9]{1,3}\.){3}[0-9]{1,3}'
where '^' is an anchor that tells to search at the start of a line.
I usually start with grep, to get the regexp right.
# [multiple failed attempts here]
grep '[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*' file # good?
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file # good enough
Then I'd try and convert it to sed to filter out the rest of the line. (After reading this thread, you and I aren't going to do that anymore: we're going to use grep -o instead)
sed -ne 's/.*\([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\).*/\1/p # FAIL
That's when I usually get annoyed with sed for not using the same regexes as anyone else. So I move to perl.
$ perl -nle '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ and print $&'
Perl's good to know in any case. If you've got a teeny bit of CPAN installed, you can even make it more reliable at little cost:
$ perl -MRegexp::Common=net -nE '/$RE{net}{IPV4}/ and say $&' file(s)
You can use sed. But if you know perl, that might be easier, and more useful to know in the long run:
perl -n '/(\d+\.\d+\.\d+\.\d+)/ && print "$1\n"' < file
I wrote a little script to see my log files better, it's nothing special, but might help a lot of the people who are learning perl. It does DNS lookups on the IP addresses after it extracts them.
You can use some shell helper I made:
https://github.com/philpraxis/ipextract
included them here for convenience:
#!/bin/sh
ipextract ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)'
}
ipextractnet ()
{
egrep --only-matching -E '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/[[:digit:]]+'
}
ipextracttcp ()
{
egrep --only-matching -E '[[:digit:]]+/tcp'
}
ipextractudp ()
{
egrep --only-matching -E '[[:digit:]]+/udp'
}
ipextractsctp ()
{
egrep --only-matching -E '[[:digit:]]+/sctp'
}
ipextractfqdn ()
{
egrep --only-matching -E '[a-zA-Z0-9]+[a-zA-Z0-9\-\.]*\.[a-zA-Z]{2,}'
}
Load it / source it (when stored in ipextract file) from shell:
$ . ipextract
Use them:
$ ipextract < /etc/hosts
127.0.0.1
255.255.255.255
$
For some example of real use:
ipextractfqdn < /var/log/snort/alert | sort -u
dmesg | ipextractudp
For those who want a ready solution for getting IP addresses from apache log and listing occurences of how many times IP address has visited website, use this line:
grep -Eo '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' error.log | sort | uniq -c | sort -nr > occurences.txt
Nice method to ban hackers. Next you can:
Delete lines with less than 20 visits
Using regexp cut till single space so you will have only IP addresses
Using regexp cut 1-3 last numbers of IP addresses so you will have only network addresses
Add deny from and a space at the beginning of each line
Put the result file as .htaccess
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}"
I'd suggest perl. (\d+.\d+.\d+.\d+) should probably do the trick.
EDIT: Just to make it more like a complete program, you could do something like the following (not tested):
#!/usr/bin/perl -w
use strict;
while (<>) {
if (/(\d+\.\d+\.\d+\.\d+)/) {
print "$1\n";
}
}
This handles one IP per line. If you have more than one IPs per line, you need to use the /g option. man perlretut gives you a more detailed tutorial on regular expressions.
All of the previous answers have one or more problems. The accepted answer allows ip numbers like 999.999.999.999. The currently second most upvoted answer requires prefixing with 0 such as 127.000.000.001 or 008.008.008.008 instead of 127.0.0.1 or 8.8.8.8. Apama has it almost right, but that expression requires that the ipnumber is the only thing on the line, no leading or trailing space allowed, nor can it select ip's from the middle of a line.
I think the correct regex can be found on http://www.regextester.com/22
So if you want to extract all ip-adresses from a file use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt
If you don't want duplicates use:
grep -Eo "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" file.txt | sort | uniq
Please comment if there still are problems in this regex. It easy to find many wrong regex for this problem, I hope this one has no real issues.
Everyone here is using really long-handed regular expressions but actually understanding the regex of POSIX will allow you to use a small grep command like this for printing IP addresses.
grep -Eo "(([0-9]{1,3})\.){3}([0-9]{1,3})"
(Side note)
This doesn't ignore invalid IPs but it is very simple.
I have tried all answers but all of them had one or many problems that I list a few of them.
Some detected 123.456.789.111 as valid IP
Some don't detect 127.0.00.1 as valid IP
Some don't detect IP that start with zero like 08.8.8.8
So here I post a regex that works on all above conditions.
Note : I have extracted more than 2 millions IP without any problem with following regex.
(?:(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)\.){3}(?:1\d\d|2[0-5][0-5]|2[0-4]\d|0?[1-9]\d|0?0?\d)
I wrote an informative blog article about this topic: How to Extract IPv4 and IPv6 IP Addresses from Plain Text Using Regex.
In the article there's a detailed guide of the most common different patterns for IPs, often required to be extracted and isolated from plain text using regular expressions.
This guide is based on CodVerter's IP Extractor source code tool for handling IP addresses extraction and detection when necessary.
If you wish to validate and capture IPv4 Address this pattern can do the job:
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
or to validate and capture IPv4 Address with Prefix ("slash notation"):
\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?/[0-9]{1,2})\b
or to capture subnet mask or wildcard mask:
(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)
or to filter out subnet mask addresses you do it with regex negative lookahead:
\b((?!(255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)[.](255|254|252|248|240|224|192|128|0)))(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)[.]){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
For IPv6 validation you can go to the article link I have added at the top of this answer.
Here is an example for capturing all the common patterns (taken from CodVerter`s IP Extractor Help Sample):
If you wish you can test the IPv4 regex here.
You could use awk, as well. Something like ...
awk '{i=1; if (NF > 0) do {if ($i ~ /regexp/) print $i; i++;} while (i <= NF);}' file
May require cleaning. just a quick and dirty response to shows basically how to do it with awk.
The awk example above didn't work for me, and I needed to do it with awk specifically, so I came up with this method:
$ awk '{match($0,/[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+\.[0-9]{1,3}+/); ip = substr($0,RSTART,RLENGTH); print ip}' your_sample_file.log
You can also just use pipes if you're getting the data from somewhere else. Eg, ipconfig
I also realized the method matches invalid IP addresses.
Here is an extended version that only matches valid IPv4 Addresses:
$ awk 'match($0, /(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)/) {print substr($0, RSTART, RLENGTH)}' sample_file.log
Hope it helps someone else.
It's a REALLY brute-force type of solution, and I haven't had time to handle things like subnet masks.
Since many awk variants lack backreferences in regex, range notation in regex {n,m}, FPAT ability, an array target for match(), I have to try my best to emulate some of that functionality here.
The regex itself is very basic, and it's very much intentional, since each of candidates that passed through the first layer filter will then be fed into the ip4 validation function to ensure values are in range.
Additionally, I use a second array to handle the duplicate scenario (although it's only de-duped in the ASCII string sense - leading zeros, for now, will show up multiple times for each unique ASCII string representation of it).
I know it's ultra brute-force and unseemly of a solution - there's only so much lemonade I can make out of the lemons I have.
echo "${bbbbbbbb}" \
\
| mawk 'function validIP4(_,__,___) {
__^=__=___=4;—-__
if(--___!=gsub("[.]","_",_)) {
return !___ }
++___
do {
if ((+_<-_)||(__<+_)||(--___<-___)) {
_="[|]"
break
} } while (sub("^[^_]+[_]","",_))
return _!="[|]"
} BEGIN { FS = RS = "^$"
__=(__= (__="[0]*([012]?[0-9])?[0-9][.]")__)__
sub("...$","",__)
} END {
gsub(/[^0-9.]+/,OFS)
gsub(__,"=&~")
gsub(/[~][^0-9.=~]+[=]/,"~=")
gsub(/^[^=~]+[=~]|[=~][^=~]+$/,"")
split($(_<_),___,"[=~]+")
for(_ in ___) {
if ( ! (____[__=___[_]]++)) {
if (validIP4(__)) {
print (__) } } } }' \
\
| gsort -t'.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n \
| gcat -n \
| rs -t -c$'\n' -C= 0 4 \
| column -s= -t \
| lgp3 5
1 00.69.84.243 76 23.108.43.3 151 79.127.56.148 226 172.241.192.165
2 00.71.110.228 77 23.108.43.19 152 80.48.119.28 227 172.245.220.154
3 00.105.215.18 78 23.108.43.55 153 80.76.60.2 228 175.196.182.58
4 00.123.2.171 79 23.108.43.94 154 80.244.229.102 229 176.74.9.62
5 00.123.228.2 80 23.108.43.120 155 81.8.52.78 230 176.214.97.55
6 00.201.223.164 81 23.108.43.208 156 83.166.241.233 231 177.128.44.131
7 01.51.106.70 82 23.108.43.244 157 85.25.4.28 232 177.129.53.114
8 01.144.14.232 83 23.108.75.98 158 85.25.91.156 233 178.88.185.2
9 01.148.85.50 84 23.108.75.164 159 85.25.91.161 234 180.180.171.123
10 01.174.10.170 85 23.225.64.59 160 85.25.117.171 235 180.183.15.198
11 02.64.120.219 86 36.37.177.186 161 85.25.150.32 236 180.250.153.129
12 02.68.128.214 87 36.94.161.219 162 85.25.201.22 237 181.36.230.242
13 02.129.196.242 88 37.48.82.87 163 85.195.104.71 238 181.191.141.43
14 02.134.127.15 89 37.144.180.52 164 85.208.211.163 239 182.253.186.140
15 03.28.246.130 90 41.65.236.56 165 85.209.149.130 240 185.24.233.208
16 03.73.194.2 91 41.65.251.86 166 88.119.195.35 241 185.61.152.137
17 03.80.77.1 92 41.79.65.241 167 91.107.15.221 242 185.74.7.51
18 03.81.77.194 93 41.161.92.138 168 91.188.246.246 243 185.93.205.236
19 03.97.200.52 94 41.164.68.42 169 93.184.8.74 244 185.138.114.113
20 3.120.173.144 95 41.164.68.194 170 94.16.15.100 245 186.3.85.131
21 03.134.97.233 96 41.205.24.155 171 94.75.76.3 246 186.5.117.82
22 03.148.72.192 97 43.255.113.232 172 94.228.204.229 247 186.46.168.42
23 03.150.113.147 98 45.5.68.18 173 95.181.150.121 248 186.96.50.39
24 03.159.46.18 99 45.5.68.25 174 95.181.151.105 249 186.154.211.106
25 03.162.181.132 100 45.43.63.230 175 110.74.200.177 250 186.167.48.138
26 03.177.45.7 101 45.67.212.99 176 112.163.123.242 251 186.202.176.153
27 03.177.45.10 102 45.67.230.13 177 113.161.59.136 252 186.233.186.60
28 03.177.45.11 103 45.71.203.110 178 115.87.196.88 253 186.251.71.193
29 03.217.169.100 104 45.87.249.80 179 116.212.155.229 254 187.217.54.84
30 03.232.215.194 105 45.122.233.76 180 117.54.114.101 255 188.94.225.177
31 04.208.138.14 106 45.131.213.170 181 117.54.114.102 256 188.95.89.81
32 04.244.75.205 107 45.158.158.29 182 117.54.114.103 257 188.133.153.143
33 5.39.189.39 108 45.179.193.70 183 119.82.241.21 258 188.138.89.50
34 05.149.219.201 109 45.183.142.126 184 120.72.20.225 259 188.138.90.226
35 5.149.219.201 110 45.184.103.68 185 121.1.41.162 260 188.166.218.243
36 5.189.229.42 111 45.184.155.7 186 123.31.30.100 261 190.128.225.115
37 07.151.182.247 112 45.189.113.63 187 125.25.33.241 262 190.217.7.73
38 07.154.221.245 113 45.189.117.237 188 125.25.206.28 263 190.217.19.243
39 07.244.242.103 114 45.192.141.247 189 133.242.146.103 264 192.3.219.94
40 08.177.248.47 115 45.229.32.190 190 137.74.93.21 265 192.99.38.64
41 08.177.248.213 116 45.250.65.15 191 137.184.57.245 266 192.140.42.83
42 08.177.248.217 117 46.99.146.232 192 139.5.151.182 267 192.155.107.59
43 8.210.83.33 118 46.243.220.70 193 139.59.233.24 268 192.254.104.201
44 8.213.128.19 119 46.246.80.6 194 139.255.58.212 269 194.5.193.183
45 8.213.128.30 120 47.74.114.83 195 140.238.19.26 270 194.114.128.149
46 8.213.128.41 121 47.88.79.154 196 151.106.13.221 271 194.233.67.98
47 8.213.128.106 122 47.91.44.217 197 151.106.18.126 272 194.233.69.41
48 8.213.128.123 123 47.243.75.115 198 152.26.229.67 273 194.233.73.103
49 8.213.128.131 124 47.254.28.2 199 152.32.143.109 274 194.233.73.104
50 8.213.128.149 125 49.156.47.162 200 153.122.106.94 275 194.233.73.105
51 8.213.128.152 126 50.195.227.153 201 153.122.107.129 276 194.233.73.107
52 8.213.128.158 127 50.235.149.74 202 154.85.35.235 277 194.233.73.109
53 8.213.128.171 128 50.250.56.129 203 154.95.36.182 278 194.233.88.38
54 8.213.128.172 129 51.68.199.120 204 154.236.162.59 279 195.80.49.3
55 8.213.128.202 130 51.77.141.29 205 154.236.168.179 280 195.80.49.4
56 8.213.128.214 131 51.81.32.81 206 154.236.177.101 281 195.80.49.5
57 8.213.129.23 132 51.159.3.223 207 154.236.179.226 282 195.80.49.6
58 8.213.129.36 133 51.178.182.23 208 157.100.26.69 283 195.80.49.7
59 8.213.129.51 134 54.80.246.241 209 159.65.69.186 284 195.80.49.253
60 8.213.129.57 135 61.9.48.169 210 159.65.133.175 285 195.80.49.254
61 8.213.129.243 136 61.9.53.157 211 159.203.13.121 286 195.158.30.232
62 8.214.41.50 137 62.75.219.49 212 160.16.242.164 287 197.149.247.82
63 8.218.213.95 138 62.75.229.77 213 161.22.34.142 288 197.243.20.178
64 09.200.156.102 139 62.78.84.159 214 164.132.137.241 289 198.46.200.70
65 13.237.147.45 140 62.138.8.42 215 167.71.207.46 290 198.229.231.13
66 20.47.108.204 141 62.204.35.69 216 167.86.81.208 291 212.112.113.178
67 20.113.24.12 142 63.161.104.189 217 167.249.180.42 292 212.154.234.46
68 23.19.7.136 143 66.29.154.103 218 168.205.100.36 293 212.174.44.87
69 23.19.10.93 144 66.29.154.105 219 169.57.1.85 294 213.32.75.44
70 23.81.127.253 145 69.163.252.140 220 170.81.35.26 295 213.230.69.193
71 23.105.78.193 146 76.118.227.8 221 170.83.60.19 296 213.230.71.230
72 23.105.78.252 147 77.83.86.65 222 170.155.5.235 297 213.230.90.106
73 23.105.86.52 148 77.83.87.217 223 171.233.151.214 298 221.159.192.122
74 23.108.42.228 149 77.104.97.3 224 172.241.156.1 299 222.158.197.138
75 23.108.42.238 150 77.236.243.125 225 172.241.192.104 300 222.252.23.5
cat ip_address.txt | grep '^[0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[,].*$\|^.*[,][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}[.][0-9]\{1,3\}$'
Lets assume the file is comma delimited and the position of ip address in the beginning ,end and somewhere in the middle
First regexp looks for the exact match of ip address in the beginning of the line.
The second regexp after the or looks for ip address in the middle.we are matching it in such a way that the number that follows ,should be exactly 1 to 3 digits .falsy ips like 12345.12.34.1 can be excluded in this.
The third regexp looks for the ip address at the end of the line
I wanted to get only IP addresses that began with "10", from any file in a directory:
grep -o -nr "[10]\{2\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" /var/www
If you are not given a specific file and you need to extract IP address then we need to do it recursively.
grep command -> Searches a text or file for matching a given string and displays the matched string .
grep -roE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' | grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'
-r We can search the entire directory tree i.e. the current directory and all levels of sub-directories. It denotes recursive searching.
-o Print only the matching string
-E Use extended regular expression
If we would not have used the second grep command after the pipe we would have got the IP address along with the path where it is present
for centos6.3
ifconfig eth0 | grep 'inet addr' | awk '{print $2}' | awk 'BEGIN {FS=":"} {print $2}'

How to do cumulative and consecutive sums for every column in a tab file (UNIX environment)

I have a tabulated file something like that
Q8VYA50 210 69 2 8 3
Q8VYA50 208 69 1 2 8 3
Q9C8G30 316 182 4 4 7
P335430 657 98 1 10 7
That I would like to do is to apply a cumulative sum from the 4rd column up to NF and print in every column the result of the sum for this column and the original value of previous columns if any. So that, the desired output would be
Q8VYA50 210 69 2 10 13
Q8VYA50 208 69 1 3 11 14
Q9C8G30 316 182 4 8 15
P335430 657 98 1 11 18
I have tried to do it through different ways by means of sum function inside an awk script including for-loop specifying the fields where must apply the cumulative sum. However, the result obtained is wrong.
Are there some way to do it correctly by Unix (Bash)? Thanks in advance!
This is one way I have tried to do #Inian
gawk 'BEGIN {FS=OFS="\t"} {
for (i=4;i<=NF;i++)
{
sum[i]+=$i; print $1,$2,$3,$i
}
}' "input_file"
Other way is to do for every column manually. $4,$5+$4,$6+$5+$4,$7+$6+$5+$4 and so on, but I think is a "seedy" method.
Following awk may help you here.
awk '{for(i=5;i<=NF;i++){$i+=$(i-1)}} 1' OFS="\t" Input_file

Labelling of text files in bash loop

This script should give values to the program through the command line and store the output of the terminal in their corresponding files. Since I cannot put a decimal-point number as part of the label of the file, I intent to multiply each kappa by 10000 to turn them into integers and use them as labels, but I did it wrong in the code and I don't know how to it properly. How does it work? Thank you!
#!/bin/bash
for kappa in $(seq 0.0001 0.000495 0.01);
do
kappa_10000 = $kappa * 10000;
for seed in {1..50};
do
./two_defects $seed $kappa > "equilibration_step_seed${seed}_kappa${kappa_10000}.txt";
done
done
Bash does not do floating-number calculation as pointed out by #Inian. A program like bc must be called and its output can be directly stored in a variable as follows:
for kappa in $(seq 0.0001 0.000495 0.01)
do
kappa_10000=$(echo "$kappa*10000/1" | bc)
echo kappa_10000
done
The output in the terminal would be
1
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
The /1 operation must be added to make the output an integer.

Removing unwanted characters and empty lines with SED, TR or/and awk

I need to remove some unknown characters and remaining empty lines from a file, it should be simple and I'm feeling really stupid that I couldn't do it yet.
Here's the file contents (readable):
136;2014-09-07 13:41:25;2014-09-07 13:41:55
136;2014-09-07 13:41:55;2014-09-07 13:42:25
136;2014-09-07 13:42:25;2014-09-07 13:42:55
(empty line)
(empty line)
For some reason, this file comes with several unwanted/unknown chars. The HEX is:
fffe 3100 3300 3600 3b00 3200 3000 3100 3400 2d00 3000 3900 :..1.3.6.;.2.0.1.4.-.0.9.
2d00 3000 3700 2000 3100 3300 3a00 3400 3100 3a00 3200 3500 :-.0.7. .1.3.:.4.1.:.2.5.
3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 :;.2.0.1.4.-.0.9.-.0.7. .
3100 3300 3a00 3400 3100 3a00 3500 3500 0d00 0a00 3100 3300 :1.3.:.4.1.:.5.5.....1.3.
3600 3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 :6.;.2.0.1.4.-.0.9.-.0.7.
2000 3100 3300 3a00 3400 3100 3a00 3500 3500 3b00 3200 3000 : .1.3.:.4.1.:.5.5.;.2.0.
3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 3a00 :1.4.-.0.9.-.0.7. .1.3.:.
3400 3200 3a00 3200 3500 0d00 0a00 3100 3300 3600 3b00 3200 :4.2.:.2.5.....1.3.6.;.2.
3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 :0.1.4.-.0.9.-.0.7. .1.3.
3a00 3400 3200 3a00 3200 3500 3b00 3200 3000 3100 3400 2d00 ::.4.2.:.2.5.;.2.0.1.4.-.
3000 3900 2d00 3000 3700 2000 3100 3300 3a00 3400 3200 3a00 :0.9.-.0.7. .1.3.:.4.2.:.
3500 3500 0d00 0a00 0000 0d00 0a00 :5.5...........
So, as you can see the first 2 bytes are xFF and xFE and there are many x00 after each char. The line endings are a join of 0D00 + 0A00, carriage return and linefeed (\r\n) plus the x00.
I wanted to remove those x00 and the first 2 bytes xFFxFE and the last 4, and convert the CRLF to LF.
I could do that by using head, tail and tr:
tr -d '\15\00' < 2014.log | tail -c +3 | head -c -2 > 3.log
The problem is, I'm not sure if the file will always arrive like this, so I need to build a more generic method. I ended up with:
sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log > 2.log
or
tr -d '\377\376\00\15' < 2014.log > 2.log
Now I need to remove the last two empty lines, which as I said in the beginning, should be easy, but I can't accomplish that.
I've tried:
sed '/^\s*$/d'
sed '/^$/d'
awk 'NF > 0'
egrep -v "^$"
Other stuff
But in the end it removes only one of the blank lines, I still have one x0A in the end. I tried to replace the join of two x0Ax0A with sed, even using \n\n but it didn't work.
I can't remove all \n because I need the normal lines, I just want to remove when they appear at least two times in sequence. Again I could use tail or head to remove it, but I would be assuming that all files would arrive that way, and its not true.
I see it as a simple find and replace stuff, but it seems it doesn't work that way when we are working with linefeeds.
For information purposes:
file -i 2014-09-07-13-46-51.log
2014-09-07-13-46-51.log: application/octet-stream; charset=binary
Its not been recognized as a text file... this file is extracted from a flash shared object (.sol).
As the new files may not be like this and arrived as normal text files, I can't simple cut the files, but I need to treat those who are problematic.
The "fffe" at the beginning of the file is a byte order mark (http://en.wikipedia.org/wiki/Byte_order_mark) and for me an indication that you have a unicode type file. In that kind of file 'normal' ascii characters are represented by 2 bytes.
In another stackoverflow question/aswer the file is first converted to UTF-8... (grepping binary files and UTF16)
I finally made it, but really didn't like the solution. I've replaced all linefeeds with another character, like pipe (|), then removed then when I found two in sequence (||), and then convert pipes (|) back to \n
sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log | tr '\n' '|' | sed 's/||//g;' | sed 's/|/\x0A/g' > 5.log
-- #Luciano
Wow I solved the problem by that time but forgot to answer, so here it is!
Using only tr command I could accomplish that like this:
tr -d '\377\376\015\000\277\003' < logs.csv | tr -s '\n'
tr removed all the unwanted characters and empty lines, and it was really, really fast, much faster than the options using sed and awk
If you just want the ASCII characters out of the file you might try iconv
You probably can identify the file's encoding with file -i
I know you asked for sed, tr or awk but on the offchance it will change your mind, this is how easy it is to get Perl to do the heavy lifting:
perl -e 'open my $fh, "<:encoding(utf16)", $ARGV[0] or die "Error reading $ARGV[0]: $!"; while (<$fh>) { s{\x0d\x0a}{\n}g; s{\x00\n}{}g; print $_; }' input_filename

Search for a value in a file and remove subsequent lines

I'm developing a shell script but I am stuck with the below part.
I have the file sample.txt:
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
6 100 200
7 100 200
I want to search the S.No column in sample.txt. For example if I'm searching the value 5 I need the rows up to 5 only I don't want the rows after the value of in S.NO is larger than 5.
the output must look like, output.txt
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Print the first line and any other line where the first field is less than or equal to 5:
$ awk 'NR==1||$1<=5' file
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Using perl:
perl -ane 'print if $F[$1]<=5' file
And the sed solution
n=5
sed "/^$n[[:space:]]/q" filename
The sed q command exits after printing the current line
The suggested awk relies on that column 1 is numeric sorted. A generic awk that fulfills the question title would be:
gawk -v p=5 '$1==p {print; exit} {print}'
However, in this situation, sed is better IMO. Use -i to modify the input file.
sed '6q' sample.txt > output.txt

Resources