How can I get the SOA serial number from a file with sed? - bash

I store my SOA data for multiple domains in a single file that gets $INCLUDEd by zone files. I've written a small sed script that is supposed to get the serial number, increment it, then re-save the SOA file. It all works properly as long as the SOA file is in the proper format, with the entire record on one line, but it fails as soon as the record gets split into multiple lines.
For example, this works as input data:
# IN SOA dnsserver. hostmaster.example.net. ( 2013112202 21600 900 691200 86400 )
But this does not:
# IN SOA dnsserver. hostmaster.example.net. (
2013112202 ; Serial number
21600 ; Refresh every day, 86400 is 1 day
900 ; Retry refresh every 15 min
691200 ; Expire every 8 days
86400 ) ; Minimum TTL 1 day
I like comments, and I would like to spread things out. But I need my script to be able to find the serial number so that I can increment it and rewrite the file.
The SED that works on the single line is this:
SOA=$(sed 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' $SOAfile)
But for multi-line ... I'm a bit lost. I know I can join lines with N, but how do I know if I even need to? Do I need to write separate sed scripts based on some other analysis I do of the original file?
Please help! :-)

I wouldn't use sed for this. While you might be able to brute-force something, it would require a large amount of concentration to come up with it, and it would look like line noise, and so be almost unmaintainable afterwards.
What about this in awk?
The easiest way might be to split your records based on the # character, like so:
SOA=$(awk 'BEGIN{RS="#"} NR==2{print $6}' $SOAfile)
But that will break if you have comments containing # before the uncommented line, or if you have any comments between the # and the serial number. You could make a pipe to avoid these issues...
SOA=$(sed 's/;.*//;/^#/p;1,/^#/d' $SOAfile | awk 'BEGIN{RS="#"} NR==2{print $6}')
It may seem redundant to remove comments and strip the top of the file, but there could be other lines like #include which (however unlikely) could contain your record separator.
Or you could do something like this in pure awk:
SOA=$(awk -v field=6 '/^#/ { if($2=="IN"){field++} for(i=1;i<field;i++){if(i==NF){field=field-NF;getline;i=1}} print $field}' $SOAfile)
Or, broken out for easier reading:
awk -v field=6 '
/^#/ {
if ($2=="IN") {field++;}
for (i=1;i<field;i++) {
if(i==NF) {field=field-NF;getline;i=1;}
}
print $field; }' $SOAfile
This is flexible enough to handle any line splitting you might have, as it counts to field along multiple lines. It also adjusts the field number based on whether your zone segment contains the optional "IN" keyword.
A pure-sed solution would, instead of counting fields, use the first string of digits after an open bracket after your /^#/, like this:
SOA=$(sed -n '/^#/,/^[^;]*)/H;${;x;s/.*#[^(]*([^0-9]*//;s/[^0-9].*//;p;}' $SOAfile)
Looks like line noise, right? :-) Broken out for easier reading, it looks like this:
/^#/,/^[^;]*)/H # "Hold" the meaningful part of the file...
${ # Once we reach the end...
x # Copy the hold space back to the main buffer
s/.*#[^(]*([^0-9]*// # Remove stuff ahead of the serial
s/[^0-9].*// # Remove stuff after the serial
p # And print.
}
The idea here is that starting from the first line that begins with #, we copy the file into sed's hold space, then at the end of the file, do some substitutions to strip out all the text up to the serial number, and then after the serial number, and print whatever remains.
All of these work on single line and multi line zone SOA records I've tested with.

You can try the following - it's your original sed program preceded by commands to first read all input lines, if applicable:
SOA=$(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' \
"$SOAfile")
This form will work with both single- and multi-line input files.
Multi-line input files are first read as a whole before applying the substitutions.
Note: The awkward separate -e options are needed to keep FreeBSD happy with respect to labels and branching commands, which need a literal \n for termination - using separate -e options is a more readable alternative to splicing in literal newlines with $'\n'.
Alternative solution, using awk:
SOA=$(awk -v RS='#' '$1 == "IN" && $2 == "SOA" { print $6 }' "$SOAfile")
Again, this will work with both single- and multi-line record definitions.
The only constraint is that comments must not precede the serial number.
Additionally, if a file contained multiple records, the above would collect ALL serial numbers, separated by a newline each.

Why sed? grep is simplest in this case:
grep -A1 -e '#.*SOA' 1 | grep -oe '[0-9]*'
or: (maybe better):
grep -A1 -e '#.*SOA' 1 | grep 'Serial number' | grep -oe '[0-9]*'

This might work for you (GNU sed):
sed -nr '/# IN SOA/{/[0-9]/!N;s/[^0-9]+([0-9]+).*/\1/p}' file
For lines that contain # IN SOA if the line contains no numbers append the next line. Then extract the first sequence of numbers from the line(s).

Related

How to detect some pattern with grep -f on a file in terminal, and extract those lines without the pattern

I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.

Use sed to count periods, commas, and numbers?

I have a file that looks like this:
19.217.179.33,175.176.12.8
253.149.205.57,174.210.221.195
222.118.178.218,255.99.100.202
241.55.199.243,167.98.204.104
38.224.198.117,21.11.184.68
Each line is 2 IP addresses, separated by a comma. So, each line should meet these requirements:
Has 1 comma.
Has 6 periods.
Has ONLY numbers, commas, and periods.
If a line is missing a period, has more/less than one commas, has a letter, is blank, or anything like that - it isn't correct. Basically I just want to use sed or something similar to loop through each line in the file and make sure each of them meets the above requirements.
Is this something that can be done with sed? I know you can use it to delete files that do/don't have matching strings, but I wasn't sure about counting specific characters or verifying that a line only has certain characters.
Any help would be greatly appreciated. Thanks!
I think grep is a better tool for this. You just want to ensure that each line matches a particular regex, so invert the grep with -v and label the input invalid if any line gets output. Something like:
grep -qvE '^([0-9]{1,3}\.){3}[0-9]{1,3},([0-9]{1,3}\.){3}[0-9]{1,3}$' input || echo input is valid
You can simplify that a bit:
IP='([0-9]{1,3}\.){3}[0-9]{1,3}'
grep -qvE "^$IP,$IP$" input || echo input is valid
Or if you are more interested in invalid data:
grep -qvE "^$IP,$IP$" input && echo input is invalid
What I'd do is to think up a regular expression that fits the 'proper' lines, and omits them from printing. Like this:
sed -r '/^([0-9]{1,3}\.){3}[0-9]{1,3},([0-9]{1,3}\.){3}[0-9]{1,3}$/d' file
Everything that remains is a wrong line.
Here's the recipe in more detail:
[0-9]{1,3} between one and three digits
\. literal period (just the period is a wildcard and matches any character)
(...){3} three repetitions of something, so together
([0-9]{1,3}\.){3}[0-9]{1,3} makes up something that looks like an IP address. (Though note that it doesn't enforce the <256 rule, so 999.999.999.999 matches.)
/^ ... $/ the match needs to start at the beginning of the line and run until its end.
'/ ... /d' print everything except lines that match what's inside the two slashes
-r is needed to recognise the {1,3} syntax.
This will find and print the lines that are wrong. If you want to delete the wrong lines, you can easily invert this:
sed -i.bak -n -r '/^([0-9]{1,3}\.){3}[0-9]{1,3},([0-9]{1,3}\.){3}[0-9]{1,3}$/p' file
-i.bak means keep a backup, but overwrite the input file
-n means don't output anything unless expressly directed to output, and
/ ... /p output all the lines that match this regex.
If you would like to display only information about file contents correctness , you can use this command:
sed -n -r '/^([0-9]{1,3}\.){3}[0-9]{1,3},([0-9]{1,3}\.){3}[0-9]{1,3}$/!{a \
FILE IS INCORRECT
;q;};$aFILE IS OK'
It's modified version of #chw21 answer, but displays only information text:
FILE IS INCORRECT, or
FILE IS OK.

replace strings in file from a reference list

There are a few threads that seem to be asking the same question as I'm interested in here, but some of the answers seem to be tricky to generalise (or I'm not smart enough). e.g.
how to replace strings in file based on values from another file? (example inside)
Replacing strings in file, using patterns from another file
I have some complicated files that look like this:
((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);
(These are Newick format phylogenetic trees in case anyone is interested)
I need to change all the ID keys (the bits that look like XXX_YYYYY) in this file and am not sure what the best approach would be.
They need to be replaced by the 'group' (operon) they belong to, and so I was thinking that making an index file of sorts would be the way to go, so for example, PLT_01696 gets replaced with group_1 say:
Keyfile:
PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
....
PAU_02074 group_2
So I think if I could pass a file to sed or some equivalent, get it to read and look for the entry in column one, and replace it with whatever I've paired it with in column 2 is the best way to do this? This file will have about 350 individual keys in the end which will end up sorted in to around 12 groups.
And the file would end up looking like:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,group_1:0.08160284537473952438)98:0.04771898687201567291,.....
I'm open to alternative suggestions, this just seemed most apparent to me. This is on Ubuntu 14.04 so any solution is fair game really, but I'm much more au fait with bash (and a bit of perl).
One solution in such cases is to write a sed script that writes the sed script you want to execute. It appears that operons are preceded by either ( or , and are always followed by :. So, given your file containing mappings such as:
PLT_01736 group_1
then for each line in that file you want to create a sed operation that looks like:
s/\([,(]\)PLT_01736:/\1group_1:/g
where the g might not be necessary (I don't know if a given operon can appear more than once in a single line). The initial character class captures the ( or , and the \( and \) remember that, and it's followed by the specific ID key, and the colon; the replace operation outputs the remembered character, the replacement text and the colon. The advantage of tracking the preceding and following characters is that if by some mischance you have operons PLT_00100 and PLT_001001 (where one operon is a prefix of the other), tracking the surrounding characters ensures the correct match. Otherwise, you have to ensure that the longest matches appear first in the script, which is fiddlier (sort -r probably sorts that out, but …).
Hence, assuming the mappings are in a file mapping.data, you can use:
sed 's%\([A-Z]*_[0-9]*\) *\(.*\)%s/\\([,(]\\)\1:/\\1\2:/g%' mapping.data > script.sed
sed -f script.sed newick.phylogenetic.tree.data > transformed.data
This uses % in the generating s%%% operation, outputting s/// (it requires some care). The search part of the s%%% looks for zero or more upper-case letters, an underscore, and zero or more digits, capturing that with the \( and \); followed by one or more spaces, followed by some other characters which are also captured. If the ID keys can have a different structure, then change the matching regex appropriately. I assume that the input data is 'clean' so there's no need to worry about only processing lines with exactly three letters, and underscore and exactly five digits, and there's no trailing blanks. With the two parts (key ID and replacement) isolated, it is just necessary to generate the output s/// command, remembering to double up the backslashes that must appear in the output.
Given your input data and list of keys, the output I get is:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);
I'll bite. Let's call the script phylo.awk:
NR==FNR { pattern[NR] = $1; replacement[NR] = $2; count++; next }
{
for (i = 1; i <= count; i++) {
sub(pattern[i], replacement[i])
}
print $0
}
Then say:
awk -f phylo.awk patterns data
#!/bin/bash
while read i;do #enter your loop
a=$(echo "$i" | cut -d" " -f1) #get what to find
b=$(echo "$i" | cut -d" " -f2) #get what to replace with
sed -i "s/$a/$b/g" input.txt #find and replace -i is "in place"
done <ref.txt #define file you're looping through
input:
((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);
ref:
PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
PAU_02074 group_2
output:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

Given the number of tokens, how to find corresponding closest number of lines within a file?

Given a huge file which contains text (one sentence per line), the task is to extract N tokens (for example 100 million tokens out of 3 billion), since I can not break a sentence into parts, I need to find closest number of lines that contains given number of tokens.
I tried following code:
perl -p -e 's/\n/ #/g' huge_file | cut -d' ' -f1-100000000 | grep -o ' #' | wc -w
which replace newline symbol with symbol ' #' (we are basically joining sentences into single line) and counting number of symbol ' #' which should correspond to number of sentence (huge_file doesn't contain '#' symbol). However, grep can not process the large line and giving 'grep: memory exhausted' error. Is there any other efficient way for accomplishing task, that would also work for very large files?
I had a bit of a hard time understanding what you're asking. But I think you're tackling it very badly. Running perl as a super-sed, then cut, then grep then wc is horribly inefficient.
If I understand correctly, you want as many lines as it takes to get at least 100M words.
Why not instead:
#!/usr/bin/env perl
use strict;
use warnings;
my $wordcount = 0;
#use 'magic' filehandle - read piped input or
#command line specified 'myscript.pl somefilename' - just like sed/grep
while ( <> )
#split on whitespace, count number of fields. Or words in this case.
$wordcount += scalar split;
#chomp; if you don't want the line feed here
#print current line
print;
#bail out if our wordcount is above a certain number.
last if $wordcount >= 100_000_000
#NB $. is line number if you wanted to just do a certain number of lines.
}
#already printed the content with line feeds intact.
#this prints the precise count we've printed.
print $wordcount," words printed\n";
This will iterate your file, and as soon as you've seen 100M words, it'll bail out - meaning you no longer have to read the whole file, nor do you need to invoke a daisy chain of commands.
This would oneliner if you're really insistent:
perl -p -e '$wordcount += scalar split; last if $wordcount > 100_000_000;'
Again - I couldn't quite tell what the significance of line feeds and # symbols were, so I haven't done anything with them. But s/\n/ #/ works fine in the above code block, as does chomp; to remove trailing linefeeds if that's what you're after.

Resources