How to specify amount of characters per line on a fasta file - bioinformatics

I have a fasta file that looks like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTATGAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGTTTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAGTGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCACTTGTGAGTAC
However, I need the file to have 60 characters per line. It should look like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTATGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGC
CGCCTCGTGTTTCACGGCCTCAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTC
ACTTCTACTCTTGGCAGCAGTGGCCAGCTCATATGCCGCTGCACAAAGGAAACTGCTGAC
I tried to use fold -w 60 myfile.fasta > out.fa to change my file but the output is not what I expected. The output file looks like this:
>abc
AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTAT
GAATCCAGTA
TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGT
TTCACGGCCT
CAGTTAGTACCGTTGTGACCGCCACCGGCTTGGCCCTCTCACTTCTACTCTTGGCAGCAG
TGGCCAGCTC
ATATGCCGCTGCACAAAGGAAACTGCTGACACCGGTGACAGTGCTTACTGCGGTTGTCAC
TTGTGAGTAC
ACACGCACCATTTACAATGCATGATGTTCGTGAGATTGATCTGTCTCTAACAGTTCACTT
Is there another way I can manipulate my fasta file to get it to the format I need?

Do not reinvent the wheel. Use common bioinformatics tools, preferably open source tools. For example, you can use seqtk tool like so:
seqtk seq -l N infile > outfile
EXAMPLES:
$ echo ">seq1\nACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG\nACTG" | seqtk seq -l 60
>seq1
ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG
ACTGACTG
$ echo ">seq1\nACTG\nACTG" | seqtk seq -l 60
>seq1
ACTGACTG
To install these tools, use conda, specifically miniconda, for example:
conda create --channel bioconda --name seqtk seqtk
conda activate seqtk
# ... use seqtk here ...
conda deactivate
REFERENCES:
seqtk: https://github.com/lh3/seqtk
conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

I have listed two options below. Hope it helps.
seqkit seq input.fasta -w 60
Instructions for seqkit is here.
And it's very easy to install using conda:
https://anaconda.org/bioconda/seqkit
A second option is to use Python's textwrap module. You could first parse the FASTA file with Biopython then use textwrap on the the string.
from Bio import SeqIO
import textwrap
for seq_record in SeqIO.parse("input.fasta", "fasta"):
dna = str(seq_record.seq)
fasta_record=textwrap.fill(dna, width=30)
print(">",seq_record.id)
print(fasta_record)

with python
with open("yourfile", "r") as f:
text = f.read().split("\n",1)
text[1] = text[1].replace("\n",'')
text = text[0]+"\n"+"\n".join(" "+text[1][i*60:(i*60)+60] for i in range(len(text[1])//60))
print(text)

Perl to the rescue!
perl -lne 'sub out { print substr $buff, 0, 60, "" while $buff; print $_ }
if (/^>/) { out() }
else { $buff .= $_ }
END { out() }
' file.fasta
-n reads the file line by line and runs the code for each line;
-l removes newlines from input and adds them to print;
We store the non-header lines in $buff;
When a header line (or end of file) arrives, we print the buffer 60 characters at a time.

Related

Bash: decode string with url escaped hex codes [duplicate]

I'm looking for a way to turn this:
hello < world
to this:
hello < world
I could use sed, but how can this be accomplished without using cryptic regex?
Try recode (archived page; GitHub mirror; Debian page):
$ echo '<' |recode html..ascii
<
Install on Linux and similar Unix-y systems:
$ sudo apt-get install recode
Install on Mac OS using:
$ brew install recode
With perl:
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
With php from the command line:
cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'
An alternative is to pipe through a web browser -- such as:
echo '!' | w3m -dump -T text/html
This worked great for me in cygwin, where downloading and installing distributions are difficult.
This answer was found here
Using xmlstarlet:
echo 'hello < world' | xmlstarlet unesc
A python 3.2+ version:
cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu -
Code Version Control between local files and Ask Ubuntu answers
Edit June 26, 2017
Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
Here's the function:
LineOut="" # Make global
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()
On macOS, you can use the built-in command textutil (which is a handy utility in general):
echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout
outputs:
👋 hello < world 🌐
To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.
But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):
#!/bin/sh
htmlEscDec2Hex() {
file=$1
[ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
printf -- \
"$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
$(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
[ x"$1" != x"$file" ] && rm -f -- "$file"
}
htmlHexUnescape() {
printf -- "$(
sed 's/\\/\\\\/g;s/%/%%/g
;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}
htmlEscDec2Hex "$1" | htmlHexUnescape \
| sed -f named_entities.sed
Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.
This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:
<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
const referenceURL = 'https://html.spec.whatwg.org/entities.json';
function writeln(element, text) {
element.appendChild( document.createTextNode(text) );
element.appendChild( document.createElement("br") );
}
(async function(container) {
const json = await (await fetch(referenceURL)).json();
container.innerHTML = "";
writeln(container, "#!/usr/bin/sed -f");
const addLast = [];
for (const name in json) {
const characters = json[name].characters
.replace("\\", "\\\\")
.replace("/", "\\/");
const command = "s/" + name + "/" + characters + "/g";
if ( name.endsWith(";") ) {
writeln(container, command);
} else {
addLast.push(command);
}
}
for (const command of addLast) { writeln(container, command); }
})( document.getElementById("sed-script") );
</script>
</body></html>
Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.
Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.
For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:
nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done
Explanation of the script follows.
There are three types of escape sequences that need to be supported:
&#D; where D is the decimal value of the escaped character’s Unicode code point;
&#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;
&N; where N is the name of one of the named entities for the escaped character.
The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.
The central piece of this method for supporting the code point escapes is the printf utility, which is able to:
print numbers in hexadecimal format, and
print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).
The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.
The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.
I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)
I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --
python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.
Here is another easy pythonic way without buffering in memory: using awkg.
$ echo 'hello < : " world' | \
awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world
awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:
pip install awkg
-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.
Each line record is in R0 variable, for which we did
print(unescape(R0))
Disclaimer:
I am the maintainer of awkg
I have created a sed script based on the list of entities so it must handle most of the entities.
sed -f htmlentities.sed < file.html
My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:
$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML
The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.
HTML 5 defines the list of entities here:
https://html.spec.whatwg.org/multipage/named-characters.html
The definition includes a machine readable specification in JSON format:
https://html.spec.whatwg.org/entities.json
The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.
Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.
#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);
my $entities;
INIT {
if (eof DATA) {
my $data = tell DATA;
open DATA, '+<', $0;
seek DATA, $data, 0;
my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
print DATA $entities_json;
truncate DATA, tell DATA;
close DATA;
$entities = parse_json ($entities_json);
} else {
local $/ = undef;
$entities = parse_json (<DATA>);
}
}
local $/ = undef;
my $html = <>;
for my $entity (sort { length $b <=> length $a } keys %$entities) {
my $characters = $entities->{$entity}->{characters};
$html =~ s/$entity/$characters/g;
}
print $html;
__DATA__
Example usage:
$ echo '😊 & ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
😊 & ٱلْعَرَبِيَّة
With Xidel:
echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

Nextflow: how should I truncate a fastq file at line X? Process fails with error 141

I'm trying to create a process where a gzipped fastq file is truncated at 900k lines. If it has less than 900k lines, the code executes but does not modify the file. But if the file has more than 900k lines, the new file should only have 900k lines. I've tried it with sed and head, piping the zcat command of the file to one of those, but the process fails with error code: 141, which for what I've found, has to do with pipes.
My code, using head and paired end reads is as follows:
process TRUNCATE_FASTQ {
input:
set val(sample), val(single_end), path(reads) from ch_reads
output:
path foo into ch_output
script:
zcat ${reads[0]} | head -900000 > truncated_file_1
gzip truncated_file_1
mv truncated_file_1.gz truncated_${reads[0]}
zcat ${reads[1]} | head -900000 > truncated_file_2
gzip truncated_file_2
mv truncated_file_2.gz truncated_${reads[1]}
}
My code using sed is: (inputs and outputs are the same, only the script changes)
script:
sed -i -n '1,900000p' ${reads[0]}
sed -i -n '1,900000p' ${reads[1]}
So my question is: how should I modify the fastq file and truncate it inside the process? My aim is just to carry out the whole pipeline after this process with a smaller fastq file.
Thanks
You might have added something like the following to your nextflow.config:
process {
shell = [ '/bin/bash', '-euo', 'pipefail' ]
}
You get an error because zcat has had its destination closed early, and wasn't able to (successfully) write a decompressed file. It has no way of knowing that this was intentional, or if something bad happened (e.g. out of disk space, network error, etc).
If you'd rather not make changes to your shell options, you could instead use a tool that reads the entire input stream. This is what your awk solution does, which is the same as:
zcat test.fastq.gz | awk 'NR<=900000 { print $0 }'
For each input record, AWK tests the current record number (NR) to see if it is less than 900000. If it is, the whole line is printed to stdout. If it's not, try to see if the next line is less 900000. The problem with this solution is that this will be slow if your input FASTQ is really big. So you might be tempted to use:
zcat test.fastq.gz | awk 'NR>900000 { exit }'
But this will work much like the original head solution, and will produce an error with your current shell options. A better solution (which lets you keep your current shell options) is to instead use process substitution to avoid the pipe:
head -n 900000 < <(zcat test.fastq.gz)
Your Nextflow code might look like:
nextflow.enable.dsl=2
params.max_lines = 900000
process TRUNCATE_FASTQ {
input:
tuple val(sample), path(reads)
script:
def (fq1, fq2) = reads
"""
head -n ${params.max_lines} < <(zcat "${fq1}") | gzip > "truncated_${fq1}"
head -n ${params.max_lines} < <(zcat "${fq2}") | gzip > "truncated_${fq2}"
"""
}
workflow {
TRUNCATE_FASTQ( Channel.fromFilePairs('*.{1,2}.fastq.gz') )
}
If you have Python, which has native gzip support, you could also use pysam for this. For example:
params.max_records = 225000
process TRUNCATE_FASTQ {
container 'quay.io/biocontainers/pysam:0.16.0.1--py37hc334e0b_1'
input:
tuple val(sample), path(reads)
script:
def (fq1, fq2) = reads
"""
#!/usr/bin/env python
import gzip
import pysam
def truncate(fn, num):
with pysam.FastxFile(fn) as ifh:
with gzip.open(f'truncated_{fn}', 'wt') as ofh:
for idx, entry in enumerate(ifh):
if idx >= num:
break
print(str(entry), file=ofh)
truncate('${fq1}', ${params.max_records})
truncate('${fq2}', ${params.max_records})
"""
}
Or using Biopython:
params.max_records = 225000
process TRUNCATE_FASTQ {
container 'quay.io/biocontainers/biopython:1.78'
input:
tuple val(sample), path(reads)
script:
def (fq1, fq2) = reads
"""
#!/usr/bin/env python
import gzip
from itertools import islice
from Bio import SeqIO
def xopen(fn, mode='r'):
if fn.endswith('.gz'):
return gzip.open(fn, mode)
else:
return open(fn, mode)
def truncate(fn, num):
with xopen(fn, 'rt') as ifh, xopen(f'truncated_{fn}', 'wt') as ofh:
seqs = islice(SeqIO.parse(ifh, 'fastq'), num)
SeqIO.write(seqs, ofh, 'fastq')
truncate('${fq1}', ${params.max_records})
truncate('${fq2}', ${params.max_records})
"""
}
So!
After some testing it seems that as #tripleee said, there was an error signal using head.
I do not fully understant the motives behind this, but this answer found in another question, seems to have solved at least, my problem.
So the final code is something like that:
process TRUNCATE_FASTQ {
input:
set val(sample), val(single_end), path(reads) from ch_reads
output:
path foo into ch_output
script:
zcat ${reads[0]} | awk '(NR<=900000)' > truncated_file_1
gzip truncated_file_1

Bash script csv manipulation optimization

I have a 2 million line csv file where what I want to do is replace the second column of each line in the csv file with a unique value to that string, these are all filled with usernames. The long process I've got below does work, but does take a while.
It doesn't have to be hashed, but this seemed like a sure way of when the next file comes along there are no discrepancies.
I'm by no means a coder, and was wondering if there was anyway that I could optimize the process. Although I understand the best way to do this would be in some sort of scripting language.
#!/bin/bash
#Enter Filename to Read
echo "Enter File Name"
read filename
#Extracts Usersnames from file
awk -F "\"*,\"*" '{print $2}' $filename > usernames.txt
#Hashes Usernames using SHA256
cat usernames.txt | while read line; do echo -n $line|openssl sha256 |sed 's/^.* //'; done > hashedusernames.txt
#Deletes usernames out of first file
cat hash.csv | cut -d, -f2 --complement > output.txt
#Pastes hashed usernames to end of first file
paste -d , output.txt hashedusernames.txt > output2.txt
#Moves everything back into place
awk -F "\"*,\"*" '{print $1","$4","$2","$3}' output2.txt > final.csv
Example File, there are 7 columns in all but only 3 are shown
Time Username Size
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
You could do this in Perl easily in a few lines. The following program uses the Crypt::Digest::SHA256, which you need to install from CPAN or from your OS repository if they have it.
The program assumes input from the DATA section, which we typically do around here to include example data in an mcve.
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
while (my $line = <DATA>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print join ',', #fields;
}
__DATA__
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
It prints the following output.
2017-01-01T14:53.45,g8EPHWc3L1ln_lfRhq8elyOUgsiJm6BtTtb_GVt945s,12345
2016-01-01T13:42.56,jwXsws2dJq9h_R08zgSIPhufQHr8Au8_RmniTQbEKY4,54312
2015-01-01T12:34.34,mkrKXbM1ZiPiXSSnWYNo13CUyzMF5cdP2SxHGyO7rgQ,54123
To make it read a file that is supplied as a command line argument and write to a new file with the .new extension, you can use it like this:
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
open my $fh_in, '<', $ARGV[0] or die $!;
open my $fh_out, '>', "$ARGV[0].new" or die $!;
while (my $line = <$fh_in>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print $fh_out join ',', #fields;
}
Run it as follows:
$ perl foo.pl example.csv
Your new file will be named example.csv.new.
Yet another Python solution, focus on speed but also on maintainability.
#!/usr/bin/python3
import argparse
import hashlib
import re
parser = argparse.ArgumentParser(description='CSV swaper')
parser.add_argument(
'-f',
'--file',
dest='file_path',
type=str,
required=True,
help='The CSV file path.')
def hash_user(users, user):
try:
return users[user]
except KeyError:
id_ = int(hashlib.md5(user.encode('utf-8')).hexdigest(), 16)
users[user] = id_
return id_
def main():
args = parser.parse_args()
username_extractor = re.compile(r',([\s\S]*?),')
users = {}
counter = 0
templ = ',{},'
with open(args.file_path) as file:
with open('output.csv', 'w') as output:
line = file.readline()
while line:
try:
counter += 1
if counter == 1:
continue
username = username_extractor.search(line).groups()[0]
hashuser = hash_user(users, username)
output.write(username_extractor.sub(
templ.format(hashuser), line)
)
except StopIteration:
break
except:
print('Malformed line at {}'.format(counter))
finally:
line = file.readline()
if __name__ == '__main__':
main()
There are still some points that could be optimized, but the central ones are based on do try instead of check, and save users hashes in the case there are repeated users will not have to redigest the username.
Also, will You run this on a multi-core host?.. this can be easily be improved using threads.
This Python program might do what you want. You can pass the filenames to convert on the command line:
$ python this_program.py file1.csv file2.csv
import fileinput
import csv
import sys
import hashlib
class stdout:
def write(self, *args):
sys.stdout.write(*args)
input = fileinput.input(inplace=True, backup=".bak", mode='rb')
reader = csv.reader(input)
writer = csv.writer(stdout())
for row in reader:
row[1] = hashlib.sha256(row[1]).hexdigest()
writer.writerow(row)
Since you used awk in your original attempt, here's a simpler approach in awk
awk -F"," 'BEGIN{i=0;}
{if (unique_names[$2] == "") {
unique_names[$2]="Unique"i;
i++;
}
$2=unique_names[$2];
print $0}'

Return only sequence of lines with specific patterns

I get stuck with that task. I have list like that:
(...)
distName="PLMN-PLMN/MRBTS-4130/LNBTS-4130/FTM-1/IPNO-1"
"btsId">4130<
IpAddress">10.52.71.38</p>
(...)
And I'm doing final file like that:
MRBTS-4130,4130,10.52.71.38
But sometimes few parts are missing and file looks like:
distName="PLMN-PLMN/MRBTS-4130/LNBTS-4130/FTM-1/IPNO-1"
"btsId">4130<
distName="PLMN-PLMN/MRBTS-4132/LNBTS-4132/FTM-1/IPNO-1"
"btsId">4132<
IpAddress">10.52.71.38</p>
distName="PLMN-PLMN/MRBTS-4135/LNBTS-4135/FTM-1/IPNO-1"
"btsId">4135<
distName="PLMN-PLMN/MRBTS-4138/LNBTS-4138/FTM-1/IPNO-1"
And in my final file I would like to have only lines like this:
MRBTS-4132,4132,10.52.71.38
So I would like to search only for lines where I have such pairs:
first line has a distName
second line has btsId
third line has IpAddress
Lines with different sequence like:
first distName
second btsId
third distName again
will be simply rejected.
I have currently such code:
grep -E "MRBTS|btsId|IpAddress" topology.xml > temp_list
id_list=(`grep -E "btsId" temp_list | grep -o '[0-9]*'`)
ip_list=(`grep -E "IpAddress" temp_list | grep -E -o "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"`)
id_size=${#id_list[*]}
for (( e=0; e<$id_size; e++ ))
do
echo "MRBTS-${id_list[e]};${ip_list[e]}" >> id_list
done
But as you can see it's accepting some missing line sequences and I would like to avoid such scenario.
With gawk:
awk -v RS='distName=' -F "[<>/]" 'NR!=1{print $2","$7","$10}' file.txt
After all it was much easier to use python for that solution with ElementTree library. The code is
from xml.etree import ElementTree
import os
HOME = os.environ['HOME']
with open(HOME+'/TF/topo/topo.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.findall('.//{raml20.xsd}managedObject'):
btsId=None
Ip=None
for p in node.findall('{raml20.xsd}p'):
if p.attrib.get('name')=='btsId':
btsId=p.text
elif p.attrib.get('name')=='IpAddress':
Ip=p.text
if btsId and Ip:
print "MRBTS-"+btsId+";"+Ip

Splitting large text file on every blank line

I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
...and so on
A typical information table in my file has anywhere between 10-40 rows.
I would like this file to be split in n smaller files, where n is the amount of content tables.
That is
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
would be its own separate file, (whateverN.txt)
and
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
again a separate file whateverN+1.txt and so forth.
It seems like awk or Perl are nifty tools for this, but having never used them before the syntax is kinda baffling.
I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:
Split text file into multiple files & How can I split a text file into multiple text files? (on Unix & Linux)
How should one modify the command line inputs, so that it solves my problem?
Setting RS to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR to set the name of the file corresponding to each new record:
awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
RS:
This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.
$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
$ ls whatever-*.txt
whatever-1.txt whatever-2.txt whatever-3.txt
$ cat whatever-1.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
$ cat whatever-2.txt
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
$ cat whatever-3.txt
asdasd #299 yadayada 60 40
content
content
contend done
$
You could use the csplit command:
csplit \
--quiet \
--prefix=whatever \
--suffix-format=%02d.txt \
--suppress-matched \
infile.txt /^$/ {*}
POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit.
This is what the options do:
--quiet – suppress output of file sizes
--prefix=whatever – use whatever instead fo the default xx filename prefix
--suffix-format=%02d.txt – append .txt to the default two digit suffix
--suppress-matched – don't include the lines matching the pattern on which the input is split
/^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
Perl has a useful feature called the input record separator. $/.
This is the 'marker' for separating records when reading a file.
So:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
my $count = 0;
while ( my $chunk = <> ) {
open ( my $output, '>', "filename_".$count++ ) or die $!;
print {$output} $chunk;
close ( $output );
}
Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work.
This can be reduced to a one liner:
perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
You can use this awk,
awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
(OR)
awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
More readable format:
BEGIN {
file="content"++i".txt"
}
!NF {
file="content"++i".txt";
next
}
{
print > file
}
In case you get "too many open files" error as follows...
awk: whatever-18.txt makes too many open files
input record number 18, file file.txt
source line number 1
You may need to close newly created file, before creating a new one, as follows.
awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
Since it's Friday and I'm feeling a bit helpful... :)
Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.
use strict;
use warnings;
# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;
# split on double new line
my #chunks = split(/\n\n/, $text);
# make new files from chunks
my $count = 1;
for my $chunk (#chunks) {
open my $ofh, '>', "whatever$count.txt" or die $!;
print $ofh $chunk, "\n";
close $ofh;
$count++;
}
The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt
Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.
Try this bash script also
#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do
if [ "$line" == "" ] ; then
((++i))
fileName="OutputFile_$i"
else
echo $line >> "$fileName"
fi
done < InputFile.txt
You can also try split -p "^$"

Resources