How to remove lines with same domain - bash

Have large txt file of 1 million lines.
Example:
http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
http://e-planet.ru/feedback.html
How to clean it of lines with same domains?
I need clean one of http://e-planet.ru/hosting/ or http://e-planet.ru/feedback.html

I didn't understand your question at first. Here is an awk 1-liner :
awk -F'/' '!a[$3]++' myfile
Test input :
http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
http://e-planet.ru/feedback.html
https://escrow.webmoney.ru/woopwoop
httpp://whatever.com/slk
Output :
http://e-planet.ru/hosting/
http://www.anelegantchaos.org/
http://site.ru/e3-den-vtoroj/
https://escrow.webmoney.ru/about.aspx
httpp://whatever.com/slk
Here, the second occurences of http://e-planet.ru/ and https://escrow.webmoney.ru/ are removed.
This script splits the lines using / as a separator, and compares the 3rd column (the domain) to see if there are duplicates. If it is unique, it will be printed. It is to be noted that it only works if ALL urls are preceeded by whateverprotocol//. The double slash is important because this is what makes the 3rd column the domain

use strict;
use warnings;
open my $in, '<', 'in.txt' or die $!;
my %seen;
while(<$in>){
chomp;
my ($domain) = /[http:|https]\/\/(.+?)\//g;
$seen{$domain}++;
print "$_\n" if $seen{$domain} == 1;
}

Sorry I'm not able to reply to fugu post,
I think the problem might be that you have more that one URL in one line, so try this out:
use strict;
use warnings;
open my $in, '<', 'in.txt' or die $!;
my %seen;
while(<$in>){
chomp;
for (split /\s/) {
my ($url) = /[http:|https]\/\/(.+?)\//g;
$seen{$url}++;
print "$_\n" if $seen{$url} == 1;
}
}

If all you care about are the domains of those URI's, then I suggest that you filter them out first.
Then it's a simple process of sorting them and specifying you only want unique entries:
perl -lne 'print $1 if m{//(.+?)/}' file | sort | uniq > uniq_domains.txt

Related

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Splitting very long (4GB) string with new lines

I have a file that is supposed to be JSON objects, one per line. Unfortunately, a miscommunication happened with the creation of the file, and the JSON objects only have a space between them, not a new-line.
I need to fix this by replacing every instance of } { with }\n{.
Should be easy for sed or Perl, right?
sed -e "s/}\s{/}\n{/g" file.in > file.out
perl -pe "s/}\s{/}\n{/g" file.in > file.out
But file.in is actually 4.4 GB which seems to be causing a problem for both of these solutions.
The sed command finishes with a halfway-correct file, but file.out is only 335 MB and is only about the first 1/10th of the input file, cutting off in the middle of a line. It's almost like sed just quit in the middle of the stream. Maybe it's trying to load the entire 4.4 GB file into memory but running out of stack space at around 300MB and silently kills itself.
The Perl command errors with the following message:
[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out
What else should I try?
Unlike the earlier solutions, this one handles {"x":"} {"}.
use strict;
use warnings;
use feature qw( say );
use JSON::XS qw( );
use constant READ_SIZE => 64*1024*1024;
my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;
binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';
while (1) {
my $rv = sysread(\*STDIN, my $block, READ_SIZE);
die($!) if !defined($rv);
last if !$rv;
$j_in->incr_parse($block);
while (my $o = $j_in->incr_parse()) {
say $j_out->encode($o);
}
}
die("Bad data") if $j_in->incr_text !~ /^\s*\z/;
The default input record separator in Perl is \n, but you can change it to any character you want. For this problem, you could use { (octal 173).
perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out
perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output
Assuming your input doesn't contain } { pairs in other contexts that you do not want replaced, ll you need is:
awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
e.g.
$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}
The above uses GNU awk for multi-char RS and RT and will work on any size input file as it does not read the whole file into memory at one time, just each } {-separated "line" one at a time.
You may read input in blocks/chunks and process them one by one.
use strict;
use warnings;
binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';
while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
$buffer =~ s/\}\s\{/}\n{/sg;
if( length($buffer) > $CHUNK) { # More than one chunk buffered
syswrite( STDOUT, $buffer, $CHUNK); # write FIRST of buffered chunks
substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
}
}
syswrite( STDOUT, $buffer) if length($buffer);

Extracting the first two characters from a file in perl into another file

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.
I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}
this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

Taking multiple header (rows matching condition) and convert into a column

Hello I have a file that has multiple Headers in it that I need to have turned into column values. The file looks like this:
Day1
1,Smith,London
2,Bruce,Seattle
5,Will,Dallas
Day2
1,Mike,Frisco
4,James,LA
I would like the file to end up looking like this:
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
The file doesn't have sequential numbers before the names and it doesn't have the same quantity of records after the "Day" Header.
Does anyone have any ideas on how to accomplish this using the command-line?
In awk
awk -F, 'NF==1{a=$0;next}{print a","$0}' file
Checks if the number of fields is 1, if it is it sets a variable to that and skips the next block.
For each line that doesn't have 1 field, it prints the saved variable and the line
And in sed
sed -n '/,/!{h};/,/{x;G;s/\n/,/;p;s/,.*//;x}' file
Broken down for MrBones wild ride.
sed -n '
/,/!{h}; // If the line does not contain a comma overwrite buffer with line
/,/{ // If the line contains a comma, do everything inside the brackets
x; // Exchange the line for the held in buffer
G; // Append buffer to line
s/\n/,/; // Replace the newline with a comma
p; // Print the line
s/,.*//; // Remove everything after the first comma
x // exchange line for hold buffer to put title back in buffer for the next line.
}' file // The file you are using
In essence it saves the lines without a ,, i.e the headers. Then if its not a header, it switches the current line with the saved header and appends the now switched line to the end of the header. As it is appended with a newline, then the next statement replaces that with a comma. Then the line is printed. NExt to recover the header, everything after it is removed and it is swapped back into the buffer, ready for the next line.
sed '/^Day/ {h;d;}
G;s/\(.*\)\n\(.*\)/\2,\1/
' YourFile
posix compliant
print nothing if not at least 1 data after a Day
white line are treated as data
awk '{if ( $0 ~ /^Day/ ) Head = $0; else print Head "," $0}' YourFile
use Day as paragraph separator and content as header to use on following line
Perl solution:
#! /usr/bin/perl
use warnings;
use strict;
my $header;
while (<>) { # Read line by line.
if (/,/) { # If the line contains a comma,
print "$header,$_"; # prepend the header.
} else {
chomp; # Remove the newline.
$header = $_; # Remember the header.
}
}
Another sed version
sed -n '/Day[0-9]\+/{h;b end};{G;s/\(.*\)\n\(.*\)/\2,\1/;p;:end}'
Perl
$ perl -F, -wlane ' if(#F eq 1){$s=$F[0]; next}print "$s,$_"' file
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
This Perl one-line program will do as you ask. It requires Perl v5.14 or better
perl -ne'tr/,// ? print $c,$_ : ($c = s/\s*\z/,/r)' myfile.txt
for earlier versions of perl, use
perl -ne'tr/,// ? print $c,$_ : ($c = $_) =~ s/\s*\z/,/' myfile.txt
output
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
Another perl example- this time using $/ to separate each record.
use strict;
use warnings;
local $/ = "Day";
while (<>) {
next unless my ($num) = m/^(\d+)/;
for ( split /\n/ ) {
print "Day${num},$_\n" if m/,/;
}
}

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Resources