Given the number of tokens, how to find corresponding closest number of lines within a file? - bash

Given a huge file which contains text (one sentence per line), the task is to extract N tokens (for example 100 million tokens out of 3 billion), since I can not break a sentence into parts, I need to find closest number of lines that contains given number of tokens.
I tried following code:
perl -p -e 's/\n/ #/g' huge_file | cut -d' ' -f1-100000000 | grep -o ' #' | wc -w
which replace newline symbol with symbol ' #' (we are basically joining sentences into single line) and counting number of symbol ' #' which should correspond to number of sentence (huge_file doesn't contain '#' symbol). However, grep can not process the large line and giving 'grep: memory exhausted' error. Is there any other efficient way for accomplishing task, that would also work for very large files?

I had a bit of a hard time understanding what you're asking. But I think you're tackling it very badly. Running perl as a super-sed, then cut, then grep then wc is horribly inefficient.
If I understand correctly, you want as many lines as it takes to get at least 100M words.
Why not instead:
#!/usr/bin/env perl
use strict;
use warnings;
my $wordcount = 0;
#use 'magic' filehandle - read piped input or
#command line specified 'myscript.pl somefilename' - just like sed/grep
while ( <> )
#split on whitespace, count number of fields. Or words in this case.
$wordcount += scalar split;
#chomp; if you don't want the line feed here
#print current line
print;
#bail out if our wordcount is above a certain number.
last if $wordcount >= 100_000_000
#NB $. is line number if you wanted to just do a certain number of lines.
}
#already printed the content with line feeds intact.
#this prints the precise count we've printed.
print $wordcount," words printed\n";
This will iterate your file, and as soon as you've seen 100M words, it'll bail out - meaning you no longer have to read the whole file, nor do you need to invoke a daisy chain of commands.
This would oneliner if you're really insistent:
perl -p -e '$wordcount += scalar split; last if $wordcount > 100_000_000;'
Again - I couldn't quite tell what the significance of line feeds and # symbols were, so I haven't done anything with them. But s/\n/ #/ works fine in the above code block, as does chomp; to remove trailing linefeeds if that's what you're after.

Related

degenerate positions in motifs -bash

I am new to coding and writing a shell script which searches for motifs in protein sequence files and prints their location if present.
But these motifs have degenerate positions.
For example,
A motif can be (psi, psi,x, psi) where psi=(I, L or V) and x can be any of the 20 amino acids.
I would search a set of sequences for the occurrence of this motif. However, my protein sequences are exact sequences, i.e. they have no ambiguity, like:
>
MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGL
Would like the search for the all possible exact instances of the motif in the protein sequence which is present in fasta file.
I have a rough code which I know is wrong.
#!/usr/bin/bash
x=(A C G H I L M P S T V D E F K N Q R W Y)
psi=(I L V)
alpha=(D E)
motif1=($psi,$psi,$x,$psi)
for f in *.fasta ; do
if grep -q "$motif1" <$f ; then
echo $f
grep "^>" $f | tr -d ">"
grep -v ">" $f | grep -aob "$motif1"
fi
done
Appreciate any help in finding my way.
Thanks in advance!
The shell is an excellent tool for orchestrating other tools, but it's not particularly well suited to analyzing the contents of files.
A common arrangement is to use the shell to run Awk over a set of files, and do the detection logic in Awk instead. (Other popular tools are Python and Perl; I would perhaps tackle this in Python if I were to start from scratch.)
Regardless of the language of your script, you should avoid code duplication; refactor to put parameters in variables and then run the code with those parameters, or perhaps move the functionality to a function and then call it with different parameters. For example,
scan () {
local f
for f in *.fasta; do
# Inefficient: refactor to do the grep only once, then decide whether you want to show the output or not
if grep -q "$1" "$f"; then
# Always, always use double quotes around file names
echo "$f"
grep "^>" "$f" | tr -d ">"
grep -v ">" "$f" | grep -aob "$1"
fi
done
}
case $motif in
1) scan "$SIM_Type_1";; # Notice typo in variable name
2) scan "$SIM_Type_2";; # Ditto
3) scan "$SIM_Type_3";; # Ditto
4) scan "$SIM_Type_4";; # Ditto
5) scan "$SIM_TYPE_5";; # Notice inconsistent variable name
alpha) scan "$SIM_Type_alpha";;
beta) scan "$SIM_Type_beta";;
esac
You are declaring the _*Type_* variables (or occasionally *_TYPE_* -- the shell is case sensitive, and you should probably use the same capitalization for all the variables just to make it easier for yourself) as arrays, but then you are apparently attempting to use them as regular scalars. I can only guess as to what you intend for the variables to actually contain; but I'm guessing you want something like
# These are strings which contain regular expressions
x='[ACGHILMPSTVDEFKNQRWY]'
psi='[ILV]'
psi_1='[IV]'
alpha='[DE]'
# These are strings which contain sequences of the above regexes
# The ${variable} braces are not strictly necessary, but IMHO help legibility
SIM_Type_1="${psi}${psi}${x}${psi}"
SIM_Type_2="${psi}${x}${psi}${psi}"
SIM_Type_3="${psi}${psi}${psi}${psi}"
SIM_Type_4="${x}${psi}${x}${psi}"
SIM_TYPE_5="${psi}${alpha}${psi}${alpha}${psi}"
SIM_Type_alpha="${psi_1}${x}${psi_1}${psi}"
SIM_Type_beta="${psi_1}${psi_1}.${x}${psi}"
# You had an empty spot here ^ I guessed you want to permit any character?
If you really wanted these to be arrays, the way to access the contents of the array is "${array[#]}" but then that will not produce something we can directly pass to grep or Awk so I went with declaring these as strings containing regular expressions for the motifs.
But to reiterate, Awk is probably a better language for this, so let's refactor scan to be an Awk script.
# This replaces the function definition above
scan () {
awk -v re="$1" '{ if (/^>/) label=$0
else if (idx = match($0, re, result) {
if (! printed) { print FILENAME; printed = 1 }
print len + idx ":" result[0]
}
len += 1+length($0) # one for the newline
}
# Reset printed if we skip to a new file
FNR == 1 { printed = 0 }' *.fasta
}
The main complication here is reimplementing the grep -b byte offset calculation. If that is not strictly necessary (perhaps a line number would suffice?) then the Awk script can be reduced to a somewhat more trivial one.
Your use of grep -a suggests that perhaps your input files contain DOS newlines. I think this will work fine regardless of this.
The immediate benefit of this refactoring is that we avoid scanning the potentially large input file twice. We only scan the file once, and print the file name on the first match.
If you want to make the script more versatile, this is probably also a better starting point than the grep | tr solution you had before. But if the script does what you need, and the matches are often near the beginning of the input file, or the input files are not large, perhaps you don't actually want to switch to Awk after all.
Notice also that like your grep logic, this will not work if a sequence is split over several lines in the FASTA file and the match happens to straddle one of the line breaks.
Finally, making the script prompt for interactive input is a design wart. I would suggest you accept the user choice as a command-line argument instead.
motif=$1
So you'd use this as ./scriptname alpha to run the alpha regex against your FASTA files.
Another possible refactoring would be to read all the motif regexs into a slightly more complex Awk script and print all matches for all of them in a form which then lets you easily pick the ones you actually want to examine in more detail. If you have a lot of data to process, looping over it only once is a huge win.

using xargs as an argument for cut

Say i have a file a.txt containing a word, followed by a number, followed by a newline on
and 3
now 2
for 2
something 7
completely 8
different 6
I need to select the nth char from every word (specified by the number next to the word)
cat a.txt | cut -d' ' -f2 | xargs -i -n1 cut a.txt -c {}
I tried this command, which selects the numbers and uses xargs to put them into the -c option from cut, but the cut command gets executed on every line, instead of a.txt being looped (which I had expected to happen) How can I resolve this problem?
EDIT: Since it seems to be unclear, i want to select a character from a word. The character which I need to select can be found next to the word, for example:
and 3, will give me d. I want to do this for the entire file, which will then form a word :)
A pure shell solution:
$ while read word num; do echo ${word:$((num-1)):1}; done < a.txt
d
o
o
i
e
r
This is using a classic while; do ... ; done shell loop and the read builtin. The general format is
while read variable1 variable2 ... variableN; do something; done < input_file
This will iterate over each line of your input file splitting it into as many variables as you've given. By default, it will split at whitespace but you can change that by changing the $IFS variable. If you give a single variable, the entire line will be saved, if you give more, it will populate as many variables as you give it and save the rest in the last one.
In this particular loop, we're reading the word into $word and the number into $num. Once we have the word, we can use the shell's string manipulation capabilities to extract a substring. The general format is
${string:start:length}
So, ${string:0:2} would extract the first two characters from the variable $string. Here, the variable is $word, the start is the number minus one (this starts counting at 0) and the length is one. The result is the single letter at the position given by the number.
I would suggest that you used awk:
awk '{print substr($1,$2,1)}' file
substr takes a substring of the first field starting from the number contained in the second field and of length 1.
Testing it out (using the original input from your question):
$ cat file
and 3
now 2
for 2
something 7
completely 8
different 6
$ awk '{print substr($1,$2,1)}' file
d
o
o
i
e
r

How can I get the SOA serial number from a file with sed?

I store my SOA data for multiple domains in a single file that gets $INCLUDEd by zone files. I've written a small sed script that is supposed to get the serial number, increment it, then re-save the SOA file. It all works properly as long as the SOA file is in the proper format, with the entire record on one line, but it fails as soon as the record gets split into multiple lines.
For example, this works as input data:
# IN SOA dnsserver. hostmaster.example.net. ( 2013112202 21600 900 691200 86400 )
But this does not:
# IN SOA dnsserver. hostmaster.example.net. (
2013112202 ; Serial number
21600 ; Refresh every day, 86400 is 1 day
900 ; Retry refresh every 15 min
691200 ; Expire every 8 days
86400 ) ; Minimum TTL 1 day
I like comments, and I would like to spread things out. But I need my script to be able to find the serial number so that I can increment it and rewrite the file.
The SED that works on the single line is this:
SOA=$(sed 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' $SOAfile)
But for multi-line ... I'm a bit lost. I know I can join lines with N, but how do I know if I even need to? Do I need to write separate sed scripts based on some other analysis I do of the original file?
Please help! :-)
I wouldn't use sed for this. While you might be able to brute-force something, it would require a large amount of concentration to come up with it, and it would look like line noise, and so be almost unmaintainable afterwards.
What about this in awk?
The easiest way might be to split your records based on the # character, like so:
SOA=$(awk 'BEGIN{RS="#"} NR==2{print $6}' $SOAfile)
But that will break if you have comments containing # before the uncommented line, or if you have any comments between the # and the serial number. You could make a pipe to avoid these issues...
SOA=$(sed 's/;.*//;/^#/p;1,/^#/d' $SOAfile | awk 'BEGIN{RS="#"} NR==2{print $6}')
It may seem redundant to remove comments and strip the top of the file, but there could be other lines like #include which (however unlikely) could contain your record separator.
Or you could do something like this in pure awk:
SOA=$(awk -v field=6 '/^#/ { if($2=="IN"){field++} for(i=1;i<field;i++){if(i==NF){field=field-NF;getline;i=1}} print $field}' $SOAfile)
Or, broken out for easier reading:
awk -v field=6 '
/^#/ {
if ($2=="IN") {field++;}
for (i=1;i<field;i++) {
if(i==NF) {field=field-NF;getline;i=1;}
}
print $field; }' $SOAfile
This is flexible enough to handle any line splitting you might have, as it counts to field along multiple lines. It also adjusts the field number based on whether your zone segment contains the optional "IN" keyword.
A pure-sed solution would, instead of counting fields, use the first string of digits after an open bracket after your /^#/, like this:
SOA=$(sed -n '/^#/,/^[^;]*)/H;${;x;s/.*#[^(]*([^0-9]*//;s/[^0-9].*//;p;}' $SOAfile)
Looks like line noise, right? :-) Broken out for easier reading, it looks like this:
/^#/,/^[^;]*)/H # "Hold" the meaningful part of the file...
${ # Once we reach the end...
x # Copy the hold space back to the main buffer
s/.*#[^(]*([^0-9]*// # Remove stuff ahead of the serial
s/[^0-9].*// # Remove stuff after the serial
p # And print.
}
The idea here is that starting from the first line that begins with #, we copy the file into sed's hold space, then at the end of the file, do some substitutions to strip out all the text up to the serial number, and then after the serial number, and print whatever remains.
All of these work on single line and multi line zone SOA records I've tested with.
You can try the following - it's your original sed program preceded by commands to first read all input lines, if applicable:
SOA=$(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' \
"$SOAfile")
This form will work with both single- and multi-line input files.
Multi-line input files are first read as a whole before applying the substitutions.
Note: The awkward separate -e options are needed to keep FreeBSD happy with respect to labels and branching commands, which need a literal \n for termination - using separate -e options is a more readable alternative to splicing in literal newlines with $'\n'.
Alternative solution, using awk:
SOA=$(awk -v RS='#' '$1 == "IN" && $2 == "SOA" { print $6 }' "$SOAfile")
Again, this will work with both single- and multi-line record definitions.
The only constraint is that comments must not precede the serial number.
Additionally, if a file contained multiple records, the above would collect ALL serial numbers, separated by a newline each.
Why sed? grep is simplest in this case:
grep -A1 -e '#.*SOA' 1 | grep -oe '[0-9]*'
or: (maybe better):
grep -A1 -e '#.*SOA' 1 | grep 'Serial number' | grep -oe '[0-9]*'
This might work for you (GNU sed):
sed -nr '/# IN SOA/{/[0-9]/!N;s/[^0-9]+([0-9]+).*/\1/p}' file
For lines that contain # IN SOA if the line contains no numbers append the next line. Then extract the first sequence of numbers from the line(s).

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

How to remove duplicate phrases from a document?

Is there a simple way to remove duplicate contents from a large textfile? It would be great to be able to detect duplicate sentences (as separated by "." or even better to find duplicates of sentence fragments (such as 4-word pieces of text).
Removing duplicate words is easy enough, as other people have pointed out. Anything more complicated than that, and you're into Natural Language Processing territory. Bash isn't the best tool for that -- you need a slightly more elegant weapon for a civilized age.
Personally, I recommend Python and it's NLTK (natural language toolkit). Before you dive into that, it's probably worth reading up a little bit on NLP so that you know what you actually need to do. For example, the "4-word pieces of text" are known as 4-grams (n-grams in the generic case) in the literature. The toolkit will help you find those, and more.
Of course, there are probably alternatives to Python/NLTK, but I'm not familiar with any.
Remove duplicate phrases an keeping the original order:
nl -w 8 "$infile" | sort -k2 -u | sort -n | cut -f2
The first stage of the pipeline prepends every line with line number to document the original order. The second stage sorts the original data with the unique switch set.
The third restores the original order (sorting the 1. column). The final cut removes the first column.
You can remove duplicate lines (which have to be exactly equal) with uniq if you sort your textfile first.
$ cat foo.txt
foo
bar
quux
foo
baz
bar
$ sort foo.txt
bar
bar
baz
foo
foo
quux
$ sort foo.txt | uniq
bar
baz
foo
quux
Apart from that, there's no simple way of doing what you want. (How will you even split sentences?)
You can use grep with backreferences.
If you write grep "\([[:alpha:]]*\)[[:space:]]*\1" -o <filename> it will match any two identical words following one another. I.e. if the file content is this is the the test file , it will output the the.
(Explanation [[:alpha:]] matches any character a-z and A-Z, the asterisk * after it means that may appear as many times as it wants, the \(\) is used for grouping to backreference it later, then [[:space:]]* matches any number of spaces and tabs, and finally \1 matches the exact sequence that was found, that was enclosed in \(\)brackets)
Likewise, if you want to match a group of 4 words, that is repeated two times in a row, the expression will look like grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}[[:space:]]*\1" -o <filename> - it will match e.g. a b c d a b c d.
Now we need to add an arbitrary character sequence inbetween matches. In theory this should be done with inserting .* just before backreference, i.e. grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}.*\1" -o <filename>, but this doesn't seem to work for me - it matches just any string and ignores said backreference
The short answer is that there's no easy method. In general any solution needs to first decide how to split the input document into chunks (sentences, sets of 4 words each, etc) and then compare them to find duplicates. If it's important that the ordering of the non-duplicate elements by the same in the output as it was in the input then this only complicates matters further.
The simplest bash-friendly solution would be to split the input into lines based on whatever criteria you choose (e.g. split on each ., although doing this quote-safely is a bit tricky) and then use standard duplicate detection mechanisms (e.g. | uniq -c | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' and then, for each resulting line, remote the text from the input.
Presuming that you had a file that was properly split into lines per "sentence" then
uniq -c lines_of_input_file | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' | while IFS= read -r match ; do sed -i '' -e 's/'"$match"'//g' input_file ; done
Might be sufficient. Of course it will break horribly if the $match contains any data which sed interprets as a pattern. Another mechanism should be employed to perform the actual replacement if this is an issue for you.
Note: If you're using GNU sed the -E switch above should be changed to -r
I just created a script in python, that does pretty much what I wanted originally:
import string
import sys
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
if len(sys.argv) != 2:
sys.exit("Usage: find_duplicate_fragments.py some_textfile.txt")
file=sys.argv[1]
infile=open(file,"r")
text=infile.read()
text=text.replace('\n','') # remove newlines
table = string.maketrans("","")
text=text.translate(table, string.punctuation) # remove punctuation characters
text=text.translate(table, string.digits) # remove numbers
text=text.upper() # to uppercase
while text.find(" ")>-1:
text=text.replace(" "," ") # strip double-spaces
spaces=list(find_all(text," ")) # find all spaces
# scan through the whole text in packets of four words
# and check for multiple appearances.
for i in range(0,len(spaces)-4):
searchfor=text[spaces[i]+1:spaces[i+4]]
duplist=list(find_all(text[spaces[i+4]:len(text)],searchfor))
if len(duplist)>0:
print len(duplist),': ',searchfor
BTW: I'm a python newbie, so any hints on better python practise are welcome!

Resources