How to remove duplicate phrases from a document? - bash

Is there a simple way to remove duplicate contents from a large textfile? It would be great to be able to detect duplicate sentences (as separated by "." or even better to find duplicates of sentence fragments (such as 4-word pieces of text).

Removing duplicate words is easy enough, as other people have pointed out. Anything more complicated than that, and you're into Natural Language Processing territory. Bash isn't the best tool for that -- you need a slightly more elegant weapon for a civilized age.
Personally, I recommend Python and it's NLTK (natural language toolkit). Before you dive into that, it's probably worth reading up a little bit on NLP so that you know what you actually need to do. For example, the "4-word pieces of text" are known as 4-grams (n-grams in the generic case) in the literature. The toolkit will help you find those, and more.
Of course, there are probably alternatives to Python/NLTK, but I'm not familiar with any.

Remove duplicate phrases an keeping the original order:
nl -w 8 "$infile" | sort -k2 -u | sort -n | cut -f2
The first stage of the pipeline prepends every line with line number to document the original order. The second stage sorts the original data with the unique switch set.
The third restores the original order (sorting the 1. column). The final cut removes the first column.

You can remove duplicate lines (which have to be exactly equal) with uniq if you sort your textfile first.
$ cat foo.txt
foo
bar
quux
foo
baz
bar
$ sort foo.txt
bar
bar
baz
foo
foo
quux
$ sort foo.txt | uniq
bar
baz
foo
quux
Apart from that, there's no simple way of doing what you want. (How will you even split sentences?)

You can use grep with backreferences.
If you write grep "\([[:alpha:]]*\)[[:space:]]*\1" -o <filename> it will match any two identical words following one another. I.e. if the file content is this is the the test file , it will output the the.
(Explanation [[:alpha:]] matches any character a-z and A-Z, the asterisk * after it means that may appear as many times as it wants, the \(\) is used for grouping to backreference it later, then [[:space:]]* matches any number of spaces and tabs, and finally \1 matches the exact sequence that was found, that was enclosed in \(\)brackets)
Likewise, if you want to match a group of 4 words, that is repeated two times in a row, the expression will look like grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}[[:space:]]*\1" -o <filename> - it will match e.g. a b c d a b c d.
Now we need to add an arbitrary character sequence inbetween matches. In theory this should be done with inserting .* just before backreference, i.e. grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}.*\1" -o <filename>, but this doesn't seem to work for me - it matches just any string and ignores said backreference

The short answer is that there's no easy method. In general any solution needs to first decide how to split the input document into chunks (sentences, sets of 4 words each, etc) and then compare them to find duplicates. If it's important that the ordering of the non-duplicate elements by the same in the output as it was in the input then this only complicates matters further.
The simplest bash-friendly solution would be to split the input into lines based on whatever criteria you choose (e.g. split on each ., although doing this quote-safely is a bit tricky) and then use standard duplicate detection mechanisms (e.g. | uniq -c | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' and then, for each resulting line, remote the text from the input.
Presuming that you had a file that was properly split into lines per "sentence" then
uniq -c lines_of_input_file | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' | while IFS= read -r match ; do sed -i '' -e 's/'"$match"'//g' input_file ; done
Might be sufficient. Of course it will break horribly if the $match contains any data which sed interprets as a pattern. Another mechanism should be employed to perform the actual replacement if this is an issue for you.
Note: If you're using GNU sed the -E switch above should be changed to -r

I just created a script in python, that does pretty much what I wanted originally:
import string
import sys
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
if len(sys.argv) != 2:
sys.exit("Usage: find_duplicate_fragments.py some_textfile.txt")
file=sys.argv[1]
infile=open(file,"r")
text=infile.read()
text=text.replace('\n','') # remove newlines
table = string.maketrans("","")
text=text.translate(table, string.punctuation) # remove punctuation characters
text=text.translate(table, string.digits) # remove numbers
text=text.upper() # to uppercase
while text.find(" ")>-1:
text=text.replace(" "," ") # strip double-spaces
spaces=list(find_all(text," ")) # find all spaces
# scan through the whole text in packets of four words
# and check for multiple appearances.
for i in range(0,len(spaces)-4):
searchfor=text[spaces[i]+1:spaces[i+4]]
duplist=list(find_all(text[spaces[i+4]:len(text)],searchfor))
if len(duplist)>0:
print len(duplist),': ',searchfor
BTW: I'm a python newbie, so any hints on better python practise are welcome!

Related

How to count the number of files each pattern of a group appears in a file?

I am having problems when trying to count the number of times a specific pattern appears in a file (let's call it B). In this case, I have a file with 30 patterns (let's call it A), and I want to know how many lines contain that pattern.
With only one pattern it is quite simple:
grep "pattern" file | wc -l
But with a file full of them I am not able to figure out how it may work. I already tried this:
grep -f "fileA" "fileB" | wc -l
Nevertheless, it gives me the total times all patterns appear, not each one of them (that's what I desire to get).
Thank you so much.
Count matches per literal string
If you simply want to know how often each pattern appears and each of your pattern is a fixed string (not a regex), use ...
grep -oFf needles.txt haystack.txt | sort | uniq -c
Count matching lines per literal string
Note that above is slightly different from your formulation " I want to know how many lines contain that pattern" as one line can have multiple matches. If you really have to count matching lines per pattern instead of matches per pattern, then things get a little bit trickier:
grep -noFf needles.txt haystack.txt | sort | uniq | cut -d: -f2- | uniq -c
Count matching lines per regex
If the patterns are regexes, you probably have to iterate over the patterns, as grep's output only tells you that (at least) one pattern matched, but not which one.
# this will be very slow if you have many patterns
while IFS= read -r pattern; do
printf '%8d %s\n' "$(grep -ce "$pattern" haystack.txt)" "$pattern"
done < needles.txt
... or use a different tool/language like awk or perl.
Note on overlapping matches
You did not formulate any precise requirements, so I went with the simplest solutions for each case. The first two solutions and the last solution behave differently in case multiple patterns match (part of) the same substring.
grep -f needles.txt matches each substring at most once. Therefore some matches might be "missed" (interpretation of "missed" depends on your requirements)
whereas iterating grep -e pattern1; grep -e pattern2; ... might match the same substring multiple times.

Bash script output text between first match and 2nd match only [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'
With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/
Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3
another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|(http:\/\/[^\/]+\/).*|\1|'
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"
sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.
Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.
I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.
Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too
Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!
Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)
#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...
You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'
If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/
The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href
Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

replace strings in file from a reference list

There are a few threads that seem to be asking the same question as I'm interested in here, but some of the answers seem to be tricky to generalise (or I'm not smart enough). e.g.
how to replace strings in file based on values from another file? (example inside)
Replacing strings in file, using patterns from another file
I have some complicated files that look like this:
((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);
(These are Newick format phylogenetic trees in case anyone is interested)
I need to change all the ID keys (the bits that look like XXX_YYYYY) in this file and am not sure what the best approach would be.
They need to be replaced by the 'group' (operon) they belong to, and so I was thinking that making an index file of sorts would be the way to go, so for example, PLT_01696 gets replaced with group_1 say:
Keyfile:
PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
....
PAU_02074 group_2
So I think if I could pass a file to sed or some equivalent, get it to read and look for the entry in column one, and replace it with whatever I've paired it with in column 2 is the best way to do this? This file will have about 350 individual keys in the end which will end up sorted in to around 12 groups.
And the file would end up looking like:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,group_1:0.08160284537473952438)98:0.04771898687201567291,.....
I'm open to alternative suggestions, this just seemed most apparent to me. This is on Ubuntu 14.04 so any solution is fair game really, but I'm much more au fait with bash (and a bit of perl).
One solution in such cases is to write a sed script that writes the sed script you want to execute. It appears that operons are preceded by either ( or , and are always followed by :. So, given your file containing mappings such as:
PLT_01736 group_1
then for each line in that file you want to create a sed operation that looks like:
s/\([,(]\)PLT_01736:/\1group_1:/g
where the g might not be necessary (I don't know if a given operon can appear more than once in a single line). The initial character class captures the ( or , and the \( and \) remember that, and it's followed by the specific ID key, and the colon; the replace operation outputs the remembered character, the replacement text and the colon. The advantage of tracking the preceding and following characters is that if by some mischance you have operons PLT_00100 and PLT_001001 (where one operon is a prefix of the other), tracking the surrounding characters ensures the correct match. Otherwise, you have to ensure that the longest matches appear first in the script, which is fiddlier (sort -r probably sorts that out, but …).
Hence, assuming the mappings are in a file mapping.data, you can use:
sed 's%\([A-Z]*_[0-9]*\) *\(.*\)%s/\\([,(]\\)\1:/\\1\2:/g%' mapping.data > script.sed
sed -f script.sed newick.phylogenetic.tree.data > transformed.data
This uses % in the generating s%%% operation, outputting s/// (it requires some care). The search part of the s%%% looks for zero or more upper-case letters, an underscore, and zero or more digits, capturing that with the \( and \); followed by one or more spaces, followed by some other characters which are also captured. If the ID keys can have a different structure, then change the matching regex appropriately. I assume that the input data is 'clean' so there's no need to worry about only processing lines with exactly three letters, and underscore and exactly five digits, and there's no trailing blanks. With the two parts (key ID and replacement) isolated, it is just necessary to generate the output s/// command, remembering to double up the backslashes that must appear in the output.
Given your input data and list of keys, the output I get is:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);
I'll bite. Let's call the script phylo.awk:
NR==FNR { pattern[NR] = $1; replacement[NR] = $2; count++; next }
{
for (i = 1; i <= count; i++) {
sub(pattern[i], replacement[i])
}
print $0
}
Then say:
awk -f phylo.awk patterns data
#!/bin/bash
while read i;do #enter your loop
a=$(echo "$i" | cut -d" " -f1) #get what to find
b=$(echo "$i" | cut -d" " -f2) #get what to replace with
sed -i "s/$a/$b/g" input.txt #find and replace -i is "in place"
done <ref.txt #define file you're looping through
input:
((PLT_01736:0.06834090301258281819,(((PLT_01758:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((PAU_02074:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,PLT_01696:0.01562657531699716829);
ref:
PLT_01696 group_1
PLT_01736 group_1
PLT_01758 group_1
PAU_02074 group_2
output:
((group_1:0.06834090301258281819,(((group_1:0.04822932915066913823,PLT_01716:0.08160284537473952438)98:0.04771898687201567291,((group_2:0.04683560272944583408,PAK_01896:0.02826787310445108212)95:0.03010698277052889504,PLT_02424:0.06991513512243620332)99:0.06172493035971356873)90:0.05291396820697712167,((PAK_02014:0.00000187538096058579,PAU_02206:0.00721521619460035492)100:0.43252725913741080221,((PLT_02568:0.06262043352060168988,(PAU_01961:0.02293694470289835488,PAK_01787:0.01049771144617094552)98:0.05833869619359682152)100:0.65266156617675985530,(PAK_03203:0.06403695571262699171,PAU_03392:0.03453883849938884504)99:0.10276841868475847241)2:0.14443958710162313475)10:0.20176450294539299835)9:0.01245548664398392694)92:0.05176685581730120639,(PAK_02606:0.03709141633854080161,PAU_02775:0.01796540370573110335)57:0.01492069367348663675,group_1:0.01562657531699716829);

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

String comparison precedence in Bash

The following example will compare all files in a directory to input string ($string) and return matching filename. It is not very elegant and efficient way of accomplishing that. For speed purposes I modified for condition to only compare to files that start start with first word of $string.
Problem with this script is following - I have two files in the directory:
Foo Bar.txt
Foo Bar Foo.txt
and I compare them to string "Foo Bar 09.20.2010". This will return both files in that directory, as both files match. But I need to return only the file that matches the string in most exact way - in our example it should be Foo Bar.txt.
Also if you have better ideas how to solve this problem please post your ideas as I am not that proficient in scripting yet and I am sure there are better and maybe even easier ways of doing this.
#!/bin/bash
string="Foo Bar 09.20.2010"
for file in /path/to/directory/$(echo "$string" | awk '{print $1}')*; do
filename="${file##*/}"
filename="${filename%.*}"
if [[ $(echo "$string" | grep -i "^$filename") ]]; then
result="$file"
echo $result
fi
done
Here is breakdown what I want to achieve. Two files in directory to match against two strings, Correct/Incorrect in brackets means if result was as I expected/wanted or not.
2 Files In directory (stripped off extensions for matching):
Foo Bar.txt
Foo Bar Foo.txt
To compare against 2 Strings:
Foo Bar Random Additional Text
Foo Bar Foo Random Additional Text
Results:
compare "Foo Bar"(.txt) against Foo Bar Random Additional Text -> Match (Correct)
compare "Foo Bar"(.txt) against Foo Bar Foo Random Additional Text -> Match (Incorrect)
compare "Foo Bar Foo"(.txt) against Foo Bar Random Additional Text -> NOT Match (Correct)
compare "Foo Bar Foo"(.txt) against Foo Bar Foo Random Additional Text -> Match (Correct)
Thank you everyone for your answers.
Correct me if I'm wrong, but it appears that your script is equivalent to:
ls /path/to/directory/"$string"*
If you only want one file name out of it, you can use head. Since ls lists files alphabetically you'll get the first one in alphabetical order.
(Notice that when ls's output is piped to another program it prints one file name per line, making it easier to process than its normal column-based output.)
ls /path/to/directory/"$string"* | head -1
For the shortest match try something like the following, which uses an awkward combination of awk, sort -n, and cut to order the lines from shortest to longest and then print the first one.
ls /path/to/directory/"$string"* |
awk '{print length($0) "\t" $0}' | sort -n | head -1 | cut -f 2-
A lot of your echo and awk calls are superfluous. To get all the files that begin with your matching, you can simply evaluate "$string"*.
e.g. both
echo "$string"*
and
ls "$string"*
Will generate your lists. (In a pipe, echo will have them space-separated, and ls will have them newline-separated).
The next step is to realize that given this, as you have defined it, your extra constraint of "most exact match" is equivalent to the shortest matching filename.
To find the shortest string in a set of strings in bash (I'd prefer perl myself, but let's stick with the constraint of doing it in bash):
for fn in "/path/to/$string"*; do
echo $(echo $fn | wc -c) "$fn"
done | sort -n | head -1 | cut -f2- -d' '
The for loop loops over the expanded filenames. The echo prepends the length of the names to the names. We then pipe the entire output of that into sort -n and head -1 to get the shortest name, and cut -f2- -d' ' strips the length off of it (taking the second field on with a space as the field separator).
The key with shell programming is knowing your building blocks, and how to combine them. With clever combinations of sort, head, tail, and cut you can do a lot of pretty sophisticated processing. Throw in sed and uniq and you are already able to do some quite impressive things.
That being said, I usually only use the shell for things like this "on-the-fly" -- for anything that I might want to re-use and that is at all complex I would be much more likely to use perl.

Resources