How to call sort to ensure that the following list is the result of any its randomizations? In particular, standard sort sorts [0-9] before [A-Za-z] but I need [A-Za-z] before [0-9]. I read the manpage but nothing seems to fit. I read that the locale influences the sorting of individual characters but which locale is the right one?
01_abc
02_abc
02_01_abc
02_02_abc
02_02_01_abc
02_02_02_abc
02_02_03_abc
02_03_abc
03_abc
04_abc
It's pretty ugly, but given a file data containing the sample data, and a program shuffle that randomizes the sequence of the lines in the file, and sed and sort, you can use:
shuffle data |
sed -e 's/[a-z][a-z]*/a:&/g' -e 's/[0-9][0-9]/d:&/g' |
sort |
sed 's/[ad]://g'
The first sed tags strings of letters with a: (for alpha), and strings of digits with d: (for digit), where the crucial point is that a comes before d in any plausible (ascending) sort order. This means that the data gets sorted with letters before digits. The second sed removes the tags.
Example of the steps:
$ shuffle data
02_03_abc
04_abc
02_02_abc
02_02_02_abc
03_abc
02_02_01_abc
02_01_abc
02_abc
02_02_03_abc
01_abc
$ shuffle data | sed -e 's/[a-z][a-z]*/a:&/g' -e 's/[0-9][0-9]/d:&/g'
d:01_a:abc
d:04_a:abc
d:02_d:02_d:02_a:abc
d:02_a:abc
d:03_a:abc
d:02_d:03_a:abc
d:02_d:02_a:abc
d:02_d:02_d:03_a:abc
d:02_d:02_d:01_a:abc
d:02_d:01_a:abc
$ shuffle data | sed -e 's/[a-z][a-z]*/a:&/g' -e 's/[0-9][0-9]/d:&/g' | sort
d:01_a:abc
d:02_a:abc
d:02_d:01_a:abc
d:02_d:02_a:abc
d:02_d:02_d:01_a:abc
d:02_d:02_d:02_a:abc
d:02_d:02_d:03_a:abc
d:02_d:03_a:abc
d:03_a:abc
d:04_a:abc
$ shuffle data | sed -e 's/[a-z][a-z]*/a:&/g' -e 's/[0-9][0-9]/d:&/g' | sort | sed 's/[ad]://g'
01_abc
02_abc
02_01_abc
02_02_abc
02_02_01_abc
02_02_02_abc
02_02_03_abc
02_03_abc
03_abc
04_abc
$
It is important to tag the letter sequences before tagging the number sequences. Note that each time it is run (at least for small numbers of runs), shuffle produces a different permutation of the data.
This technique – modifying the input so that the key can be sorted — can be applied to other sort operations too. Sometimes (often, even), it is better to prefix the key data to the unmodified line, separating the parts with, for example, a tab. This makes it easier to remove the sort key. For example, if you need to sort dates with alphabetic month names, you may well need to map the month names to numbers.
Related
I am having problems when trying to count the number of times a specific pattern appears in a file (let's call it B). In this case, I have a file with 30 patterns (let's call it A), and I want to know how many lines contain that pattern.
With only one pattern it is quite simple:
grep "pattern" file | wc -l
But with a file full of them I am not able to figure out how it may work. I already tried this:
grep -f "fileA" "fileB" | wc -l
Nevertheless, it gives me the total times all patterns appear, not each one of them (that's what I desire to get).
Thank you so much.
Count matches per literal string
If you simply want to know how often each pattern appears and each of your pattern is a fixed string (not a regex), use ...
grep -oFf needles.txt haystack.txt | sort | uniq -c
Count matching lines per literal string
Note that above is slightly different from your formulation " I want to know how many lines contain that pattern" as one line can have multiple matches. If you really have to count matching lines per pattern instead of matches per pattern, then things get a little bit trickier:
grep -noFf needles.txt haystack.txt | sort | uniq | cut -d: -f2- | uniq -c
Count matching lines per regex
If the patterns are regexes, you probably have to iterate over the patterns, as grep's output only tells you that (at least) one pattern matched, but not which one.
# this will be very slow if you have many patterns
while IFS= read -r pattern; do
printf '%8d %s\n' "$(grep -ce "$pattern" haystack.txt)" "$pattern"
done < needles.txt
... or use a different tool/language like awk or perl.
Note on overlapping matches
You did not formulate any precise requirements, so I went with the simplest solutions for each case. The first two solutions and the last solution behave differently in case multiple patterns match (part of) the same substring.
grep -f needles.txt matches each substring at most once. Therefore some matches might be "missed" (interpretation of "missed" depends on your requirements)
whereas iterating grep -e pattern1; grep -e pattern2; ... might match the same substring multiple times.
From the manual of the command sort
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
Versions:
sort: GNU coreutils 5.93
OS: MAC OSX 10.11.6
Bash: GNU bash 3.2.57(1)
Terminal: 2.6.1
It does not quite help me to understand how to use this option. I've seen patterns like -k1 -k2 and -k1,2 (see this post), -k1.2 and -k1.2n (see this post) and -k3 -k1 -k4 (see this post).
How does the flag --key (-k) work for the command sort?
I only have a vague intuition about what can be done with the option -k but if it is handy to consider an example, I would be happy for you to consider numerically (-n) sorting the following input by the numbers that directly follow the word "row". If two records have the same value after the word "row", then sorting could be done numerically on the value that follows the letter "G".
H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt
The expected output is
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
The . specifies a starting position within a single field. You want to sort numerically on fields 2 (starting at character 4) and 3 (starting at character 2). The following should work:
sort -t_ -k2.4n -k3.2n tmp.txt
-t_ specifies the field separator
The first key is 2.4n
The second key, if the first keys are equal, is 3.2n
Technically, .txt is part of field 3, but when you ask for numeric sorting, the trailing non-digit characters are ignored.
(More correctly, -k2.4,2n -k3.2,3n prevents any additional fields from being included in each key; I think the simpler form shown above works because any overlap is "overwritten", as it were. n prevents field 3 by itself from being treated as a number, and there is no field 4.)
from the manpage
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number
and C a character position in the field; both are origin 1, and the stop position defaults
to the line's end. If neither -t nor -b is in effect, characters in a field are counted
from the beginning of the preceding whitespace. OPTS is one or more single-letter order‐
ing options [bdfgiMhnRrV], which override global ordering options for that key. If no key
is given, use the entire line as the key. Use --debug to diagnose incorrect key usage.
The implication is that sort splits lines into fields. The period separator is used to offset into the field. With _ as your separator, you'd use an offset of 4.
In this case, the field delimiter isn't whitespace and so you would need to specify it using the -t option.
sort uses a locale based search by default and it looks like you want these sorted numerically. The -n switch does this.
sort -t _ -k 2.4 -n
This isn't really a programming question, but here goes:
If you're using GNU sort, your desired output can be achieved by sort -V:
$ echo 'H3_row24_G500.txt
H3_row32_G1000.txt
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
H3_row68_G999.txt
H3_row68_G500.txt' | sort -V
H3_row9_G999.txt
H3_row9_G1000.txt
H3_row24_G500.txt
H3_row24_G999.txt
H3_row32_G1000.txt
H3_row68_G500.txt
H3_row68_G999.txt
H3_row102_G500.txt
H3_row2400_G999.txt
That's because -V compares numeric and general string segments separately and H, 3, _row are the same in all lines.
I am working on a RNA-Seq data set consisting of around 24000 rows (genes) and 1100 columns (samples), which is tab separated. For the analysis, I need to choose a specific gene set. It would be very helpful if there is a method to extract rows based on row number? It would be easier that way for me rather than with the gene names.
Below is an example of the data (4X4) -
gene Sample1 Sample2 Sample3
A1BG 5658 5897 6064
AURKA 3656 3484 3415
AURKB 9479 10542 9895
From this, say for example, I want row 1, 3 and4, without a specific pattern
I have also asked on biostars.org.
You may use a for-loop to build the sed options like below
var=-n
for i in 1 3,4 # Put your space separated ranges here
do
var="${var} -e ${i}p"
done
sed $var filename
Note: In any case the requirement mentioned here would still be pain as it involves too much typing.
Say you have a file, or a program that generates a list of the line numbers you want, you could edit that with sed to make it into a script that prints those lines and passes it to a second invocation of sed.
In concrete terms, say you have a file called lines that says which lines you want (or it could equally be a program that generates the lines on its stdout):
1
3
4
You can make that into a sed script like this:
sed 's/$/p/' lines
1p
3p
4p
Now you can pass that to another sed as the commands to execute:
sed -n -f <(sed 's/$/p/' lines) FileYouWantLinesFrom
This has the advantage of being independent of the maximum length of arguments you can pass to a script because the sed commands are in a pseudo-file, i.e. not passed as arguments.
If you don't like/use bash and process substitution, you can do the same like this:
sed 's/$/p/' lines | sed -n -f /dev/stdin FileYouWantLinesFrom
I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:
canyon
fish
forest
mountain
river
The second file, list2.txt is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:
fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
The script should output the most frequently matching item. For e.g., if run with the 2 files above, the output would be “fish”, because that word from list1.txt most often occurs in list2.txt.
Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:
#!/bin/bash
while read -r line
do
count=$(grep -c ^$line list2.txt)
echo $line”,”$count >> found.csv
done < ./list1.txt
After that, found.csv is sorted descending by the second column. The output is the word appearing on the first line.
I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:
If there is a tie between 2 or more words, e.g. “fish”, “canyon”, and “forest” each appear 5 times, while no other appear as often, the output would be these 3 words in alphabetical order, separated by commas, e.g.: “canyon,fish,forest”.
If none of the words from list1.txt appears in list2.txt, then the output is simply the first word from the file list1.txt, e.g. “canyon”.
How can I create a more efficient script which finds which word from the first list appears most often in the second?
You can use the following pipeline:
grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1
F tells grep to search literal words, f tells it to use list1.txt as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).
> awk 'FNR==NR{a[$1]=0;next}($1 in a){a[$1]++}END{for(i in a)print a[i],i}' file1 file2 | sort -rn|head -1
assuming 'list1.txt' is sorted, I would use unix join :
sort list2.txt | join -1 1 -2 1 list1.txt - | sort |\
uniq -c | sort -n | tail -n1
Is there a simple way to remove duplicate contents from a large textfile? It would be great to be able to detect duplicate sentences (as separated by "." or even better to find duplicates of sentence fragments (such as 4-word pieces of text).
Removing duplicate words is easy enough, as other people have pointed out. Anything more complicated than that, and you're into Natural Language Processing territory. Bash isn't the best tool for that -- you need a slightly more elegant weapon for a civilized age.
Personally, I recommend Python and it's NLTK (natural language toolkit). Before you dive into that, it's probably worth reading up a little bit on NLP so that you know what you actually need to do. For example, the "4-word pieces of text" are known as 4-grams (n-grams in the generic case) in the literature. The toolkit will help you find those, and more.
Of course, there are probably alternatives to Python/NLTK, but I'm not familiar with any.
Remove duplicate phrases an keeping the original order:
nl -w 8 "$infile" | sort -k2 -u | sort -n | cut -f2
The first stage of the pipeline prepends every line with line number to document the original order. The second stage sorts the original data with the unique switch set.
The third restores the original order (sorting the 1. column). The final cut removes the first column.
You can remove duplicate lines (which have to be exactly equal) with uniq if you sort your textfile first.
$ cat foo.txt
foo
bar
quux
foo
baz
bar
$ sort foo.txt
bar
bar
baz
foo
foo
quux
$ sort foo.txt | uniq
bar
baz
foo
quux
Apart from that, there's no simple way of doing what you want. (How will you even split sentences?)
You can use grep with backreferences.
If you write grep "\([[:alpha:]]*\)[[:space:]]*\1" -o <filename> it will match any two identical words following one another. I.e. if the file content is this is the the test file , it will output the the.
(Explanation [[:alpha:]] matches any character a-z and A-Z, the asterisk * after it means that may appear as many times as it wants, the \(\) is used for grouping to backreference it later, then [[:space:]]* matches any number of spaces and tabs, and finally \1 matches the exact sequence that was found, that was enclosed in \(\)brackets)
Likewise, if you want to match a group of 4 words, that is repeated two times in a row, the expression will look like grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}[[:space:]]*\1" -o <filename> - it will match e.g. a b c d a b c d.
Now we need to add an arbitrary character sequence inbetween matches. In theory this should be done with inserting .* just before backreference, i.e. grep "\(\([[:alpha:]]*[[:space]]*\)\{4\}.*\1" -o <filename>, but this doesn't seem to work for me - it matches just any string and ignores said backreference
The short answer is that there's no easy method. In general any solution needs to first decide how to split the input document into chunks (sentences, sets of 4 words each, etc) and then compare them to find duplicates. If it's important that the ordering of the non-duplicate elements by the same in the output as it was in the input then this only complicates matters further.
The simplest bash-friendly solution would be to split the input into lines based on whatever criteria you choose (e.g. split on each ., although doing this quote-safely is a bit tricky) and then use standard duplicate detection mechanisms (e.g. | uniq -c | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' and then, for each resulting line, remote the text from the input.
Presuming that you had a file that was properly split into lines per "sentence" then
uniq -c lines_of_input_file | sort -n | sed -E -ne '/^[[:space:]]+1/!{s/^[[:space:]]+[0-9]+ //;p;}' | while IFS= read -r match ; do sed -i '' -e 's/'"$match"'//g' input_file ; done
Might be sufficient. Of course it will break horribly if the $match contains any data which sed interprets as a pattern. Another mechanism should be employed to perform the actual replacement if this is an issue for you.
Note: If you're using GNU sed the -E switch above should be changed to -r
I just created a script in python, that does pretty much what I wanted originally:
import string
import sys
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
if len(sys.argv) != 2:
sys.exit("Usage: find_duplicate_fragments.py some_textfile.txt")
file=sys.argv[1]
infile=open(file,"r")
text=infile.read()
text=text.replace('\n','') # remove newlines
table = string.maketrans("","")
text=text.translate(table, string.punctuation) # remove punctuation characters
text=text.translate(table, string.digits) # remove numbers
text=text.upper() # to uppercase
while text.find(" ")>-1:
text=text.replace(" "," ") # strip double-spaces
spaces=list(find_all(text," ")) # find all spaces
# scan through the whole text in packets of four words
# and check for multiple appearances.
for i in range(0,len(spaces)-4):
searchfor=text[spaces[i]+1:spaces[i+4]]
duplist=list(find_all(text[spaces[i+4]:len(text)],searchfor))
if len(duplist)>0:
print len(duplist),': ',searchfor
BTW: I'm a python newbie, so any hints on better python practise are welcome!