Find lines that have partial matches - bash

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.

If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...

If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}

Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.

Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Related

Awk : accommodating the blank cell values

I need to calculate the size of deleted file size per user from lsof command in bash.
But there are few rows which has third column as blank causing issues to sum up values.
For example, in the attached image I need to show the total size by each user type for deleted files, now because of few blank cells in third column the column count is not coming as right and hence the resulting values are not correct too.
I tried using few options to replace blank cells with some dummy text but that is not working well, so need suggestion to solve this problem and also any command to show the resulting size in human readable format ?
I tried to add the output by user type with following command
lsof|grep -i deleted| awk '{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
Above command did not give the right results, so I tried below command to replace blank cells with a dummy text
lsof|grep -i deleted| awk '!$3{$3="NA"}{a[$5] +=$8} END{for (i in a) print i, a[$i]}'
That did not work, so I tried using if condition
lsof|grep -i deleted| awk '{if($3 =="") $3="NA"; a[$5] +=$8} END{for (i in a) print i, a[$i]}'
Assuming you are interested in file owner/size/name, here is a python script (test.py) which can get them :
import re
import sys
user_last_column = 0
for line in sys.stdin:
if user_last_column:
if re.search(r'\(deleted\)$', line):
print("%s %s %s" % (re.sub(r'.* ', '', line[:user_last_column]),
re.sub(r'.* ', '', line[:size_last_column]),
line[name_first_column:][:-11]))
else: # Process first row which is header
user_last_column = line.find('USER') + 4
size_last_column = line.find('SIZE/OFF') + 8
name_first_column = line.find('NAME')
Call it with :
lsof | python test.py | sort -u # [sort -u] to remove duplicates
or
python test.py < /tmp/sample
Explanation
The main thing is to find positions (in characters) of the three pieces of info.

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.
How should I do that?
The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq.
This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
I ended up to use below codes in R and solved my problem.
library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
You can use the Python package biotite for it (https://www.biotite-python.org).
Let's say you have the following FASTQ file:
#Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G#?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7#+6!FDC&
+4=4+,#2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
#Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,#'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?#2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'#",8:#5%18&+3>'8:28+:5F0);E9<=,+
This is a script, that should do the work:
import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq
# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]
fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
sequence = fastq_file.get_sequence(header)
# Iterative over each of the consecutive sequences
for consecutive_nuc in consecutive_nucs:
# Find all indices, where a match was found
matches = seq.find_subsequence(sequence, consecutive_nuc)
if len(matches) > 0:
# If any match was found report it
print(
f"Found '{consecutive_nuc}' "
f"in sequence '{header}' at position {matches[0]}"
)
This is the output:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

Using awk to format text

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.
Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}
sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...
It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.
With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat

how do i make my code read random lines 37 different times?

def pick_random_line
chosen_line = nil
File.foreach("id'sForCascade.txt").each_with_index do |line, id|
chosen_line = line if rand < 1.0/(id+1)
end
return chosen_line
end`enter code here
Hey, i'm trying to make that code pick 37 different lines. So how would I do that i'm stuck and confused.
Assuming you don't want the same line to repeat more than once, I would do it in one line like this:
File.read("test.txt").split("\n").shuffle.first(37)
File.read("test.txt") reads the entire file.
split("\n") splits the file to lines based on the \n delimiter (I assume your file is textual and have lines separated by new line character).
shuffle is a very convenient method of Array that shuffles the lines randomly. You can read about it here:
http://docs.ruby-lang.org/en/2.0.0/Array.html#method-i-shuffle
Finally, first(37) gives you the first 37 lines out of the shuffled array. These are guaranteed to be random from the shuffle operation.
You can do something like this:
input_lines = File.foreach("test.txt").map(&:to_s)
output_lines = []
37.times do
output_lines << input_lines.delete_at(rand(input_lines.length))
end
puts output_lines
This will ensure that you aren't grabbing duplicate lines and you don't need to do any fancy checking.
However, if your file is less than 37 lines this may cause a problem, it also assumes that your file exists.
EDIT:
What is happening is the rand call is now changing the range on which it is called based on the size of the input lines. And since you are deleting at an index when you take the line out, the length shrinks and you do not risk duplicating lines.
If you want to save relatively few lines from a large file, reading the entire file into an array (and then randomly selecting lines) could be costly. It might be better to count the number of lines in the file, randomly select line offsets and then save the lines at those offsets to an array. This approach is no more difficult to implement than the former one, but makes the method more robust, even if the files in the current application are not overly large.1
Suppose your filename were given by FName. Here are three ways to count the numbers of lines in the file:
Count lines, literally
cnt = File.foreach(FName).reduce(0) { |c,_| c+1 }
Use $.
File.foreach(FName) {}
cnt = $.
On Unix-family computers, shell-out to the operating system
cnt = %x{wc -l #{FName}}.split.first.to_ii
The third option is very fast.
Random offsets (base 1) for n lines to be saved could be computed as follows:
lines = (1..cnt).to_a.sample(n).sort
Saving the lines at those offsets to an array is straightforward; for example:
File.foreach(FName).with_object([]) do |line,a|
if lines.first == $.
a << line
lines.shift
break a if lines.empty?
end
end
Note that $. #=> 1 after the first line is first line is read, and $. is incremented by 1 after each successive line is read. (Hence base 1 for line offsets.)
1 Moreover, many programmers, not just Rubiests, are repelled by the idea of amassing large numbers of anything and then discarding all but a few.

Get random text block from file using bash

what is the simplest way of reading a random block of characters from a text file using bash?
A block is a set of characters which begin with X and end with X, where X is a character sequence, usually it will be "\n\n"
We can assume that file has short lines, less than 200 characters each.
Blocks don't have more than 20 lines.
I have seen threads like get random line, get text from between two tokens, but it's not exacly what I need.
I can write a simple program in C that will read how many blocks are in file, get a random number from a given range and then search for a block with this ID, but there must be an easier way.
Example:
X = "\n\n"
File: (the .'s are not in the file, I used them to make "empty" line at the begginning and end of code)
.
first line
second line and some other text
fourth line
sixth line
seventh line, more textęęę
.
Running the script for first time, output:
fourth line
Running the script for the second time, output:
first line
second line and some other text
Yours faithfully,
user2420535
To get a uniformly random block from a file of blank-line-separated blocks in one pass,
awk -v RS='\n\n' '
BEGIN { srand(); }
rand() < 1.0/NR { s=$0; }
END { print s; }
' file
This is a simple case of Reservoir Sampling.

Resources