Copy lines from one file to another, skipping multiple occurrences - bash

I have a big file, results.txt, that I want to take certain lines out of and put them into another file. The data I want to take out is some variable, omega and alpha. However for results.txt, there are two occurrences of omega and alpha for each set of data in results.txt, and I only want the second set of data. I am not sure how to proceed. I know I should use SED but I don't know how since I have only found help regarding replacing lines use sed. Any help would be appreciated. Thank you very much.
#
--- Sorry I was on mobile when I asked the question. Didn't know how to insert code. ---
So my file looks something like
Very big list of useless output
.
.
.
Results 1:
Omega = 121
Distance = 18.7037218936
Alpha = -1.05958217593e-05
Result 5 = 18983
Result 6 = 1231.903
-------------------------
Results 1:
Omega = 121
Distance = 18.7037218936
Alpha = -1.05958217593e-05
Result 5 = 18983
Result 6 = 1231.903
-------------------------
Second useless output for the next data set
.
.
.
The next data set begins after both sets of results. I have 600 data sets. I want to print Omega and Alpha from the second set of results from each dataset to some other file, preferably in two columns, which I don't know if it is possible.
I have tried using sed but the documentation I have found only talks about replacing words I searched for. Thanks for any help!

Made a test file for you:
$ cat > results.txt
foo
alpha 1
omega 1
foo
alpha 2
omega 2
foo
$ tac results.txt|grep -m 1 alpha; tac results.txt |grep -m 1 omega
alpha 2
omega 2

Related

Filter sequences with more than 8 same consecutive nucleotides in a fastq file?

I want to filter my sequences which has more than 8 same consecutive nucleotides like "GGGGGGGG", "CCCCCCCC", etc in my fastq files.
How should I do that?
The quick and incorrect way, which might be close enough: grep -E -B1 -A2 'A{8}|C{8}|G{8}|T{8}' yourfile.fastq.
This will miss blocks where the 8-mer is split across two lines (e.g. the first line ends with AAAA and the second starts with AAAA). It also assumes the output has blocks of 4 lines each.
The proper way: write a little program (in Python, or a language of your choice) which buffers one FASTQ block (e.g. 4 lines) and checks that the concatenation of the previous (buffered) block's sequence and the current block's sequence do not have an 8-mer as above. If that's the case, then output the buffered block.
I ended up to use below codes in R and solved my problem.
library(ShortRead)
fq <- FastqFile("/Users/path/to/file")
reads_fq <- readFastq(fq)
trimmed_fq <- reads_fq[grep("GGGGGGGG|TTTTTTTTT|AAAAAAAAA|CCCCCCCCC",
sread(reads_fq), invert = TRUE)]
writeFastq(trimmed_fq, "new_name_for_fq.fastq", compress = FALSE)
You can use the Python package biotite for it (https://www.biotite-python.org).
Let's say you have the following FASTQ file:
#Read:01
CCCAAGGGCCCCCCCCCACTGCGATCACCTGGTTGCTGCCGGGAAAGGAGACCCAGGAGGTGAAACGGACTGGTGAATTG
CGGGGGTAGATATGGCGGGTGACACAAAAACATATAATCGGGCC
+
.+.+:'-FEAC-4'4CA-3-5#/4+?*G#?,<)<E&5(*82C9FH4G315F*DF8-4%F"9?H5535F7%?7#+6!FDC&
+4=4+,#2A)8!1B#,HA18)1*D1A-.HGAED%?-G10'6>:2
#Read:02
AACACTACTTCGCTGTCGCCAAAGGTTGGTGTAGGTCGGACTTCGAATTATCGATACTAGTTAGTAGTACGTCGCGTGGC
GTCAGCTCGTATGCTCTCAGAACAGGGAGAACTAGCACCGTAAGTAACCTAGCTCCCAAC
+
6%9,#'4A0&%.19,1E)E?!9/$.#?(!H2?+E"")?6:=F&FE91-*&',,;;$&?#2A"F.$1)%'"CB?5$<.F/$
7055E>#+/650B6H<8+A%$!A=0>?'#",8:#5%18&+3>'8:28+:5F0);E9<=,+
This is a script, that should do the work:
import biotite.sequence.io.fastq as fastq
import biotite.sequence as seq
# 'GGGGGGGG', 'CCCCCCCC', etc.
consecutive_nucs = [seq.NucleotideSequence(nuc * 8) for nuc in "ACGT"]
fastq_file = fastq.FastqFile("Sanger")
fastq_file.read("example.fastq")
# Iterate over sequence entries in file
for header in fastq_file:
sequence = fastq_file.get_sequence(header)
# Iterative over each of the consecutive sequences
for consecutive_nuc in consecutive_nucs:
# Find all indices, where a match was found
matches = seq.find_subsequence(sequence, consecutive_nuc)
if len(matches) > 0:
# If any match was found report it
print(
f"Found '{consecutive_nuc}' "
f"in sequence '{header}' at position {matches[0]}"
)
This is the output:
Found 'CCCCCCCC' in sequence 'Read:01' at pos 8

How do I concatenate lines from a text file into one big string?

I have an input file that looks like(without such big spaces between lines):
3 4
ATCGA
GACTTACA
AACTGTA
ATC
...and I need to concatenate all lines except for the first "3 4" line. Is there a simple solution? I've tried manipulating getline() somehow, but that has not worked for me.
Edit: The amount of lines will not be known initially, so it will have to be done recursively.
If your concate 2 lines in 1 line then you can use easily concate "+",
e.g:
String a = "WAQAR MUGHAL";
String b = "check";
System.out.println(a + b);
System.out.println("WAQAR MUGHAL" + "CHECK");
Output:
WAQAR MUGHAL check
WAQAR MUGHAL CHECK

Automatically increment filename VideoWriter MATLAB

I have MATLAB set to record three webcams at the same time. I want to capture and save each feed to a file and automatically increment it the file name, it will be replaced by experiment_0001.avi, followed by experiment_0002.avi, etc.
My code looks like this at the moment
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
avi1 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_002.AVI');
avi2 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI');
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
and I am incrementing the 002 each time.
Any thoughts on how to implement this efficiently?
Thanks.
dont forget matlab has some roots to C programming language. That means things like sprintf will work
so since you are printing out an integer value zero padded to 3 spaces you would need something like this sprintf('%03d',n) then % means there is a value to print that isn't text. 0 means zero pad on the left, 3 means pad to 3 digits, d means the number itself is an integer
just use sprintf in place of a string. the s means String print formatted. so it will output a string. here is an idea of what you might do
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
for (n=1:2:max_num_captures)
avi1 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_%03d.AVI',n));
avi2 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI',n));
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
end

Substituting string labels by integer IDs and back

My data files contain lines with the first entity being a string label followed by features. For example:
MEMO |f write down this note
CALL |f call jim's cell
The problem is that Vowpal Wabbit accepts only integer labels. How can I quickly change from string labels to unique integer IDs and back? That is quickly modify the data file to:
1 |f write down this note
2 |f call jim's cell
... and back when needed.
For my sample dataset I did it manually for each class using ``sed'', but this breaks seriously my workflow.
cat input.data | perl -nale '$i=$m{$F[0]}; $i or $i=$m{$F[0]}=++$n; $F[0]=$i; print "#F"; END{warn "$_ $m{$_}\n" for sort {$m{$a}<=>$m{$b}} keys %m}' > output.data 2> mapping.txt

sed: flexible template w/ line number constraint

Problem
I need to insert text of arbitrary length ( # of lines ) into a template while maintaining an exact number of total lines.
Sample source data file:
You have a hold available for pickup as of 2012-01-13:
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
#end-of-record
You have a hold available for pickup as of 2012-01-13:
Title: Selling Out Democracy For Fun and Profit. Volume 1, A-B, United States
Author: Lamar Smith
Copy: 12
#end-of-record
Sample Template ( simplified for brevity ):
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
<%TITLES GO HERE%>
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
At this point I use bash's 'mapfile' to load the source file
record by record using the /^#end-of-file/ regex ...so far so good.
Then I pull predictable aspects of each record according to the line
on which they occur, then process the info using a series of sed
search replace statements.
The Hang-Up
So the problem is the unknown number of 'title' records that could occur.
How can I accommodate an unknown number of titles and always have output
of precisely 65 lines?
Given that title records always occur starting on line 8, I can pull the
titles easily with:
sed -n '8,$p' test-match.txt
However, how can I insert this within an allotted space, ex, between <%CUST-CTY-ZIP%> and <%STORE-NAME%> without pushing the store info out of place in the template?
My idea so far:
-first send the customer info through:
Ex.
sed 's/<%CUST-NAME%>/Benedict Arnold/' template.txt
-Append title records
???
-Then the store/location info
sed 's/<%STORE-NAME%>/Smith's House of Greasy Palms/' template.txt
I have code and functions for this stuff if interested but this post is 'windy' as it is.
Just need help with inserting the title records while maintaining position of following text and maintaining total line number of 65.*
UPDATE
I've decided to change tactics. I'm going to create place holders in the template for all available lines between customer and store info --- then:
Test if line is null in source
if yes -- replace placeholder with null leaving the line ending. Line number maintained.
if not null -- again, replace with text, maintaining line number and line endings in template.
Eventually, I plan to invest some time looking closer at Triplee's suggestion regarding Perl. The Perl way really does look simpler and easier to maintain if I'm going to be stuck with this project long term.
This might work for you:
cat <<! >titles.txt
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> Title 1
> Title 2
> Title 3
> Title 4
> Title 5
> Title 6
> !
cat <<! >template.txt
> <%CUST-NAME%>
> <%CUST-ADDR%>
> <%CUST-CTY-ZIP%>
>
> <%TITLES GO HERE%>
>
> <%STORE-NAME%>
> <%STORE-ADDR%>
> <%STORE-CTY-ZIP%>
> !
sed '1,7d;:a;$!{N;ba};:b;G;s/\n[^\n]*//5g;tc;bb;:c;s/\n/\\n/g;s|.*|/<%TITLES GO HERE%>/c\\&|' titles.txt |
sed -f - template.txt
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
Title 1
Title 2
Title 3
Title 4
Title 5
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
This pads/squeezes the titles to 5 lines (s/\n[^\n]*//5g) if you want fewer or more change the 5 to the number desired.
This will give you five lines of output regardless of the number of lines in titles.txt:
sed -n '$s/$/\n\n\n\n\n/;8,$p' test-match.txt | head -n 5
Another version:
sed -n '8,$N; ${s/$/\n\n\n\n\n/;s/\(\([^\n]*\n\)\{4\}\).*/\1/p}' test-match.txt
Use one less than the number of lines you want (4 in this example will cause 5 lines of output).
Here's a quick proof of concept using Perl formats. If you are unfamiliar with Perl, I guess you will need some additional help with how to get the values from two different files, but it's quite doable, of course. Here, the data is simply embedded into the script itself.
I set the $titles format to 5 lines instead of the proper value (58 or something?) in order to make this easier to try out in a terminal window, and to demonstrate that the output is indeed truncated when it is longer than the allocated space.
#!/usr/bin/perl
use strict;
use warnings;
use vars (qw($cust_name $cust_addr $cust_cty_zip $titles
$store_name $store_addr $store_cty_zip));
my $fmtline = '#' . '<' x 78;
my $titlefmtline = '^' . '<' x 78;
my $empty = '';
my $fmt = join ("\n$fmtline\n", 'format STDOUT = ',
'$cust_name', '$cust_addr', '$cust_cty_zip', '$empty') .
("\n$titlefmtline\n" . '$titles') x 5 . #58
join ("\n$fmtline\n", '', '$empty',
'$store_name', '$store_addr', '$store_cty_zip');
#print $fmt;
eval "$fmt\n.\n";
titles = <<____HERE;
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
____HERE
# Preserve line breaks -- ^<< will fill lines, but preserves line breaks on \r
$titles =~ s/\n/\r\n/g;
while (<DATA>) {
chomp;
($cust_name, $cust_addr, $cust_cty_zip, $store_name, $store_addr, $store_cty_zip)
= split (",");
write STDOUT;
}
__END__
Charlie Bravo,23 Alpa St,Delta ND 12345,Spamazon,98 Spamway,Atlanta GA 98765
The use of $empty to get an empty line is pretty ugly, but I wanted to keep the format as regular as possible. I'm sure it could be avoided, but at the cost of additional code complexity IMHO.
If you are unfamiliar with Perl, the use strict is a complication, but a practical necessity; it requires you to declare your variables either with use vars or my. It is a best practice which helps immensely if you try to make changes to the script.
Here documents with <<HERE work like in shell scripts; it allows you to create a multi-line string easily.
The x operator is for repetition; 'string' x 3 is 'stringstringstring' and ("list") x 3 is ("list" "list" "list"). The dot operator is string concatenation; that is, "foo" . "bar" is "foobar".
Finally, the DATA filehandle allows you to put arbitrary data in the script file itself after the __END__ token which signals the end of the program code. For reading from standard input, use <> instead of <DATA>.

Resources