How to concatenate many files using their basenames? - bash

I study genetic data from 288 fish samples (Fish_one, Fish_two ...)
I have four files per fish, each with a different suffix.
eg. for sample_name Fish_one:
file 1 = "Fish_one.1.fq.gz"
file 2 = "Fish_one.2.fq.gz"
file 3 = "Fish_one.rem.1.fq.gz"
file 4 = "Fish_one.rem.2.fq.gz"
I would like to apply the following concatenate instructions to all my samples, using maybe a text file containing a list of all the sample_name, that would be provided to a loop?
cp sample_name.1.fq.gz sample_name.fq.gz
cat sample_name.2.fq.gz >> sample_name.fq.gz
cat sample_name.rem.1.fq.gz >> sample_name.fq.gz
cat sample_name.rem.2.fq.gz >> sample_name.fq.gz
In the end, I would have only one file per sample, ideally in a different folder.
I would be very grateful to receive a bit of help on this one, even though I'm sure the answer is quite simple for a non-novice!
Many thanks,
Noé

I would like to apply the following concatenate instructions to all my
samples, using maybe a text file containing a list of all the
sample_name, that would be provided to a loop?
In the first place, the name of the cat command is mnemonic for "concatentate". It accepts multiple command-line arguments naming sources to concatenate together to the standard output, which is exactly what you want to do. It is poor form to use a cp and three cats where a single cat would do.
In the second place, although you certainly could use a file of name stems to drive the operation you describe, it's likely that you don't need to go to the trouble to create or maintain such a file. Globbing will probably do the job satisfactorily. As long as there aren't any name stems that need to be excluded, then, I'd probably go with something like this:
for f in *.rem.1.fq.gz; do
stem=${f%.rem.1.fq.gz}
cat "$stem".{1,2,rem.1,rem.2}.fq.gz > "${other_dir}/${stem}.fq.gz"
done
That recognizes the groups present in the current working directory by the members whose names end with .rem.1.fq.gz. It extracts the common name stem from that member's name, then concatenates the four members to the correspondingly-named output file in the directory identified by ${other_dir}. It relies on brace expansion to form the arguments to cat, so as to minimize code and (IMO) improve clarity.

Related

Shell help text syntax for repeatable group of arguments

I'm writing a help output for a Bash script. Currently it looks like this:
dl [m|r]… (<file>|<URL> [m|r|<index>]…)…
The meaning that I'm trying to convey (and elsewhere describe with words) is that (after a potential "m" and/or "r") there can be an endless list of sets of arguments. The first argument in each set is always a file or URL and the further arguments can each be "m", "r" or a number. After that, it starts over with a file or URL and so on.
In my special case, I could just write this:
dl [m|r]… (<file>|<URL>) (<file>|<URL>|m|r|<index>)…
This works, because listing a URL and then another URL with nothing in between is allowed, as well as listing an arbitrarily long chain of "m"s (it's just useless to do so) and pretty much any other combination.
But what if that wasn't the case? What if I had for example a command like this:
change (<from> <to>)…
…which would be used e.g. like this:
change from1 to1 from2 to2 from3 to3
Would the bracket syntax be correct here? I just guessed it based on the grouping of (a|b), but I wasn't able to find any documentation that uses this for multiple, non-exclusive arguments that belong together. Is there even a standard for this?

Rule-specific wildcards in Snakemake

I often find when adding rules to my workflow that I need to split large jobs up into batches. This means that my input/output files will branch out across temporary sets of batches for some rules before consolidating again into one input file for a later rule. For example:
rule all:
input:
expand("final_output/{sample}.counts",sample=config["samples"]) ##this final output relates to blast rule in that it will feature a column defining transcript type
...
rule batch_prep:
input: "transcriptome.fasta"
output:expand("blast_input_{X}.fasta",X=[1,2,3,4,5])
script:"scripts/split_transcriptome.sh"
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast.txt"
script:"scripts/blastx.sh"
...
rule rsem:
input:
"transcriptome.fasta",
"{sample}.fastq"
output:
"final_output/{sample}.counts"
script:
"scripts/rsem.sh"
In this simplified workflow, snakemake -n would show a separate rsem job for each sample (as expected, from wildcards set in rule all). However, blast would give a WildcardError stating that
Wildcards in input files cannot be determined from output files:
'X'
This makes sense, but I can't figure out a way for the Snakefile to submit separate jobs for each of the 5 batches above using the one blast template rule. I can't make separate rules for each batch, as the number of batches will vary on the size of the dataset. It seems it would be useful if I could define wildcards local to a rule. Does such a thing exist, or is there a better way to solve this issue?
I hope I understood your problem correctly, if not, feel free to correct me:
So, you want to call the rule blast for every "blast_input_{X}.fasta"?
Then, the batch wildcard would need to be carried over into the output.
rule blast:
input:"blast_input_{X}.fasta",
output:"output_blast_{X}.txt"
script:"scripts/blastx.sh"
If you then later want to merge the batches again in another rule, just use expand in the input of that rule.
input: expand("output_blast_{X}.txt", X=your_batches)
output: "merged_blast_output.txt"

How can i save in a list/array first parts of a line and then sort them depending on the second part?

I have a school project that gives me several lines of string in a text like this:
team1-team2:2-1
team3-team1:2-2
etc
it wants me to determine what team won (or drew) and then make a league table with them, awarding points for wins/draws.
this is my first time using bash. what i did was save team1/team2 names in a variable and then do the same for goals. how should i make the table? i managed to make my script create a new file that saves in there all team names (And checking for no duplicates) but i dont know how to continue. should i make an array for each team saving in there their results? and then how do i implement the rankings, for example
team1 3p
team2 1p
etc.
im not asking for actual code, just a guide as to how i should implement it. is making a new file the right move? should i try making a new array with the teams instead? or something else?
The problem can be divided into 3 parts:
Read the input data into memory in a format that can be manipulated easily.
Manipulate the data in memory
Output the results in the desired format.
When reading the data into memory, you might decide to read all the data in one go before manipulating it. Or you might decide to read the input data one line at a time and manipulate each line as it is read. When using shell scripting languages, like bash, the second option usually results in simpler code.
The most important decision to make here is how you want to structure the data in memory. You normally want to avoid duplication of data, and you usually want a data structure that is easy to transform into your desired output. In this case, the most logical data structure is an associative array, using the team name as the key.
Assuming that you have to use bash, here is a framework for you to build upon:
#!/bin/bash
declare -A results
while IFS=':-' read team1 team2 score1 score2; do
if [ ${score1} -gt ${score2} ]; then
((results[${team1}]+=2))
elif [ ...next test... ]; then
...
else
...
fi
done < scores.txt
# Now you have an associative array containing the points for each team.
# You can either output it as it stands, or sort it by piping through the
# 'sort' command.
for key in $[!results[#]}; do
echo ...
done
I would use awk for this
AWK is an interpreted programming language(AWK stands for Aho, Weinberger, Kernighan) designed for text processing and typically used as a data extraction and reporting tool. AWK is used largely with Unix systems.
Using pure bash scripting is often messy for that kind of jobs.
Let me show you how easy it can be using awk
Input file : scores.txt
team1-team2:2-1
team3-team1:2-2
Code :
awk -F'[:-]' ' # set delimiters to ':' or '-'
{
if($3>$4){teams[$1] += 3} # first team gets 3 points
else if ($3<$4){teams[$2] += 3} # second team gets 3 points
else {teams[$1]+=1; teams[$2]+=1} # both teams get 1 point
}
END{ # after scanning input file
for(team in teams){
print(team OFS teams[team]) # print total points per team
}
}' scores.txt | sort -rnk 2 > ranking.txt # sort by nb of points
Output (ranking.txt):
team1 4
team3 1

Find matched and unmatched records and position of key-word is unknown [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have two files FILE1 & FILE2, and lets say both are fixed length of 30 characters. I need to find the records from FILE1 & FILE2 which contain the string 'COBOL', where the position of this key-word is unknown and changes for every record. To be more clear below is as sample layout.
FILE1 :
NVWGNLVKAKOIVCOBOLLKASVOIWSNVS
SOSIVSNAVIS7780HLSVHSKSCOBOL56
ZXCVBNMASDFGHJJKKLIIUYYTRREEWQ
1234567890COBOL1234556FCVHJJHH
COBOL1231231231231231341234334
FILE2 :
123456789012345678901234567890
COBOL1231231231231231341234334
GYKCHYYIIHHFTIUIHGJUGTUHGFUYHG
Can any one explain me how to do it using SORT or JOINKEYS and also by using COBOL program.
I need two output files.
Output FILE-OP1 : (which contain all the records having COBOL key-word from file1 & file2)
NVWGNLVKAKOIVCOBOLLKASVOIWSNVS
SOSIVSNAVIS7780HLSVHSKSCOBOL56
1234567890COBOL1234556FCVHJJHH
COBOL1231231231231231341234334
COBOL1231231231231231341234334
Output File-OP2 (which contain only matching records with COBOL key-word from file1 & file2)
COBOL1231231231231231341234334
An example, pseudo-codeish, Cobol:
Open File1
Read File1 into The-Record
Perform until End-Of-File
Perform varying II from 1 by 1
until II > length of The-Record
If The-Record (II:5) = 'COBOL'
Display "Found COBOL at position " II
End-If
End-Perform
Read File1 into The-Record
End-perform
Repeat for file2 with the same program pointed at your other file.
As this sounds homework-y, I've left several little quirks that you will need to fix in that code, but you should see where it blows up or fails and be able to resolve those reasonably easily.
If you need to do some sort of matching and dropping between the two files, that is a different animal and you need to get your rules for it. Are you trying to match the files that have "COBOL" located in the same position or something? What behavior do you expect?
For your FILE1, SORT it on the entire input data, only including records which contain COBOL and appending a sequence number (you show your output in the original sequence). If there can be duplicate records, SORT on the sequence-number you attach as well.
Similar for FILE2.
The SORT for each program can be stand-alone (DFSORT or SyncSORT) or within a COBOL program.
You then "match" the files, here's a useful bit of pseudo-code from Bruce Martin: https://stackoverflow.com/a/22950005/1927206
Logically after the match, you then need to SORT both outputs on the sequence-number alone, and after that remove the sequence-numbers.
Remembering that you only need to know if COBOL is present in the data, if using COBOL for the first two SORTs you have a variety of ways to locate the word COBOL (and remembering you only need to know if it is there, not where it is or how many times it may be there): as Joe Zitzelberger showed, you can use a one-byte reference-modification, but be careful not to go beyond the data with your PERFORM VARYING (use compiler option SSRANGE if you are unclear what I mean); you can use INSPECT; UNSTRING; STRING; define you data with an OCCURS, for a length of five and use an index for a one-byte table; use OCCURS DEPENDING ON; do it "byte at a time"; etc.
This is a little bit like free-format number handling.
You can use "SS" in DFSORT to find records containing cobol.
Step 1. read both infiles, produce one outfile OP-1
INCLUDE COND=(1,30,SS,EQ,C'COBOL')
Step2. produce a work file in the same way as step 1. using only File 1.
Step3. produce a work file in the same way as step 1. using only File 2.
Run joinkeys on these two to find matches. ==> outfile OP-2
Essentially this strategy serves to eliminate non qualifying rows from the join.

automate creating sql statements using scripting tool

I often have a task a bit like this: insert a large number of users onto to the users table with similar properties. Not always that simple, but in general, list of strings -> list of corresponding sql statements.
my usual solution is this with the list of usernames in excel use a formula to generate a load of insert statements
=concatenate("insert into users values(username .......'",A1,"'.....
and then I fill down the formula to get all the insert rows.
This works but sometimes the statement is long, sometimes including a few different steps for each, and cramming it all into an excel formula and getting all the wrapping quotes right is a pain.
I'm wondering if there is a better way. What I really want is to be able to have a template file template txt:
insert into users
([username],
[company] ...
)
values('<template tag1>...
and then using some magic command line tool, to simply be able to type something like
command_line> make_big_file_using_template template.txt /values [username1 username2]
/output: bigfile.txt
and this gives me a big file with the template repeated for each username value with the tag replaced with the username.
So does such a command exist, or are my expectations of command line tools too high? Any freely available windows tool will do. I could whip up a c# program to do this in not too much time but I feel like there must be an easy to use tool out there already.
This is trivial using a Powershell script. PS allows inline variables in strings, so you could do something like:
$Tag1 = 'blah'
$Tag2 = 'foo'
$SQLHS = #"
INSERT INTO users
([username],
[company],...)
VALUES
('$tag1', '$tag2'...)
"#
set-content 'C:\Mynewfile.txt' -value $SQLHS
The #"...."# is a here-string, which makes it very easy to write readable code without escaping quotes and such.
The above could be very easily modified to accept parameters for the various tags and another for the output file, or to run for a set of values located in another .txt or .csv file as inputs.
EDIT:
To modify it to accept parameters, you can just add a param() block at top:
param($outfile, $tab1, $tab2, $tab3)
Then use those $variables in your script:
set-content "$outfile" -value $SQLHS

Resources