Having trouble creating and saving a text file in shell scriptin - bash

I attempt to save a few lines into an output file called Temporal_regions.txt in my folder, so I can continue with further process.
I took a look at some of other post, and they suggested to do this way:
echo "some file content" > /path/to/outputfile
So my returned output looks like this:
1 3 13579 586 Right-Temporal 72 73 66 54
2 5 24680 587 Left-Temporal 89 44 65 56
3 7 34552 599 Right-Temporal 72 75 66 54
4 8 24451 480 Left-Temporal 68 57 47 66
*All of these lines were individually returned as output using the grep command from another file (call it TR.stats)
If I want to store these outputs into the .txt (call it TemperolRegion.txt), I would have to first create the output file right? What would be the command to do this?
Then, I can just simply use the above suggested way to store to the .txt file?
grep "Left-Temporal" TR.stats > /path/to/TemperolRegion.txt
I can't seem to get the commands right.

It sounds like you are missing the directories along the path. In Linux, you have to create the parent directory before you create the file. You can create every directory along the path to the parent by using mkdir with the -p option.
mkdir -p /path/to
grep "Left-Temporal" TR.stats > /path/to/TemperolRegion.txt
Another problem you might have is that you are not in the right directory to TR.stats. In that case you should use the absolute path for that file as well.
grep "Left-Temporal" /path/to/TR.stats > /path/to/TemperolRegion.txt

Related

Upload directory as whole instead of contents in s3

I have folder path /images/2020/05/ which contains subdirectories:
ls /images/2020/05/
1 2 3 4 ... 30
I want to feed the bash script to s3 sync folders, e.g.
./sync.sh 11 12 13 14 29 30 such that those directories are uploaded to s3 bucket called imgbucket2021/myimages/
#!/bin/bash
workdir="/images/2020/05"
for subdir in $#; do
/usr/bin/docker run --rm -it -v /root/.aws:/root/.aws -v ${workdir}:/root/ amazon/aws-cli s3 sync /root/${subdir} -- s3://imgbucket2021/myimages/
done
With above script execution, the image files inside the directory are uploaded to bucket path instead of whole folder(e.g. 11 12 13 14 29 30 and contents inside).
Is there some way (without using --exclude, --include) so I can upload a whole directory and content?
Such that listing bucket(imgbucket2021/myimages/) content would have contents like 11 12 13 14 29 30.
edit:
Some commands only perform operations on the contents of a local
directory or S3 prefix/bucket. Adding or omitting a forward slash or
back slash to the end of any path argument, depending on its type,
does not affect the results of the operation
https://docs.aws.amazon.com/cli/latest/reference/s3/
it looks like i will have to use include, exclude flags

Incomplete paste in a file (using cat) when copy selection from terminal

I am trying to copy output from the Mobaxterm terminal in a file in Ubuntu 20.4 running on Win 10 - WSL 2.
Steps I perform:
I select the lines I want to copy.
cat > file
Paste (with Middle-Click, Shift-Ins, Right click menu & Paste)
Ctrl-D to finish the input for the cat command
The result are not complete/reliable. I created several files using different copy&paste methods and the files obtained has different sizes (even when using the same method). See bellow:
wc AftnRG.trace.log.*
233 1704 13751 AftnRG.trace.log.console
233 1819 14570 AftnRG.trace.log.consoleMc
233 1734 13940 AftnRG.trace.log.consoleMcCc
233 1689 13625 AftnRG.trace.log.consoleMcCd
233 1759 14129 AftnRG.trace.log.consoleMcCd2
233 1749 14066 AftnRG.trace.log.consoleMp
233 1713 13814 AftnRG.trace.log.consoleSi
234 1756 14134 AftnRG.trace.log.consolecp
233 1704 13688 AftnRG.trace.log.consolesi
Legend: Mc - middle click, Mp - Menu Paste, Si - shift Insert, Cp - menu Copy Paste, Cd - Ctrl-D , Cc - Ctrl-C
The paste looks complete but data in the file is not.
What am I doing wrong?
How to obtain the data from the clipboard complete in a file?
P.S. I remeber a similar situation when using ssh between RedHat native machines.
At the question how to obtain complete data, I found that using vim, paste and save in a file, there were no lost of information.
It is still unclear why cat is not working as expected.

Reorder lines near the beginning of a huge text file (>20G)

I am a vim user and can use some basic awk or bash commands. Now I have a text (vcf) file with size more than 20G. What I wanted is to move the line #69 to below line#66:
$less huge.vcf
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
69 ##contig=<ID=MT,length=16299>
...
What I wanted is:
...
66 ##contig=<ID=9,length=124595110>
67 ##contig=<ID=MT,length=16299>
68 ##contig=<ID=X,length=171031299>
69 ##contig=<ID=Y,length=91744698>
...
I tried to open and edit it using vim (LargeFile plugin installed), but still not working very well.
The easy approach is to copy the section you want to edit out of your file, modify it in-place, then copy it back in.
# extract the first hundred lines
head -n 100 huge.txt >start.txt
# modify that extracted subset
vim start.txt
# copy that section back into the beginning of larger file
dd if=start.txt of=huge.txt conv=notrunc
Note that this only works if your edits don't change the size of the section being modified. That is to say -- make sure that start.txt has the exact same size in bytes after being modified that it had before.
Here's an awk version:
$ awk 'NR>=3 && NR<=4{b=b (b==""?"":ORS) $0;next}1;NR==5 {print b}' file
...
66 ##contig=<ID=9,length=124595110>
69 ##contig=<ID=MT,length=16299>
67 ##contig=<ID=X,length=171031299>
68 ##contig=<ID=Y,length=91744698>
...
You need to change the line numbers in the code, though. 3 -> 67, 4 -> 68 and 5 -> 69 and redirect the output to a new file. If you' like it to perform inplace, use i inplace for GNU awk.

orthAgogue incorrectly processing BLAST files

Need to recruit the help of any budding bioinformaticians that are lurking in the shadows here.
I am currently in the process of formatting some .fasta files for use in a set of grouping programs but I cannot for the life of me get them to work. First things first, all the files have to have a 3 or 4 character name such as the following:
PP41.fasta
PP59.fasta
PPBD.fasta
...etc...
The files must have headers for each gene sequence that look like so: >xxxx|yyyyyyyyyy where xxxx is the same 3 or 4 letter 'taxon' identifier as the file names I put above and yyyyyyy is a numerical identifier for each of the proteins within each of the taxons (the pipe symbol can also be replaced with an _ as below). I then cat all of these in to one file which has a header that looks correct like so:
>PP49_00001
MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQG
EMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTL
SFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKD
TVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRA
RLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSF
KRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ*
>PP49_00002
MQNDIILPINKLHGLKLLNSLELSDIELGELLSLEGDIKQVSTGNNGIVVHRIDMSEIGS
FLIIDSGESRFVIKAS*
Next step is to construct a blast database which I do as follows, using the formatdb tool of NCBI Blast:
formatdb -i allproteins.fasta -p T -o T
This produces a set of files for the database. Next I conduct an all-vs-all BLAST of the concatenated proteins against the database that I made of them like so, which outputs a tabular file which I suspect is where my issues are beginning to arise:
blastall -p blastp -d allproteins.fasta -i allproteins.fasta -a 6 -F '0 S' -v 100000 -b 100000 -e 1e-5 -m 8 -o plasmid_allvall_blastout
These files have 12 columns and look like the below. It appears correct to me, but my supervisor suspects the error is in the blast file - I don't know what I'm doing wrong however.
PP49_00001 PP51_00025 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PP49_00001 100.00 354 0 0 1 354 1 354 0.0 552
PP49_00001 PPTI_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPNP_00026 90.28 288 28 0 1 288 1 288 3e-172 476
PP49_00001 PPKC_00016 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPBD_00021 89.93 288 29 0 1 288 1 288 2e-170 472
PP49_00001 PPJN_00003 91.14 79 7 0 145 223 2 80 8e-47 147
PP49_00002 PPTI_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPNP_00024 100.00 76 0 0 1 76 1 76 3e-50 146
PP49_00002 PPKC_00018 100.00 76 0 0 1 76 1 76 3e-50 146
SO, this is where the problems really begin. I now pass the above file to a program called orthAgogue which analyses the paired sequences I have above using parameters laid out in the manual (still no idea if I'm doing anything wrong) - all I know is the several output files that are produced are all just nonsense/empty.
Command looks like so:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O .
Any and all ideas welcome! (Hope I've covered everything - sorry about the long post!)
EDIT Never did manage to find a solution to this. Had to use an alternative piece of software. If admins wish to close this please do, unless it is worth having open for someone else (though I suspect its a pretty niche issue).
Discovered this issue (of orthAgogue) first today:
though my reply may be old, I hope it may help future users;
issue is due to a missing parameter: seems like you forgot to specify the separator: -s '_', ie, the following set of command-line parameters should do the trick*:
orthAgogue -i plasmid_allvsall_blastout -t 0 -p 1 -e 5 -O -s '_'
(* Under the assumption that your input-file is a tabular-seperated file of columns.)
A brief update after comment made by Joe:
In brief, the problem described in the intiail error report (by Joe) is (in most cases) not a bug. Instead it is one of the core properties of the Inparanoid algorithm which orthAgogue implements: if your ortholog-result-file is empty (though constructed), this (in most cases) implies that there are no reciprocal best match between a protein-pair from two different taxa/species.
One (of many) explanations for this could be that your blastp-scores are too similar, a case where I would suggest a combined tree-based/homology clustering as in TREEFAM.
Therefore, when I receive your data, I'll send it to one of the biologists I'm working with, with goal of identifying the tool proper for your data: hope my last comment makes your day ;)
Ole Kristian Ekseth, developer of orthAgogue

Reading a text file in Ruby gives wrong output

I am not an experienced ruby programmer, so bear with me. I have a problem with this specific text file containing two lines ( this issue shows up only on occasions) :
trim(0, 15447)
0, 15447
I am trying to read these two lines with the following code:
File.open(trim).each do |line|
puts line
end
I normally obtain the normal output, but here, I get only one line, with some characters missing:
0, 1544715447)
If I want to check the character codes, I get this:
irb(main):120:0> File.open(trim).each do |line|
irb(main):121:1* puts '========================'
irb(main):122:1> puts line
irb(main):123:1> puts '........................'
irb(main):124:1> puts line.each_byte {|c| print c, ' ' }
irb(main):125:1> end
========================
0, 1544715447)
........................
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55 trim(0,0, 15447
=> #<File:E:\Public\Public_videos\Soccer\1995_0129_odp_es\950129-ODP_&m3_trim30.txt>
I frankly don't understand what is going on, as I don't see any hidden character, and this happen randomly, but consistently with some files.
Any suggestion to help me understand or avoid this issue would be greatly appreciated.
What happened is that your file had two "lines" separated by a carraige return character, and not a linefeed.
You showed the bytes in your file as
116 114 105 109 40 48 44 32 49 53 52 52 55 41 13 48 44 32 49 53 52 52 55
That 13 is a carriage return, which is sometimes "displayed" by the writer going back to the start of the line it is writing.
So first it wrote out
trim(0, 15447)
then it went back to the start of the same line and wrote
0, 15447
overlaying the initial line! What do you end up with?
0, 1544715447)
Your "problem" is probably best fixed by reencoding that text file of yours to use a better way to separate lines. On Unix systems, including OSX these days, the line terminator is character 10 - known as LINE FEED. Windows uses the two-character combination 13 10 (CR LF). Only old Mac systems to my knowledge used the 13.
Many text editors today will allow you to select a "line ending" option, so you might be able to just open that file, then save it using a different line ending option. FWIW my guess is that you are using Windows now, which is known for rendering CRs and LFs differently than *Nix systems.

Resources