how to use psiblaster to align uniref50 protein db and blast 300m+ queries fastest way with 36 nodes and 52 cores each - bioinformatics

Trying to use psiblaster to query 300m+ records, but even running 6 queries on slurm taking 55+ minutes. What is the best way to blast all queries.
psiblast="./ncbi-blast-2.13.0+/bin/psiblast" #set path where your ncbi blast unzipped and bin folder
makeblastdb="./ncbi-blast-2.13.0+/bin/makeblastdb"
db="./db"
dbpath=~/sandbox/blast/db
path_to_fasta_file="./fasta/file_1.fsa"
query_file="./fasta/test.fsa"
echo $path_to_fasta_file
generate_database=true #set this to false after first time generating your database
type="prot"
#generate database
if $generate_database
then
echo "Generating Database"
$makeblastdb -dbtype ${type} -in ${path_to_fasta_file} -out ${db}/mydb.fasta -parse_seqids
echo "Database Generated Successfully"
fi
output="./output/results.xml"
#runblaster
runblaster=true #set this to true when you want to run your psi blaster
if $runblaster
then
echo "PSI Blaster started"
$psiblast -db ${dbpath}/mydb.fasta -query ${query_file} -out ${output} -outfmt 5
echo "PSI Blaster Completed"
fi
Here is my initial code
I tried to run above script, but now I have to complete uniref 50 all queries, with resources available from slurm.
I have splitted uniref50.fasta files in 6 chunks each, and tried running it with sbatch but still not getting the performance improvement.How can i split my tasks and utilize maximum available resources of nodes and cores and complete task.

Related

PowerShell update variable while script is running

How do i update a variable in PowerShell while the script is running state?
I have a situation where the script monitors the size of a disk continuously and compares it with a number in a text file on a shared drive (say Z:\quota\software-share-size.txt). If the number in text file is greater than disk size it monitors, then it sends out an email to expand the disk to new size as mentioned in text file. But once the script starts, its not pulling in the new number from file and i dont want to stop and start the script to load new content from the text file. Help please
Maybe this can help you :
while($true)
{
#Here i pull my disk (C:) infomations (i use localhost for my computer, but you can use an IP or a file with multiple IP)
$diskinfo = Get-WmiObject Win32_LogicalDisk -ComputerName localhost | where {$_.DeviceId -eq 'C:'}
#Here you pull only the freespace value with Gb format (default is byte)
$freespace = $diskinfo.freespace/1Gb
$freespace = [int]$freespace
#here you pull the "limit number" off your file, must be in Gb and only the number is written in the file.
$limit=Get-Content -Path "B:\Dev\filewithsizeingo.txt"
$limit = [int]$limit
if ($freespace -gt $limit) #The free diskspace is greater than the limit
{
Write-Host "Diskfreespace is Above Limit" -f Green
}
elseif ($freespace -lt $limit) #The free diskspace is inferior than the limit
{
Write-Host "Free diskspace below limit" -f Red
#break
#code mail sending
}
Start-Sleep 1
}
Because it's a loop, you can modify the filewithsizeingo.txt without stoping the script, and the script will refresh the free diskspace and limit value at each loop.
In the elseif statement, you can insert a break and code the sending of the email (which I do not know yet), dont forget the break or it will send a mail every second.
I hope it helps, or at least it gave you fresh's ideas (i'm a beginner with powershell, the code CAN be improved).

How to make script in bash aware that a server is still busy installing/configuring and wait for reboot?

The issue / dilemma
I am currently busy creating a script to kickstart servers (with CentOS 6.x and CentOS 7.x) remotely. So far the script is working, but hangs on one minor thing. Well actually it does not hang, but it does not give detailed information about what is happening. In other words, I am not getting the correct information back in bash about the job being finished correctly.
I have tried various things, however it's hanging with the following message (which is being repeated endlessly):
servername is still installing and configuring packages...
PING 100.125.150.175 (100.125.150.175) 56(84) bytes of data.
64 bytes from 100.125.150.175: icmp_seq=1 ttl=63 time=0.152 ms
64 bytes from 100.125.150.175: icmp_seq=2 ttl=63 time=0.157 ms
64 bytes from 100.125.150.175: icmp_seq=3 ttl=63 time=0.157 ms
64 bytes from 100.125.150.175: icmp_seq=4 ttl=63 time=0.143 ms
64 bytes from 100.125.150.175: icmp_seq=5 ttl=63 time=0.182 ms
--- 100.125.150.175 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 120025ms
rtt min/avg/max/mdev = 0.143/0.158/0.182/0.015 ms
servername is still installing and configuring packages...
PING 100.125.150.175 (100.125.150.175) 56(84) bytes of data.
64 bytes from 100.125.150.175: icmp_seq=1 ttl=63 time=0.153 ms
64 bytes from 100.125.150.175: icmp_seq=2 ttl=63 time=0.132 ms
64 bytes from 100.125.150.175: icmp_seq=3 ttl=63 time=0.142 ms
etc....
So for some reason it does not contine to the next line of code or does the next action. Since it's only feedback to me (or another user), it's not a majorissue. But it would be nice to get this functional and providing (detailed) information back about the current progress or what the script/server is actually doing at the moment. This is not the case for the above (last) piece of code unfortunately.
This is the current code snippet I have (yes, it's a mess):
while true;
do
#ping -c3 -i3 $HWNODEIP > /dev/null
#ping -c5 -i30 $HWNODEIP > /dev/null
ping -c5 -i30 $HWNODEIP
if [ $? -eq 1 ] || [ $? -eq 2 ] || [ $? -eq 68 ]
then
echo -e " "
echo -e "Kickstart part II also done. $HOSTNAME will be rebooted one more time."
sleep 5
######return 0
echo -e " "
printf "%s" "Waiting for $HOSTNAME to come back online: "
while ! ping -c 1 -n -w 30 $HWNODEIP &> /dev/null
do
printf "%c" "."
#sleep 10
done
echo -e " "
echo -e "Reboot is done and $HOSTNAME is back online. Performing final check. Please wait..."
sleep 10
echo -e " "
sudo /usr/local/collectHWdata.pl $HWNODEIP
ssh root#$HWNODEIP "while ! test -e /root/kickstart-DONE; do sleep 3; done; echo KICKSTART IS DONE\!"
echo -e " "
exit
else
echo -e " "
echo -e "$HOSTNAME is still installing and configuring packages..."
fi
done
Sidenote: I removed > /dev/null #5 for debugging (not that it helped)
I am guessing I am using things incorrectly and I am by no means a experienced scripter; I can only do minor stuff, but ofcourse I am doing my best. I have been fooling around with this since last week and still no result on this part.
What am I trying to achieve?
The server is rebooted after the selected CentOS version, creating partitions and setting up the network. This all works. The above snippet is after that reboot. Now it will install packages I selected, configure various things (like Nagios) and install/compile certain PERL modules. And a few other minor things.
This is done correctly in the background. I wanted to make the script (the above piece of code) that the server is still busy with installing things and such. Since I lack the knowledge to do that, I decided for a different approach; check if the server is online (in other words that it's still installing). As long as the server is online, it's still installing/configuring things obviously. After that is done, the server will reboot once more to perform the final 2 commands (as seen in my snippet). However (here is the problem) it never does those commands, though the kickstart is completely done.
So I am guessing I am doing something wrong and even might messed up things (or got confused by doing so). Maybe someone has an idea, solution or a completely different approach to tackle and fix this problem (or at least I hope so).
Other things I have tried so far? Well I tried a various of ping commands and I also tried nc (netcat) but also without a good result. I every single time hit a brick wall with the last 2 commands and it keeps pinging instead of showing that the kickstart was done... I think I have spend several hours (since last week) on this already without getting anywhere.
So I am hoping someone can take a look at this and tell me what I am doing wrong and maybe there is a better approach (other than pinging a server) to see if it's still busy. Maybe a (remote) check on yum, perl or a service, so that the script knows it's still busy.
Sorry for the long post, but I know when I provide as much information as possible including code examples and results, this is more "appreciated". So I am hoping I provided adequate information. If not, let me know. I will try to add as much information as I can. As always I am always willing to learn or change my approach.
Thank you already for reading my post!
As noted in the comments under the question:
The server may already be rebooted by the time ping -c5 -i30 $HWNODEIP finishes. The command sends 5 packets (-c flag), waiting 30 seconds between each packet (-i interval flag). So thats's 5*30 = 150 seconds, which is a bit more than 2 minutes. A server could reboot just fine within 2 minutes, especially if there's SSD in use. So try lowering the total time it would take this command to complete.
[ $? -eq 68 ] is probably unnecessary. $HWNODEIP is just ip address, and exit code 68 is for domain name not being resolved, which doesn't apply to IP addresses.
The if statement could be simplified to
if ! ping -c5 -i30 "$HWNODEIP"
These are minor suggestions,probably not bulletproof. As confirmed by OP in the comments, lowering interval helps. There's other small improvements that could be done (like quoting variables), but that's outside the scope of the question, so I'll leave it for now.

Nmap - RTTVAR has grown to over 2.3 seconds, decreasing to 2.0

I have a script that I'm using to build a config for icinga2. The network is large, multiple /13's large. When I run the script I keep getting the RTTVAR has grown to over 2.3 seconds, decreasing to 2.0 error. I've tried raising my gc_thresh and breaking up the subnets. I've dived through the little info from google and can't seem to find a fix. If anyone has any ideas, I'd really appreciate it. I'm on Ubuntu 16.04
My script:
# Find devices and create IP list
i=72
while [ $i -lt 255 ]
do
echo "$(date) - Scanning xx.$i.0.0/16" >> files/scan.log
nmap -sn --host-timeout 5 xx.$i.0.0/16 -oG - | awk '/Up$/{print $2}' >> files/ip-list
let i=i+1
done
My /etc/sysctl.conf
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600
# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600
# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 8192
net.ipv4.neigh.default.gc_thresh2 = 4096
net.ipv4.neigh.default.gc_thresh1 = 2048
Edit: added host-timeout 5 removed -n
I can suggest you tu use ping scan. If you want an "overall sight" of your network you can use
nmap -sP -n
It decreases the time a little bit comparing to nmap -sn , you can check it with small examples.
As I said in a comment. Use --host-timeout and --max-retries and that will improve your performance.

How to script sfdisk or parted for multiple partitions?

For QA purposes I need to be able to partition a drive via a bash script up to 30 or more partitions for both RHEL and SLES.
I have attempted to do this in BASH with fdisk via a "here document" which works but as you can guess blows up in various steps. I assume this is because of timing of the input commands occurring at the wrong times and getting out of sync. 1 out of 10 times my script will work correctly.
I have looked at parted and sfdisk and don't really understand how to use these tools.
I have only ever used fdisk.
My issue is that with fdisk you can state something like "new partition +1gb" over and over and this works for me because in my script I don't need to keep track of prior partitions or remaining space or do any calculations. Every time I run this function this just makes an additional 1gb partition from any unused space.
Is there a way to use parted or sfdisk (or any other tool that would already be a part of these distros) so that I could script a loop from 1 to x operations without having to do this accounting of remaining space? Does anyone have any examples they could share?
Update
Here is an example of one of my functions. At the start of the script we ask the user for the number of partitions to create, their size (this is static for all), and type of FS if any. This functions creates partition 1 thru 3 then a different function handles the extended (4th) and another handles 5th to nn.
As I said before, this script is fully functional; my problem is that at times the commands being sent to fdisk seem to arrive at the wrong timing, which then breaks the entire script thereby breaking any automation.
So the commands are being sent like this:
n
p
1
+1M
w
I have been reading up on fdisk and have learned it is not suited well for scripting so what I am seeing is that when in script mode, fdisk might be asking for p my script already thinks it's time to send the 1.
The thing about fdisk that worked for me is that after you specify the partition number it already calculated the next free sector so all I have to do at this point is send a blank line for my start and then the +1M for my total size. Parted and sfdisk don't appear to work this way from what I can tell and I am still very new at this to understand how to automate those tools at this time.
Create1to3Primary_Func() {
Size=\+$partSize\MB
for i in {1..3}
do
echo " this loop i= $i"
echo "Creating Partition $i on $targetFull as $targetFull$i using Create1to3Primary_Func()"
rm -f /tmp/myScript
echo -e "n" >> /tmp/myScript
echo -e "p" >> /tmp/myScript
echo -e "$i" >> /tmp/myScript
echo -e " " >> /tmp/myScript
echo -e "$Size" >> /tmp/myScript
echo -e "w" >> /tmp/myScript
echo -e "EOF" >> /tmp/myScript
fdisk $targetFull < /tmp/myScript
echo " sleeping Create1to3Primary_Func()"
sleep 4s
if [ "$RawOrFs" == "f" ]; then
mkfsCMD="mkfs.$fsType"
mkfsFullTarget="$targetFull$i"
cmdline="$mkfsCMD $mkfsFullTarget -L 'Partition$i'"
echo "Creating $fsType File System on $mkfsFullTarget"
$cmdline
fi
void="/mnt/mymnt$i"
if [ ! -d $void ] ; then
echo "Creating Mount Point /mnt/mymnt$i"
void="/mnt/mymnt$i"
mkdir $void
fi
echo "Part Probe on $targetFull "
partprobe $targetFull ; sleep 4s
done
}
Not sure to get what you really want, but you may be interested by the fact that sfdisk can dump a partition layout and use this layout to partition other disks. For instance:
sfdisk -d /dev/sda > mydiskpartitionslayout
Then in your script (take care of course) you can specify
sfdisk /dev/sdx < mydiskpartitionslayout
sfdisk
sfdisk is a Scripted version of fdisk
It is part of util-linux, just like fdisk, so availability should be the same.
A partition table with a single partition that takes the whole disk can be
created with:
echo 'type=83' | sudo sfdisk /dev/sdX
and more complex partition tables are explained below.
To generate an example script, get the setup of one of your disks:
sudo sfdisk -d /dev/sda > sda.sfdisk
Sample output on my Lenovo T430 Windows 7 / Ubuntu dual boot:
label: dos
label-id: 0x7ddcbf7d
device: /dev/sda
unit: sectors
/dev/sda1 : start= 2048, size= 3072000, type=7, bootable
/dev/sda2 : start= 3074048, size= 195430105, type=7
/dev/sda3 : start= 948099072, size= 28672000, type=7
/dev/sda4 : start= 198504446, size= 749594626, type=5
/dev/sda5 : start= 198504448, size= 618891264, type=83
/dev/sda6 : start= 940277760, size= 7821312, type=82
/dev/sda7 : start= 817397760, size= 61437952, type=83
/dev/sda8 : start= 878837760, size= 61437500, type=83
Once you have the script saved to a file, you can apply it to sdX with:
sudo sfdisk /dev/sdX < sda.sfdisk
For sfdisk input, you can just omit the device names, and use lines of type:
start= 2048, size= 3072000, type=7, bootable
They are just ignored if present, and the device name is taken from the command line argument.
Some explanations:
header lines: all optional:
label: type of partition table. dos (MBR) is the old an widely supported one, gpt the new shiny thing.
unit: only sector is supported. 1 sector usually equals 512 bytes. Find with cat /sys/block/sda/queue/hw_sector_size See also: https://unix.stackexchange.com/questions/2668/finding-the-sector-size-of-a-partition
device: informative only I think
partition lines:
start: offset inside the disk at which the partition starts.
start has very good defaults, and can often be ommited:
on the first line, start is 2048, i.e. 1Mb (2048 + 512), which is a sane default for disk compatibility
further start default to the first unallocated position
size: man sfdisk says: The default value of size indicates "as much as possible". So to fill the disk with a single partition use: /dev/sda : start=2048, type=83
type: magic byte stored on the boot sector for each partition entry. Possible values: https://en.wikipedia.org/wiki/Partition_type On this example we observe:
7 (sda1, 2 and 3): filesystems that Windows supports. Preinstalled Windows stuff and Lenovo recovery partitions. sudo blkid labels help identify them.
5 (sda4): extended primary partition, which will contain other logical partitions (because we can only have 4 primary partitions with MBR)
83(sda5, 7, and 8): partitions which Linux supports. For me one home, and two roots with different Ubuntu versions
82 (sd6): swap
fdisk can also read sfdisk scripts with the I command, which "sources" them during an interactive fdisk session, allowing you further customization before writing the partition.
Tested on Ubuntu 16.04, sfdisk 2.27.1.
Format and populate the partitions an image file without sudo
This is a good way to learn to use sfdisk without blowing up your hard disks: How to create a multi partition SD disk image without root privileges?
An approach I like (which I saw in this article) is to "script" the fdisk input directly, since it's smarter than sfdisk about creating a partition "until the end of the disk" or "2 GB large". Example:
echo "d
1
d
2
d
3
n
p
1
+2G
n
p
2
w
" | fdisk /dev/sda
This script deletes up to 3 existing partitions, creates a 2 GB partition (e.g. swap) and then creates a partition that would extend over the remaining disk space.
In contrast, if a partition layout was created and used in sfdisk, the script would not cover the whole disk if more space was available.
Automating repetitive task is a norm in the life of automation and we need a method to automatically provide answers to these programs if we are to include them in our script.
This is where a program called “Expect” steps in to automate. For Red Hat based systems, execute the below command to install "Expect"
yum install expect
For Debian based or Ubuntu, execute the below command.
apt-get install expect
below is the expect script to create a partition /dev/sdc
!/usr/bin/expect
log_file -a "/tmp/expect.log"
set timeout 600
spawn /sbin/fdisk /dev/sdc
expect "Command (m for help): " { send "n\r" }
expect "p primary partition (1-4)"
expect "" { send "p\r" }
expect "Partition number (1-4): " { send "1\r" }
expect "First cylinder (1-133544, default 1): " { send "1\r" }
expect ": " { send "\r" }
expect "Command (m for help): " { send "w\r" }
interact

fastest hashing in a unix environment?

I need to examine the output of a certain script 1000s of times on a unix platform and check if any of it has changed from before.
I've been doing this:
(script_stuff) | md5sum
and storing this value. I actually don't really need "md5", JUST a simple hash function which I can compare against a stored value to see if its changed. Its okay if there are an occassional false positive.
Is there anything better than md5sum that works faster and generates a fairly usable hash value? The script itself generates a few lines of text - maybe 10-20 on average to max 100 or so.
I had a look at fast md5sum on millions of strings in bash/ubuntu - that's wonderful, but I can't compile a new program. Need a system utility... :(
Additional "background" details:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I have no idea what the use of such a system would be, I'm just doing this as a job for someone else...
The cksum utility calculates a non-cryptographic CRC checksum.
How big is the output you're checking? A hundred lines max. I'd just save the entire original file then use cmp to see if it's changed. Given that a hash calculation will have to read every byte anyway, the only way you'll get an advantage from a checksum type calculation is if the cost of doing it is less than reading two files of that size.
And cmp won't give you any false positives or negatives :-)
pax> echo hello >qq1.txt
pax> echo goodbye >qq2.txt
pax> cp qq1.txt qq3.txt
pax> cmp qq1.txt qq2.txt >/dev/null
pax> echo $?
1
pax> cmp qq1.txt qq3.txt >/dev/null
pax> echo $?
0
Based on your question update:
I've been asked to monitor the DNS record of a set of 1000 or so domains and immediately call certain other scripts if there has been any change. I intend to do a dig xyz +short statement and hash its output and store that, and then check it against a previously stored value. Any change will trigger the other script, otherwise it just goes on. Right now, we're planning on using cron for a set of these 1000, but can think completely diffeerently for "seriously heavy" usage - ~20,000 or so.
I'm not sure you need to worry too much about the file I/O. The following script executed dig microsoft.com +short 5000 times first with file I/O then with output to /dev/null (by changing the comments).
#!/bin/bash
rm -rf qqtemp
mkdir qqtemp
((i = 0))
while [[ $i -ne 5000 ]] ; do
#dig microsoft.com +short >qqtemp/microsoft.com.$i
dig microsoft.com +short >/dev/null
((i = i + 1))
done
The elapsed times at 5 runs each are:
File I/O | /dev/null
----------+-----------
3:09 | 1:52
2:54 | 2:33
2:43 | 3:04
2:49 | 2:38
2:33 | 3:08
After removing the outliers and averaging, the results are 2:49 for the file I/O and 2:45 for the /dev/null. The time difference is four seconds for 5000 iterations, only 1/1250th of a second per item.
However, since an iteration over the 5000 takes up to three minutes, that's how long it will take maximum to detect a problem (a minute and a half on average). If that's not acceptable, you need to move away from bash to another tool.
Given that a single dig only takes about 0.012 seconds, you should theoretically do 5000 in sixty seconds assuming your checking tool takes no time at all. You may be better off doing something like this in Perl and using an associative array to store the output from dig.
Perl's semi-compiled nature means that it will probably run substantially faster than a bash script and Perl's fancy stuff will make the job a lot easier. However, you're unlikely to get that 60-second time much lower just because that's how long it takes to run the dig commands.

Resources