Can you use blastn to check for exact matches?

Can you use blastn to check for exact matches? - bioinformatics

I need to know on which subjects a query has been matched and where, and this match has to be 100%. Is there a way to do this using blastall?
Thanks.

You can't restrict mapping accuracy to 100% but you can increase the stringency on the E value using the -evalue argument i.e. use a very very small E value. In addition to that, to return back the subject id or accession number as well as the mapping coordinates you can use a custom output format such as this:
-outfmt "6 qacc sacc sseqid evalue qstart qend sstart send"
This will return output in tabular format with 8 columns where: qacc is the query accession, sacc is the subject accession, sseqid is the subject seq-id, evalue is the E value for the alignment, qstart and qend are the query start and end mapping coordinates for the alignment and sstart and send the subject start and end mapping coordinates for the alignment. Putting it all together for an example blastn call:
blastn -query /path/to/myquery.fasta -db /path/to/db -evalue 0.001 -out /path/to/myoutput.tsv -outfmt "6 qacc sacc sseqid evalue qstart qend sstart send"
blastn -help will give you more options on the custom output format.

Related

Reading a folder of log files, and calculating the event durations for unique ID's

I have an air gapped system (so limited in software access) that generates usage logs daily. The logs have unique ID's for devices that I've managed to scrape in the past and pump out to a CSV, to which I would then cleanup in LibreCalc (related to this question I asked here - https://superuser.com/questions/1732415/find-next-matching-event-in-log-and-compare-timings) and get event durations for each one.
This is getting arduous as more devices are added so I wish to automate the calculation of the total durations for each device, and how many events occurred for that device. I've had some suggestions of using out/awk/sed and I'm a bit lost on how to implement it.
Log Example
message="device02 connected" event_ts=2023-01-10T09:20:21Z
message="device05 connected" event_ts=2023-01-10T09:21:31Z
message="device02 disconnected" event_ts=2023-01-10T09:21:56Z
message="device04 connected" event_ts=2023-01-10T11:12:28Z
message="device05 disconnected" event_ts=2023-01-10T15:26:36Z
message="device04 disconnected" event_ts=2023-01-10T18:23:32Z
I already have a bash script that scrapes these events from the log files in the folder and then outputs it all to a csv.
#/bin/bash
#Just a datetime stamp for the flatfile
now=$(date +”%Y%m%d”)
#Log file path, also where I define what month to scrape
LOGFILE=’local.log-202301*’
#Shows what log files are getting read
echo $LOGFILE \n
#Output line by line to csv
awk ‘(/connect/ && ORS=”\n”) || (/disconnect/ && ORS=RS) {field1_var=$1” “$2” “$3”,”; print field1_var}’ $LOGFILE > /home/user/logs/LOG_$now.csv
Ideally I'd like to keep that process so I can manually inspect the file if necessary. But ultimately I'd prefer to automate the event calculations to produce something like below:
Desired Output Example
Device Total Connection Duration Total Connections
device01 0h 0m 0s 0
device02 0h 1m 35s 1
device03 0h 0m 0s 0
device04 7h 11m 4s 1
device05 6h 5m 5s 1
Hopefully thats enough info, any help or pointers would be greatly appreciated. Thanks.

This isn't based on your script at all, since I didn't get it to produce a CSV, but anyway...
Here's an AWK script that computes the desired result for the given example log file:
function time_lapsed(from, to) {
gsub(/[^0-9 ]/, " ", from);
gsub(/[^0-9 ]/, " ", to);
return mktime(to) - mktime(from);
}
BEGIN { OFS = "\t"; }
(/ connected/) {
split($1, a, "=\"", _);
split($3, b, "=", _);
device_connected_at[a[2]] = b[2];
device_connection_count[a[2]]++;
}
(/disconnected/) {
split($1, a, "=\"", _);
split($3, b, "=", _);
device_connection_duration[a[2]]+=time_lapsed(device_connected_at[a[2]], b[2]);
}
END {
print "Device","Total Connection Duration", "Total Connections";
for (device in device_connection_duration) {
print device, strftime("%Hh %Mm %Ss", device_connection_duration[device]), device_connection_count[device];
};
}
I used it on this example log file
message="device02 connected" event_ts=2023-01-10T09:20:21Z
message="device05 connected" event_ts=2023-01-10T09:21:31Z
message="device02 disconnected" event_ts=2023-01-10T09:21:56Z
message="device04 connected" event_ts=2023-01-10T11:12:28Z
message="device06 connected" event_ts=2023-01-10T11:12:28Z
message="device05 disconnected" event_ts=2023-01-10T15:26:36Z
message="device02 connected" event_ts=2023-01-10T19:20:21Z
message="device04 disconnected" event_ts=2023-01-10T18:23:32Z
message="device02 disconnected" event_ts=2023-01-10T21:41:33Z
And it produces this output
Device Total Connection Duration Total Connections
device02 03h 22m 47s 2
device04 08h 11m 04s 1
device05 07h 05m 05s 1
You can pass this program to awk without any flags. It should just work (given you didn't mess around with field and record separators somewhere in your shell session).
Let me explain what's going on:
First we define the time_lapsed function. In that function we first convert the ISO8601 timestamps into the format that mktime can handle (YYYY MM DD HH MM SS), we simply drop the offset since it's all UTC. We then compute the difference of the Epoch timestamps that mktime returns and return that result.
Next in the BEGIN block we define the output field separator OFS to be a tab.
Then we define two rules, one for log lines when the device connected and one for when the device disconnected.
Due to the default field separator the input to these rules looks like this:
$1: message="device02
$2: connected"
$3: event_ts=2023-01-10T09:20:21Z
We don't care about $2. We use split to get the device identifier and the timestamp from $1 and $3 respectively.
In the rule for a device connecting, using the device identifier as the key, we then store when the device connected and increase the connection count for that device. We don't need to initially assign 0 because the associative arrays in awk return "" for fields that contain no record which is coerced to 0 by incrementing it.
In the rule for a device disconnecting we compute the time lapsed and add that to the total time elapsed for that device.
Note that this requires every connect to have a matching disconnect in the logs. I.e., this is very fragile, a missing connect log line will mess up the calculation of the total connection time. A missing disconnect log line with increase the connection count but not the total connection time.
In the END rule we print the desired Output header and for every entry in the associative array device_connection_duration we print the device identifier, total connection duration and total connection count.
I hope this gives you some ideas on how to solve your task.

how to grab text after newline and concat each line to make a new one in a text file no clean of spaces, tabs

I have a text like this:
Print <javascript:PrintThis();>
www.example.com
Order Number: *912343454656548 * Date of Order: November 54 2043
------------------------------------------------------------------------
*Dicders Folcisad:
* STACKOVERFLOW
*dum FWEFaadasdd:* ‎[U+200E] ‎
STACK OVERFLOW
BLVD OF SOMEPLACENICE 434
SANTA MONICA, COUNTY
LOS ANGEKES, CALI 90210
(SW)
*Order Totals:*
Subtotal Usd$789.75
Shipping Usd$87.64
Duties & Taxes Usd$0.00 ‎
Rewards Credit Usd$0.00
*Order Total * *Usd$877.39 *
*Wordskccds:*
STACKOVERFLOW
FasntAsia
xxxx-xxxx-xxxx-
*test Method / Welcome Info *
易客满x京配个人行邮税- 运输 + 关税 & 税费 / ADHHX15892013504555636
*Order Number: 916212582744342X*
*#* *Item* *Price* *Qty.* *Discount* *Subtotal*
1
Random's Bounty, Product, 500 mg, 100 Rainsd Harrys AXK-0ew5535
Usd$141.92 4 -Usd$85.16 Usd$482.52
2
Random Product, Fast Forlang, Mayority Stonghold, Flavors, 10 mg,
60 Stresss CXB-034251
Usd$192.24 1 -Usd$28.83 Usd$163.41
3
34st Omicron, Novaccines Percent Pharmaceutical, 10 mg, 120 Tablesds XDF-38452
Usd$169.20 1 -Usd$25.38 Usd$143.82
*Extra Discounts:* Extra 15% discounts applied! Usd$139.37
*Stackoverflox Contact Information :*
*Web: *www.example.com
*Disclaimer:* something made, or service sold through this website,
have not been test by the sweden Spain norway and Dumrug
Advantage. They are not intended to treet, treat, forsee or
forshadow somw clover.
I'm trying to grab each line that start with number, then concat second line, and finally third line. example text:
1 Random's Bounty, Product, 500 mg, 100 Rainsd Harrys AXK-0ew5535 Usd$141.92 4 -Usd$85.16 Usd$482.52
2 Random Product, Fast Forlang, Mayority Stonghold, Flavors, 10 mg, 60 Stresss CXB-034251 Usd$192.24 1 -Usd$28.83 Usd$163.41 <- 1 line
3 34st Omicron, Novaccines Percent Pharmaceutical, 10 mg, 120 Wedscsd XDF-38452 Usd$169.20 1 -Usd$25.38 Usd$143.82 <- 1 lines as first
as you may notices Second line has 3 lines instead of 2 lines. So make it harder to grab.
Because of the newline and whitespace, the next command only grabs 1:
grep -E '1\s.+'
also, I have been trying to make it with new concats:
grep -E '1\s|[A-Z].+'
But doesn't work, grep begins to select similar pattern in different parts of the text
awk '{$1=$1}1' #done already
tr -s "\t\r\n\v" #done already
tr -d "\t\b\r" #done already
I'm trying to make a script, so I give as an ARGUMENT a not clean FILE and then grab the table and select each number with their respective data. Sometimes data has 4 lines, sometimes 3 lines. So copy/paste don't work for ME.

I think the last line to be joined is the line starting with "Usd". In that case you only need to change the formatting in
awk '
!orderfound && /^[0-9]/ {ordernr++; orderfound=1 }
orderfound { order[ordernr]=order[ordernr] " " $0 }
$1 ~ "Usd" { orderfound = 0 }
END {
for (i=1; i<=ordernr; i++) { print order[i] }
}' inputfile

How to find number of unique strings in a column followed by position of a given string

I need to do get 2 things from tsv input file:
1- To find how many unique strings are there in a given column where individual values are comma separated. For this I used the below command which gave me unique values.
$awk < input.tsv '{print $5}' | sort | uniq | wc -l
Input file example with header (6 columns) and 10 rows:
$cat hum1003.tsv
p-Value Score Disease-Id Disease-Name Gene-Symbols Entrez-IDs
0.0463 4.6263 OMIM:117000 #117000 CENTRAL CORE DISEASE OF MUSCLE;;CCD;;CCOMINICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTIMINICORE DISEASE, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;NEUROMUSCULAR DISEASE, CONGENITAL, WITH UNIFORM TYPE 1 FIBER, INCLUDED;CNMDU1, INCLUDED RYR1 (6261) 6261
0.0463 4.6263 OMIM:611705 MYOPATHY, EARLY-ONSET, WITH FATAL CARDIOMYOPATHY TTN (7273) 7273
0.0513 4.6263 OMIM:609283 PROGRESSIVE EXTERNAL OPHTHALMOPLEGIA WITH MITOCHONDRIAL DNA DELETIONS,AUTOSOMAL DOMINANT, 2 POLG2 (11232), SLC25A4 (291), POLG (5428), RRM2B (50484), C10ORF2 (56652) 11232, 291, 5428, 50484, 56652
0.0539 4.6263 OMIM:605637 #605637 MYOPATHY, PROXIMAL, AND OPHTHALMOPLEGIA; MYPOP;;MYOPATHY WITH CONGENITAL JOINT CONTRACTURES, OPHTHALMOPLEGIA, ANDRIMMED VACUOLES;;INCLUSION BODY MYOPATHY 3, AUTOSOMAL DOMINANT, FORMERLY; IBM3, FORMERLY MYH2 (4620) 4620
0.0577 4.6263 OMIM:609284 NEMALINE MYOPATHY 1 TPM2 (7169), TPM3 (7170) 7169, 7170
0.0707 4.6263 OMIM:608358 #608358 MYOPATHY, MYOSIN STORAGE;;MYOPATHY, HYALINE BODY, AUTOSOMAL DOMINANT MYH7 (4625) 4625
0.0801 4.6263 OMIM:255320 #255320 MINICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MINICORE MYOPATHY;;MULTICORE MYOPATHY;;MULTIMINICORE MYOPATHY MULTICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MULTIMINICORE DISEASE WITH EXTERNAL OPHTHALMOPLEGIA RYR1 (6261) 6261
0.0824 4.6263 OMIM:256030 #256030 NEMALINE MYOPATHY 2; NEM2 NEB (4703) 4703
0.0864 4.6263 OMIM:161800 #161800 NEMALINE MYOPATHY 3; NEM3MYOPATHY, ACTIN, CONGENITAL, WITH EXCESS OF THIN MYOFILAMENTS, INCLUDED;;NEMALINE MYOPATHY 3, WITH INTRANUCLEAR RODS, INCLUDED;;MYOPATHY, ACTIN, CONGENITAL, WITH CORES, INCLUDED ACTA1 (58) 58
0.0939 4.6263 OMIM:602771 RIGID SPINE MUSCULAR DYSTROPHY 1 MYH7 (4625), SEPN1 (57190), TTN (7273), ACTA1 (58) 4625, 57190, 7273, 58
So in this case the string is gene name and I want to count unique strings within the entire stretch of 5th column where they are separated by a comma and a space.
2- Next, the order of data is fixed and is arranged as per column 2's score. So, I want to know where is the gene of interest placed in this ranked list within column 5 (Gene-Symbols). And this has to be done after removing duplicates as same genes are being repeated based on other parameters in rest of the columns but it doesn't concern my final output. I only need to focus on ranked list as per column 2. How do I do that? Is there a command I can pipe to above command to get the position of given value?
Expected output:
If I type the command in point 1 then it should give me unique genes in column 5. I have total 18 genes in column 5. But unique values are 14. If gene of interest is TTN, then it's first occurrence was at second position in original ranked list. Hence, expected answer of where my gene of interest is located should be 2.
$14
$2
Thanks

How to calculate Content-length properly in tclhttpd?

My Tcl source files are in utf-8. Tclhttpd would not send national characters properly, so I modified it a bit. However, I also send binary stuff like jpg images and sometimes binary chunks are present in my otherwise utf-8 HTML. I have difficulty calculating the proper Content-length to match exactly what the browser receives (otherwise some trailing characters clobber the next-request headers or the browser keeps waiting 30 sec per request, until a timeout).
In other words, can I please know how many bytes did puts $socket write into the socket?
I have discovered a particular 11-byte sequence that messes up counting:
proc dump3 string {
binary scan $string c* c
binary scan $string H* hex
return [sdump $string]\n$c\n$hex
};#dump3
proc Httpd_ReturnData {sock type content {code 200} {close 0}} {
global Httpd
upvar #0 Httpd$sock data
#...skip non-pertinent code...
set content \x4f\x4e\xc2\x00\x03\xff\xff\x80\x00\x3c\x2f
#content=ONÂÿÿ�</
#79 78 -62 0 3 -1 -1 -128 0 60 47
#4f4ec20003ffff80003c2f
puts content=[dump3 $content]
puts utf8=[dump3 [encoding convertto utf-8 $content]]
if {[catch {
puts "string length=[string length $content] type=$type"
puts "stringblength=[string bytelength $content]"
set len [string length $content]
if [string match -nocase *utf-8* $type] {
fconfigure $sock -encoding utf-8
set len [string bytelength $content]
}
puts "len=$len fcon=[fconfigure $sock]"
HttpdRespondHeader $sock $type $close $len $code
HttpdSetCookie $sock
puts $sock ""
if {$data(proto) != "HEAD"} {
##fconfigure $sock -translation binary -blocking $Httpd(sockblock)
##native: -translation {auto crlf}
fconfigure $sock -translation lf -blocking $Httpd(sockblock)
puts -nonewline $sock $content
}
Httpd_SockClose $sock $close
} err]} {
HttpdCloseFinal $sock $err
}
}
The output on console is:
content=ONÂÿÿ�</
79 78 -62 0 3 -1 -1 -128 0 60 47
4f4ec20003ffff80003c2f
utf8=ONÃ�Ã¿Ã¿Â�</
79 78 -61 -126 0 3 -61 -65 -61 -65 -62 -128 0 60 47
4f4ec3820003c3bfc3bfc280003c2f
string length=11 type=text/html;charset=utf-8
stringblength=17
len=17 fcon=-blocking 0 -buffering full -buffersize 16384 -encoding utf-8 -eofchar {{} {}} -translation {auto crlf} -peername {128.0.0.71 128.0.0.71 55305} -sockname {128.0.0.8 gen 8016}
HttpdRespondHeader 17
The resultant Content-Length: 17 is too much, the browser keeps waiting. If I only could know beforehand, how many bytes puts will make out of my string, the rest would be easy. Is there a way?

For data going over HTTP, the content length should be the number of bytes in the data as observed on the wire. When working with Httpd_ReturnData you need to ensure that you provide it the binary data to transfer; it does not handle encoding the data for you.
To send binary data with a length it's actually easy, and you do:
set binaryData [...]
Httpd_ReturnData $sock "application/octet-stream" $binaryData
# There are many other binary encodings; that's just the most universal one
# Choose the right one for your application, of course
To send text data with a length, you need to do a little more work with encoding convertto:
set textData [...]
Httpd_ReturnData $sock "text/plain; charset=utf-8" \
[encoding convertto utf-8 $textData]
# Similarly, text/plain is a decent fallback here too
(Yes, if you choose a different encoding then you should mention that in both places. You probably ought to use UTF-8 for all text content in this day and age.)
If you can pull the data from a file, you should do so; Httpd_ReturnFile is more efficient than Httpd_ReturnData as it can move the data using efficient data transfer techniques. If sending a text file, you need to be careful to describe the encoding of the file correctly. By far the easiest way to do that is by convention, such as deciding that all text files on your system are UTF-8...
You should virtually never use string bytelength, as that reports in units that are one of Tcl's internal-only encodings (a lightly-denormalized almost-UTF-8). The measure it returns is only correct when you're doing something very weird like generating C code that needs to know buffer sizes that contain strings that will be fed into Tcl's implementation, which is very much not what you're doing (I've only done that sort of thing once in more than 20 years of using Tcl; I've never heard of another legitimate use). I believe it is deprecated precisely because it has a bunch of subtle bugs in how it is used by all too many people.

Gnuplot Date and Time on the x axis

What I want to do is plot a graph where my x axis takes a date from the first column of data and then a time from my second column and uses both to create the x axis,
I have a set of data from a date logger that I want to graph on gnuplot, as I get new data every day and it would be so easy to just add on each txt file as I get them
The text files look like this (each span 24 hours)
Date Time Value
30/07/2014 00:59:38 0.075
30/07/2014 00:58:34 0.102
30/07/2014 00:57:31 0.058
30/07/2014 00:56:31 0.089
30/07/2014 00:55:28 0.119
30/07/2014 00:54:26 0.151
30/07/2014 00:53:22 0.17
30/07/2014 00:52:19 0.171
30/07/2014 00:51:17 0.221
30/07/2014 00:50:17 0
30/07/2014 00:49:13 0
30/07/2014 00:48:11 0
30/07/2014 00:47:09 0
This solution mixing date and time on gnuplot xaxis would suit me perfectly, but its very complex and I have no idea what is going on, let alone apply it to multiple files
Here's the code I tried, but I get an illegal day of the month error?
#!/gnuplot
set timefmt '%d/%m/%Y %H:%M:%S'
set xdata time
set format x '%d/%m/%Y %H:%M:%S'
#DATA FILES
plot '30.07.2014 Soli.txt' using 1:3 title '30/07/2014' with points pt 5 lc rgb 'red',\
'31.07.2014 Soli.txt' using 1:3 title '31/07/2014' with points pt 5 lc rgb 'blue'
All help appreciated! Thanks

Such an error is triggered, if some unexpected data appears in the data file, like an uncomment and unused header line in your case.
The following file.dat
Date Time Value
30/07/2014 00:59:38 0.075
30/07/2014 00:58:34 0.102
gives such an error with the minimal script
set xdata time
set timefmt '%d/%m/%Y %H:%M:%S'
plot 'file.dat' using 1:3
To solve the error, remove the first line (or similar lines in between).
Since version 4.6.6 you could also use the skip option to skip some lines at the beginning of the data file like:
set xdata time
set timefmt '%d/%m/%Y %H:%M:%S'
plot 'file.dat' using 1:3 skip 1

I wanted to note that you probably don't have to remove non-compliant lines like the "Date Time Value" header line --- you can just comment it out instead with the hash/pound/octothorp:
#Date Time Value
This will then be ignored by Gnuplot at plot time, but stay visible to you when you want to remember which data is in which column.

You get "month error" due to first line.
Your first line "Date Time Value" doesn't match with time format.
In my humble opinion, you have 2 options.
Delete first line and set titles manually and don't change anything of your code
30/07/2014 00:59:38 0.075
30/07/2014 00:58:34 0.102
30/07/2014 00:57:31 0.058`
Keep without changes data file and modify titles in your gnuplot code, setting columnhead in order to ignore first line in your data file:
plot '30.07.2014 Soli.txt' using 1:3 title columnhead with points pt 5 lc rgb 'red',\
'31.07.2014 Soli.txt' using 1:3 title columnhead with points pt 5 lc rgb 'blue'`

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Can you use blastn to check for exact matches? - bioinformatics

I need to know on which subjects a query has been matched and where, and this match has to be 100%. Is there a way to do this using blastall? Thanks.

Related

Reading a folder of log files, and calculating the event durations for unique ID's

how to grab text after newline and concat each line to make a new one in a text file no clean of spaces, tabs

How to find number of unique strings in a column followed by position of a given string

How to calculate Content-length properly in tclhttpd?

Gnuplot Date and Time on the x axis

Categories

Resources