Sort large text file by length of a column

Sort large text file by length of a column - sorting

I have an about 2Gb FASTA (text) file that needs to be sorted by length of its 4th column. It looks like
MERCURE:174:C0UT3ACXX:5:2316:18091:100842/1 + dogpremirnas 4910 AAAAAAAAAA DDC#BBDDDD 0 3:T>A,9:T>A
MERCURE:174:C0UT3ACXX:5:2316:18110:100902/1 + dogpremirnas 4909 AAAAAAAAAA DDDDDBDDBD 0 0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18153:100840/1 - dogpremirnas 2269 TTTTTTTTTTT BDDB>9<#A>< 0 5:C>T,9:C>T
MERCURE:174:C0UT3ACXX:5:2316:18259:100924/1 + dogpremirnas 833 ACCGATCTCGTA CHHFCC8ACBBB 0 6:G>C,7:C>T,8:T>C
MERCURE:174:C0UT3ACXX:5:2316:18344:100886/1 + dogpremirnas 11734 AAAAAAAAAA DCDCDDDDDD 0 4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18415:100878/1 + dogpremirnas 4909 AAAAAAAAAA BDDCDDDDDB 0 0:G>A,4:T>A
MERCURE:174:C0UT3ACXX:5:2316:18442:100808/1 + dogpremirnas 11734 AAAAAAAAAA DDDDDDDDDB 0 4:C>A,9:G>A
MERCURE:174:C0UT3ACXX:5:2316:18461:100754/1 + dogpremirnas 4914 AAAAAAAAAA DDDDDDDBDB 0 5:T>A,6:T>A
MERCURE:174:C0UT3ACXX:5:2316:18464:100926/1 + dogpremirnas 833 ACCGATCTCGTA HHHFCC/=CBBB 0 6:G>C,7:C>T,8:T>C
and needs to be sorted by the length of the column. In the man page of the sort command it says I can specify the key, but no indication how to put in "length" into it.
I only need the lines that have over 20 symbols in the 4th column. Unfortunately, the soft that got me this result (bowtie) doesn't provide such requests either.
Any suggestions would be very welcome.
Thanks.

I like awk for working with column data like this:
awk 'length($5)>20' /path/to/input > outputfile

Related

Invalid syntax loop in Stata

I'm trying to run a for loop to make a balance table in Stata (comparing the demographics of my dataset with national-level statistics)
For this, I'm prepping my dataset and attempting to calculate the percentages/averages for some key demographics.
preserve
rename unearnedinc_wins95 unearninc_wins95
foreach var of varlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019 { //continuous or binary; to put categorical vars use kwallis test
dis "for variable `var':"
tabstat `var'
summ `var'
local `var'_samplemean=r(mean)
}
clear
set obs 11
gen var=""
gen sample=.
gen F=.
gen pvalue=.
replace var="% Female" if _n==1
replace var="Age" if _n==2
replace var="% Non-white" if _n==3
replace var="HH size" if _n==4
replace var="% Parent" if _n==5
replace var="% Employed" if _n==6
replace var="Savings stock ($)" if _n==7
replace var="Debt stock ($)" if _n==8
replace var="Earned income last mo. ($)" if _n==9
replace var="Unearned income last mo. ($)" if _n==10
replace var="% Under FPL 2019" if _n==11
foreach col of varlist sample {
replace `col'=100*round(`fem_`col'mean', 0.01) if _n==1
replace `col'=round(`age_`col'mean') if _n==2
replace `col'=100*round(`nonwhite_`col'mean', 0.01) if _n==3
replace `col'=round(`hhsize_`col'mean', 0.1) if _n==4
replace `col'=100*round(`parent_`col'mean', 0.01) if _n==5
replace `col'=100*round(`employed_`col'mean', 0.01) if _n==6
replace `col'=round(`savings_wins95_`col'mean') if _n==7
replace `col'=round(`debt_wins95_`col'mean') if _n==8
replace `col'=round(`earnedinc_wins95_`col'mean') if _n==9
replace `col'=round(`unearninc_wins95_`col'mean') if _n==10
replace `col'=100*round(`underfpl2019_`col'mean', 0.01) if _n==11
}
I'm trying to run the following loop, but in the second half of the loop, I keep getting an 'invalid syntax' error. For context, in the first half of the loop (before clearing the dataset), the code stores the average values of the variables as a macro (`var'_samplemean). Can someone help me out and mend this loop?
My sample data:
clear
input byte fem float(age nonwhite) byte(hhsize parent) float employed double(savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95) float underfpl2019
1 35 1 6 1 1 0 2500 0 0 0
0 40 0 4 1 1 0 10000 1043 0 0
0 40 0 4 1 1 0 20000 2400 0 0
0 40 0 4 1 1 .24 20000 2000 0 0
0 40 0 4 1 1 10 . 2600 0 0
Thanks!

Thanks for sharing the snippet of data. Apart from the fact the variable unearninc_wins95 has already been renamed in your sample data, the code runs fine for me without returning an error.
That being said, the columns for your F-statistics and p-values are empty once the loop at the bottom of your code completes. As far as I can see there is no local/varlist called sample which you're attempting to call with the line foreach col of varlist sample{. This could be because you haven't included it in your code, in which case please do, or it could be because you haven't created the local/varlist sample, in which case this could well be the source of your error message.
Taking a step back, there are more efficient ways of achieving what I think you're after. For example, you can get (part of) what you want using the package stat2data (if you don't have it installed already, run ssc install stat2data from the command prompt). You can then run the following code:
stat2data fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019, saving("~/yourstats.dta") stat(count mean)
*which returns:
preserve
use "~/yourstats.dta", clear
. list, sep(11)
+----------------------------+
| _name sN smean |
|----------------------------|
1. | fem 5 .2 |
2. | age 5 39 |
3. | nonwhite 5 .2 |
4. | hhsize 5 4.4 |
5. | parent 5 1 |
6. | employed 5 1 |
7. | savings_wins 5 2.048 |
8. | debt_wins95 4 13125 |
9. | earnedinc_wi 5 1608.6 |
10. | unearninc_wi 5 0 |
11. | underfpl2019 5 0 |
+----------------------------+
restore
This is missing the empty F-statistic and p-value variables you created in your code above, but you can always add them in the same way you have with gen F=. and gen pvalue=.. The presence of these variables though indicates you want to run some tests at some point and then fill the cells with values from them. I'd offer advice on how to do this but it's not obvious to me from your code what you want to test. If you can clarify this I will try and edit this answer to include that.

This doesn't answer your question directly; as others gently point out the question is hard to answer without a reproducible example. But I have several small comments on your code which are better presented in this form.
Assuming that all the variables needed are indeed present in the dataset, I would recommend something more like this:
local myvarlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019
local desc `" "% Female" "Age" "% Non-white" "HH size" "% Parent" "% Employed" "Savings stock ($)" "Debt stock ($)" "Earned income last mo. ($)" "Unearned income last mo. ($)" "% Under FPL 2019" "'
local i = 1
gen variable = ""
gen mean = ""
local i = 1
foreach var of local myvars {
summ `var', meanonly
local this : word `i' of `desc'
replace variable = "`this'" in `i'
if inlist(`i', 1, 3, 5, 6, 11) {
replace mean = strofreal(100 * r(mean), "%2.0f") in `i'
}
else if `i' == 4 {
replace mean = strofreal(r(mean), "%2.1f") in `i'
}
else replace mean = strofreal(r(mean), "%2.0f") in `i'
local ++i
}
This has not been tested.
Points arising include:
Using in is preferable for what you want over testing the observation number with if.
round() is treacherous for rounding to so many decimal places. Most of the time you will get what you want, but occasionally you will get bizarre results arising from the fact that Stata works in binary, like any equivalent program. It is safer to treat rounding as a problem in string manipulation and use display formats as offering precisely what you want.
If the text you want to show is just the variable label for each variable, this code could be simplified further.
The code hints at intent to show other stuff, which is easily done compatibly with this design.

group values in 1-50, 51-100,101-150 etc in crystalreport

I have list of values(selected from DB) that start from 1 but the end range is unknown,and i want to group this values in crystal report in such a way that
values between 1 to 50 then 51 to 100 then 101 to 150 etc up to the maximum value.
how should i group??
eg: The selected values of the column rate are
1,1.6,2,56,71.1,61.9,109,118 etc.
i want to group like
rate range(1-50)
1
1.6
2
-------------------------------
rate range(51-100)
56
71.1
61.9
--------------------
rate range(101-150)
109
118
etc.
but i don't know the exact max value of the list

Create a formula similar to this logic, and group the report on that formula:
NumberVar YourValue := 300 ;
NumberVar FromValue := 50*Floor(YourValue/50) ;
NumberVar ToValue := FromValue + 50 ;
ToText(FromValue,0,"") & "-" & ToText(ToValue,0,"");

Create following formula and group the report on that formula.
Floor(({yourTable.Value}-1)/50)+1
It returns the group number. Examples:
for the value 23 it returns 1
for the value 150.99 it returns 3
for the value 543 it returns 11.
Use following formula as group title:
"rate range(" & CStr({#GroupNumber}*50-49, "#") & "-" & CStr({#GroupNumber}*50, "#") & ")"

How can I extract parts of one column and append them to other columns?

I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:
file_name,#,Date,Time,Temp (°C) ,Intensity
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,
I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?

Using sed you can do this:
sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

gawk approach:
awk -F, 'NR==1{ print $0,"can_und,trap_no" }
NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file
The output:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
NR==1{ print $0,"can_und,trap_no" } - print the header line
match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter

With use of sed, this will be like:
sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file
Use sed -i ... to replace it in file.

Using python pandas reader because python is awesome for numerical analysis:
First: I had to modify the data header row so that the columns were consistent by appending 3 commas:
file_name,#,Date,Time,Temp (°C) ,Intensity,,,
There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.
Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:
import pandas
def main():
df = pandas.read_csv("file.csv")
df['cu_int'] = 0 # Add the new columns to the data frame.
df['cu_char'] = ' '
for index, df_row in df.iterrows():
file_name = df['file_name'][index].strip()
trap_string = file_name.split("_")[0] # Get the file_name string prior to the underscore
numeric_offset_beg = len("trap") # Parse the number following the 'trap' string.
numeric_offset_end = len(trap_string) - 1 # Leave off the 'c' or 'u' char.
numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
cu_value = trap_string[len(trap_string) - 1]
df['cu_int'] = int(numeric_value)
df['cu_char'] = cu_value
# The pandas dataframe is ready for number crunching.
# For now just print it out:
print df
if __name__ == "__main__":
main()
The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):
$ python read_csv.py
file_name # Date Time Temp (°C) Intensity Unnamed: 6 Unnamed: 7 Unnamed: 8 cu_int cu_char
0 trap12u_10733862_150809.txt 1 05/28/15 06:00:00.0 20.424 215.3 NaN NaN NaN 12 c
1 trap12u_10733862_150809.txt 2 05/28/15 07:00:00.0 21.091 1.0 130.2 NaN NaN 12 c
2 trap12u_10733862_150809.txt 3 05/28/15 08:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
3 trap11u_10733862_150809.txt 4 05/28/15 09:00:00.0 25.222 3.0 444.5 NaN NaN 12 c
4 trap11u_10733862_150809.txt 5 05/28/15 10:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
5 trap11u_10733862_150809.txt 6 05/28/15 11:00:00.0 25.902 2.0 927.8 NaN NaN 12 c
6 trap11u_10733862_150809.txt 7 05/28/15 12:00:00.0 25.708 2.0 325.0 NaN NaN 12 c
7 trap12c_10733862_150809.txt 8 05/28/15 13:00:00.0 26.292 3.0 100.0 NaN NaN 12 c
8 trap12c_10733862_150809.txt 9 05/28/15 14:00:00.0 26.390 2.0 66.7 NaN NaN 12 c
9 trap12c_10733862_150809.txt 10 05/28/15 15:00:00.0 26.097 1.0 463.9 NaN NaN 12 c

How can I convert the qblast XML output into the NCBI BLAST -outfmt 17?

I started my project with the NCBI standalone BLAST and used the -outfmt 17 option. For my purpose that formatting is extremely helpful. However, I had to change to Biopython and I'm now using qblast to align my sequences to the NCBI NT database. Can I save/convert the qblast XML in a format which is comparable to the NCBI BLAST standalone -outfmt 17 format?
Thank you very much for your help!
Cheers,
Philipp

I'm going to assume you meant -outfmt 7 and you need an output with columns.
from Bio.Blast import NCBIWWW, NCBIXML
# This is the BLASTN query which returns an XML handler in a StringIO
r = NCBIWWW.qblast(
"blastn",
"nr",
"ACGGGGTCTCGAAAAAAGGAGAATGGGATGAGAAGGATATATGGGTAGTGTCATTTTTTAACTTGCAGAT" +
"TTCATCCTAGTCTTCCAGTTATCGTTTCCTAGCACTCCATGTTCCCAAGATAGTGTCACCACCCCAAGGA" +
"CTCTCTCTCATTTTCTTTGCCTGGGCCCTCTTTCTACTGAGGAGTCGTGGCCTTCCATCAGTAGAAGCCG",
expect=1E-5)
# Now we read that XML extracting the info
for record in NCBIXML.parse(r):
for alignment in record.alignments:
for hsp in alignment.hsps:
cols = "{}\t" * 10
print(cols.format(hsp.positives / hsp.align_length,
hsp.align_length,
hsp.align_length - hsp.positives,
hsp.gaps,
hsp.query_start,
hsp.query_end,
hsp.sbjct_start,
hsp.sbjct_end,
hsp.expect,
hsp.score))
Outputs something like:
1 210 0 0 1 210 89250 89459 8.73028e-102 420.0
0 206 19 2 5 210 46259 46462 5.16461e-73 314.0
1 210 0 0 1 210 68822 69031 8.73028e-102 420.0
0 206 19 2 5 210 25825 26028 5.16461e-73 314.0
1 210 0 0 1 210 65887 66096 8.73028e-102 420.0
...

High & Low Numbers From A String (Ruby)

Good evening,
I'm trying to solve a problem on Codewars:
In this little assignment you are given a string of space separated numbers, and have to return the highest and lowest number.
Example:
high_and_low("1 2 3 4 5") # return "5 1"
high_and_low("1 2 -3 4 5") # return "5 -3"
high_and_low("1 9 3 4 -5") # return "9 -5"
Notes:
All numbers are valid Int32, no need to validate them.
There will always be at least one number in the input string.
Output string must be two numbers separated by a single space, and highest number is first.
I came up with the following solution however I cannot figure out why the method is only returning "542" and not "-214 542". I also tried using #at, #shift and #pop, with the same result.
Is there something I am missing? I hope someone can point me in the right direction. I would like to understand why this is happening.
def high_and_low(numbers)
numberArray = numbers.split(/\s/).map(&:to_i).sort
numberArray[-1]
numberArray[0]
end
high_and_low("4 5 29 54 4 0 -214 542 -64 1 -3 6 -6")
EDIT
I also tried this and receive a failed test "Nil":
def high_and_low(numbers)
numberArray = numbers.split(/\s/).map(&:to_i).sort
puts "#{numberArray[-1]}" + " " + "#{numberArray[0]}"
end

When omitting the return statement, a function will only return the result of the last expression within its body. To return both as an Array write:
def high_and_low(numbers)
numberArray = numbers.split(/\s/).map(&:to_i).sort
return numberArray[0], numberArray[-1]
end
puts high_and_low("4 5 29 54 4 0 -214 542 -64 1 -3 6 -6")
# => [-214, 542]

Using sort would be inefficient for big arrays. Instead, use Enumerable#minmax:
numbers.split.map(&:to_i).minmax
# => [-214, 542]
Or use Enumerable#minmax_by if you like result to remain strings:
numbers.split.minmax_by(&:to_i)
# => ["-214", "542"]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sort large text file by length of a column - sorting

I like awk for working with column data like this: awk 'length($5)>20' /path/to/input > outputfile

Related

Invalid syntax loop in Stata

group values in 1-50, 51-100,101-150 etc in crystalreport

How can I extract parts of one column and append them to other columns?

How can I convert the qblast XML output into the NCBI BLAST -outfmt 17?

High & Low Numbers From A String (Ruby)

Categories

Resources