SortedEffects R Package - sorting

I study the association between air pollution levels and the weekly number of cases of a respiratory disease controlling for various socioeconomic characteristics at the census block level in a state. I do not attempt to examine the causal effect of pollution - just the association between air pollution and the respiratory disease.
I want to explore the heterogeneity of the effect of pollution on the respiratory disease incidence across census blocks. I know that I can implement the partial (sorted) effects method and classification analysis using the SortedEffects R package. However, my main variable of interest, the level of pollution, is continuous and not binary. In this case, does it still make sense for me to use the package (its spe command, etc.)?
If I set var_type = "continuous”, the spe command gives me the following error: “Error in FUN(left, right) : non-numeric argument to binary operator”.
If I set var_type = "binary”, which is not the case in ‘real life’, the command starts working, but then it gives new errors: “Error in quantile.default(t_hat, 1 - alpha) : missing values and NaN's not allowed if 'na.rm' is FALSE. In addition: Warning message: In predict.lm(model_fit, newdata = d1) : prediction from a rank-deficient fit may be misleading”
I do not know what I am doing wrong.
I am quite new to R, so sorry in advance.
Thank you.

Related

Creating balanced bootstrap resamples in caret

I'm using caret to compare models for a classification problem with nested CV. Vfold in the outer loop and bootstrap (500 replicates) in the inner loop. I get this error after training knn:
Warning: There were missing values in resampled performance measures.
Which I believe comes from the fact that some resamples have zero items of the class of interest in the holdout sample, yielding NA for Sensitivity and ROC. My question is: Is there any way to ensure that items from this class are present in every bootstrap resample? Kind of what the CreateDataPartition function does (I believe this is also called stratified bootstrap?).
If not, how should we proceed with this? (In terms of comparing model performance on the same resamples)
Thanks!
So I couldn't find a way to do this within caret but here is a workaround using rsample package. The point is to compute the resamples before and feed this information to trainControl function via index and indexOut arguments, previous conversion to caret format.
indices=bootstraps(train,times=50,strata="class_of_interest")
indices=rsample2caret(indices)
train_control <- trainControl(method="boot",number=50,index=indices$index,indexOut = indices$indexOut)
Hope this helps.

How to make forecast using the fpp2 package?

It is the first time I am using fpp2 package to make linear forecast. I have successfully installed the package. However i am having an error when using the commands.
I have already converted the data in time series using ts command.
library(SPEI)
library(fpp2)
m<- read.delim("D:/PHD_UOM/PHD_Dissertation/PhD/PhD_R/mydata/mruspi.txt")
head(m)
y<- spi(ts(m$mru,freq=12,start=c(1971,1)), end=c(2019,12),scale =12)
y
forecast(y,12)
naive(y,12)'''
forecast(y,12)
Error in is.constant(y) :
'list' object cannot be coerced to type 'double'
naive(y,12)
Error in x[, (1 + cs[i]):cs[i + 1]] <- xx :
incorrect number of subscripts on matrix
The problem seems to be with the output of spi, according to the manual it outputs an object of class spi. This object is possibly not adequate to input is forecast function. You might need to use fitted component of the spi class object: y.fitted
For the documentation of SPEI package (specifically p. 6-7): SPEI Documentation
There is a newer version of fpp, called fpp3. I recommend installing fpp3 for starters:
install.package(fpp3)
There is an excellent book that demonstrates how to use fpp3, called Forecasting Principles and Practice. It can be purchased via Amazon, or viewed for free from the author online: https://otexts.com/fpp3/. I am working my way through the book on my own (not in a class), it is very clear and extremely well written. The book receives my highest recommendation, I very strongly recommend using it to learn forecasting.
I am unable to load the library spei, R returns this error:
"package ‘spei’ is not available for this version of R"
If you are able to update R and fpp, then an example of making a linear forecast would be:
library(fpp3)
library(tidyverse)
us_change %>%
model(TSLM(Unemployment ~ Consumption + Production + Savings + season() + trend())) %>%
report()
You can learn more about linear regression using fpp3 here: https://otexts.com/fpp3/regression-intro.html

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

Adjusting the MAFFT command line algorithm to better account for gaps

I've been attempting to use the MAFFT command line tool as a means to identify coding regions within a genome. My general process is to align the amino acid consensus sequence of a gene to a translated reading frame of a target sequence. My method has been largely successful. However, I've noticed some peculiar alignments which will unfortunately impede my annotation method. The following is one such example (Note - I've also included a pairwise alignment from the Pairwise2 Biopython module to demonstrate my desired output. Unfortunately, the computation time for Pairwise2 is nearly 20 times slower than MAFFT command line):
from time import *
from Bio.SubsMat import MatrixInfo as matlist
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.Align.Applications import MafftCommandline
startTime = time()
sample_tList = [['>Frame 1', 'RIGVGSIPRHLYCQELPLAQPKTCCAETPFRDSPLQGRLGVCPHLASGVALLYGLSTPLTMSGILDRCTCTPNARVFMAEGQVYCTRCLSARSLLPLNLQVPELGVLGLFYRPEEPLRWTLPRAFPTVECSPAGACWLSAIFPIARMTSGNLNFQQRMVRVAAEIYRAGQLTPAVLKVLQVYERGCRWYPIVGPVPGVGVYANSLHVSDKPFPGATHVLTNLPLPQRPKPEDFCPFECAMADVYDIGHGAVMFVAGGKVSWAPRGGDEVRFETVPEELKLIANRLHISFPPHHLVDMSKFAFIVPGSGVSLRVEHQHGCLPADIVPKGNCWWCLFDLLPPGVQNREIRYANQFGYQTKHGVSGKYLQRRLQINGLRAVTDTHGPIVVQYFSVKESWIRHFRLAGEPSLPGFEDLLRIRVESNTSPLADKDEKIFRFGSHKWYGAGKRARKARSGATTTVAHRASSARETRQAKKHEGVDANNAAHLEHYSPPAEGNCGWHCISAIVNRMVNSNFETTLPERVRPSDDWATDEDFVNTIQILRLPAALDRNGACKSAKYVLKLEGEHWTVSVAPGMSPSLLPLECVQGCCEHKGGLGSPDAVEVSGFDPTCLDRLAEVMHLPSSVIPAALAEMSNNSDRPASLVNTAWTVSQFYARHTGGNHRDQVRLGKIISLCQVIEECCCHQNKTNRATPEEVAAKIDQYLRGATSLEECLIKLERVSPPSAADTSFDWNVVLPGVEAAGPTTEQPHANQCCAPVPVVTQEPLDKDSVPLTAFSLSNCYYPAQGDEVRHRERLNSVLSKLEEVVLEEYGLMPTGLGPRPVLPSGLDELKDQMEEDLLKLANAQATSEMMALAAEQVDLKAWVKSYPRWIPPPPPPKVQPRRMKPVKSLPENKPVPAPRRKVRSDPGKSILAVGGPLNFSTPSELVTPLGEPVLMPASQHVSRPVTPLSEPAPVPAPRRIVSRPMTPLSEPTFVFAPWRKSQQVEEANPAAATLTCQDEPLDLSASSQTEYEAYPLAPLENIGVLEAGGQEAEEVLSGISDILDNTNPAPVSSSSSLSSVKITRPKYSAQAIIDSGGPCSGHLQKEKEACLRIMREACDAARLGDPATQEWLSHMWDRVDVLTWRNTSVYQAFRTLDGRFGFLPKMILETPPPYPCGFVMLPHTPTPSVSAESDLTIGSVATEDVPRILGKTENTGNVLNQKPLALFEEEPVCDQPAKDSRTLSRESGDSTTAPPVGTGGAGLPTDLPPLDGVDADGGGLLRTAKGKAERFFDQLSRQVFNIVSHLPVFFSHLFKSDSGYSPGDWGFAAFTLFCLFLCYSYPFFGFAPLLGVFSGSSRRVRMGVFGCWLAFAVGLFKPVSDPVGAACEFDSPECRNILHSFELLKPWDPVRSLVVGPVGLGLAILGRLLGGARYIWHFLLRLGIVADCILAGAYVLSQGRCKKCWGSCIRTAPNEIAFNVFPFTRATRSSLIDLCDRFCAPKGMDPIFLATGWRGCWTGQSPIEQPSEKPIAFAQLDEKRITARTVVSQPYDPNQAVKCLRVLQAGGAMVAEAVPKVVKVSAIPFRAPFFPTGVKVDPECRIVVDPDTFTTALRSGYSTTNLVLGVGDFAQLNGLKIRQISKPSGGGPHLIAALHVACSMVLHMLAGVYVTAVGSCGTGTSDPWCANPFAVPGYGPGSLCTSRLCISQHGLTLPLTALVAGFGLQEIALVVLIFVSIGGMAHRLSCKADMLCILLAIASYVWVPLTWLLCVFPCWLRWFSLHPLTILWLVFFLISVNMPSGILAVVLLVSLWLLGRYTNIAGLVTPYDIHHYTSGPRGVAALATAPDGTYLAAVRRAALTGRTMLFTPSQLGSLLEGAFRTRKPSLNTVNVVGSSMGSGGVFTIDGRIKCVTAAHVLTGNSARVSGVGFNQMLDFDVKGDFAIADCPNWQGVAPKTQFCGDGWTGRAYWLTSSGVEPGVIGDGFAFCFTACGDSGSPVITEAGELVGVHTGSNKQGGGIVTRPSGQFCNVTPIKLSELSEFFAGPKVPLGDVKVGSHIIKDTSEVPSDLCALLAAKPELEGGLSTVQLLCVFFLLWRMMGHAWTPLVAVGFFILNEVLPAVLVRSVFSFGMFALSWLTPWSAQVLMIRLLTAALNRNRVSLIFYSLGAVTGFVADLATTQGHPLQAVMNLSTYAFLPRMMVVTSPVPAIACGVVHLLAIILYLFKYRCLHHVLVGDGAFSAAFFLRYFAEGKLREGVSQSCGMSHESLTGALAIKLSDEDLDFLTKWTDFKCFVSASNMRNAAGQFIEAAYAKALRIELAQLVQVDKVRGTLAKLEAFADTVAPQLSPGDIVVALGHTPVGSIFDLKVGSTKHTLQAIETRVLAGSKMTVARVVDPTPAPPPAPVPIPLPPKVLENGPNAWGGEDRLNKRKRRRMEAVGIFVMDGKKYQKFWDKNSGDVFYEEVHNSTDEWECLRAGDPADFDPETGIQCGHVTIEDKVYNVFTSPSGRRFLVPANPENRRIQWEAARLSVEQALGMMNVDGELTAKELEKLKRIIDKLQGLTKEQCLNCPPVAPAVVAAAWLLLRQRKNFTTGPSPDLTKWPVRLSRTRSSTTNIRLPNRLMVVLCSCAPLFLRLMSSPALMHLLSYLPATGRETLGLMARFGILRPRPPKRKSHLVRKYRLVTLGAVTHLKLVSLISCTLLGATLSGKEFYRIQGLETYLTEPPVTLEAQCMRLPASRPMLLRLMGVPSWPQPCPPVLSCMYRPFQRPSLIILILGLTALNSQSTVVRMLLGTSPNTICPPKALFCLEFFALCGSTCLPMWVSARPFIGLPLTLPRILWLEMGTDFQPRIFRASLKSTFCAHRLCEKTGKLLLLVPSRSSIVGRRRLGQYLALITLRWPTGQRVVLPRASKRHSTRPSPSEKTNLRNYILQFAGALKLILHPAIDPHLQLSAGSLPIFFMNSPVLKSIYRRTCLTAVTTYWLRSPARLREAACRLATRLPPCQTPFTAYMHSTWCSVTLKVVTLMAFCFCKTSSLRTCSRFNPSSIQTTSCCMPSLPPCQITTGGLNITLCVSKRTQRRQPQTRHHFVAGMGVSSLTVTGFLRPSPTIRQAMSLNTTPRRLQYLWTAVLVSMILSGLKSSWLVRSAPARTVTASQARRSSCPCGKNSGPIMKGRSPECAGTAEPRLRTPLPVASTSVLTTPISTSIVLSSGVATRRVLALVVSVNLPWEKAQVLWMRCNKSRISLRGLSCMWSRVSPLLTQVDTKLAADSPLGVASGETKLTCQTVIMPVPPCSPLVKRSTWSLSPPTCCAAGSSSVPPALGKHTGSSNRSRMVMSFTRQLTRPCLTLGLWGCAGSTSQRVRRCNSLPPLVPARGFASWPAVGVLVRIPFWTKQRIAITLMSGFLAKPPLPAEISNNSTRWVLTLIAMFLTSCLRPNRPSGDSDRISVMPSNQITGTNLCPWSTQPVPRWTNLSGMGKSSPPTTGTERTAPSLSTPVKVPHLMWLHCICPLKIHSTGNEPLLLSPGQDMQSSCMTHTGNCRACLIFLRKAHPSTSQCSVTSSSYIEITKNARLLRLAMEINSGLQTSALILSAPFVQIWKGRAPRSPKLHITWGSISHLIHSLLNSQQNSHPTGPWQPRTMKSGLIGWLPAFAPSINIAARALVQAIWWAPRCFAPQGLCHTTSQNLLGARLKCFLRQSSAPAELRIAGSTSMIGSEKLLSPSHMPSLATSKALPVGDVITSPPDTFRASFLRNQLRSGFLAPEKLQRQFAHQMCTSQILKRTSTQRPSPSAGKCWILEKSDWSGKTRRPIFNLKAAISPGINLQATPHTSEFLLILQCIWTPAWALPFATGGLLGPPIGELTSRSPLMITVPKSFCLVHTMVKCLQGTKFWRARSSRLTTQGTNTLGDLNRIQRICTSLLGMVRTGRIIMKRFGRARKGKFIRLLPPASFIFPRALSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]
ex_file = open("newTempFile112233.fasta", "w")
for items in sample_tList:
ex_file.write(items[0] + "\n")
ex_file.write(items[1] + "\n")
ex_file.close()
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
#mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5)
stdout, stderr = mafft_cline()
print(stdout)
test_align = AlignIO.read(io.StringIO(stdout), "fasta")
#print(test_align)
os.remove("newTempFile112233.fasta")
print('Total time = ' + str(time() - startTime))
startTime = time()
matrix = matlist.blosum62
pWise_align = pairwise2.align.localds(sample_tList[0][1], sample_tList[1][1], matrix, -6, -1)
print(format_alignment(*pWise_align[0]))
print('Total time = ' + str(time() - startTime))
I've attempted to change the MAFFT command line alignment algorithm by referencing the help document (http://mafft.cbrc.jp/alignment/software/manual/manual.html). I don't get any error messages, but the alignment output does not change. I'm unsure what adjustments need to be made. I believe that by increasing the gap extension penalty (which is zero by default), the alignment will be improved. I haven't been able to find many documentation examples where custom variables are used when using MAFFT command line on this forum or through Google search. Help is much appreciated. For reference, documentation on the Pairwise2 alignment parameters can be found here: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
Managed to figure out a possible solution. The alignment of the example sequences provided results in a long terminal/end gap which should not be present. Changing the MAFFT alignment algorithm using localpair, lexp, and lop had no effect (causing me a good deal of confusion). However, I have noticed differences in the alignment output when each input sequence is reversed. Oddly, the only way I was able to remove the terminal/end gap was to set the lop (gap opening penalty) to a lesser amount relative to lexp (gap extension penalty). I suspect my solution is niche and may not be applicable to other similar occurrences of terminal gaps. Changing the alignment settings also likely reduces the optimal alignment.
Going forward, I plan to use an automated process to run alignments of consensus sequences to raw sequences. In the event I detect irregularities with the alignment output (specifically terminal gaps), I'll attempt to reverse the input sequences and apply custom alignment settings. I suppose if that isn't a consistent solution, I'll figure out a way to refine the alignment output directly.
For anyone curious, I used a lexp value of -1.5 and lop value of 0.5 (now included in a hashed out line in my example code).

AMPL Syntax Error - Greater than or equal to issue

I am modeling a sort of schedule optimization problem in AMPL and am using gurobi for the option solver.
In this problem, I have declared a set of schedules from 1 through 1000, and called this set "Schedules".
Each schedule sas a sort of layer (a parameter called "layer" has been created) with a value ranging from 1 to 4. This is a sort of preference or hirearchy of the days off that this particular schedule has.
I want several constraints that determine how many schedules of each preference are available. For instance, I want at least 170 of the schedules to have a preference layer of 1. I wrote the following line to do so:
subject to Preference1: sum {j in Schedules: layer[j]=1} >= L1Demand;
Where L1Demand is set to 170. However, when I go to include the model file in the ampl window, I get the following error:
syntax error
context:
subject to Preference1: sum {j in Schedules: layer[j]=1} >>> >= <<< L1Demand;
I don't understand why this is throwing out a syntax error. I may be missing something very basic or obvious, but can anyone tell me as to why this would be happening? Thank you very much.
You should specify an argument for sum, for example:
sum {j in Schedules: layer[j]=1} x[j]
where x is some variable indexed over Schedules.

Resources