Adjusting the MAFFT command line algorithm to better account for gaps - bioinformatics

I've been attempting to use the MAFFT command line tool as a means to identify coding regions within a genome. My general process is to align the amino acid consensus sequence of a gene to a translated reading frame of a target sequence. My method has been largely successful. However, I've noticed some peculiar alignments which will unfortunately impede my annotation method. The following is one such example (Note - I've also included a pairwise alignment from the Pairwise2 Biopython module to demonstrate my desired output. Unfortunately, the computation time for Pairwise2 is nearly 20 times slower than MAFFT command line):
from time import *
from Bio.SubsMat import MatrixInfo as matlist
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.Align.Applications import MafftCommandline
startTime = time()
sample_tList = [['>Frame 1', 'RIGVGSIPRHLYCQELPLAQPKTCCAETPFRDSPLQGRLGVCPHLASGVALLYGLSTPLTMSGILDRCTCTPNARVFMAEGQVYCTRCLSARSLLPLNLQVPELGVLGLFYRPEEPLRWTLPRAFPTVECSPAGACWLSAIFPIARMTSGNLNFQQRMVRVAAEIYRAGQLTPAVLKVLQVYERGCRWYPIVGPVPGVGVYANSLHVSDKPFPGATHVLTNLPLPQRPKPEDFCPFECAMADVYDIGHGAVMFVAGGKVSWAPRGGDEVRFETVPEELKLIANRLHISFPPHHLVDMSKFAFIVPGSGVSLRVEHQHGCLPADIVPKGNCWWCLFDLLPPGVQNREIRYANQFGYQTKHGVSGKYLQRRLQINGLRAVTDTHGPIVVQYFSVKESWIRHFRLAGEPSLPGFEDLLRIRVESNTSPLADKDEKIFRFGSHKWYGAGKRARKARSGATTTVAHRASSARETRQAKKHEGVDANNAAHLEHYSPPAEGNCGWHCISAIVNRMVNSNFETTLPERVRPSDDWATDEDFVNTIQILRLPAALDRNGACKSAKYVLKLEGEHWTVSVAPGMSPSLLPLECVQGCCEHKGGLGSPDAVEVSGFDPTCLDRLAEVMHLPSSVIPAALAEMSNNSDRPASLVNTAWTVSQFYARHTGGNHRDQVRLGKIISLCQVIEECCCHQNKTNRATPEEVAAKIDQYLRGATSLEECLIKLERVSPPSAADTSFDWNVVLPGVEAAGPTTEQPHANQCCAPVPVVTQEPLDKDSVPLTAFSLSNCYYPAQGDEVRHRERLNSVLSKLEEVVLEEYGLMPTGLGPRPVLPSGLDELKDQMEEDLLKLANAQATSEMMALAAEQVDLKAWVKSYPRWIPPPPPPKVQPRRMKPVKSLPENKPVPAPRRKVRSDPGKSILAVGGPLNFSTPSELVTPLGEPVLMPASQHVSRPVTPLSEPAPVPAPRRIVSRPMTPLSEPTFVFAPWRKSQQVEEANPAAATLTCQDEPLDLSASSQTEYEAYPLAPLENIGVLEAGGQEAEEVLSGISDILDNTNPAPVSSSSSLSSVKITRPKYSAQAIIDSGGPCSGHLQKEKEACLRIMREACDAARLGDPATQEWLSHMWDRVDVLTWRNTSVYQAFRTLDGRFGFLPKMILETPPPYPCGFVMLPHTPTPSVSAESDLTIGSVATEDVPRILGKTENTGNVLNQKPLALFEEEPVCDQPAKDSRTLSRESGDSTTAPPVGTGGAGLPTDLPPLDGVDADGGGLLRTAKGKAERFFDQLSRQVFNIVSHLPVFFSHLFKSDSGYSPGDWGFAAFTLFCLFLCYSYPFFGFAPLLGVFSGSSRRVRMGVFGCWLAFAVGLFKPVSDPVGAACEFDSPECRNILHSFELLKPWDPVRSLVVGPVGLGLAILGRLLGGARYIWHFLLRLGIVADCILAGAYVLSQGRCKKCWGSCIRTAPNEIAFNVFPFTRATRSSLIDLCDRFCAPKGMDPIFLATGWRGCWTGQSPIEQPSEKPIAFAQLDEKRITARTVVSQPYDPNQAVKCLRVLQAGGAMVAEAVPKVVKVSAIPFRAPFFPTGVKVDPECRIVVDPDTFTTALRSGYSTTNLVLGVGDFAQLNGLKIRQISKPSGGGPHLIAALHVACSMVLHMLAGVYVTAVGSCGTGTSDPWCANPFAVPGYGPGSLCTSRLCISQHGLTLPLTALVAGFGLQEIALVVLIFVSIGGMAHRLSCKADMLCILLAIASYVWVPLTWLLCVFPCWLRWFSLHPLTILWLVFFLISVNMPSGILAVVLLVSLWLLGRYTNIAGLVTPYDIHHYTSGPRGVAALATAPDGTYLAAVRRAALTGRTMLFTPSQLGSLLEGAFRTRKPSLNTVNVVGSSMGSGGVFTIDGRIKCVTAAHVLTGNSARVSGVGFNQMLDFDVKGDFAIADCPNWQGVAPKTQFCGDGWTGRAYWLTSSGVEPGVIGDGFAFCFTACGDSGSPVITEAGELVGVHTGSNKQGGGIVTRPSGQFCNVTPIKLSELSEFFAGPKVPLGDVKVGSHIIKDTSEVPSDLCALLAAKPELEGGLSTVQLLCVFFLLWRMMGHAWTPLVAVGFFILNEVLPAVLVRSVFSFGMFALSWLTPWSAQVLMIRLLTAALNRNRVSLIFYSLGAVTGFVADLATTQGHPLQAVMNLSTYAFLPRMMVVTSPVPAIACGVVHLLAIILYLFKYRCLHHVLVGDGAFSAAFFLRYFAEGKLREGVSQSCGMSHESLTGALAIKLSDEDLDFLTKWTDFKCFVSASNMRNAAGQFIEAAYAKALRIELAQLVQVDKVRGTLAKLEAFADTVAPQLSPGDIVVALGHTPVGSIFDLKVGSTKHTLQAIETRVLAGSKMTVARVVDPTPAPPPAPVPIPLPPKVLENGPNAWGGEDRLNKRKRRRMEAVGIFVMDGKKYQKFWDKNSGDVFYEEVHNSTDEWECLRAGDPADFDPETGIQCGHVTIEDKVYNVFTSPSGRRFLVPANPENRRIQWEAARLSVEQALGMMNVDGELTAKELEKLKRIIDKLQGLTKEQCLNCPPVAPAVVAAAWLLLRQRKNFTTGPSPDLTKWPVRLSRTRSSTTNIRLPNRLMVVLCSCAPLFLRLMSSPALMHLLSYLPATGRETLGLMARFGILRPRPPKRKSHLVRKYRLVTLGAVTHLKLVSLISCTLLGATLSGKEFYRIQGLETYLTEPPVTLEAQCMRLPASRPMLLRLMGVPSWPQPCPPVLSCMYRPFQRPSLIILILGLTALNSQSTVVRMLLGTSPNTICPPKALFCLEFFALCGSTCLPMWVSARPFIGLPLTLPRILWLEMGTDFQPRIFRASLKSTFCAHRLCEKTGKLLLLVPSRSSIVGRRRLGQYLALITLRWPTGQRVVLPRASKRHSTRPSPSEKTNLRNYILQFAGALKLILHPAIDPHLQLSAGSLPIFFMNSPVLKSIYRRTCLTAVTTYWLRSPARLREAACRLATRLPPCQTPFTAYMHSTWCSVTLKVVTLMAFCFCKTSSLRTCSRFNPSSIQTTSCCMPSLPPCQITTGGLNITLCVSKRTQRRQPQTRHHFVAGMGVSSLTVTGFLRPSPTIRQAMSLNTTPRRLQYLWTAVLVSMILSGLKSSWLVRSAPARTVTASQARRSSCPCGKNSGPIMKGRSPECAGTAEPRLRTPLPVASTSVLTTPISTSIVLSSGVATRRVLALVVSVNLPWEKAQVLWMRCNKSRISLRGLSCMWSRVSPLLTQVDTKLAADSPLGVASGETKLTCQTVIMPVPPCSPLVKRSTWSLSPPTCCAAGSSSVPPALGKHTGSSNRSRMVMSFTRQLTRPCLTLGLWGCAGSTSQRVRRCNSLPPLVPARGFASWPAVGVLVRIPFWTKQRIAITLMSGFLAKPPLPAEISNNSTRWVLTLIAMFLTSCLRPNRPSGDSDRISVMPSNQITGTNLCPWSTQPVPRWTNLSGMGKSSPPTTGTERTAPSLSTPVKVPHLMWLHCICPLKIHSTGNEPLLLSPGQDMQSSCMTHTGNCRACLIFLRKAHPSTSQCSVTSSSYIEITKNARLLRLAMEINSGLQTSALILSAPFVQIWKGRAPRSPKLHITWGSISHLIHSLLNSQQNSHPTGPWQPRTMKSGLIGWLPAFAPSINIAARALVQAIWWAPRCFAPQGLCHTTSQNLLGARLKCFLRQSSAPAELRIAGSTSMIGSEKLLSPSHMPSLATSKALPVGDVITSPPDTFRASFLRNQLRSGFLAPEKLQRQFAHQMCTSQILKRTSTQRPSPSAGKCWILEKSDWSGKTRRPIFNLKAAISPGINLQATPHTSEFLLILQCIWTPAWALPFATGGLLGPPIGELTSRSPLMITVPKSFCLVHTMVKCLQGTKFWRARSSRLTTQGTNTLGDLNRIQRICTSLLGMVRTGRIIMKRFGRARKGKFIRLLPPASFIFPRALSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]
ex_file = open("newTempFile112233.fasta", "w")
for items in sample_tList:
ex_file.write(items[0] + "\n")
ex_file.write(items[1] + "\n")
ex_file.close()
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
#mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5)
stdout, stderr = mafft_cline()
print(stdout)
test_align = AlignIO.read(io.StringIO(stdout), "fasta")
#print(test_align)
os.remove("newTempFile112233.fasta")
print('Total time = ' + str(time() - startTime))
startTime = time()
matrix = matlist.blosum62
pWise_align = pairwise2.align.localds(sample_tList[0][1], sample_tList[1][1], matrix, -6, -1)
print(format_alignment(*pWise_align[0]))
print('Total time = ' + str(time() - startTime))
I've attempted to change the MAFFT command line alignment algorithm by referencing the help document (http://mafft.cbrc.jp/alignment/software/manual/manual.html). I don't get any error messages, but the alignment output does not change. I'm unsure what adjustments need to be made. I believe that by increasing the gap extension penalty (which is zero by default), the alignment will be improved. I haven't been able to find many documentation examples where custom variables are used when using MAFFT command line on this forum or through Google search. Help is much appreciated. For reference, documentation on the Pairwise2 alignment parameters can be found here: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Managed to figure out a possible solution. The alignment of the example sequences provided results in a long terminal/end gap which should not be present. Changing the MAFFT alignment algorithm using localpair, lexp, and lop had no effect (causing me a good deal of confusion). However, I have noticed differences in the alignment output when each input sequence is reversed. Oddly, the only way I was able to remove the terminal/end gap was to set the lop (gap opening penalty) to a lesser amount relative to lexp (gap extension penalty). I suspect my solution is niche and may not be applicable to other similar occurrences of terminal gaps. Changing the alignment settings also likely reduces the optimal alignment.
Going forward, I plan to use an automated process to run alignments of consensus sequences to raw sequences. In the event I detect irregularities with the alignment output (specifically terminal gaps), I'll attempt to reverse the input sequences and apply custom alignment settings. I suppose if that isn't a consistent solution, I'll figure out a way to refine the alignment output directly.
For anyone curious, I used a lexp value of -1.5 and lop value of 0.5 (now included in a hashed out line in my example code).

Related

Limiting BART HuggingFace Model to complete sentences of maximum length

I'm implementing BART on HuggingFace, see reference: https://huggingface.co/transformers/model_doc/bart.html
Here is the code from their documentation that works in creating a generated summary:
from transformers import BartModel, BartTokenizer, BartForConditionalGeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
def baseBart(ARTICLE_TO_SUMMARIZE):
inputs = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=25, early_stopping=True)
return [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0]
I need to impose conciseness with my summaries so I am setting max_length=25. In doing so though, I'm getting incomplete sentences such as these two examples:
EX1: The opacity at the left lung base appears stable from prior exam.
There is elevation of the left hemidi
EX 2: There is normal mineralization and alignment. No fracture or
osseous lesion is identified. The ankle mort
How do I make sure that the predicted summary is only coherent sentences with complete thoughts and remains concise. If possible, I'd prefer to not perform a regex on the summarized output and cut off any text after the last period, but actually have the BART model produce sentences within the the maximum length.
I tried setting truncation=True in the model but that didn't work.

Configure larger limit of columns on Format module

The Format module
The Format module is used to model and combine pretty printers with a syntactic extension that allows typed formats and it helps a lot when you are writing something like a code generator or a data structure printer.
The problem
However, there is a limit of 78 columns that is initialized on the margin of the formatter and will pull to the left anything that takes more than this limit.
I'm printing a lighter version of a Yojson.Basic.json program using the Format module, but when the input is too large, the output is collapsed, and that is not really "prettily".
Preview
Here is how it is is formatted when it is short:
Here is how it is formatted when the indentation becomes too large
I've been trying to exceed and configure this limit to 120 columns, but didn't have any success.
What have I tried?
Using Format.pp_set_margin ppf 120 to reconfigure
Using Format.pp_set_max_indent to a larger value
But they doesn't seem to have any effect and there is no documentation easily available about this limit. I've discovered it only by reading the source code.
What am I doing?
let string_of_cst program =
let ppf = Format.str_formatter in
(* I've enabled colors. *)
Format.pp_set_tags ppf colors;
Format.pp_set_formatter_tag_functions ppf with_colors;
(* [print_json] is my printer. *)
print_json ppf program;
(* Get string out of printer. *)
Format.flush_str_formatter ()
How can I configure a larger limit?
The issue is that the values for margin and max_indent are implicitly constrained to the cone 1 < max_indent < margin and the function set_max_indent silently fails and does nothing if this constraint is not respected.
To avoid this issue, in OCaml ≥4.08, it would be possible to use the new set_geometry function that requires to set both value simultaneously and fails with an exception if the required max_indent is greater than the margin.
Otherwise, you should always set both values at the same time, and always in the order
margin first, and max_indent second. If you don't know which value to chose for max_indent, margin - 10 is generally an alright choice.

Save a figure to file with specific resolution

In an old version of my code, I used to do a hardcopy() with a given resolution, ie:
frame = hardcopy(figHandle, ['-d' renderer], ['-r' num2str(round(pixelsperinch))]);
For reference, hardcopy saves a figure window to file.
Then I would typically perform:
ZZ = rgb2gray(frame) < 255/2;
se = strel('disk',diskSize);
ZZ2 = imdilate(ZZ,se); %perform dilation.
Surface = bwarea(ZZ2); %get estimated surface (in pixels)
This worked until I switched to Matlab 2017, in which the hardcopy() function is deprecated and we are left with the print() function instead.
I am unable to extract the data from figure handler at a specific resolution using print. I've tried many things, including:
frame = print(figHandle, '-opengl', strcat('-r',num2str(round(pixelsperinch))));
But it doesn't work. How can I overcome this?
EDIT
I don't want to 'save' nor create a figure file, my aim is to extract the data from the figure in order to mesure a surface after a dilation process. I just want to keep this information and since 'im processing a LOT of different trajectories (total is approx. 1e7 trajectories), i don't want to save each file to disk (this is costly, time execution speaking). I'm running this code on a remote server (without a graphic card).
The issue I'm struggling with is: "One or more output arguments not assigned during call to "varargout"."
getframe() does not allow for setting a specific resolution (it uses current resolution instead as far as I know)
EDIT2
Ok, figured out how to do, you need to pass the '-RGBImage' argument like this:
frame = print(figHandle, ['-' renderer], ['-r' num2str(round(pixelsperinch))], '-RGBImage');
it also accept custom resolution and renderer as specified in the documentation.
I think you must specify formattype too (-dtiff in my case). I've tried this in Matlab 2016b with no problem:
print(figHandle,'-dtiff', '-opengl', '-r600', 'nameofmyfig');
EDIT:
If you need the CData just find the handle of the corresponding axes and get its CData
f = findobj('Tag','mytag')
Then depending on your matlab version use:
mycdata = get(f,'CData');
or directly
mycdta = f.CData;
EDIT 2:
You can set the tag of your image programatically and then do what I said previously:
a = imshow('peppers.png');
set(a,'Tag','mytag');

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

SPSS syntax for naming individual analyses in output file outline

I have created syntax in SPSS that gives me 90 separate iterations of general linear model, each with slightly different variations fixed factors and covariates. In the output file, they are all just named as "General Linear Model." I have to then manually rename each analysis in the output, and I want to find syntax that will add a more specific name to each result that will help me identify it out of the other 89 results (e.g. "General Linear Model - Males Only: Mean by Gender w/ Weight covariate").
This is an example of one analysis from the syntax:
USE ALL.
COMPUTE filter_$=(Muscle = "BICEPS" & Subj = "S1" & SMU = 1 ).
VARIABLE LABELS filter_$ 'Muscle = "BICEPS" & Subj = "S1" & SMU = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0). FILTER BY filter_$.
EXECUTE.
GLM Frequency_Wk6 Frequency_Wk9
Frequency_Wk12 Frequency_Wk16
Frequency_Wk20
/WSFACTOR=Time 5 Polynomial
/METHOD=SSTYPE(3)
/PLOT=PROFILE(Time)
/EMMEANS=TABLES(Time)
/CRITERIA=ALPHA(.05)
/WSDESIGN=Time.
I am looking for syntax to add to this that will name this analysis as: "S1, SMU1 BICEPS, GLM" Not to name the whole output file, but each analysis within the output so I don't have to do it one-by-one. I have over 200 iterations at times that come out in a single output file, and renaming them individually within the output file is taking too much time.
Making an assumption that you are exporting the models to Excel (please clarify otherwise).
There is an undocumented command (OUTPUT COMMENT TEXT) that you can utilize here, though there is also a custom extension TEXT also designed to achieve the same but that would need to be explicitly downloaded via:
Utilities-->Extension Bundles-->Download And Install Extension Bundles--->TEXT
You can use OUTPUT COMMENT TEXT to assign a title/descriptive text just before the output of the GLM model (in the example below I have used FREQUENCIES as an example).
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
oms /select all /if commands=['output comment' 'frequencies'] subtypes=['comment' 'frequencies']
/destination format=xlsx outfile='C:\Temp\ExportOutput.xlsx' /tag='ExportOutput'.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - jobcat".
freq jobcat.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - gender".
freq gender.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - minority".
freq minority.
omsend tag=['ExportOutput'].
You could use TITLE command here also but it is limited to only 60 characters.
You would have to change the OMS tags appropriately if using TITLE or TEXT.
Edit:
Given the OP wants to actually add a title to the left hand pane in the output viewer, a solution for this is as follows (credit to Albert-Jan Roskam for the Python code):
First save the python file "editTitles.py" to a valid Python search path (for example (for me anyway): "C:\ProgramData\IBM\SPSS\Statistics\23\extensions")
#editTitles.py
import tempfile, os, sys
import SpssClient
def _titleToPane():
"""See titleToPane(). This function does the actual job"""
outputDoc = SpssClient.GetDesignatedOutputDoc()
outputItemList = outputDoc.GetOutputItems()
textFormat = SpssClient.DocExportFormat.SpssFormatText
filename = tempfile.mktemp() + ".txt"
for index in range(outputItemList.Size()):
outputItem = outputItemList.GetItemAt(index)
if outputItem.GetDescription() == u"Page Title":
outputItem.ExportToDocument(filename, textFormat)
with open(filename) as f:
outputItem.SetDescription(f.read().rstrip())
os.remove(filename)
return outputDoc
def titleToPane(spv=None):
"""Copy the contents of the TITLE command of the designated output document
to the left output viewer pane"""
try:
outputDoc = None
SpssClient.StartClient()
if spv:
SpssClient.OpenOutputDoc(spv)
outputDoc = _titleToPane()
if spv and outputDoc:
outputDoc.SaveAs(spv)
except:
print "Error filling TITLE in Output Viewer [%s]" % sys.exc_info()[1]
finally:
SpssClient.StopClient()
Re-start SPSS Statistics and run below as a test:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
title="##Model##: jobcat".
freq jobcat.
title="##Model##: gender".
freq gender.
title="##Model##: minority".
freq minority.
begin program.
import editTitles
editTitles.titleToPane()
end program.
The TITLE command will initially add a title to main output viewer (right hand side) but then the python code will transfer that text to the left hand pane output tree structure. As mentioned already, note TITLE is capped to 60 characters only, a warning will be triggered to highlight this also.
This editTitles.py approach is the closest you are going to get to include a descriptive title to identify each model. To replace the actual title "General Linear Model." with a custom title would require scripting knowledge and would involve a lot more code. This is a simpler alternative approach. Python integration required for this to work.
Also consider using:
SPLIT FILE SEPARATE BY <list of filter variables>.
This will automatically produce filter labels in the left hand pane.
This is easy to use for mutually exclusive filters but even if you have overlapping filters you can re-run multiple times (and have filters applied to get as close to your desired set of results).
For example:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
sort cases by jobcat minority.
split file separate by jobcat minority.
freq educ.
split file off.

Resources