Native Vim Random number script - random

I know that there are various ways to get random numbers, eg, from the shell. However, I'm running vim on an android phone with very little compiled in. Also, it does not have to be rigorously random. The point is, what's an interesting, or concise, or fast (that is, with vim native functions), or short way to get a sequence of reasonably good random numbers in Vim?

Try something like
function Rand()
return str2nr(matchstr(reltimestr(reltime()), '\v\.#<=\d+')[1:])
endfunction
. I know no better option then using some of the time functions (there are two of them: reltime() and localtime(), but the latter is updated only each second). I would prefer to either avoid random numbers or use pyeval('random.randint(1, 10)') (preceded by python import random), because shell is slow and I don’t trust time-based solutions.
Note: documentation says that format of the item returned by reltime() depends on the system, thus I am using reltimestr(), not doing something with reltime()[1] which looks like if it contains nanoseconds.

I've recently played around with random numbers in Vim script myself. Here are some resources that I found in the process.
No Vim script
By all means, use an external random number generator if you can. As a rule, they are better and faster than anything that could be done in Vim script.
For example, try
:python import random; print random.randrange(1, 7)
:echo system('echo $RANDOM')
another scripting language, for example Ruby
Libraries
Vim script libraries. These hopefully strive to provide decent quality RNG implementations.
vital.vim is an excellent and comprehensive library created by the vim-jp user group. Their random number generator sports an impressive array of functionality and is the best pure Vim script RNG I know of. vital.vim uses an Xorshift algorithm. Check it out!
Rolling a die with vital.vim:
let Random = vital#of('vital').import('Random')
echo Random.range(1, 7)
vim-rng is a small random number generator plugin. It exports a couple of global functions that rely on a multiply-with-carry algorithm. This project seems to be a work in progress.
Rolling a die with rng:
echo RandomNumber(1, 6)
magnum.vim is my own little big integer library. I've recently added a random number generator that generates integers of any size. It uses the XORSHIFT-ADD algorithm.
Rolling a die with magnum.vim:
let six = magnum#Int(6)
echo magnum#random#NextInt(six).Add(magnum#ONE).Number()
Rndm has been around for much longer than the other libraries. Its functionality is exposed as a couple of global functions. Rolling a die with Rndm:
echo Urndm(1, 6)
Discussion and snippets
Finally, a few links to insightful discussion and Vim script snippets.
ZyX's reltime snippet on this page.
loreb's vimprng project on GitHub has an impressive number of RNG implementations in Vim script. Very useful.
This old mailing list discussion has a couple of Vim script snippets. The first one given by Bee-9 is limited to 16 bit but I found it quite effective. Here it is:
let g:rnd = localtime() % 0x10000
function! Random(n) abort
let g:rnd = (g:rnd * 31421 + 6927) % 0x10000
return g:rnd * a:n / 0x10000
endfunction
Another script, found in a person named Bart's personal config files.
Episode 57 on Vimcasts.org discusses Vim's 'expression register' and refers to random number examples throughout. Refers to this Stackoverflow question and ZyX's snippet. Recommended.
The Vim wiki on wikia has an article 'Jump to a random line' that has a few resources not mentioned yet.

Based on others' answers and other resources from the internet, I have written
two functions to generate a random integer in the given range [Low, High].
Both the two functions receive two arguments: Low and High and return a
random number in this range.
Combine Python and Vim script
The first function combines Python and Vim script.
" generate a random integer from range [Low, High] using Python
function! RandInt(Low, High) abort
" if you use Python 3, the python block should start with `python3` instead of
" `python`, see https://github.com/neovim/neovim/issues/9927
python3 << EOF
import vim
import random
# using vim.eval to import variable outside Python script to python
idx = random.randint(int(vim.eval('a:Low')), int(vim.eval('a:High')))
# using vim.command to export variable inside Python script to vim script so
# we can return its value in vim script
vim.command("let index = {}".format(idx))
EOF
return index
endfunction
Pure Vim script
The second function I propose uses pure vim script:
function! RandInt(Low, High) abort
let l:milisec = str2nr(matchstr(reltimestr(reltime()), '\v\.\zs\d+'))
return l:milisec % (a:High - a:Low + 1) + a:Low
endfunction
Use luaeval() (Neovim only)
The third way to generate random number is to use lua via luaeval().
" math.randomseed() is need to make the random() function generate different numbers
" on each use. Otherwise, the first number it generate seems same all the time.
luaeval('math.randomseed(os.time())')
let num = luaeval('math.random(1, 10)')
If you want to generate random number in non-serious occasions, you may use the
these methods as a starter.

Related

Adjusting the MAFFT command line algorithm to better account for gaps

I've been attempting to use the MAFFT command line tool as a means to identify coding regions within a genome. My general process is to align the amino acid consensus sequence of a gene to a translated reading frame of a target sequence. My method has been largely successful. However, I've noticed some peculiar alignments which will unfortunately impede my annotation method. The following is one such example (Note - I've also included a pairwise alignment from the Pairwise2 Biopython module to demonstrate my desired output. Unfortunately, the computation time for Pairwise2 is nearly 20 times slower than MAFFT command line):
from time import *
from Bio.SubsMat import MatrixInfo as matlist
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
from Bio.Align.Applications import MafftCommandline
startTime = time()
sample_tList = [['>Frame 1', 'RIGVGSIPRHLYCQELPLAQPKTCCAETPFRDSPLQGRLGVCPHLASGVALLYGLSTPLTMSGILDRCTCTPNARVFMAEGQVYCTRCLSARSLLPLNLQVPELGVLGLFYRPEEPLRWTLPRAFPTVECSPAGACWLSAIFPIARMTSGNLNFQQRMVRVAAEIYRAGQLTPAVLKVLQVYERGCRWYPIVGPVPGVGVYANSLHVSDKPFPGATHVLTNLPLPQRPKPEDFCPFECAMADVYDIGHGAVMFVAGGKVSWAPRGGDEVRFETVPEELKLIANRLHISFPPHHLVDMSKFAFIVPGSGVSLRVEHQHGCLPADIVPKGNCWWCLFDLLPPGVQNREIRYANQFGYQTKHGVSGKYLQRRLQINGLRAVTDTHGPIVVQYFSVKESWIRHFRLAGEPSLPGFEDLLRIRVESNTSPLADKDEKIFRFGSHKWYGAGKRARKARSGATTTVAHRASSARETRQAKKHEGVDANNAAHLEHYSPPAEGNCGWHCISAIVNRMVNSNFETTLPERVRPSDDWATDEDFVNTIQILRLPAALDRNGACKSAKYVLKLEGEHWTVSVAPGMSPSLLPLECVQGCCEHKGGLGSPDAVEVSGFDPTCLDRLAEVMHLPSSVIPAALAEMSNNSDRPASLVNTAWTVSQFYARHTGGNHRDQVRLGKIISLCQVIEECCCHQNKTNRATPEEVAAKIDQYLRGATSLEECLIKLERVSPPSAADTSFDWNVVLPGVEAAGPTTEQPHANQCCAPVPVVTQEPLDKDSVPLTAFSLSNCYYPAQGDEVRHRERLNSVLSKLEEVVLEEYGLMPTGLGPRPVLPSGLDELKDQMEEDLLKLANAQATSEMMALAAEQVDLKAWVKSYPRWIPPPPPPKVQPRRMKPVKSLPENKPVPAPRRKVRSDPGKSILAVGGPLNFSTPSELVTPLGEPVLMPASQHVSRPVTPLSEPAPVPAPRRIVSRPMTPLSEPTFVFAPWRKSQQVEEANPAAATLTCQDEPLDLSASSQTEYEAYPLAPLENIGVLEAGGQEAEEVLSGISDILDNTNPAPVSSSSSLSSVKITRPKYSAQAIIDSGGPCSGHLQKEKEACLRIMREACDAARLGDPATQEWLSHMWDRVDVLTWRNTSVYQAFRTLDGRFGFLPKMILETPPPYPCGFVMLPHTPTPSVSAESDLTIGSVATEDVPRILGKTENTGNVLNQKPLALFEEEPVCDQPAKDSRTLSRESGDSTTAPPVGTGGAGLPTDLPPLDGVDADGGGLLRTAKGKAERFFDQLSRQVFNIVSHLPVFFSHLFKSDSGYSPGDWGFAAFTLFCLFLCYSYPFFGFAPLLGVFSGSSRRVRMGVFGCWLAFAVGLFKPVSDPVGAACEFDSPECRNILHSFELLKPWDPVRSLVVGPVGLGLAILGRLLGGARYIWHFLLRLGIVADCILAGAYVLSQGRCKKCWGSCIRTAPNEIAFNVFPFTRATRSSLIDLCDRFCAPKGMDPIFLATGWRGCWTGQSPIEQPSEKPIAFAQLDEKRITARTVVSQPYDPNQAVKCLRVLQAGGAMVAEAVPKVVKVSAIPFRAPFFPTGVKVDPECRIVVDPDTFTTALRSGYSTTNLVLGVGDFAQLNGLKIRQISKPSGGGPHLIAALHVACSMVLHMLAGVYVTAVGSCGTGTSDPWCANPFAVPGYGPGSLCTSRLCISQHGLTLPLTALVAGFGLQEIALVVLIFVSIGGMAHRLSCKADMLCILLAIASYVWVPLTWLLCVFPCWLRWFSLHPLTILWLVFFLISVNMPSGILAVVLLVSLWLLGRYTNIAGLVTPYDIHHYTSGPRGVAALATAPDGTYLAAVRRAALTGRTMLFTPSQLGSLLEGAFRTRKPSLNTVNVVGSSMGSGGVFTIDGRIKCVTAAHVLTGNSARVSGVGFNQMLDFDVKGDFAIADCPNWQGVAPKTQFCGDGWTGRAYWLTSSGVEPGVIGDGFAFCFTACGDSGSPVITEAGELVGVHTGSNKQGGGIVTRPSGQFCNVTPIKLSELSEFFAGPKVPLGDVKVGSHIIKDTSEVPSDLCALLAAKPELEGGLSTVQLLCVFFLLWRMMGHAWTPLVAVGFFILNEVLPAVLVRSVFSFGMFALSWLTPWSAQVLMIRLLTAALNRNRVSLIFYSLGAVTGFVADLATTQGHPLQAVMNLSTYAFLPRMMVVTSPVPAIACGVVHLLAIILYLFKYRCLHHVLVGDGAFSAAFFLRYFAEGKLREGVSQSCGMSHESLTGALAIKLSDEDLDFLTKWTDFKCFVSASNMRNAAGQFIEAAYAKALRIELAQLVQVDKVRGTLAKLEAFADTVAPQLSPGDIVVALGHTPVGSIFDLKVGSTKHTLQAIETRVLAGSKMTVARVVDPTPAPPPAPVPIPLPPKVLENGPNAWGGEDRLNKRKRRRMEAVGIFVMDGKKYQKFWDKNSGDVFYEEVHNSTDEWECLRAGDPADFDPETGIQCGHVTIEDKVYNVFTSPSGRRFLVPANPENRRIQWEAARLSVEQALGMMNVDGELTAKELEKLKRIIDKLQGLTKEQCLNCPPVAPAVVAAAWLLLRQRKNFTTGPSPDLTKWPVRLSRTRSSTTNIRLPNRLMVVLCSCAPLFLRLMSSPALMHLLSYLPATGRETLGLMARFGILRPRPPKRKSHLVRKYRLVTLGAVTHLKLVSLISCTLLGATLSGKEFYRIQGLETYLTEPPVTLEAQCMRLPASRPMLLRLMGVPSWPQPCPPVLSCMYRPFQRPSLIILILGLTALNSQSTVVRMLLGTSPNTICPPKALFCLEFFALCGSTCLPMWVSARPFIGLPLTLPRILWLEMGTDFQPRIFRASLKSTFCAHRLCEKTGKLLLLVPSRSSIVGRRRLGQYLALITLRWPTGQRVVLPRASKRHSTRPSPSEKTNLRNYILQFAGALKLILHPAIDPHLQLSAGSLPIFFMNSPVLKSIYRRTCLTAVTTYWLRSPARLREAACRLATRLPPCQTPFTAYMHSTWCSVTLKVVTLMAFCFCKTSSLRTCSRFNPSSIQTTSCCMPSLPPCQITTGGLNITLCVSKRTQRRQPQTRHHFVAGMGVSSLTVTGFLRPSPTIRQAMSLNTTPRRLQYLWTAVLVSMILSGLKSSWLVRSAPARTVTASQARRSSCPCGKNSGPIMKGRSPECAGTAEPRLRTPLPVASTSVLTTPISTSIVLSSGVATRRVLALVVSVNLPWEKAQVLWMRCNKSRISLRGLSCMWSRVSPLLTQVDTKLAADSPLGVASGETKLTCQTVIMPVPPCSPLVKRSTWSLSPPTCCAAGSSSVPPALGKHTGSSNRSRMVMSFTRQLTRPCLTLGLWGCAGSTSQRVRRCNSLPPLVPARGFASWPAVGVLVRIPFWTKQRIAITLMSGFLAKPPLPAEISNNSTRWVLTLIAMFLTSCLRPNRPSGDSDRISVMPSNQITGTNLCPWSTQPVPRWTNLSGMGKSSPPTTGTERTAPSLSTPVKVPHLMWLHCICPLKIHSTGNEPLLLSPGQDMQSSCMTHTGNCRACLIFLRKAHPSTSQCSVTSSSYIEITKNARLLRLAMEINSGLQTSALILSAPFVQIWKGRAPRSPKLHITWGSISHLIHSLLNSQQNSHPTGPWQPRTMKSGLIGWLPAFAPSINIAARALVQAIWWAPRCFAPQGLCHTTSQNLLGARLKCFLRQSSAPAELRIAGSTSMIGSEKLLSPSHMPSLATSKALPVGDVITSPPDTFRASFLRNQLRSGFLAPEKLQRQFAHQMCTSQILKRTSTQRPSPSAGKCWILEKSDWSGKTRRPIFNLKAAISPGINLQATPHTSEFLLILQCIWTPAWALPFATGGLLGPPIGELTSRSPLMITVPKSFCLVHTMVKCLQGTKFWRARSSRLTTQGTNTLGDLNRIQRICTSLLGMVRTGRIIMKRFGRARKGKFIRLLPPASFIFPRALSLNQLATEMKWGLCRASLTKLVNFLWMLSRNFWCPLLISSYFWPFCLASPSPAGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGILWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISNLDVVAHFQHLAAIEAETYKYLASRLPMLHNLRMTGSNVTIVYNSTLNQVFAIFPTSGSRPRLHDSQQWLIAVHSSIFSSVVASCTLFVVLWLRIPMLRSVFGFRWLGAIFLLNSRITRCVRLASPGRPLLRSMNPVGLFGAGGMTDAVRTTMTNGSWFRLASAKATPVFTPGWRSCHSATRPSSIPRYLGGTVKFMLTSRTNSFAPSTTGRTPPCLAMTTFQPYFRPTTNIRSTAVIGFTNGCAPSFPLGWFMFRGFSGVRLQAMFQFKSFRHQDQHYRSIRLCCPPGHQLPVWRLAPSDGSQELSVPHGDRDTRVHHHHSQCHRELFTFFSPHAFLLPFLCFDEKGIQSGIWQCVRHRGCVCLYQLRPTCQGVHPTLLGSRSCATASFHDTDHEVGNRFSLSFCHPTGNLNVQVCWGNAPRAVTRNCFLCGVSCRSVLLCSSTPAATAALIFSFITRYVSMAQIGWQKDLTGQWRLLSFFLCLTLFPMEHSPPAIFLTRLVSLCPPPGSITGGMSVVSMRSVLWLRFASSLGLRRTACPGATLVLDTPTSFWTLRADSIVGGRPLLRKGVRLKSRVTSTSKELCLMVPWQPLPEFQRNNGVVSRRLLPHGSTKGAFGVFHYLYASDDICSKGKSRPTARASAPFDLPELCFYLRVHDIRALSEHKGRAHYGGSSCTSLGGVLSHRNLEIHHLQMPFVLARPQVHSGPCPPRRKCRGLSSDCGKPRICRPASRLHYGRHIGARVEKPRVGWQKSCTGSGKPCQICQITTASSKRERRGTASQSISCARCWVRSSPNKTSPEARDRGRKIIREARRSPIFLRLKKMSGTTSPLVSGNCVCRRSRLPLTRAPGHVPCQIQGGVTLWSLVCRRIILCASASQHHPQHDELAFFGHLGVMIGRMCGEWHLTLCLVTYSIRATVWGSLIGENHAAAIKKKKKKKK'], ['>ORF2_GP2', 'MKWGLCKASLTKLANFLWMLSRSFWCPLLISSYFWPFCLASQSPVGWWSFASDWFAPRYSVRALPFTLSNYRRSYEAFLSQCQVDIPTWGVKHPLGVLWHHKVSTLIDEMVSRRMYRIMEKAGQAAWKQVVSEATLSRISGLDVVAHFQHLAAIEAETCKYLASRLPMLHNLRLTGSNVTIVYNSTLDQVFAIFPTPGSRPKLHDFQQWLIAVHSSIFSSVAASCTLFVVLWLRIPMLRSVFGFRWLGATFLLNSW']]
ex_file = open("newTempFile112233.fasta", "w")
for items in sample_tList:
ex_file.write(items[0] + "\n")
ex_file.write(items[1] + "\n")
ex_file.close()
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
#mafft_cline = MafftCommandline(mafft_exe, input=in_file, localpair=True, lexp=-1.5, lop=0.5)
stdout, stderr = mafft_cline()
print(stdout)
test_align = AlignIO.read(io.StringIO(stdout), "fasta")
#print(test_align)
os.remove("newTempFile112233.fasta")
print('Total time = ' + str(time() - startTime))
startTime = time()
matrix = matlist.blosum62
pWise_align = pairwise2.align.localds(sample_tList[0][1], sample_tList[1][1], matrix, -6, -1)
print(format_alignment(*pWise_align[0]))
print('Total time = ' + str(time() - startTime))
I've attempted to change the MAFFT command line alignment algorithm by referencing the help document (http://mafft.cbrc.jp/alignment/software/manual/manual.html). I don't get any error messages, but the alignment output does not change. I'm unsure what adjustments need to be made. I believe that by increasing the gap extension penalty (which is zero by default), the alignment will be improved. I haven't been able to find many documentation examples where custom variables are used when using MAFFT command line on this forum or through Google search. Help is much appreciated. For reference, documentation on the Pairwise2 alignment parameters can be found here: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
Managed to figure out a possible solution. The alignment of the example sequences provided results in a long terminal/end gap which should not be present. Changing the MAFFT alignment algorithm using localpair, lexp, and lop had no effect (causing me a good deal of confusion). However, I have noticed differences in the alignment output when each input sequence is reversed. Oddly, the only way I was able to remove the terminal/end gap was to set the lop (gap opening penalty) to a lesser amount relative to lexp (gap extension penalty). I suspect my solution is niche and may not be applicable to other similar occurrences of terminal gaps. Changing the alignment settings also likely reduces the optimal alignment.
Going forward, I plan to use an automated process to run alignments of consensus sequences to raw sequences. In the event I detect irregularities with the alignment output (specifically terminal gaps), I'll attempt to reverse the input sequences and apply custom alignment settings. I suppose if that isn't a consistent solution, I'll figure out a way to refine the alignment output directly.
For anyone curious, I used a lexp value of -1.5 and lop value of 0.5 (now included in a hashed out line in my example code).

Editing thickness in postscript (.ps or .eps) figures via unix shell commands?

I have many figures (graphs) in postscript (.eps) format that I wish to thicken the plots with.
I found the following code, but the output file is no different. I was wondering what I was doing wrong.
The code:
# get list of all arguments
set args = ($*)
# if not enough arguments, complain.
if ($#args < 2) then
echo "Usage: ps_thicken ps_file factor"
echo "Thickens all lines in a PostScript file by changing the linewidth macro."
echo "Result goes to standard output."
exit 1
endif
sed -e "s/^\/lw {\(.*\) div setlinewidth/\/lw {$2 mul \1 div setlinewidth/" $1
Now to execute this from my command line, I use the command (filename is ps_thicken, and has appropriate permissions):
./ps_thicken old_file.eps 10 > new_thick_file.eps
Which I thought should make everything 10x thicker, but it just doesnt change anything.
Any help would be greatly appreciated, I'm pretty new to shell script!
PostScript is a programming language, so it isn't really possible to make changes in an automated fashion like this. At least not without writing a PostScript program to do so!
Note that linewidth isn't a 'macro' (PostScript doesn't have macros) its am operator. What the code you've posted for sed does (if I recall sed well enough) is look for the definition of /lw and replace it with a modified version. The problem with that is that /lw is a function declartation in a particular PostScript program. Most PostScript programs won't have (or use) a function called 'lw'.
You would be much better to prepend the PostScript program code with something like:
/oldsetlinewidth /linewidth load def
/setlinewidth {2 div oldsetlinewidth} bind def
That will define (in the current dictionary) a function called 'setlinewidth'. Now, if the following program simply uses the current definition of setlinewdith when creating its own functions, it will use the redefined one above. Which will have the effect of dividing all line widths by 2 in this case. Obviously to increase the width you would use something like 2 mul instead of 2 div.
Note that this is by no means foolproof, its entirely possible for a PostScript program to explicitly load the definition of setlinewidth from systemdict, and you can't replace that (at least not easily) because systemdict is read-only.
However its unlikely that an EPS program would pull such tricks, so that should probably work well enough for you.
[based on comments]
Hmm, you mean 'failed to import' into an application or something else ?
If you're loading the EPS into an application then simply putting that code in front of it will break it. EPS (unlike PostScript) is required to follow some rules, so to modify it successfully you will have to follow them. This includes skipping over any EPS preview.
This is not really a trivial exercise. Your best bet is probably to run the files through Ghostscript, you can do a lot by harnessing a PostScript interpreter to do the work.
Start with the 2 lines of PostScript above in a file, then run the EPS file you want to 'modify' through Ghostscript, using the eps2write device. That will produce a new EPS which has the changes 'baked in'.
Eg (assuming the linewidth modifying code is in 'lw.ps'):
gs -sDEVICE=eps2write -o out.eps lw.ps file.eps
But be aware that the resulting EPS is a completely rewritten program and will bear no relation to the original. In particular any preview thumbnail will be lost.

is it easy to modify this python code to use pandas and would it help if i did?

I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Bash randomizing variables

I'm working on a bash script to basically play Rock Paper Scissors against the CPU.
The problem I'm having is that I can't get it to randomly pick between variables, instead it just picks the first variable noted. Here is the section of code that needs work:
r="rock"
p="paper"
s="scissors"
RPS=$r||$p||$s #The line that needs to be fixed
#rps=$r||$p||$s works but only outputs rock...
echo $RPS
I've tried looking for ways to do this on forums but google only pops up forums for randomly picking lines from another file and not within the file itself.
declare -a values=(rock paper scissors)
echo "${values[$(( $RANDOM % ${#values[*]} ))]}"
How about something like this:
choices=(rock paper cissors) # Define an array with 3 choices
RPS=${choices[$RANDOM%3]} # Pick one at random
Discussion
Bash has a built-in variable called $RANDOM, which returns a random integer.

Resources