Tatsu Parsing Performance - performance

I've implemented a grammar in Tatsu for parsing a description of a quantum program Quipper ASCII (link). The parser works but is slow for the files I'm looking at (about 10kB-1MB size, see the resources directory). It takes approximately 10-30 seconds to parse some files. The grammar is very straightforward and it should be possible to parse these files fairly quickly. One thing I've tried is adding cuts wherever possible to make sure there is no unneeded backtracking. The grammar is specified as
##grammar :: Quipper
##whitespace :: /[^\S\n]*/
# The root of any program
start::BCircuit = circuit:circuit subroutines:{subroutine} {newline} $ ;
# A sequence of gates
circuit::Circuit =
"Inputs:" ~ inputs:arity
gatelist:{gate newline}
"Outputs: "outputs:arity
;
# "Function" definitions
subroutine::Subroutine =
newline
"Subroutine:" ~ name:string newline
"Shape:" shape:string newline
"Controllable:" controllable:("yes"|"no"|"classically") newline
circuit:circuit
;
# Wires and their types.
arity = #:",".{type_assignment}+ newline ;
type_assignment::TypeAssignment = number:int ":" type:("Qbit"|"Cbit") ;
# Gate control
control_app = [controlled:controlled] [no_control:no_control];
controlled::Controlled = "with controls=[" ~ controls:",".{int}+ "]" ;
no_control::NoControl = "with nocontrol" ;
# All gates
gate
=
| qgate
| qrot
| gphase
| cnot
| cgate
| cswap
| qprep
| qunprep
| qinit
| cinit
| cterm
| qmeas
| qdiscard
| cdiscard
| dterm
| subroutine_call
| comment
;
# Gate definitions
qgate::QGate::Gate = "QGate[" ~ name:string "]" inverse:["*"] "(" qubit:int ")" > control_app;
qrot::QRot::Gate = "QRot[" ~ string "," double "](" int ")" ;
gphase::GPhase::Gate = "Gphase() with t=" ~ timestep:double >control_app "with anchors=[" ~ wires:",".{wire} "]" ;
cnot::CNo::Gate = "CNot(" ~ wire:wire ")" >control_app ;
cgate::CGat::Gate = "CGate[" ~ name:string "]" inverse:["*"] "(" wires:",".{wire} ")" no_control;
cswap::CSwap::Gate = "CSwap(" ~ wire1:wire "," wire2:wire ")" >control_app ;
qprep::QPrep::Gate = "QPrep(" ~ wire:wire ")" no_control ;
qunprep::QUnprep::Gate = "QUnprep(" ~ wire:wire ")" no_control ;
qinit::QInit::Gate = state:("QInit0" | "QInit1") ~ "(" wire:wire ")" no_control;
cinit::CInit::Gate = state:("CInit0" | "CInit1") ~ "(" wire:wire ")" no_control;
qterm::QTerm::Gate = state:("QTerm0" | "QTerm1") ~ "(" wire:wire ")" no_control;
cterm::CTerm::Gate = state:("CTerm0" | "CTerm1") ~ "(" wire:wire ")" no_control;
qmeas::QMeas::Gate = "QMeas(" ~ wire:wire ")" ;
qdiscard::QDiscard::Gate = "QDiscard(" ~ wire:wire ")" ;
cdiscard::CDiscard::Gate = "CDiscard(" ~ wire:wire ")" ;
dterm::DTerm::Gate = state:("DTerm0" | "Dterm1") ~ "(" wire:wire ")" ;
subroutine_call::SubCall::Gate = "Subroutine" ~ ["(x" repetitions:int ")"]
"[" name:string ", shape" shape:string "]"
inverse:["*"]
"(" inputs:",".{int}+ ")"
"-> (" outputs:",".{int}+ ")"
>control_app ;
comment::Comment::Gate = "Comment[" ~ text:string "](" wires:",".{wire}+ ")" ;
# Reference to an input wire and a textual description
wire::Wire = qubit:int ":" text:string ;
# Literals
string = '"'#:String_literal'"'; # Don't include the quotes.
String_literal::str = /[^"]+/ ;
int::int = /([+|-])?\d+/ ;
double::float = /(-)?\d+\.\d+e(-)?\d+/ ;
newline = /\n/ ;
I've tried profiling the code to find bottlenecks in the performance, but time is spent all over the place. I generate a parser with tatsu grammar.ebnf and model with tatsu -g, which I then use in a test cases to parse input files. Performance results using standard python 3.6.4, sorted by tottime, for parsing all files in resources/PF:
Tue Feb 27 13:35:58 2018 parser_profiling
4787639497 function calls (4611051402 primitive calls) in 3255.157 seconds
Ordered by: internal time
List reduced from 326 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
144386670 129.008 0.000 491.328 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:245(_eat_regex)
312540554 92.592 0.000 216.890 0.000 {built-in method builtins.isinstance}
15701720/80 88.741 0.000 3249.970 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:561(_invoke_rule)
34327680 87.957 0.000 370.060 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/ast.py:14(__init__)
57815550 76.052 0.000 380.881 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:322(matchre)
95427920 75.086 0.000 149.629 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:208(goto)
67455280 74.390 0.000 124.299 0.000 /Users/eddie/dev/quippy/.venv/bin/../lib/python3.6/abc.py:178(__instancecheck__)
45115140 69.378 0.000 671.723 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:260(next_token)
57815550 67.317 0.000 143.516 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:329(_scanre)
32497610 63.583 0.000 69.483 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/ast.py:115(__hasattribute__)
15701720/80 63.262 0.000 3249.976 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:501(_call)
18626000 61.979 0.000 454.538 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:644(_try)
68655360 57.334 0.000 57.334 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/ast.py:103(__setattr__)
57815550 54.505 0.000 54.505 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
126339356 49.908 0.000 49.908 0.000 /Users/eddie/dev/quippy/.venv/bin/../lib/python3.6/_weakrefset.py:70(__contains__)
72092420 48.541 0.000 170.810 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:211(move)
33410950/4018330 47.143 0.000 334.748 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/objectmodel.py:149(_adopt_children)
48867260 46.582 0.000 724.214 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:237(_next_token)
37932280 44.424 0.000 91.042 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:277(_add_cst_node)
48128890 44.030 0.000 58.493 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:256(eat_eol_comments)
36470510 43.970 0.000 234.524 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/ast.py:48(update)
24351260 43.216 0.000 120.680 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/ast.py:60(set)
18626000 41.821 0.000 545.252 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:666(_option)
75390600 40.445 0.000 40.445 0.000 {built-in method builtins.getattr}
79996390 39.995 0.000 52.367 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:223(_pos)
246258630 39.345 0.000 39.345 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:155(pos)
38860150/25041530 38.587 0.000 118.596 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/objectmodel.py:108(__cn)
95427954 38.319 0.000 38.319 0.000 {built-in method builtins.min}
15701720/80 38.148 0.000 3249.974 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:536(_recursive_call)
32060560 37.741 0.000 45.157 0.000 /usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/contextlib.py:59(__init__)
Even the largest contributor (_eat_regex) cannot explain the 50 minute run-time. I also cannot find how to optimize for speed in the tatsu documentation.
Some more insight may be gained by looking at the cumulative time spent in functions. I've sorted the profiling results by cumtime:
Tue Feb 27 13:35:58 2018 parser_profiling
4787639497 function calls (4611051402 primitive calls) in 3255.157 seconds
Ordered by: cumulative time
List reduced from 326 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 3273.337 3273.337 {built-in method builtins.exec}
1 0.000 0.000 3273.337 3273.337 <string>:1(<module>)
1 0.005 0.005 3273.337 3273.337 test_model.py:95(test_optimizer)
80 0.002 0.000 3272.954 40.912 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:182(parse)
15701720/80 31.630 0.000 3249.977 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:47(wrapper)
15701720/80 63.262 0.000 3249.976 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:501(_call)
15701720/80 38.148 0.000 3249.974 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:536(_recursive_call)
15701720/80 88.741 0.000 3249.970 40.625 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:561(_invoke_rule)
80 0.001 0.000 3174.181 39.677 /Users/eddie/dev/quippy/_parser.py:82(_start_)
380/240 0.034 0.000 3169.206 13.205 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:762(_closure)
220 0.000 0.000 3164.452 14.384 /Users/eddie/dev/quippy/_parser.py:87(block2)
220 0.003 0.000 3087.977 14.036 /Users/eddie/dev/quippy/_parser.py:121(_subroutine_)
624800/800 15.209 0.000 3076.119 3.845 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:746(_repeat)
220 0.002 0.000 3024.411 13.747 /Users/eddie/dev/quippy/_parser.py:101(_circuit_)
2578040/1272030 7.935 0.000 2970.683 0.002 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:732(_isolate)
1875860 3.029 0.000 2839.993 0.002 /Users/eddie/dev/quippy/_parser.py:108(block2)
1875860 11.120 0.000 2483.056 0.001 /Users/eddie/dev/quippy/_parser.py:217(_gate_)
1875860 25.944 0.000 1793.816 0.001 /Users/eddie/dev/quippy/_parser.py:256(_qgate_)
48867260 46.582 0.000 724.214 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:237(_next_token)
45115140 69.378 0.000 671.723 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:260(next_token)
14443300 35.990 0.000 577.765 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:598(_invoke_semantic_rule)
49442230/22727050 22.427 0.000 575.361 0.000 {built-in method builtins.next}
18626000 41.821 0.000 545.252 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:666(_option)
48128890 19.695 0.000 497.263 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:249(eat_whitespace)
32060560/13779580 12.309 0.000 493.588 0.000 /usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/contextlib.py:79(__enter__)
144386670 129.008 0.000 491.328 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/buffering.py:245(_eat_regex)
14443300 34.625 0.000 460.941 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/semantics.py:76(_default)
18626000 61.979 0.000 454.538 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:644(_try)
17463860 36.024 0.000 415.958 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/contexts.py:606(_token)
4018330 18.630 0.000 410.825 0.000 /Users/eddie/dev/quippy/.venv/lib/python3.6/site-packages/tatsu/objectmodel.py:19(__init__)
We can see that quite some time is spent in choosing which gate is being performed (it is a large choice statement after all). A lot of time is also spent in qgate which is not surprising since these nodes are dominant in the code.
What are possible optimizations I can add to the grammar or parsing so that it can parse these files more quickly?

The problem may be the ##whitespace definition, because it matches the empty string.
You could try:
##whitespace :: /[^\S\n]+/
If that works, and the documentation misled you to use a normal closure for whitespace, please post an issue to the issue base

Related

numpy - what is recursor in numpy/core/arrayprint.pyarrayprint.py and why it costs time?

Question
What is recursor in numpy/core/arrayprint.pyarrayprint.py and why it costs time? Appreciate if there is a resource
Background
Noticed that calculating softmax exp(x) / sum(exp(X)) is taking time and run the profiler.
def softmax(X: Union[np.ndarray, float]) -> Union[np.ndarray, float]:
C = np.max(X, axis=-1, keepdims=True)
exp = np.exp(X - C) # to prevent overflow
return exp / np.sum(exp, axis=-1, keepdims=True)
profiler = cProfile.Profile()
profiler.enable()
for _ in range(1000):
softmax(X)
profiler.disable()
profiler.print_stats(sort="cumtime")
Apparently it spends majority of time in arrayprint.py especially in recurser. Hence wonder what is arrayprint and if there is a way to improve the performance.
129000/3000 0.335 0.000 1.106 0.000 arrayprint.py:718(recurser)
Entire profiler output.
2419006 function calls (2275006 primitive calls) in 2.158 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1000 0.136 0.000 2.158 0.002 functions.py:173(softmax)
3000 0.006 0.000 1.966 0.001 arrayprint.py:1473(_array_str_implementation)
3000 0.013 0.000 1.960 0.001 arrayprint.py:516(array2string)
3000 0.013 0.000 1.926 0.001 arrayprint.py:461(wrapper)
3000 0.022 0.000 1.908 0.001 arrayprint.py:478(_array2string)
3000 0.005 0.000 1.111 0.000 arrayprint.py:709(_formatArray)
129000/3000 0.335 0.000 1.106 0.000 arrayprint.py:718(recurser)
3000 0.016 0.000 0.677 0.000 arrayprint.py:409(_get_format_function)
3000 0.005 0.000 0.651 0.000 arrayprint.py:366(<lambda>)
3000 0.012 0.000 0.646 0.000 arrayprint.py:836(__init__)
3000 0.112 0.000 0.632 0.000 arrayprint.py:863(fillFormat)
108000 0.368 0.000 0.588 0.000 arrayprint.py:947(__call__)
216000 0.395 0.000 0.395 0.000 {built-in method numpy.core._multiarray_umath.dragon4_positional}
111000 0.053 0.000 0.323 0.000 arrayprint.py:918(<genexpr>)
111000 0.075 0.000 0.249 0.000 arrayprint.py:913(<genexpr>)
126000 0.110 0.000 0.152 0.000 arrayprint.py:695(_extendLine)
17000 0.034 0.000 0.134 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
24000 0.040 0.000 0.096 0.000 {built-in method builtins.max}
21000/3000 0.043 0.000 0.094 0.000 arrayprint.py:324(_leading_trailing)
8000 0.017 0.000 0.085 0.000 fromnumeric.py:70(_wrapreduction)
960000 0.078 0.000 0.078 0.000 {built-in method builtins.len}
4000 0.005 0.000 0.071 0.000 <__array_function__ internals>:2(amax)
8000 0.062 0.000 0.062 0.000 {method 'reduce' of 'numpy.ufunc' objects}
4000 0.008 0.000 0.062 0.000 fromnumeric.py:2589(amax)
9000 0.008 0.000 0.037 0.000 <__array_function__ internals>:2(concatenate)
6000 0.013 0.000 0.034 0.000 _ufunc_config.py:32(seterr)
111000 0.021 0.000 0.029 0.000 arrayprint.py:922(<genexpr>)
111000 0.020 0.000 0.027 0.000 arrayprint.py:923(<genexpr>)
3000 0.004 0.000 0.025 0.000 _ufunc_config.py:433(__enter__)
3000 0.003 0.000 0.025 0.000 <__array_function__ internals>:2(amin)
108000 0.021 0.000 0.021 0.000 {method 'split' of 'str' objects}
1000 0.002 0.000 0.021 0.000 <__array_function__ internals>:2(sum)
3000 0.004 0.000 0.020 0.000 fromnumeric.py:2714(amin)
3000 0.009 0.000 0.018 0.000 arrayprint.py:60(_make_options_dict)
1000 0.003 0.000 0.018 0.000 fromnumeric.py:2105(sum)
3000 0.003 0.000 0.015 0.000 _ufunc_config.py:438(__exit__)
6000 0.012 0.000 0.013 0.000 _ufunc_config.py:132(geterr)
18000 0.008 0.000 0.012 0.000 index_tricks.py:727(__getitem__)
3000 0.007 0.000 0.007 0.000 arrayprint.py:358(_get_formatdict)
3000 0.007 0.000 0.007 0.000 {built-in method builtins.locals}
27000 0.006 0.000 0.006 0.000 {method 'rstrip' of 'str' objects}
24000 0.006 0.000 0.006 0.000 {built-in method builtins.isinstance}
6000 0.006 0.000 0.006 0.000 {built-in method numpy.seterrobj}
8000 0.005 0.000 0.005 0.000 fromnumeric.py:71(<dictcomp>)
3000 0.002 0.000 0.004 0.000 _asarray.py:14(asarray)
12000 0.003 0.000 0.003 0.000 {built-in method numpy.geterrobj}
1000 0.002 0.000 0.003 0.000 __init__.py:1412(debug)
3000 0.002 0.000 0.002 0.000 arrayprint.py:65(<dictcomp>)
11000 0.002 0.000 0.002 0.000 {method 'items' of 'dict' objects}
3000 0.002 0.000 0.002 0.000 {built-in method numpy.array}
12000 0.002 0.000 0.002 0.000 {built-in method builtins.issubclass}
3000 0.002 0.000 0.002 0.000 {method 'update' of 'dict' objects}
9000 0.002 0.000 0.002 0.000 multiarray.py:143(concatenate)
3000 0.002 0.000 0.002 0.000 _ufunc_config.py:429(__init__)
3000 0.002 0.000 0.002 0.000 {method 'discard' of 'set' objects}
1000 0.001 0.000 0.001 0.000 __init__.py:1677(isEnabledFor)
3000 0.001 0.000 0.001 0.000 {built-in method builtins.id}
3000 0.001 0.000 0.001 0.000 {method 'add' of 'set' objects}
3000 0.001 0.000 0.001 0.000 arrayprint.py:827(_none_or_positive_arg)
3000 0.001 0.000 0.001 0.000 {built-in method _thread.get_ident}
3000 0.001 0.000 0.001 0.000 {method 'copy' of 'dict' objects}
4000 0.001 0.000 0.001 0.000 fromnumeric.py:2584(_amax_dispatcher)
3000 0.001 0.000 0.001 0.000 fromnumeric.py:2709(_amin_dispatcher)
1000 0.000 0.000 0.000 0.000 fromnumeric.py:2100(_sum_dispatcher)
1 0.000 0.000 0.000 0.000 __init__.py:214(_acquireLock)
1 0.000 0.000 0.000 0.000 __init__.py:223(_releaseLock)
1 0.000 0.000 0.000 0.000 __init__.py:1663(getEffectiveLevel)
1 0.000 0.000 0.000 0.000 {method 'acquire' of '_thread.RLock' objects}
1 0.000 0.000 0.000 0.000 {method 'release' of '_thread.RLock' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
recurser in python3.8/site-packages/numpy/core/arrayprint.py
def _formatArray(a, format_function, line_width, next_line_prefix,
separator, edge_items, summary_insert, legacy):
"""formatArray is designed for two modes of operation:
1. Full output
2. Summarized output
"""
def recurser(index, hanging_indent, curr_width):
"""
By using this local function, we don't need to recurse with all the
arguments. Since this function is not created recursively, the cost is
not significant
"""
axis = len(index)
axes_left = a.ndim - axis
if axes_left == 0:
return format_function(a[index])
# when recursing, add a space to align with the [ added, and reduce the
# length of the line by 1
next_hanging_indent = hanging_indent + ' '
if legacy == '1.13':
next_width = curr_width
else:
next_width = curr_width - len(']')
a_len = a.shape[axis]
show_summary = summary_insert and 2*edge_items < a_len
if show_summary:
leading_items = edge_items
trailing_items = edge_items
else:
leading_items = 0
trailing_items = a_len
# stringify the array with the hanging indent on the first line too
s = ''
# last axis (rows) - wrap elements if they would not fit on one line
if axes_left == 1:
# the length up until the beginning of the separator / bracket
if legacy == '1.13':
elem_width = curr_width - len(separator.rstrip())
else:
elem_width = curr_width - max(len(separator.rstrip()), len(']'))
line = hanging_indent
for i in range(leading_items):
word = recurser(index + (i,), next_hanging_indent, next_width)
s, line = _extendLine(
s, line, word, elem_width, hanging_indent, legacy)
line += separator
if show_summary:
s, line = _extendLine(
s, line, summary_insert, elem_width, hanging_indent, legacy)
if legacy == '1.13':
line += ", "
else:
line += separator
for i in range(trailing_items, 1, -1):
word = recurser(index + (-i,), next_hanging_indent, next_width)
s, line = _extendLine(
s, line, word, elem_width, hanging_indent, legacy)
line += separator
if legacy == '1.13':
# width of the separator is not considered on 1.13
elem_width = curr_width
word = recurser(index + (-1,), next_hanging_indent, next_width)
s, line = _extendLine(
s, line, word, elem_width, hanging_indent, legacy)
s += line
# other axes - insert newlines between rows
else:
s = ''
line_sep = separator.rstrip() + '\n'*(axes_left - 1)
for i in range(leading_items):
nested = recurser(index + (i,), next_hanging_indent, next_width)
s += hanging_indent + nested + line_sep
if show_summary:
if legacy == '1.13':
# trailing space, fixed nbr of newlines, and fixed separator
s += hanging_indent + summary_insert + ", \n"
else:
s += hanging_indent + summary_insert + line_sep
for i in range(trailing_items, 1, -1):
nested = recurser(index + (-i,), next_hanging_indent,
next_width)
s += hanging_indent + nested + line_sep
nested = recurser(index + (-1,), next_hanging_indent, next_width)
s += hanging_indent + nested
# remove the hanging indent, and wrap in []
s = '[' + s[len(hanging_indent):] + ']'
return s
try:
# invoke the recursive part with an initial index and prefix
return recurser(index=(),
hanging_indent=next_line_prefix,
curr_width=line_width)
finally:
# recursive closures have a cyclic reference to themselves, which
# requires gc to collect (gh-10620). To avoid this problem, for
# performance and PyPy friendliness, we break the cycle:
recurser = None
Update
My mistake of placing a logger statement with non-lazy.
It looks like something is formatting array(s) for printing, but I don't see where that's happening in your profiling.
Here's what I get when I use ipython's profiling magic:
In [7]: %prun softmax(np.arange(1000))
22 function calls in 0.001 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.000 0.000 <ipython-input-2-4abb95d104cf>:1(softmax)
1 0.000 0.000 0.001 0.001 {built-in method builtins.exec}
2 0.000 0.000 0.000 0.000 fromnumeric.py:71(<dictcomp>)
2 0.000 0.000 0.000 0.000 fromnumeric.py:70(_wrapreduction)
1 0.000 0.000 0.000 0.000 {built-in method numpy.arange}
1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(amax)
2 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
1 0.000 0.000 0.000 0.000 fromnumeric.py:2111(sum)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 fromnumeric.py:2617(amax)
1 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(sum)
2 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
1 0.000 0.000 0.000 0.000 fromnumeric.py:2612(_amax_dispatcher)
1 0.000 0.000 0.000 0.000 fromnumeric.py:2106(_sum_dispatcher)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}

Parse large stdin ruby

I need to write script for parse large input data (30GB). I need extract all numbers from stdin text and output it by order desc.
Example of usage:
cat text_file_30gb.txt | script
Now I use for parse:
numbers = []
$stdin.each_line do |line|
numbers += line.scan(/\d+/).map(&:to_i)
end
numbers.uniq!.sort!.reverse!
But I tried to pass text from 60MB file to script and it parsed it for 50 min
Are the way for speed up script?
UPD. Profiling result:
%self total self wait child calls name
95.42 5080.882 4848.293 0.000 232.588 1 IO#each_line
3.33 169.246 169.246 0.000 0.000 378419 String#scan
0.26 15.148 13.443 0.000 1.705 746927 <Class::Time>#now
0.18 9.310 9.310 0.000 0.000 378422 Array#uniq!
0.15 14.446 7.435 0.000 7.011 378423 Array#map
0.14 7.011 7.011 0.000 0.000 8327249 String#to_i
0.10 5.179 5.179 0.000 0.000 378228 Array#sort!
0.03 1.508 1.508 0.000 0.000 339416 String#%
0.03 1.454 1.454 0.000 0.000 509124 Symbol#to_s
0.02 0.993 0.993 0.000 0.000 48488 IO#write
0.02 1.593 0.945 0.000 0.649 742077 Numeric#quo
0.01 0.649 0.649 0.000 0.000 742077 Fixnum#fdiv
0.01 0.619 0.619 0.000 0.000 509124 String#intern
0.01 0.459 0.459 0.000 0.000 315172 Fixnum#to_s
0.01 0.453 0.453 0.000 0.000 746927 Fixnum#+
0.01 0.383 0.383 0.000 0.000 72732 Array#reject
0.01 16.100 0.307 0.000 15.793 96976 *Enumerable#inject
0.00 15.793 0.207 0.000 15.585 150322 *Array#each
...
Thanks for the excellent problem.
I couldn't dig for a long time. However, this is what I can see as a quick fix to reduce 50 mins mark to 11 mins. At least 4.5 times faster.
require 'ruby-prof'
def profile(&block)
RubyProf::FlatPrinter.new(RubyProf.profile(&block)).print($stdout)
end
numbers = []
profile do
$stdin.each_line do |line|
line.scan(/\d+/) {|digit| numbers << digit.to_i }
end
numbers.uniq!.sort!.reverse!
end
The reason is pretty simple. As you can see += on array allocates new array instead of pushing new values to the existing reference. A quick fix is using << instead. A big win that along cut the whole lag.
Still, there are some significant glitches if you juggle with larger file set. I don't have anything top of my head.

Optimizing Ruby Arrays or Hashes

I have a program that produces simulated typing. The program takes user input on where the location of the file is and the file along with the extension. It then breaks down the file using an iteration and puts it into an array.
def file_to_array(file)
empty = []
File.foreach("#{file}") do |line|
empty << line.to_s.split('')
end
return empty.flatten!
end
When the program runs, it sends the keys to the text area to simulate typing via win32ole.
After 5,000 characters there's too much memory overhead and the program begins to slow down. The further past 5,000 characters, the slower it goes. Is there a way that this can be optimized?
--EDIT--
require 'Benchmark'
def file_to_array(file)
empty = []
File.foreach(file) do |line|
empty << line.to_s.split('')
end
return empty.flatten!
end
def file_to_array_2(file)
File.read(file).split('')
end
file = 'xxx'
Benchmark.bm do |results|
results.report { print file_to_array(file) }
results.report { print file_to_array_2(file) }
end
user system total real
0.234000 0.000000 0.234000 ( 0.787020)
0.218000 0.000000 0.218000 ( 1.917185)
I did my benchmark and profile, here is the code:
#!/usr/bin/env ruby
require 'benchmark'
require 'rubygems'
require 'ruby-prof'
def ftoa_1(path)
empty = []
File.foreach(path) do |line|
empty << line.to_s.split('')
end
return empty.flatten!
end
def ftoa_2(path)
File.read(path).split('')
end
def ftoa_3(path)
File.read(path).chars
end
def ftoa_4(path)
File.open(path) { |f| f.each_char.to_a }
end
GC.start
GC.disable
Benchmark.bm(6) do |x|
1.upto(4) do |n|
x.report("ftoa_#{n}") {send("ftoa_#{n}", ARGV[0])}
end
end
1.upto(4) do |n|
puts "\nProfiling ftoa_#{n} ...\n"
result = RubyProf.profile do
send("ftoa_#{n}", ARGV[0])
end
RubyProf::FlatPrinter.new(result).print($stdout)
end
And here is my result:
user system total real
ftoa_1 2.090000 0.160000 2.250000 ( 2.250350)
ftoa_2 1.540000 0.090000 1.630000 ( 1.632173)
ftoa_3 0.420000 0.080000 0.500000 ( 0.505286)
ftoa_4 0.550000 0.090000 0.640000 ( 0.630003)
Profiling ftoa_1 ...
Measure Mode: wall_time
Thread ID: 70190654290440
Fiber ID: 70189795562220
Total: 2.571306
Sort by: self_time
%self total self wait child calls name
83.39 2.144 2.144 0.000 0.000 103930 String#split
12.52 0.322 0.322 0.000 0.000 1 Array#flatten!
3.52 2.249 0.090 0.000 2.159 1 <Class::IO>#foreach
0.57 0.015 0.015 0.000 0.000 103930 String#to_s
0.00 2.571 0.000 0.000 2.571 1 Global#[No method]
0.00 2.571 0.000 0.000 2.571 1 Object#ftoa_1
0.00 0.000 0.000 0.000 0.000 1 Fixnum#to_s
* indicates recursively called methods
Profiling ftoa_2 ...
Measure Mode: wall_time
Thread ID: 70190654290440
Fiber ID: 70189795562220
Total: 1.855242
Sort by: self_time
%self total self wait child calls name
99.77 1.851 1.851 0.000 0.000 1 String#split
0.23 0.004 0.004 0.000 0.000 1 <Class::IO>#read
0.00 1.855 0.000 0.000 1.855 1 Global#[No method]
0.00 1.855 0.000 0.000 1.855 1 Object#ftoa_2
0.00 0.000 0.000 0.000 0.000 1 Fixnum#to_s
* indicates recursively called methods
Profiling ftoa_3 ...
Measure Mode: wall_time
Thread ID: 70190654290440
Fiber ID: 70189795562220
Total: 0.721246
Sort by: self_time
%self total self wait child calls name
99.42 0.717 0.717 0.000 0.000 1 String#chars
0.58 0.004 0.004 0.000 0.000 1 <Class::IO>#read
0.00 0.721 0.000 0.000 0.721 1 Object#ftoa_3
0.00 0.721 0.000 0.000 0.721 1 Global#[No method]
0.00 0.000 0.000 0.000 0.000 1 Fixnum#to_s
* indicates recursively called methods
Profiling ftoa_4 ...
Measure Mode: wall_time
Thread ID: 70190654290440
Fiber ID: 70189795562220
Total: 0.816140
Sort by: self_time
%self total self wait child calls name
99.99 0.816 0.816 0.000 0.000 2 IO#each_char
0.00 0.000 0.000 0.000 0.000 1 File#initialize
0.00 0.000 0.000 0.000 0.000 1 IO#close
0.00 0.816 0.000 0.000 0.816 1 <Class::IO>#open
0.00 0.000 0.000 0.000 0.000 1 IO#closed?
0.00 0.816 0.000 0.000 0.816 1 Global#[No method]
0.00 0.816 0.000 0.000 0.816 1 Enumerable#to_a
0.00 0.816 0.000 0.000 0.816 1 Enumerator#each
0.00 0.816 0.000 0.000 0.816 1 Object#ftoa_4
0.00 0.000 0.000 0.000 0.000 1 Fixnum#to_s
* indicates recursively called methods
The conclusion is that ftoa_3 is the fastest when GC is turned off, but I would recommend ftoa_4 because it uses less memory and thus reduces the times of GC. If you turn GC on, you can see ftoa_4 will be the fastest.
From the profile result, you can see the program spends most time in String#split in both ftoa_1 and ftoa_2. The ftoa_1 is the worst because String#split runs many times (1 for each line), and Array.flatten! also takes a lot of time.
Yes this can be optimized (written sorter, with less assignments and less method calls):
def file_to_array(file)
File.read(file).split('')
end
This works, because file is already a string and therefore there is no need for the string interpolation "#{file}". File.read returns the whole file, this removes the need to iterate over each line. Without the iteration there is no need for a temporary empty array, the flatten! and the string concatenation <<. And there is no need for the explizit return in your example.
Update: It is not clear from your question what you are optimizing for: performance, memory usage or readablity. Since I was surprised by your benchmark results I ran my own. And I think my solution is faster than yours.
But the results might differ on different Ruby versions (I used Ruby 2.3), the input file size and number of lines or the number of iterations ran in the benchmark.
def file_to_array_1(file)
empty = []
File.foreach("#{file}") do |line|
empty << line.to_s.split('')
end
return empty.flatten!
end
def file_to_array_2(file)
File.read(file).split('')
end
require 'benchmark'
# file = '...' # a path to a file with about 26KB data in about 750 lines
n = 1000
Benchmark.bmbm(15) do |x|
x.report("version 1 :") { n.times do; file_to_array_1(file); end }
x.report("version 2 :") { n.times do; file_to_array_2(file); end }
end
# Rehearsal ---------------------------------------------------
# version 1 : 11.970000 0.110000 12.080000 ( 12.092841)
# version 2 : 8.150000 0.120000 8.270000 ( 8.267420)
# ----------------------------------------- total: 20.350000sec
# user system total real
# version 1 : 11.940000 0.100000 12.040000 ( 12.045505)
# version 2 : 8.130000 0.110000 8.240000 ( 8.248707)
# [Finished in 40.7s]

Python any() + generator expression

According to the blog post here, an any() + generator expression should run quicker than a for loop, and it seems like his reasoning makes sense.
But I've tried to use this method (albeit on some other function), but it seems to take a longer time to run than an explicit for loop.
def with_loop(a, b):
for x in xrange(1, b):
if x * a % b == 1: return True
return False
def with_generator(a, b):
return any(x * a % b == 1 for x in xrange(1, b))
Basically the code loops through all the numbers from 1 to b to find whether the number a has a modular inverse.
The times I got from running the functions are:
>>> from timeit import Timer as T
>>> T(lambda : with_generator(100, 300)).repeat(number = 100000)
[3.4041796334919923, 3.6303230626526215, 3.6714475531563266]
>>> T(lambda : with_loop(100, 300)).repeat(number = 100000)
[2.1977450660490376, 2.2083902291327604, 2.1905292602997406]
>>> T(lambda : with_generator(101, 300)).repeat(number = 100000)
[1.213779524696747, 1.2228346702509043, 1.2216941170634072]
>>> T(lambda : with_loop(101, 300)).repeat(number = 100000)
[0.7431202233722161, 0.7444361146951906, 0.7525384471628058]
with_generator(100,300) returns False and with_generator(101,300) returns True.
It seems that with_generator always takes a longer time to run than with_loop. Is there any reason for this?
EDIT:
Is there any other shorter or more elegant way of rewriting with_loop so that we achieve similar or better performance? Thanks!
Context
I think that
any() + generator expression should run quicker than a for loop
means that any does not generate all values but a loop does:
>>> T(lambda : any([x * 101 % 300 == 1 for x in xrange(1, 300)])).repeat(number = 100000)
[5.7612644951345935, 5.742304846931542, 5.746804810873488]
>>> T(lambda : any(x * 101 % 300 == 1 for x in xrange(1, 300))).repeat(number = 100000)
[2.1652204281427814, 2.1640463131248886, 2.164674290446399]
So the quote does not mean that a loop can never achieve the performance of a generator.
The quote means that a loop usually generates all elements and any does not use all of them and a generator only generates the elements that any uses.
Your function with_loop is equivalent to the generator. So you can not expect a different behaviour.
To put it more clearly: any(loop) is slower than any(generator) because the loop generates everything. Your with_loop is equivalent to any(generator) and not to any(loop).
Original Question
>>> profile.run("""T(lambda : with_loop(101, 300)).repeat(number = 100000)""")
600043 function calls (600040 primitive calls) in 6.133 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.000 0.000 0.000 0.000 :0(append)
6 0.000 0.000 0.000 0.000 :0(clock)
3 0.000 0.000 0.000 0.000 :0(disable)
3 0.000 0.000 0.000 0.000 :0(enable)
3 0.000 0.000 0.000 0.000 :0(globals)
1 0.000 0.000 0.000 0.000 :0(hasattr)
3 0.000 0.000 0.000 0.000 :0(isenabled)
2 0.000 0.000 0.000 0.000 :0(isinstance)
1 0.000 0.000 0.000 0.000 :0(range)
1 0.005 0.005 0.005 0.005 :0(setprofile)
300000 0.579 0.000 5.841 0.000 <string>:1(<lambda>)
4/1 0.000 0.000 6.128 6.128 <string>:1(<module>)
300000 5.262 0.000 5.262 0.000 <string>:1(with_loop)
1 0.000 0.000 6.133 6.133 profile:0(T(lambda : with_loop(101, 300)).repeat(number = 100000))
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.000 0.000 timeit.py:121(__init__)
3 0.000 0.000 0.000 0.000 timeit.py:143(setup)
3 0.000 0.000 6.128 2.043 timeit.py:178(timeit)
1 0.000 0.000 6.128 6.128 timeit.py:201(repeat)
1 0.000 0.000 0.000 0.000 timeit.py:94(_template_func)
3 0.287 0.096 6.128 2.043 timeit.py:96(inner)
>>> profile.run("""T(lambda : with_generator(101, 300)).repeat(number = 100000)""")
31500043 function calls (31500040 primitive calls) in 70.531 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
300000 30.898 0.000 67.590 0.000 :0(any)
3 0.000 0.000 0.000 0.000 :0(append)
6 0.000 0.000 0.000 0.000 :0(clock)
3 0.000 0.000 0.000 0.000 :0(disable)
3 0.000 0.000 0.000 0.000 :0(enable)
3 0.000 0.000 0.000 0.000 :0(globals)
1 0.000 0.000 0.000 0.000 :0(hasattr)
3 0.000 0.000 0.000 0.000 :0(isenabled)
2 0.000 0.000 0.000 0.000 :0(isinstance)
1 0.000 0.000 0.000 0.000 :0(range)
1 0.000 0.000 0.000 0.000 :0(setprofile)
300000 0.667 0.000 70.222 0.000 <string>:1(<lambda>)
4/1 0.000 0.000 70.531 70.531 <string>:1(<module>)
300000 1.629 0.000 69.555 0.000 <string>:6(with_generator)
30600000 37.027 0.000 37.027 0.000 <string>:7(<genexpr>)
1 0.000 0.000 70.531 70.531 profile:0(T(lambda : with_generator(101, 300)).repeat(number = 100000))
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.000 0.000 timeit.py:121(__init__)
3 0.000 0.000 0.000 0.000 timeit.py:143(setup)
3 0.000 0.000 70.531 23.510 timeit.py:178(timeit)
1 0.000 0.000 70.531 70.531 timeit.py:201(repeat)
1 0.000 0.000 0.000 0.000 timeit.py:94(_template_func)
3 0.309 0.103 70.531 23.510 timeit.py:96(inner)
To call the generator every time, 30600000 times, seems to be much slower than just a for loop.
If you know how many elements exist in a list then you can write this:
l[0] * 101 % 300 == 1 or l[1] * 101 % 300 == 1 or l[2] * 101 % 300 == 1 or ....

Why is this CoffeeScript faster than this Ruby script?

I was solving a problem that asked me to find the sum of all EVEN fibonacci numbers under 4,000,000 and I noticed that the below CoffeeScript executed faster than the below Ruby.
CoffeeScript
sum = 0
f = (n) ->
if n == 0 then return 0
else if n == 1 then return 1
else return f(n-1) + f(n-2)
console.log "Starting..."
for i in [1..4000000]
ff = f(i)
break if ff >= 4000000
sum += ff if ff % 2 == 0
console.log ".." + ff
console.log "Sum:" + sum
Ruby
sum = 0
def F(n)
if n.eql? 0
return 0
elsif n.eql? 1
return 1
else
return F(n-1) + F(n-2)
end
end
puts "Starting..."
4000000.times do |num|
the_num = F(num)
break if the_num >= 4000000
sum += the_num if the_num % 2 == 0
puts ".." + the_num.to_s
end
puts "Sum:" + sum.to_s
It took nearly 6-8 seconds for the Ruby script to find all even fibonacci numbers under 4,000,000 while it took roughly 0.2 seconds to complete for the NodeJS execution of the transpilation of coffeescript. Why is this??
$ ruby --version
ruby 2.1.0p0 (2013-12-25 revision 44422) [x86_64-darwin12.0]
$ node --version
v0.10.25
Let's profile it using ruby-prof:
# so_21908065.rb
sum = 0
def F(n)
# using == instead of eql? speeds up from ~ 7s to ~ 2s
if n == 0
return 0
elsif n == 1
return 1
else
return F(n-1) + F(n-2)
end
end
puts "Starting..."
1.upto(4000000) do |num|
the_num = F(num)
break if the_num >= 4000000
sum += the_num if the_num % 2 == 0
puts ".." + the_num.to_s
end
puts "Sum:" + sum.to_s
$ ruby-prof so_21908065.rb
%self total self wait child calls name
16.61 27.102 27.102 0.000 0.000 87403761 Fixnum#==
9.57 15.615 15.615 0.000 0.000 48315562 Fixnum#-
4.92 8.031 8.031 0.000 0.000 24157792 Fixnum#+
0.00 0.000 0.000 0.000 0.000 70 IO#write
0.00 163.123 0.000 0.000 163.123 1 Integer#upto
0.00 163.122 0.000 0.000 163.122 48315596 *Object#F
0.00 0.000 0.000 0.000 0.000 35 IO#puts
0.00 0.000 0.000 0.000 0.000 34 Fixnum#to_s
0.00 0.001 0.000 0.000 0.000 35 Kernel#puts
0.00 0.000 0.000 0.000 0.000 34 String#+
0.00 163.123 0.000 0.000 163.123 2 Global#[No method]
0.00 0.000 0.000 0.000 0.000 34 Fixnum#>=
0.00 0.000 0.000 0.000 0.000 2 IO#set_encoding
0.00 0.000 0.000 0.000 0.000 33 Fixnum#%
0.00 0.000 0.000 0.000 0.000 1 Module#method_added
* indicates recursively called methods
So the main culprits are Fixnum#==, Fixnum#- and Fixnum#+.
As #HolgerJust noted, probably a "warmed" JRuby execution is much faster, due to JIT optimization.

Resources