How to increase control procedure memory in Memgraph? - memgraphdb

I am trying to increase control procedure memory usage with following code:
match (n)-[e]->(m)
with collect(e) as edges
call super_awesome_module.do_something(edges) MEMORY UNLIMITED YIELD * RETURN *;
I get a following message error:Client received exception: line 3:37 mismatched input 'MEMORY' expecting {<EOF>, ';'}
What is wrong with my code?

The query is missing the keyword PROCEDURE. Your Cypher query should look like this:
MATCH (n)-[e]->(m)
WITHcollect(e)AS edges
CALL super_awesome_module.do_something(edges)
PROCEDURE MEMORY UNLIMITED
YIELD *RETURN *;

Related

Should i launch an error or correct the values

Suppose we create a list class with a function that eliminates elements from position a to position b.
The class is supposed to be used by other programmers (like std::list).
Example:
list values : {0,1,2,3,4,5,6} and we call this function with (begin = 2, end = 5).
This would change the list to being {0,1,6}
If the user calls the function with end > size of the list, it's better to just reassign end = size and delete until the last one or launch an exception like out_of_range?
This is a good question about programming standards. If someone calls your method with end > size, they have technically gone against the prerequisites of your function. It is possible the programmer called your function thinking it did something else, such as eliminated all list values in between the values they gave. If your function does not throw an exception, they will not know anything has gone wrong until there is a logical error later. The best practice if given incorrect parameters is to throw an exception, explaining what they did wrong. It puts more of a challenge on the person using your function, but it saves them more trouble later.
For input argument for rightIndex > list.size(), you have a few options to provide:
ERROR: Exception (Array index out of range)
WARNING: rightIndex is truncated to end of the list at Line 5:56
ASSERTION FAILED (Error): assert(rightIndex >= 0 && rightIndex < list.size());
Reference Mismatch Error: Reference provided by rightIndex is not matched with the function deleteBetween(leftIndex, rightIndex){ ... }
Segmentation Fault (SIGSEGV): Invalid memory reference and access violation, trying to read or write from/to a memory area that you have access to.

SAS: What's the optimal way to find the sum of a column by another column?

I want to find out the best way to perform a group-by in SAS so I can perform some benchmarks. The simplest two ways I can think of is Proc SQL and Proc means. Here is the example in proc sql
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
I think there are ways to make this run fast
use sasfile command to load the data into memory first
create an index on id
Are there any other options I can use? Any SAS options I should turn on to make this run as fast as possible? I am not tied to proc sql nor proc means, so if there are faster ways then I would love to know about it!!!
My set up code is as below
options macrogen;
options obs=max sortsize=max source2 FULLSTIMER;
options minoperator SASTRACE=',,,d' SASTRACELOC=SASLOG;
options compress = binary NOSTSUFFIX;
options noxwait noxsync;
options LRECL=32767;
proc fcmp outlib=work.myfunc.sample;
function RandBetween(min, max);
return (min + floor((1 + max - min) * rand("uniform")));
endsub;
run;
options cmplib=work.myfunc;
data RandInt;
do i = 1 to 250000000;
id = RandBetween(1, 2500000);
val = rand("uniform");
output;
end;
drop i;
run;
My SAS comparison macros are as below
%macro sasbench(dosql = N); %macro _; %mend;
%if &dosql. = Y %then %do;
proc sql noprint; /* took 6 mins */
create table summ as select
id,
sum(val)
from
randint
group by
id
;
quit;
%end;
proc means data=randint sum noprint;
var val ;
class id;
output out = summmeans(drop=_type_ _freq_) sum = /autoname;
run;
%mend;
%sasbench();
/**/
/*sasfile randint load;*/
/*%sasbench();*/
/*sasfile randint close;*/
proc datasets lib=work;
modify randint;
INDEX CREATE id / nomiss;
run;
%sasbench();
sasfile is only a benefit if the entire data set can fit into session ram limits and if the data set is going to be used more than once. I suppose this would make sense if your benchmark includes multiple runs / different techniques on the same sasfile.
An index on id would help if the data was unsorted by id. When the data set is presorted by id the id column metadata will have sortedby flag set which a procedure can use for its own internal optimization, however there is no guarantee. As for indexes, use option msglevel=i to get informational messages in the log about index selection during processing.
The fastest way is direct addressing, but requires enough ram to handle the largest id value as an array index:
array ids(250000000) _temporary_
ids(id) + value
The next fastest way is probably hand coded array based hashing:
search SAS conference proceedings for papers by Paul Dorfman
The next fastest hash way is probably the hash component object with key suminc.
DATA Step was edited to align with the comments
data demo_data;
do rownum = 1 to 1000;
id = ceil(100*ranuni(123)); * NOTE: 100 different groups, disordered;
value = ceil(1000*ranuni(123)); * NOTE: want to sum value over group, for demonstration individual values integers from 1..1000;
output;
end;
run;
data _null_;
if 0 then set demo_data(keep=id value); %* prep pdv ;
length total 8; %* prep keysum variable ;
call missing (total); %* prevent warnings ;
declare hash ids (ordered:'a', suminc:'value', keysum:'total'); %* ordered ensures keys will be sorted ascending upon output ;
ids.defineKey('id');
*ids.defineData('id'); % * not having a defineData is an implicit way of adding only the keys as data, only data + keysum variables are .output;
ids.defineDone();
* read all records and touch each hash key in order to perform tacit total+value summation;
do until (end);
set demo_data end=end;
if ids.find() ne 0 then ids.add();
end;
ids.output(dataset:'sum_value_over_id'); * save the summation of each key combination;
stop;
run;
Note: There can be only one keysum variable.
If the suminc variable was set to be always 1 instead of value, then the keysum would be the count instead of the total.
Obtaining both sum and count over group via hash would require an explicit defineData for a count and sum variable and slightly different statements, such as:
declare hash ids (ordered:'a');
...
ids.defineData('id', 'count', 'total');
...
if ids.find() ne 0 then do; count=0; total=0; end;
count+1;
total+value;
ids.replace();
...
However, if value is known to be always a natural number, and group size is known to be < 10group size limit you could numerically encode the count by using a suminc of value + 10-group size limit and numerically decode count by processing the output data with count = (total - int(total)) * 10group size limit.
For sorted data the fastest way is most likely a DOW loop with accumulation.
proc sort data=foo;
by id;
data sum_value_over_id_v2(keep=id total);
do until (last.id);
set foo;
by id;
total = sum(total, value);
end;
run;
You will likely find that I/O is largest component of performance.
The best answer varies dramatically by the application. In your example, PROC SQL at least on my machine significantly outperforms PROC MEANS, but there are plenty of cases where it will not do so. It's able to in this case because it's building hash tables behind the scenes, more than likely, which are quite fast - a single pass through the data is all that's needed.
You certainly could speed things up by putting your full dataset into memory with SASFILE, if you have room to store the whole thing. You would have to have it in memory to begin with, though, more than likely; just reading it into memory for this purpose alone wouldn't really help since you're doing that read anyway.
As Richard notes, there are a bunch of ways to do this. I think PROC SQL will often be the fastest or similar to the fastest in simple cases, both because it's multithreaded (as opposed to data step being single threaded) and because it's got a fast hash table backend.
PROC MEANS is also usually going to be competitive, the case you show in the example is almost a worst case for it since it's got a huge number of class variables so I think it may be creating a temporary table on disk. It's also multithreaded. Reduce the class variable categories to 2500 instead of 2,500,000 and you get PROC MEANS a bit faster than PROC SQL (but within the margin of error).
Data step accumulation, either in a hash table or a DoW loop, will sometimes outperform both of the above, and sometimes not, again depending on the data. Here it does outperform slightly. The code for data step accumulation tends to be a bit more complex, which is why I'd usually discourage it unless the savings is substantial (having more code to maintain is worse, typically). PROC MEANS and PROC SQL require less maintenance and less to understand. But in applications where performance is critical and these solutions happen to be superior, it may be worth it to go this route, especially if the data step is helpful. Of course, the hash table method is limited to fitting the results in memory, though usually that's manageable.
Ultimately, I would encourage you to use whatever method is easiest to maintain but still gives sufficient performance; and when possible try to be self consistent with other code. If most of your code is in SQL, that is probably fine. SASFILE and indexes probably won't be needed, unless you're doing more complicated things than you present above. Summation is actually more work than I/O in many cases. Don't overcomplicate it, ultimately: programmer hours and difficulty of QA is something that should trump basic performance, unless you're talking several hours' difference. And if you are, then just run tests on your actual use case and see what works best.
If you assume the data is sorted then this is another solution
data sum_value_over_id_v2(keep=id total);
set a.randint(keep=id val);
by id;
total + val;
if last.id then do;
output;
total = 0;
end;
drop val;
run;

SAS proc IML error: Not enough memory to store all matrices

Good Morning,
I am trying to progam the next simple function in SAS using proc iml, but I obtain the next error " not enough memory to store all matrices". I am trying to read two matrices one call "matriz_product" and the other one "matriz_segment", these table have a dimension of 21x(more than)1.000.000 and the values are characters. After reading this matrices I want to create one vector from each of the tables where the column picked is the one that is specified in position (another vector that I read).
The code is the following:
proc iml;
use spain.Tabla_product;
read all var {a_def_prdt1 b_def_prdt2 c_def_prdt3 d_def_prdt4 e_def_prdt5 f_def_prdt6 g_def_prdt7 h_def_prdt8 i_def_prdt9 j_def_prdt10 k_def_prdt11 l_def_prdt12 m_def_prdt13 n_def_prdt14 o_def_prdt15 p_def_prdt16 q_def_prdt17 r_def_prdt18 s_def_prdt19 t_def_prdt20} into matrizProduct;
use spain.Tabla_segment;
read all var {a_def_sgmt1 b_def_sgmt2 c_def_sgmt3 d_def_sgmt4 e_def_sgmt5 f_def_sgmt6 g_def_sgmt7 h_def_sgmt8 i_def_sgmt9 j_def_sgmt10 k_def_sgmt11 l_def_sgmt12 m_def_sgmt13 n_def_sgmt14 o_def_sgmt15 p_def_sgmt16 q_def_sgmt17 r_def_sgmt18 s_def_sgmt19 t_def_sgmt20} into matrizsegment;
use spain.contratonodato;
read all var {posi} into position;
n=nrow(matrizsegment);
DEF_PRDT=j(n,1,"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");
DEF_SGMT=j(n,1,"zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz");
do i =1 to n;
DEF_PRDT[i,1]=matrizproduct[i,position[i]];
DEF_SGMT[i,1]=matrizsegment[i,position[i]];
end;
create contratosnodato_modi var {"DEF_SGMT" "DEF_PRDT"};
append;
run;
Thank you very much.
Base SAS reads row-by-row and so rarely runs out of memory. Proc IML reads the entire dataset into memory and so can easily run out of memory for larger datasets. For this reason, I only use proc IML when absolutely necessary (e.g. doing matrix multiplication), and when I do I will:
Chunk datasets into smaller pieces that will fit into memory, and do these sequentially.
Optimise algorithms to be able to run within constraints - for example, exploit the structure of a matrix I need to invert to avoid inverting the whole matrix.
Fortunately in this case you don't even appear to need proc IML at all - what you're trying to do can be done in a data step. Try this:
data contratosnodato_modi;
format DEF_PRDT $40. DEF_SGMT $40.;
set spain.Tabla_product;
set spain.Tabla_segment;
set spain.contratonodato;
array product {20} a_def_prdt1 b_def_prdt2 c_def_prdt3 d_def_prdt4 e_def_prdt5 f_def_prdt6 g_def_prdt7 h_def_prdt8 i_def_prdt9 j_def_prdt10 k_def_prdt11 l_def_prdt12 m_def_prdt13 n_def_prdt14 o_def_prdt15 p_def_prdt16 q_def_prdt17 r_def_prdt18 s_def_prdt19 t_def_prdt20;
array segment {20} a_def_sgmt1 b_def_sgmt2 c_def_sgmt3 d_def_sgmt4 e_def_sgmt5 f_def_sgmt6 g_def_sgmt7 h_def_sgmt8 i_def_sgmt9 j_def_sgmt10 k_def_sgmt11 l_def_sgmt12 m_def_sgmt13 n_def_sgmt14 o_def_sgmt15 p_def_sgmt16 q_def_sgmt17 r_def_sgmt18 s_def_sgmt19 t_def_sgmt20;
DEF_PRDT = product{posi};
DEF_SGMT = segment{posi};
keep DEF_PRDT DEF_SGMT;
run;
Here I'm reading all data in at once, storing the columns of interest as arrays and only accessing the columns specified in the position dataset.

Where is the syntax error within this SAS view code?

data work.temp work.error / view = work.temp;
infile rawdata;
input Xa Xb Xc;
if Xa=. then output work.errors;
else output work.temp;
run;
It says there's a syntax error in the DATA statement, but I can't find where ...
The error is a typo in the OUTPUT statement. You are trying to write observations to ERRORS but the data statement only defined ERROR.
It is a strange construct and not something I would recommend, but it looks like it will work. When you exercise the view TEMP it will also generate the dataset ERROR.
67 data x; set temp; run;
NOTE: The infile RAWDATA is:
Filename=...
NOTE: 2 records were read from the infile RAWDATA.
The minimum record length was 5.
The maximum record length was 5.
NOTE: View WORK.TEMP.VIEW used (Total process time):
real time 0.32 seconds
cpu time 0.01 seconds
NOTE: The data set WORK.ERROR has 1 observations and 3 variables.
NOTE: There were 1 observations read from the data set WORK.TEMP.
NOTE: The data set WORK.X has 1 observations and 3 variables.

how to optimize contains in oracle sql query

I need to use oracle Contains function in a query like this:
select *
from iindustrialcasehistory B
where CONTAINS(B.ItemTitle, '%t1%') > 0 OR
CONTAINS(B.ItemTitle, '%t2%') > 0
I've defined context index for ItemTitle column, but execution time is about a minute!whereas i need it to be executed in less than a second!
thanks for any execution time reduction guide in advanced!
Contains searches for all appearances of substring in a string, that’s why better use instr() instead, because it searches only first occurrence of substring, what's supposed to be faster.
Then you can build index for function Instr(B.ItemTitle,'t1') + Instr(B.ItemTitle,'t1').
And use this function value > 0 in a query after that.
You can see more details about using index with instr function here.

Resources