We have a prototype of esper running, but the performance is considerably lacking. I guess this is my fault somehow rather than inherently an issue with esper, so was looking for help in locating where my performance issue is.
I am running one instance of the esper service, and I have allocated the memory constraints as follows: -Xmx6G -Xms1G (I have tried various combinations of these values). And it can use 4 cores of the CPU. No other services are running at the time of these tests, only esper, kafka, zookeeper.
I am using Akka Streams to stream events into Esper, the service is very simple, it streams in from kafka, inserts the events into Esper Runtime, Esper has 3 EPStatements tested and working. There is one listener and I add it to all 3 statements, the listener outputs the matched events to kafka.
Some things I've tried to isolate where the performance issue is:
Remove some EPStatements
Remove all EPStatements
Remove Listener
Remove EPStatements and Listener
Remove esper .sendEvent(...) (This improves performance significantly, so it seems an esper issue, rather than an akka issue)
Only number 4 above caused any significant observable performance benefit.
Below is an example query we're running through esper. It's tested and works, I have read the performance tuning section of the documentation and it seems ok to me. All my queries follow a similar format:
select * from EsperEvent#time(5 minutes)
match_recognize (
partition by asset_id
measures A as event1, B as event2, C as event3
pattern (A Z* B Z* C)
interval 10 seconds or terminated
define
A as A.eventtype = 13 AND A.win_EventID = "4624" AND A.win_LogonType = "3",
B as B.eventtype = 13 AND B.win_EventID = "4672",
C as C.eventtype = 13 AND (C.win_EventID = "4697" OR C.win_EventID = "7045")
)
Some Code..
Here is my akka stream:
kafkaConsumer
.via(parsing) // Parse the json event to a POJO for esper. Have tried without this step also, no performance impact
.via(esperFlow) // mapAsync call to sendEvent(...)
//Here I am using kafka to measure the flow throughput rate. This is where I establish my throughput rate, based on the rate messages are written to "esper_flow_through" topic.
.map(rec => new ProducerRecord[Array[Byte], String]("esper_flow_through", Serialization.write(rec)))
.runWith(sink)
esperFlow (Parallelism = 4 by default):
val esperFlow = Flow[EsperEvent]
.mapAsync(Parallelism)(event => Future {
engine.getEPRuntime.sendEvent(event)
event
})
Listener:
override def update(newEvents: Array[EventBean], oldEvents: Array[EventBean], statement: EPStatement, epServiceProvider: EPServiceProvider): Unit = Future {
logger.info(s"Received Listener updates: Query Name: ${statement.getName} ---- ${newEvents.map(_.getUnderlying)}, $oldEvents")
statement.getName match {
case "SERVICE_INSTALL" => serviceInstall.increment(newEvents.length)
case "ADMIN_GROUP" => adminGroup.increment(newEvents.length)
case "SMB_SHARE" => smbShare.increment(newEvents.length)
}
newEvents.map(_.getUnderlying.toString).toList
.foreach(queryMatch => {
val record: ProducerRecord[Array[Byte], String] = new ProducerRecord[Array[Byte], String]("esper_output", queryMatch)
producer.send(record)
})
}
Performance observations:
Input stream has a rate of ~2.4k per second.
We see esper is unable to keep up from the beginning. Maxing out at ~600 per second
Esper gradually decreses in throughput
Eventually esper throughput stabalises <100 per second
Profiling, nothing seems out of sorts here:
The rate seems very low, so I am assuming I am missing something here with regards to some esper configuration?
Our target throughput is to have ~10k per second. We are a long way from this, and we have a similar POC in Spark that gets closer to this target.
Update:
Following #user650839 comments, I was able to improve my throughput to a steady 1k per second. Both of these queries produce the same throughput:
select * from EsperEvent(eventtype = 13 and win_EventID in ("4624", "4672", "4697", "7045"))#time(5 minutes)
match_recognize (
partition by asset_id
measures A as event1, B as event2, C as event3
pattern (A B C)
interval 10 seconds or terminated
define
A as A.eventtype = 13 AND A.win_EventID = "4624" AND A.win_LogonType = "3",
B as B.eventtype = 13 AND B.win_EventID = "4672",
C as C.eventtype = 13 AND (C.win_EventID = "4697" OR C.win_EventID = "7045"))
create context NetworkLogonThenInstallationOfANewService
start EsperEvent(eventtype = 13 AND win_EventID = "4624" AND win_LogonType = "3")
end pattern [
b=EsperEvent(eventtype = 13 AND win_EventID = "4672") ->
c=EsperEvent(eventtype = 13 AND (win_EventID = "4697" OR win_EventID = "7045"))
where timer:within(5 minutes)
]
context NetworkLogonThenInstallationOfANewService select * from EsperEvent output when terminated
However 1k per second is still too slow for our needs.
The match-recognize is ill-defined. An A-event or B-event or C-event event can also be a Z event since anything matches a Z-event (Z is undefined). Therefore there is a HUGE number of combinations that is possible. I think for 4 incoming events there are already like 1*2*3*4 combinations that the match recognize keeps track off! Match-recognize tracks all possible combinations and when stuff matches match-recognize sorts and ranks the combinations and outputs all/any/some. Match-recognize may be a poor choice here or maybe define Z as something that doesn't also match A/B/C.
Instead of match-recognize I would go with a context that initiates with an A-event and terminates with a C-event with "output when terminated".
Also, they way you designed the query the time window will retain all events. You could do better.
select * from EsperEvent(eventtype = 13 and win_EventID in ("4624", "4672", "4692", "7045"))#time(5 minutes)
match_recognize (
.........
define
A as A.win_EventID = "4624" AND A.win_LogonType = "3",
B as B.win_EventID = "4672",
C as C.win_EventID = "4697" OR C.win_EventID = "7045"
)
Notice, the EsperEvent(eventtype=13 ....) discards events before they go into the time window. There is a performance tip in the docs around using filter criteria to remove unwanted events.
EDIT: A mistake is measuring IO throughput and Esper throughput as one. Remove the IO. Test Esper using Esper APIs with data your code produces. Once confident add the IO back.
Related
I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token='<|endoftext|>', eos_token='<|endoftext|>', truncation_side='left')
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_cache=True,
gradient_checkpointing=True,
)
model.to('cuda')
prompt = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
prompt = {key: value.to('cuda') for key, value in prompt.items()}
out = model.generate(**prompt,
n=1,
min_length=16,
max_new_tokens=75,
do_sample=True,
top_k=35,
top_p=0.9,
batch_size=512,
temperature=0.75,
no_repeat_ngram_size=4,
clean_up_tokenization_spaces=True,
use_cache=True,
pad_token_id=tokenizer.eos_token_id
)
res = tokenizer.decode(out[0])
As input to the model I'm using 2048 tokens and I produce 75 tokens as output. The latency is around 4-5 seconds. In the following blog post, I've read that using pipelines latency can be improved and that tokenization can be a bottleneck.
Can the tokenization be improved for my code and would using a pipeline reduce the latency? Are there any other things I can do to reduce the latency?
In my experience Matlab performs publish subscribe operations with ROS slow for some reason. I work with components as defined in an object class as shown below, where I made a test-class. Normally objects of comparable structure are used to control mobile robots.
To quantify performance tested required time for an operation and got the following results:
1x publishing a message + 1x simple subscriber callback : 3.7ms
Simply counting in a callback (per count): 2.1318e-03 ms
Creating a new message with msg1 = rosmessage(obj.publisher) adds 3.6-4.3ms per iteration
Pinging myself indicated communication latency of 0.05 ms
The times required for a simple publish + start of a subscribe callback seems oddly slow.
I want to have multiple system components as objects in my workspace such that they respond to ROS topic updates or on timer events. The pc used for testing is not a monster but should not be garbage either.
Do you also think the shown time requirements are unneccesary large? this allows barely to publish a single topic at 200hz without doing anything else. Normally I have multiple lower frequency topics (e.g.20hz) but the total consumed time becomes significant.
Do you know any practices to make the system operate quicker?
What do you think of the OOP style of making control system components in general?
classdef subpubspeedMonitor < handle
% Use: call in matlab console, after initializing ros:
%
% SPM1 = subpubspeedMonitor()
%
% This will create an object which starts a set repetitive task upon creation
% and finally destructs itself after posting results in console.
properties
node
subscriber
publisher
timestart
messagetotal
end
methods
function obj = subpubspeedMonitor()
obj.node = ros.Node('subspeedmonitor1');
obj.subscriber = ros.Subscriber(obj.node,'topic1','sensor_msgs/NavSatFix',{#obj.rosSubCallback});
obj.publisher = ros.Publisher(obj.node,'topic1','sensor_msgs/NavSatFix');
obj.timestart = tic;
obj.messagetotal = 0;
msg1 = rosmessage(obj.publisher);
% Choose to evaluate subscriber + publisher loop or just counting
if 1
send(obj.publisher,msg1);
else
countAndDisplay(obj)
end
end
%% Test method one: repetitive publishing and subscribing
function rosSubCallback(obj,~,msg_) % ~3.7 ms per loop for a simple publish+subscribe action
% Latency to self is 0.05ms on average, according to "pinging" in terminal
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); % this line adds 4.3000ms per loop
msg_.Longitude = 51; % this line adds 0.25000 ms per loop
send(obj.publisher,msg_)
else
% Display some results
timepassed = toc(obj.timestart);
time_per_pubsub = timepassed/obj.messagetotal
delete(obj);
end
end
%% Test method two: simply counting
function countAndDisplay(obj) % this costs 2.1318e-03 ms(!) per loop
obj.messagetotal = obj.messagetotal+1;
if obj.messagetotal <10000
%msg1 = rosmessage(obj.publisher); %adds 3.6ms per loop
%i = 1% adds 5.7532e-03 ms per loop
%msg1 = rosmessage("std_msgs/Bool"); %adds 1.5ms per loop
countAndDisplay(obj);
else
% Display some results
timepassed = toc(obj.timestart);
time_per_count_FCN = timepassed/obj.messagetotal
delete(obj);
end
end
%% Deconstructor
function delete(obj)
delete(obj.subscriber)
delete(obj.publisher)
delete(obj.node)
end
end
end
I am trying to speed up the calculations from multiple operations that I am adding as columns in a pyspark data frame, when I found the sparkbyexamples article on performance tunning. I am considering how to use the cache and the spark.sql.shuffle.partitions, solutions.
Would cache be appropriate for a code that first joins multiple data
frames and then adds calculations over different windows?
What happens when reassigning the cached data frame (see bellow)?
Example:
df = dfA.join(dfB, on = ['key'], how ='left') # should I add .cache here?
w_u = Window.partitionBy('user')
w_m = Window.partitionBy(['user','month']).orderBy('month')\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
MLAB = ['val1','val2'] # example to indicate that I run similar operations multiple times
for mlab in MLAB:
percent_50 = F.expr('percentile_approx('+mlab+',0.5)')
df = df.withColumn(mlab+'_md', percent_50.over(w_u) # what happens with the cache when I reassing it
Afterwards I am adding additional operations that include aggregations, such as:
radius_df = (df
# number of visits per stop
.groupby('userId', 'locationId').agg(F.count(F.lit(1)).alias('n_i'),
F.first('locationLongitude').alias('locationLongitude'),
F.first('locationLatitude').alias('locationLatitude'))
#compute center of mass (lat/lon) per user
.withColumn('center_lon', F.avg(F.col('locationLongitude')).over(w))
.withColumn('center_lat', F.avg(F.col('locationLatitude')).over(w))
# compute total visits
.withColumn('N', F.sum(F.col('n_i')).over(w))
# compute (r_i - r_cm)
.withColumn('distance', distance(F.col('locationLatitude'), F.col('locationLongitude'), F.col('center_lat'), F.col('center_lon')))
# compute n_i(r_i - r_cm)^2 / N
.withColumn('distance2', F.col('n_i') * (F.col('distance') * F.col('distance')) / F.col('N'))
# compute sum(n_i(r_i - r_cm)^2)
.groupBy('userId').agg(F.sum(F.col('distance2')).alias('sum_dist2'))
# square root
.withColumn('radius_gyr', F.sqrt(F.col('sum_dist2')))
.select('userId','radius_gyr')
)
df_f = df.join(radius_df.dropDuplicates(), on='userId', how='left')
I am open to any suggestions on how to speed up the code. Many thanks.
I have a big dataset divided in files.
I would like to read and process my data one file at the time and for this I have this keras generator:
def myGenerator():
while 1:
rnd = random.randint(1,200)
strRnd = str(rnd)
lenRnd = len(strRnd)
rndPadded = strRnd.rjust(5, '0')
nSearchesInBatch = 100
f = "path/part-" + rndPadded + "*" #read one block of data
data = sqlContext.read.load(f).toPandas()
imax = int(data.shape[0]/nSearchesInBatch) #number of batches that will be created sequentially from the generator
for i in range(imax):
data_batch = data[i*nSearchesInBatch:(i+1)*nSearchesInBatch]
features = data_batch['features']
output = data_batch['output']
yield features, output
The problem is that the reading takes the biggest part (each file is around 200mb), and in the meanwhile the GPU sits waiting, it is possible to pre-read the next batch while the GPU is traning on the previous one?
At the moment one file is read and split in steps (the inner loop), the CPUs are hidden and the GPU training, as soon as the epoch finishes, the GPU goes idle and the cpu start reading (which takes 20/30 seconds).
Any solution to parallelize this?
I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff
OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals