How to make script execution slow? - performance

I have the task: need to select data from "TABLE_FROM", modify it and insert to the "TABLE_TO". The main problem is script must run on production and shouldn't hurts live site performance, but "TABLE_FROM" contains hundred millions of rows. Going to run the script using nodejs. What techniques are using to resolve such kind of problems? ie. how to make this script running "slowly" or other words "softly" to prevent DB and CPU overload?
Time of script execution is irrelevant. I use Cassandra DB.

Sample code:
var OFFSET = 0;
var BATCHSIZE = 100;
var TIMEOUT = 1000;
function fetchPush() {
// fetch from TABLE_FROM, possibly in batches
rows = fetch(OFFSET, BATCHSIZE);
// push to TABLE_TO
push(rows);
// do next batch in timeout
setTimeout(fetchPush, TIMEOUT);
}
Here I'm assuming the fetch and push are blocking calls, for async processing you could (obviously) use async.

Related

Caching is working for one hour while it should be for days

I have created an API using .NETCore 2.0 ; This API is connected to an oracle database to retrieve needed data; One of the functions takes too much time so I decided to use caching in order to retrieve data faster;
Function description: Get ranking
Caching period: Data should be renewed in cache memory each Monday
I am using IMemoryCache, but the problem is that data is not being cached for multiple days; It lasts only for one hour, after that data is being retrieved from database and takes too much time (10 s.); Below is my code:
var dateNow = DateTime.Now;
int diff = 7; // if today is Monday then should add 7 days to get next Monday date
if (dateNow.DayOfWeek != DayOfWeek.Monday) {
var daysToStartWeek = dateNow.DayOfWeek - DayOfWeek.Monday;
diff = (7 - (daysToStartWeek)) % 7;
}
var nextMonday = dateNow.AddDays(diff).Date;
var totalDays = (nextMonday - dateNow).TotalDays;
if (_cache.TryGetValue("GetRanking", out IEnumerable<GetRankingStruct> objRanking))
{
return Ok(objRanking);
}
var dp = new DataProvider(Configuration);
var response = dp.GetRanking(userName, asAtDate);
_cache.Set("GetRanking", response, TimeSpan.FromDays(diff));
return Ok(response);
Could be related to the token life Time since it's only 1 hour?
Firstly - have you tried checking to see if your worker process is being restarted? You don't specify how you are hosting your application but, obviously, if the application (worker process) is restarted your memory cache will be empty.
If your worker process / process is restarting then you could load the cache on start up.
Secondly - I believe that the implementation may choose to empty the cache due to inactivity or memory constraints. You can set the priority to never remove - https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.caching.memory.cacheitempriority?view=dotnet-plat-ext-3.1
I believe you can set this by passing a MemoryCacheOptions object to the constructor of the memory cache https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.caching.memory.memorycache.-ctor?view=dotnet-plat-ext-3.1#Microsoft_Extensions_Caching_Memory_MemoryCache__ctor_Microsoft_Extensions_Options_IOptions_Microsoft_Extensions_Caching_Memory_MemoryCacheOptions__.
Finally - I assume you've made your _cache object static so it is shared by all instances of your class. (Or made the controller, if that's what it is, a singleton).
These are my suggestions.
Good luck.

Spark: Measure performance of UDF on large dataset

I want to measure performance of an udf on a large dataset. The spark SQL is:
spark.sql("SELECT my_udf(value) as results FROM my_table")
The udf returns an array. The issue I'm facing is how to make this execute without returning the data to the driver. I need an action but anything returning the full data set will crash the driver, eg. collect or I'm not running the calculation for all rows (show/take(n)). So how can i trigger the calculation and not return all data to the driver?
I think the closest you can get to only running your UDF for measuring timing would be something like below. The general idea is using caching to try and remove data loading time from your measurement, and then use a foreach that does nothing to make spark run your UDF.
val myFunc: String => Int = _.length
val myUdf = udf(myFunc)
val data = Seq("a", "aa", "aaa", "aaaa")
val df = sc.parallelize(data).toDF("text")
// Cache to remove data loading from measurements as much as possible
// Also, do a foreach no-op action to force the data to load and cache before our test
df.cache()
df.foreach(row => {})
// Run the test, grabbing before and after time
val start = System.nanoTime()
val udfDf = df.withColumn("udf_column", myUdf($"text"))
// Force spark to run your UDF and do nothing with the result so we don't include any writing time in our measurement
udfDf.rdd.foreach(row => {})
// Get the total elapsed time
val elapsedNs = System.nanoTime() - start

Select Count very slow using EF with Oracle

I'm using EF 5 with Oracle database.
I'm doing a select count in a table with a specific parameter. When I'm using EF, the query returns the value 31, as expected, But the result takes about 10 seconds to be returned.
using (var serv = new Aperam.SIP.PXP.Negocio.Modelos.SIP_PA())
{
var teste = (from ens in serv.PA_ENSAIOS_UM
where ens.COD_IDENT_UNMET == "FBLDY3840"
select ens).Count();
}
If I execute the simple query bellow the result is the same (31), but the result is showed in 500 milisecond.
SELECT
count(*)
FROM
PA_ENSAIOS_UM
WHERE
COD_IDENT_UNMET 'FBLDY3840'
There are a way to improve the performance when I'm using EF?
Note: There are 13.000.000 lines in this table.
Here are some things you can try:
Capture the query that is being generated and see if it is the same as the one you are using. Details can be found here, but essentially, you will instantiate your DbContext (let's call it "_context") and then set the Database.Log property to be the logging method. It's fine if this method doesn't actually do anything--you can just set a breakpoint in there and see what's going on.
So, as an example: define a logging function (I have a static class called "Logging" which uses nLog to write to files)
public static void LogQuery(string queryData)
{
if (string.IsNullOrWhiteSpace(queryData))
return;
var message = string.Format("{0}{1}",
queryData.Trim().Contains(Environment.NewLine) ?
Environment.NewLine : "", queryData);
_sqlLogger.Info(message);
_genLogger.Trace($"EntityFW query (len {message.Length} chars)");
}
Then when you create your context point to LogQuery:
_context.Database.Log = Logging.LogQuery;
When you do your tests, remember that often the first run is the slowest because the server has to actually do the work, but on the subsequent runs, it often uses cached data. Try running your tests 2-3 times back to back and see if they don't start to run in the same time.
I don't know if it generates the same query or not, but try this other form (which should be functionally equivalent, but may provide better time)
var teste = serv.PA_ENSAIOS_UM.Count(ens=>ens.COD_IDENT_UNMET == "FBLDY3840");
I'm wondering if the version you have pulls data from the DB and THEN counts it. If so, this other syntax may leave all the work to be done at the server, where it belongs. Not sure, though, esp. since I haven't ever used EF with Oracle and I don't know if it behaves the same as SQL or not.

Querying RavenDb with max 30 requests error

Just want to get some ideas from anyone who have encountered similar problems and how did you guys come up with the solution.
Basically, we have around 10K documents stored in RavenDB. And we need the ability to allow users to perform filter and search against those documents. I am aware that there is a maximum of 1024 page size within RavenDb. So in order for the filter and search to work, I need to do my own paging. But my solution gives me the following error:
The maximum number of requests (30) allowed for this session has been reached.
I have tried many different ways of disposing the session by wrapping it around using keyword and also explicitly calling Dispose after every call to RavenDb with no success.
Does anyone know how to get around this issue? what's the best practice for this kind of scenario?
var pageSize = 1024;
var skipSize = 0;
var maxSize = 0;
using (_documentSession)
{
maxSize = _documentSession.Query<LogEvent>().Count();
}
while (skipSize < maxSize)
{
using (_documentSession)
{
var events = _documentSession.Query<LogEvent>().Skip(skipSize).Take(pageSize).ToList();
_documentSession.Dispose();
//building finalPredicate codes..... which i am not providing here....
results.AddRange(events.Where(finalPredicate.Compile()).ToList());
skipSize += pageSize;
}
}
Raven limits the number of Request (Load, Query, ...) to 30 per Session. This behavior is documented.
I can see that you dispose the session in your code. But I don't see where you recreating the session. Anyways loading data they way you intend to do is not a good idea.
We're using indexes and paging and never load more than 1024.
If you're expecting thousands of documents or your precicate logic doesn't work as an index and you don't care about how long your query will take use the unbounded results API.
var results = new List<LogEvent>();
var query = session.Query<LogEvent>();
using (var enumerator = session.Advanced.Stream(query))
{
while (enumerator.MoveNext())
{
if (predicate(enumerator.Current.Document)) {
results.Add(enumerator.Current.Document);
}
}
}
Depending on the amount of document this will use a lot of RAM.

How many data can save in one Parse.Object.saveAll request? And how many number of request will be used for one Parse.Object.saveAll

Recently, I have some test on parse.com. I am now facing a problem of using Parse.Object.saveAll in background job.
From the document in parse.com, it says that a background job can run for 15 minutes. I am now setting a background job to pour the data in the database using the following code:
Parse.Cloud.job("createData", function(request, status) {
var Dummy = Parse.Object.extend("dummy");
var batchSaveArr = [];
for(var i = 0 ; i < 50000 ; i ++){
var obj = new Dummy();
// genMessage() is a function to generate a random string with 5 characters long
obj.set("message", genMessage());
obj.set("numValue",Math.floor(Math.random() * 1000));
batchSaveArr.push(obj);
}
Parse.Object.saveAll(batchSaveArr, {
success: function(list){
status.success("success");
},
error: function(error){
status.error(error.message);
}
});
});
Although it is used to pour data into database, the main purpose is to test the function Parse.Object.saveAll. When I run this job, an error "This application has exceeded its request limit." is appeared in the log. However, when I see the analysis page, it show me that the request count is less than or equal to 1. I only run this job in Parse, and no other request is made during the background is running.
It seems that there is some problem on Parse.Object.saveAll. Or maybe I have some misunderstanding on this function.
Are there anyone facing the same problem?
How many data can save in one Parse.Object.saveAll request?
How many number of request will be used for one Parse.Object.saveAll
I have asked the question in Facebook and the reply is quite disappointed.
Please follow the link:
https://developers.facebook.com/bugs/439821726158766/

Resources