How to Spam Filter Gmail Messages by Recipient Address? - filter

I use the dot feature (m.yemail#gmail.com instead of myemail#gmail.com) to give emails for questionable sites so that I can easily spot spam from my address being sold.
I made this function and set it to trigger every 30 minutes to automatically filter these.
function moveSpamByAddress(){
var addresses = ["m.yemail#gmail.com"]
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++){
var messages = threads[i].getMessages();
for (var ii = 0; ii<messages.length; ii++){
for (var iii = 0; iii<addresses.length; iii++){
if (messages[ii].getTo().indexOf(addresses[iii]) > -1){
threads[i].moveToSpam()
}
}
}
}
}
This works, but I noticed that this runs slower than I would expect it to (but my expectation may be unreasonable) given that my inbox only contains 50 messages and I am only currently filtering one address. Is there a way to increase execution speed?
Also are there any penalties for running scripts too often? I see that I have the option to trigger a script every minute, and that would increase the likelihood of filtering a message before I see it, but it would also run the scripts uselessly significantly more times.

You can do this using native gmail filters plus apps script.
Script time quotas varies from 1 to 6 hours depending on account type.
To improve performance, first check getInboxUnreadCount and return inmediately if zero.
If you use a 1minute trigger, make sure to use a lock to avoid one timer starting while the other runs. If the lock is in use simply return.
First, make a gmail filter so when "to" matches your special address, apply a special label like "mySpam"
Second, make an apps script with my suggestions above, plus your code no longer needs to search so much, now you just need to find emails with that label (a single api call) and .moveToSpam
There shouldnt be that many at any time in the label if the script runs often.

Related

Form Recognizer Heavy Workload

My use case is the following :
Once every day I upload 1000 single page pdf to Azure Storage and process them with Form Recognizer via python azure-form-recognizer latest client.
So far I’m using the Async version of the client and I send the 1000 coroutines concurrently.
tasks = {asyncio.create_task(analyse_async(doc)): doc for doc in documents}
pending = set(tasks)
# Handle retry
while pending:
# backoff in case of 429
time.sleep(1)
# concurrent call return_when all completed
finished, pending = await asyncio.wait(
pending, return_when=asyncio.ALL_COMPLETED
)
# check if task has exception and register for new run.
for task in finished:
arg = tasks[task]
if task.exception():
new_task = asyncio.create_task(analyze_async(doc))
tasks[new_task] = doc
pending.add(new_task)
Now I’m not really comfortable with this setup. The main reason being the unpredictable successive states of the service in the same iteration. Can be up then throw 429 then up again. So not enough deterministic for me. I was wondering if another approach was possible. Do you think I should rather increase progressively the transactions. Start with 15 (default TPS) then 50 … 100 until the queue is empty ? Or another option ?
Thx
We need to enable the CORS and make some changes to that CORS to make it available to access the heavy workload.
Follow the procedure to implement the heavy workload in form recognizer.
Make it for page blobs here for higher and best performance.
Redundancy is also required. Make it ZRS for better implementation.
Create a storage account to upload the files.
Go to CORS and add the URL required.
Set the Allowed origins to https://formrecognizer.appliedai.azure.com
Go to containers and upload the documents.
Upload the documents. Use the container and blob information to give as the input for the recognizer. If the case is from Form Recognizer studio, the size of the total documents is considered and also the number of characters limit is there. So suggested to use the python code using the container created as the input folder.

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

How to measure execution time of Vulkan pipeline

Summary
I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.
I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.
Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:
1: Creating query pool
void CreateQueryPool()
{
VkQueryPoolCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
createInfo.pNext = nullptr; // Optional
createInfo.flags = 0; // Reserved for future use, must be 0!
createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW
VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
if (result != VK_SUCCESS)
{
throw std::runtime_error("Failed to create time query pool!");
}
}
I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.
2: Recording command buffers
// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);
vkCmdCopyQueryPoolResults(/* many parameters here */);
I'm looking for a couple of clarifications:
What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.
I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.
So, thanks to #karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!
Query Pool Creation
I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.
Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:
1) Asking graphics queue if timestamps are supported
When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!
2) Determining the timestamp period
The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.
3) Asking for query reset support
During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:
VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;
createInfo.pNext = &resetFeatures;
4) Recording command buffers
Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);
vkCmdBeginRenderPass(/* ... */);
// render here...
vkCmdEndRenderPass(mCommandBuffers[i]);
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);
5) Retrieving query result
We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.
Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:
void FetchRenderTimeResults(uint32_t swapchainIndex)
{
uint64_t buffer[2];
VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
VK_QUERY_RESULT_64_BIT);
if (result == VK_NOT_READY)
{
return;
}
else if (result == VK_SUCCESS)
{
mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
}
else
{
throw std::runtime_error("Failed to receive query results!");
}
// Queries must be reset after each individual use.
vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}
The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).
And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.
There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.
More info in documentation.
Writing to the same query index is bad because you are overwriting your "before" timestamp with the "after" timestamp at the same query index. You might want to change the last parameter in your write timestamp calls to i * 2 for the "before" call and to i * 2 + 1 for the "after". You are already allocating 2 timestamps for each command buffer, but only using half of them. This scheme ends up producing a pair of before/after timestamps for each command buffer i.
I don't have any experience using vkCmdCopyQueryPoolResults(). If you can idle your queue, then after idle, call vkGetQueryPoolResults() which will probably be much easier for what you are doing here. It copies the query results back into host memory and you don't have to mess with synchronizing writes to another buffer and then mapping/reading it back.

While Recording in Squish using python, how to set the application to sleep for sometime between 2 consecutive activities?

I am working on an application where there are read only screens.
To test whether the data is being fetched on screen load, i want to set some wait time till the screen is ready.
I am using python to record the actions. Is there a way to check the static text on the screen and set the time ?
You can simply use
snooze(time in s).
Example:
snooze(5)
If you want to wait for a certain object, use
waitForObject(":symbolic_name")
Example:
type(waitForObject(":Welcome.Button"), )
The problem is more complicated if your objects are created dynamically. As my app does. In this case, you should create a while function that waits until the object exists. Here, maybe this code helps you:
def whileObjectIsFalse(objectID):
# objectID = be the symbolic name of your object.
counter = 300
objectState = object.exists(objectID)
while objectState == False:
objectState = object.exists(objectID)
snooze(0.1)
counter -= 1
if counter == 0:
return False
snooze(0.2)
In my case, even if I use snooze(), doesn't work all the time, because in some cases i need to wait 5 seconds, in other 8 or just 2. So, presume that your object is not created and tests this for 30 seconds.
If your object is not created until then, then the code exits with False, and you can tests this to stop script execution.
If you're using python, you can use time.sleep() as well

MongoDB Geospatial Load More Between HTTP Requests

AcaniUsers loads the first 20 users in MongoDB (on Heroku via Sinatra) closest to me from my iPhone. I want to add a Load More button that will load the next 20 users closest to me. Keep in mind, my location and the locations of the users on my phone may have changed. I was thinking of switching from Sinatra to Node.js and opening a WebSocket, so I could have realtime updates of the presences & locations of the users on my phone, but think I should save that challenge for a next iteration. Basically, how should I implement the load more functionality?
To paginate queries in MongoDB you can use a combination of limit() and skip().
So, the first query will be:
your_query.limit(20)
Then if you want to load the second 20 (you will have to remember the first query somewhere):
your_query.skip(20).limit(20)
btw I suggest you to execute in the first place the query with a limit higher than 20 and put in the cache the result you don't display. When requested, just get them from the cache (you can store it in the user session). If the position change, restart from scratch and re-query the db invalidating the cache.
think of it more as a client side question: use subscriptions based on the current group - encode the group into a geo-square if possible (more efficient than circle, I think?) - periodically (t) executes an operation that checks the locations of each user and simply sends them out with a group id to match the subscriptions
actually...to build your subscription groups, just use the geonear command on all of your subscribers
- build a hash of your subscribers and their groups
- each subscriber is subscribed to one group and themselves (for targeted communication => indicate that a specific subscriber should change their subscription)
- iterate through the results i number of times where i is the number of individuals in an update group
- execute an action that checks the current value of j, the group number for a specific subscriber, against the new j value - if there is a change, notify the subscriber on the subsriber's private channel
- notifications synchronously follow subscriber adjustments
something like:
var pageSize;
// assign pageSize in method call
var documents = collection.Find(query);
var max = documents.Size();
for (int i = 0; i == max ; i++)
{
var level = i*pageSize;
if (max / level > 1)
{
documents.Skip(pageSize);
}
else
{
documents.Skip(pageSize).Limit(level);
break;
}
}
:)

Resources