Iam working multiple multiprocessing pool starmap_async which was working fine, but the main issue is, how i can kill only particular pool.
tkinter UI
When i select the Camp1 and run it was running,it select list related to the campaign and running without issue.
self.pool = Pool(processes=10)
p = list(itertools.product(searchengines, selecteddomains, searchterms))
params = [(a, api, name_of_camp, da, pidfileloc, proxies,
logfileloc, hours, domainfileloc,
siteregistration, mozapi, mozsecret) for a in p]
self.pool.starmap_async(Expired_Search, params)
I cant get the Pool ID to terminate the process. Is there any way to get Pool ID to terminate the process of the campaign i selected.
Any help would be appreciated. Thanks
Found solution with multiprocessing.Process. Thanks to #AKX
self.executor = Process(target=Expired_Search, args=(p, api, name_of_camp, da,
pidfileloc, proxies, logfileloc, hours,
domainfileloc, siteregistration, mozapi, mozsecret))
self.executor.start()
Related
I am running something along the lines of the following:
results = queries.map do |query|
begin
Neo4j::Session.query(query)
rescue Faraday::TimeoutError
nil
end
end
After a few iterations I get an unrescued Faraday::TimeoutError: too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) and Neo4j needs switching off and on again.
I believe this is because the queries themselves aren't aborted - i.e. the connection times out but Neo4j carries on trying to run my query. I actually want to time them out, so simply increasing the timeout window won't help me.
I've had a scout around and it looks like I can find my queries and abort them via the Neo4j API, which will be my next move.
Am I right in my diagnosis? If so, is there a recommended way of managing queries (and aborting them) from neo4jrb?
Rebecca is right about managing queries manually. Though if you want Neo4j to automatically stop queries within a certain time period, you can set this in your neo4j conf:
dbms.transaction.timeout=60s
You can find more info in the docs for that setting.
The Ruby gem is using Faraday to connect to Neo4j via HTTP and Faraday has a built-in timeout which is separate from the one in Neo4j. I would suggest setting the Neo4j timeout as a bit longer (5-10 seconds perhaps) than the one in Ruby (here are the docs for configuring the Faraday timeout). If they both have the same timeout, Neo4j might raise a timeout before Ruby, making for a less clear error.
Query management can be done through Cypher. You must be an admin user.
To list all queries, you can use CALL dbms.listQueries;.
To kill a query, you can use CALL dbms.killQuery('ID-OF-QUERY-TO-KILL');, where the ID is obtained from the list of queries.
The previous statements must be executed as a raw query; it does not matter whether you are using an OGM, as long as you can input queries manually. If there is no way to manually input queries, and there is no way of doing this in your framework, then you will have to access the database using some other method in order to execute the queries.
So thanks to Brian and Rebecca for useful tips about query management within Neo4j. Both of these point the way to viable solutions to my problem, and Brian's explicitly lays out steps for achieving one via Neo4jrb so I've marked it correct.
As both answers assume, the diagnosis I made IS correct - i.e. if you run a query from Neo4jrb and the HTTP connection times out, Neo4j will carry on executing the query and Neo4jrb will not issue any instruction for it to stop.
Neo4jrb does not provide a wrapper for any query management functionality, so simply setting a transaction timeout seems most sensible and probably what I'll adopt. Actually intercepting and killing queries is also possible, but this means running your query on one thread so that you can look up its queryId in another. This is the somewhat hacky solution I'm working with atm:
class QueryRunner
DEFAULT_TIMEOUT=70
def self.query(query, timeout_limit=DEFAULT_TIMEOUT)
new(query, timeout_limit).run
end
def initialize(query, timeout_limit)
#query = query
#timeout_limit = timeout_limit
end
def run
start_time = Time.now.to_i
Thread.new { #result = Neo4j::Session.query(#query) }
sleep 0.5
return #result if #result
id = if query_ref = Neo4j::Session.query("CALL dbms.listQueries;").to_a.find {|x| x.query == #query }
query_ref.queryId
end
while #result.nil?
if (Time.now.to_i - start_time) > #timeout_limit
puts "killing query #{id} due to timeout"
Neo4j::Session.query("CALL dbms.killQuery('#{id}');")
#result = []
else
sleep 1
end
end
#result
end
end
I have recently started working with pythons multiprocessing library, and decided that using the Pool() and apply_async() approach is the most suitable for my problem. The code is quite long, but for this question I've compressed everything that isn't related to the multiprocessing in functions.
Background information
Basically, my program is supposed to take some data structure and send it to another program that will process it and write the results to a txt file. I have several thousands of these structures (N*M), and there are big chunks (M) that are independent and can be processed in any order. I created a worker pool to process these M structures before retrieving the next chunk. In order to process one structure, a new thread has to be created for the external program to run. The time spend outside the external program during the processing is less than 20 %, so if I check Task Manager, I can see the external program running under processes.
Actual problem
This works very well for a while, but after many processed structures (any number between 5000 and 20000) suddenly the external program stop showing up in the Task Manager and the python children runs at their individual peak performance (~13% cpu) without producing any more results. I don't understand what the problem might be. There are plenty of RAM left, and each child only use around 90 Mb. Also it is really weird that it works for quite some time and then stops. If I use ctrl-c, it stops after a few minutes, so it is semi-irresponsive to user input.
One thought I had was that when the timed-out external program thread is killed (which happens every now and then), maybe something isn't closed properly so that the child process is waiting for something it cannot find anymore? And if so, is there any better way of handling timed-out external processes?
from multiprocessing import Pool, TimeoutError
N = 500 # Number of chunks of data that can be multiprocessed
M = 80 # Independed chunk of data
timeout = 100 # Higher than any of the value for dataStructures.timeout
if __name__ == "__main__":
results = [None]*M
savedData = []
with Pool(processes=4) as pool:
for iteration in range(N):
dataStructures = [generate_data_structure(i) for i in range(M)]
#---Process data structures---
for iS, dataStructure in enumerate(dataStructures):
results[iS] = pool.apply_async(processing_func,(dataStructure,))
#---Extract processed data---
for iR, result in enumerate(results):
try:
processedData = result.get(timeout=timeout)
except TimeoutError:
print("Got TimeoutError.")
if processedData.someBool:
savedData.append(processedData)
Here is also the functions that create the new thread for the external program.
import subprocess as sp
import win32api as wa
import threading
def processing_func(dataStructure):
# Call another program that processes the data, and wait until it is finished/timed out
timedOut = RunCmd(dataStructure.command).start_process(dataStructure.timeout)
# Read the data from the other program, stored in a text file
if not timedOut:
processedData = extract_data_from_finished_thread()
else:
processedData = 0.
return processedData
class RunCmd(threading.Thread):
CREATE_NO_WINDOW = 0x08000000
def __init__(self, cmd):
threading.Thread.__init__(self)
self.cmd = cmd
self.p = None
def run(self):
self.p = sp.Popen(self.cmd, creationflags=self.CREATE_NO_WINDOW)
self.p.wait()
def start_process(self, timeout):
self.start()
self.join(timeout)
timedOut = self.is_alive()
# Kills the thread if timeout limit is reached
if timedOut:
wa.TerminateProcess(self.p._handle,-1)
self.join()
return timedOut
I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?
I've got a small little ruby script that pours over 80,000 or so records.
The processor and memory load involved for each record is smaller than a smurf balls, but it still takes about 8 minutes to walk all the records.
I'd though to use threading, but when I gave it a go, my db ran out of connections. Sure it was when I attempted to connect 200 times, and really I could limit it better than that.. But when I'm pushing this code up to Heroku (where I have 20 connections for all workers to share), I don't want to chance blocking other processes because this one ramped up.
I have thought of refactoring the code so that it conjoins the all the SQL, but that is going to feel really really messy.
So I'm wondering is there a trick to letting the threads share connections? Given I don't expect the connection variable to change during processing, I am actually sort of surprised that the thread fork needs to create a new DB connection.
Well any help would be super cool (just like me).. thanks
SUPER CONTRIVED EXAMPLE
Below is a 100% contrived example. It does display the issue.
I am using ActiveRecord inside a very simple thread. It seems each thread is creating it's own connection to the database. I base that assumption on the warning message that follows.
START_TIME = Time.now
require 'rubygems'
require 'erb'
require "active_record"
#environment = 'development'
#dbconfig = YAML.load(ERB.new(File.read('config/database.yml')).result)
ActiveRecord::Base.establish_connection #dbconfig[#environment]
class Product < ActiveRecord::Base; end
ids = Product.pluck(:id)
p "after pluck #{Time.now.to_f - START_TIME.to_f}"
threads = [];
ids.each do |id|
threads << Thread.new {Product.where(:id => id).update_all(:product_status_id => 99); }
if(threads.size > 4)
threads.each(&:join)
threads = []
p "after thread join #{Time.now.to_f - START_TIME.to_f}"
end
end
p "#{Time.now.to_f - START_TIME.to_f}"
OUTPUT
"after pluck 0.6663269996643066"
DEPRECATION WARNING: Database connections will not be closed automatically, please close your
database connection at the end of the thread by calling `close` on your
connection. For example: ActiveRecord::Base.connection.close
. (called from mon_synchronize at /Users/davidrawk/.rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/monitor.rb:211)
.....
"after thread join 5.7263710498809814" #THIS HAPPENS AFTER THE FIRST JOIN.
.....
"after thread join 10.743254899978638" #THIS HAPPENS AFTER THE SECOND JOIN
See this gem https://github.com/mperham/connection_pool and answer, a connection pool might be what you need: Why not use shared ActiveRecord connections for Rspec + Selenium?
The other option would be to use https://github.com/eventmachine/eventmachine and run your tasks in EM.defer block in such a way that DB access happens in the callback block (within reactor) in a non-blocking way
Alternatively, and a more robust solution too, go for a light-weight background processing queue such as beanstalkd, see https://www.ruby-toolbox.com/categories/Background_Jobs for more options - this would be my primary recommendation
EDIT,
also, you probably don't have 200 cores, so creating 200+ parallel threads and db connections doesn't really speed up the process (slows it down actually), see if you can find a way to partition your problem into a number of sets equal to your number of cores + 1 and solve the problem this way,
this is probably the simplest solution to your problem
We are trying to track down the cause of a performance problem.
We have a table with a single row that contains a primary key and a counter. Within a transaction we read the value of the counter, increment the value by one and save the new value.
The read and update is done using Entity Framework, we use a serializable transaction scope, we need to ensure that a counter value is read once only.
Most of the time this takes 0.1 seconds, then sometimes it takes over 1 second. We have not been able to find any pattern as to why this happen.
Has anyone else experienced variable performance when using transaction scope? Would it help to drop using transaction scope and set the transaction directly on the connection?
I remember commenting on this question a long time ago, but recently some developers in my shop have started using TransactionScope, and have also run into performance issues. While trying to search for some information, this came up pretty high in the Google Search results.
The issue we ran into was that, apparently chaining commands (INSERTs, etc.) with BeginChain does not work when using a TransactionScope (at least on the version we are running, Client v9.7.4.4 connecting to DB2 z/OS v 10).
I thought that I would leave a workaround for the issue we ran into (slow performance when running lots [1k+] of INSERTs under TransactionScope, but ran fine when the scope was removed and chaining was allowed). I'm not really sure if it would help the original question directly, but there are some options if you look in the IBM.Data.DB2.dll classes that allow you to update rows using a DB2DataAdapter and an underlying DataTable.
Here's some example code in VB.NET:
Private Sub InsertByBulk(tableName As String, insertCollection As List(Of Object))
Dim curTimestamp = Date.Now
Using scope = New TransactionScope
'Something that opens a connection to DB2, may vary
Using conn = GetDB2Connection()
'Dumb query to get DataTable from the ResultSet
Dim sql = String.Format("SELECT * FROM {0} WHERE 1=0", tableName)
Using adapter = New DB2DataAdapter(sql, conn)
Using table As New DataTable
adapter.FillSchema(table, SchemaType.Source)
For Each item In insertCollection
Dim row = table.NewRow()
row("ID") = item.Id
row("CHAR_FIELD") = item.CharField
row("QUANTITY") = item.Quantity
row("UPDATE_TIMESTAMP") = curTimestamp
table.Rows.Add(row)
Next
Using bc = New DB2BulkCopy(conn)
bc.DestinationTableName = tableName
bc.WriteToServer(table)
End Using 'BulkCopy
End Using 'DataTable
End Using 'DataAdapter
End Using 'Connection
scope.Complete()
End Using
End Sub
We have now solved this problem.
The root of the problem was that the DB2 provider does not support transaction promotion. This results in Transaction Scope using MSDTC distributed transactions for everything.
We replaced the use of Transaction Scope with transactions set on the database connection.
Composite services that included the code in the question were then reduced from 3 seconds to 0.3 seconds.