I launch the IShellItemImageFactory::GetImage method for each item in a folder in a number of background threads. The code looks like that:
HRESULT GetImage(IShellItemImageFactory* pImgFactory,
SIZE Size, COLORREF BkColor, HBITMAP *phbm)
{
if (pImgFactory == 0 || phbm == 0)
return E_POINTER;
*phbm = 0;
HBITMAP hBmp = 0;
HRESULT hr = pImgFactory->GetImage(Size, SIIGBF_BIGGERSIZEOK, &hBmp);
if (SUCCEEDED(hr) && hBmp)
{
*phbm = CPicture::StretchBitmap(hBmp, Size, BkColor);
}
return hr;
}
Sometimes I have the all 16 threads stuck inside call to pImgFactory->GetImage for different items. They all stuck in one and the same place, which could be seen in the provided stack. I checked that different threads process different items. What can be a cause of such strange phenomenon?
EDIT:
After David Heffernan's response I realized that the IShellItemImageFactory interface itself can be not thread safe. Our threading subsystem automatically initializes each thread as STA (by the call to CoInitialize(0) function). But may be, for IShellItemImageFactory I need MTA threads. Is there a way to discover the IShellItemImageFactory's coclass CLSID, in order to find in the Registry its threading requirements?
EDIT2:
Probably, our threading mechanism somehow related to the problem. In this specific case we use an engine that we call "Job Queue". It is non-blocking FIFO queue, which elements describe a job. The description contains pointers to job's algorithm and job's data. Typically (but not necessary) the main thread puts jobs to the queue. A free thread from thread pool may get element from the queue and perform the job. This mechanism worked well for us already a couple of years. But may be, it somehow affects icon extraction. May be, I wasn't sufficiently exact in defining the icon extracting algorithm and data. I don't know how can I determine it.
The phenomenon is not really a deadlock. It happens when extracting images of subfolders in a folder, which was not still cached by the Windows. Suppose there is a folder F with 20 subfolders F1,F2,...,F20, each of which contains 20+ other subfolders and files. If folder F is not cached, extraction of 20 images of its subfolders may take 15 - 25 min (on my Win10 8-core computer). After the all images were once extracted, repeated requests for images in that folder F are performed fast.
Related
Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.
What is the best way to check which index is executing in a loop without too much slow down the process?
For example I want to find all long fancy numbers and have a loop like
for( long i = 1; i > 0; i++){
//block
}
and I want to learn which i is executing in real time.
Several ways I know to do in the block are printing i every time, or checking if(i % 10000), or adding a listener.
Which one of these ways is the fastest. Or what do you do in similar cases? Is there any way to access the value of the i manually?
Most of my recent experience is with Java, so I'd write something like this
import java.util.concurrent.atomic.AtomicLong;
public class Example {
public static void main(String[] args) {
AtomicLong atomicLong = new AtomicLong(1); // initialize to 1
LoopMonitor lm = new LoopMonitor(atomicLong);
Thread t = new Thread(lm);
t.start(); // start LoopMonitor
while(atomicLong.get() > 0) {
long l = atomicLong.getAndIncrement(); // equivalent to long l = atomicLong++ if atomicLong were a primitive
//block
}
}
private static class LoopMonitor implements Runnable {
private final AtomicLong atomicLong;
public LoopMonitor(AtomicLong atomicLong) {
this.atomicLong = atomicLong;
}
public void run() {
while(true) {
try {
System.out.println(atomicLong.longValue()); // Print l
Thread.sleep(1000); // Sleep for one second
} catch (InterruptedException ex) {}
}
}
}
}
Most AtomicLong implementations can be set in one clock cycle even on 32-bit platforms, which is why I used it here instead of a primitive long (you don't want to inadvertently print a half-set long); look into your compiler / platform details to see if you need something like this, but if you're on a 64-bit platform then you can probably use a primitive long regardless of which language you're using. The modified for loop doesn't take much of an efficiency hit - you've replaced a primitive long with a reference to a long, so all you've added is a pointer dereference.
It won't be easy, but probably the only way to probe the value without affecting the process is to access the loop variable in shared memory with another thread. Threading libraries vary from one system to another, so I can't help much there (on Linux I'd probably use pthreads). The "monitor" thread might do something like probe the value once a minute, sleep()ing in between, and so allowing the first thread to run uninterrupted.
To have a null cost reporting (on multi-cpu computers) : set your index as a "global" property (class-wide for instance), and have a separate thread to read and report the index value.
This report could be timer-based (5 times per seconds or so).
Rq : Maybe you'll need also a boolean stating 'are we in the loop ?'.
Volatile and Caches
If you're going to be doing this in, say, C / C++ and use a separate monitor thread as previously suggested then you'll have to make the global/static loop variable volatile. You don't want the compiler decide deciding to use a register for the loop variable. Some toolchains make that assumption anyway, but there's no harm being explicit about it.
And then there's the small issue of caches. A separate monitor thread nowadays will end up on a separate core, and that'll mean that the two separate cache subsystems will have to agree on what the value is. That will unavoidably have a small impact on the runtime of the loop.
Real real time constraint?
So that begs the question of just how real time is your loop anyway? I doubt that your timing constraint is such that you're depending on it running within a specific number of CPU clock cycles. Two reasons, a) no modern OS will ever come close to guaranteeing that, you'd have to be running on the bare metal, b) most CPUs these days vary their own clock rate behind your back, so you can't count on a specific number of clock cycles corresponding to a specific real time interval.
Feature rich solution
So assuming that your real time requirement is not that constrained, you may wish to do a more capable monitor thread. Have a shared structure protected by a semaphore which your loop occasionally updates, and your monitor thread periodically inspects and reports progress. For best performance the monitor thread would take the semaphore, copy the structure, release the semaphore and then inspect/print the structure, minimising the semaphore locked time.
The only advantage of this approach over that suggested in previous answers is that you could report more than just the loop variable's value. There may be more information from your loop block that you'd like to report too.
Mutex semaphores in, say, C on Linux are pretty fast these days. Unless your loop block is very lightweight the runtime overhead of a single mutex is not likely to be significant, especially if you're updating the shared structure every 1000 loop iterations. A decent OS will put your threads on separate cores, but for the sake of good form you'd make the monitor thread's priority higher than the thread running the loop. This would ensure that the monitoring does actually happen if the two threads do end up on the same core.
I've been working on an embedded OS for ARM, However there are a few things i didn't understand about the architecture even after referring to ARMARM and linux source.
Atomic operations.
ARM ARM says that Load and Store instructions are atomic and it's execution is guaranteed to be complete before interrupt handler executes. Verified by looking at
arch/arm/include/asm/atomic.h :
#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v,i) (((v)->counter) = (i))
However, the problem comes in when i want to manipulate this value atomically using the cpu instructions (atomic_inc, atomic_dec, atomic_cmpxchg etc..) which use LDREX and STREX for ARMv7 (my target).
ARMARM doesn't say anything about interrupts being blocked in this section so i assume an interrupt can occur in between the LDREX and STREX. The thing it does mention is about locking the memory bus which i guess is only helpful for MP systems where there can be more CPUs trying to access same location at same time. But for UP (and possibly MP), If a timer interrupt (or IPI for SMP) fires in this small window of LDREX and STREX, Exception handler executes possibly changes cpu context and returns to the new task, however the shocking part comes in now, it executes 'CLREX' and hence removing any exclusive lock held by previous thread. So how better is using LDREX and STREX than LDR and STR for atomicity on a UP system ?
I did read something about an Exclusive lock monitor, so I've a possible theory that when the thread resumes and executes the STREX, the os monitor causes this call to fail which can be detected and the loop can be re-executed using the new value in the process (branch back to LDREX), Am i right here ?
The idea behind the load-linked/store-exclusive paradigm is that if if the store follows very soon after the load, with no intervening memory operations, and if nothing else has touched the location, the store is likely to succeed, but if something else has touched the location the store is certain to fail. There is no guarantee that stores will not sometimes fail for no apparent reason; if the time between load and store is kept to a minimum, however, and there are no memory accesses between them, a loop like:
do
{
new_value = __LDREXW(dest) + 1;
} while (__STREXW(new_value, dest));
can generally be relied upon to succeed within a few attempts. If computing the new value based on the old value required some significant computation, one should rewrite the loop as:
do
{
old_value = *dest;
new_value = complicated_function(old_value);
} while (CompareAndStore(dest, new_value, old_value) != 0);
... Assuming CompareAndStore is something like:
uint32_t CompareAndStore(uint32_t *dest, uint32_t new_value, uint_32 old_value)
{
do
{
if (__LDREXW(dest) != old_value) return 1; // Failure
} while(__STREXW(new_value, dest);
return 0;
}
This code will have to rerun its main loop if something changes *dest while the new value is being computed, but only the small loop will need to be rerun if the __STREXW fails for some other reason [which is hopefully not too likely, given that there will only be about two instructions between the __LDREXW and __STREXW]
Addendum
An example of a situation where "compute new value based on old" could be complicated would be one where the "values" are effectively a references to a complex data structure. Code may fetch the old reference, derive a new data structure from the old, and then update the reference. This pattern comes up much more often in garbage-collected frameworks than in "bare metal" programming, but there are a variety of ways it can come up even when programming bare metal. Normal malloc/calloc allocators are not generally thread-safe/interrupt-safe, but allocators for fixed-size structures often are. If one has a "pool" of some power-of-two number of data structures (say 255), one could use something like:
#define FOO_POOL_SIZE_SHIFT 8
#define FOO_POOL_SIZE (1 << FOO_POOL_SIZE_SHIFT)
#define FOO_POOL_SIZE_MASK (FOO_POOL_SIZE-1)
void do_update(void)
{
// The foo_pool_alloc() method should return a slot number in the lower bits and
// some sort of counter value in the upper bits so that once some particular
// uint32_t value is returned, that same value will not be returned again unless
// there are at least (UINT_MAX)/(FOO_POOL_SIZE) intervening allocations (to avoid
// the possibility that while one task is performing its update, a second task
// changes the thing to a new one and releases the old one, and a third task gets
// given the newly-freed item and changes the thing to that, such that from the
// point of view of the first task, the thing never changed.)
uint32_t new_thing = foo_pool_alloc();
uint32_t old_thing;
do
{
// Capture old reference
old_thing = foo_current_thing;
// Compute new thing based on old one
update_thing(&foo_pool[new_thing & FOO_POOL_SIZE_MASK],
&foo_pool[old_thing & FOO_POOL_SIZE_MASK);
} while(CompareAndSwap(&foo_current_thing, new_thing, old_thing) != 0);
foo_pool_free(old_thing);
}
If there will not often be multiple threads/interrupts/whatever trying to update the same thing at the same time, this approach should allow updates to be performed safely. If a priority relationship will exist among the things that may try to update the same item, the highest-priority one is guaranteed to succeed on its first attempt, the next-highest-priority one will succeed on any attempt that isn't preempted by the highest-priority one, etc. If one was using locking, the highest-priority task that wanted to perform the update would have to wait for the lower-priority update to finish; using the CompareAndSwap paradigm, the highest-priority task will be unaffected by the lower one (but will cause the lower one to have to do wasted work).
Okay, got the answer from their website.
If a context switch schedules out a process after the process has performed a Load-Exclusive but before it performs the Store-Exclusive, the Store-Exclusive returns a false negative result when the process resumes, and memory is not updated. This does not affect program functionality, because the process can retry the operation immediately.
I have a case where many threads all concurrently generate data that is ultimately written to one long, serial file stream. I need to somehow serialize these writes so that the stream gets written in the right order.
ie, I have an input queue of 2048 jobs j0..jn, each of which produces a chunk of data oi. The jobs run in parallel on, say, eight threads, but the output blocks have to appear in the stream in the same order as the corresponding input blocks — the output file has to be in the order o0o1o2...
The solution to this is pretty self evident: I need some kind of buffer that accumulates and writes the output blocks in the correct order, similar to a CPU reorder buffer in Tomasulo's algorithm, or to the way that TCP reassembles out-of-order packets before passing them to the application layer.
Before I go code it, I'd like to do a quick literature search to see if there are any papers that have solved this problem in a particularly clever or efficient way, since I have severe realtime and memory constraints. I can't seem to find any papers describing this though; a Scholar search on every permutation of [threads, concurrent, reorder buffer, reassembly, io, serialize] hasn't yielded anything useful. I feel like I must just not be searching the right terms.
Is there a common academic name or keyword for this kind of pattern that I can search on?
The Enterprise Integration Patterns book calls this a Resequencer (p282/web).
Actually, you shouldn't need to accumulate the chunks. Most operating system and languages provide a random-access file abstraction that would allow each thread to independently write its output data to the correct position in the file without affecting the output data from any of the other threads.
Or are you writing to truly serial output file like a socket?
I wouldn't use a reorderable buffer at all, personally. I'd create one 'job' object per job, and, depending on your environment, either use message passing or mutexes to receive completed data from each job in order. If the next job isn't done, your 'writer' process waits until it is.
I would use a ringbuffer that has the same lenght as the number of threads you are using. The ringbuffer would also have the same number of mutexes.
The rinbuffer must also know the id of the last chunk it has written to the file. It is equivalent to the 0 index of your ringbuffer.
On add to the ringbuffer, you check if you can write, ie index 0 is set, you can then write more than one chunk at a time to the file.
If index 0 is not set, simply lock the current thread to wait. -- You could also have a ringbuffer 2-3 times in lenght than your number of threads and lock only when appropriate, ie : when enough jobs to full the buffer have been launched.
Don't forget to update the last chunk written tough ;)
You could also use double buffering when writting to the file.
Have the output queue contain futures rather than the actual data. When you retrieve an item from the input queue, immediately post the corresponding future onto the output queue (taking care to ensure that this preserves the order --- see below). When the worker thread has processed the item it can then set the value on the future. The output thread can read each future from the queue, and block until that future is ready. If later ones become ready early this doesn't affect the output thread at all, provided the futures are in order.
There are two ways to ensure that the futures on the output queue are in the correct order. The first is to use a single mutex for reading from the input queue and writing to the output queue. Each thread locks the mutex, takes an item from the input queue, posts the future to the output queue and releases the mutex.
The second is to have a single master thread that reads from the input queue, posts the future on the output queue and then hand the item off to a worker thread to execute.
In C++ with a single mutex protecting the queues this would look like:
#include <thread>
#include <mutex>
#include <future>
struct work_data{};
struct result_data{};
std::mutex queue_mutex;
std::queue<work_data> input_queue;
std::queue<std::future<result_data> > output_queue;
result_data process(work_data const&); // do the actual work
void worker_thread()
{
for(;;) // substitute an appropriate termination condition
{
std::promise<result_data> p;
work_data data;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(input_queue.empty())
{
continue;
}
data=input_queue.front();
input_queue.pop();
std::promise<result_data> item_promise;
output_queue.push(item_promise.get_future());
p=std::move(item_promise);
}
p.set_value(process(data));
}
}
void write(result_data const&); // write the result to the output stream
void output_thread()
{
for(;;) // or whatever termination condition
{
std::future<result_data> f;
{
std::lock_guard<std::mutex> lk(queue_mutex);
if(output_queue.empty())
{
continue;
}
f=std::move(output_queue.front());
output_queue.pop();
}
write(f.get());
}
}
I am pretty sure I am suffering from memory leakage, but I havent 100% nailed down how its happening.
The application Iv'e written downloads 2 images from a url and queues each set of images, called a transaction, into a queue to be popped off by the user interface and displayed. The images are pretty big, averaging about 2.5MB. So as a way of speeding up the user interface and making it more responsive, I pre-load each transaction images into wxImage objects and store them.
When the user pops off another transaction, I feed the preloaded image into a window object that then converts the wxImage into a bitmap and DC blits to the window. The window object is then displayed on a panel.
When the transaction is finished by the user, I destroy the window object (presumably the window goes away, as does the bitmap) and the transaction data structure is overwritten with 'None'.
However, depending on how many images ive preloaded, whether the queue size is set large and its done all at once, or whether I let a small queue size sit over time, it eventually crashes. I really cant let this happen .. :)
Anyone see any obvious logical errors in what im doing? Does python garbage collect? I dont have much experience with having to deal with memory issues.
[edit] here is the code ;) This is the code related to the thread that downloads the images - it is instanced in the main thread the runs the GUI - the download thread's main function is the 'fill_queue' function:
def fill_queue(self):
while True:
if (self.len() < self.maxqueuesize):
try:
trx_data = self.download_transaction_data(self.get_url)
for trx in trx_data:
self.download_transaction_images(trx)
if self.valid_images([trx['image_name_1'], trx['image_name_2']]):
trx = self.pre_load_images(trx)
self.append(trx)
except IOError, error:
print "Received IOError while trying to download transactions or images"
print "Error Received: ", error
except Exception, ex:
print "Caught general exception while trying to download transactions or images"
print "Error Received: ", ex
else:
time.sleep(1)
def download_transaction_images(self, data):
""" Method will download all the available images for the provided transaction """
for(a, b) in data.items():
if (b) and (a == "image_name_1" or a == "image_name_2"):
modified_url = self.images_url + self.path_from_filename(b)
download_url = modified_url + b
local_filepath = self.cache_dir + b
urllib.urlretrieve(download_url, local_filepath)
urllib.urlcleanup()
def download_transaction_data(self, trx_location):
""" Method will download transaction data and return a parsed list of hash structures """
page = urllib.urlopen(trx_location)
data = page.readlines()
page.close()
trx_list = []
trx_data = {}
for line in data:
line = line.rstrip('|!\n')
if re.search('id=', line):
fields = re.split('\|', line)
for jnd in fields:
pairs = jnd.split('=')
trx_data[pairs[0]] = pairs[1]
trx_list.append(trx_data)
return trx_list
def pre_load_images(self, trx):
""" Method will create a wxImage and load it into memory to speed the image display """
path1 = self.cache_dir + trx['image_name_1']
path2 = self.cache_dir + trx['image_name_2']
image1 = wx.Image(path1)
image2 = wx.Image(path2)
trx['loaded_image_1'] = image1
trx['loaded_image_2'] = image2
return trx
def valid_images(self, images):
""" Method verifies that the image path is valid and image is readable """
retval = True
for i in images:
if re.search('jpg', i) or re.search('jpeg', i):
imagepath = self.cache_dir + i
if not os.path.exists(imagepath) or not wx.Image.CanRead(imagepath):
retval = False
else:
retval = False
return retval
Also, I'd like to add that sometimes, just before the crash I get peculiar errors in my console, they look like corrupt image errors but the images are not corrupted, the error has happened at all stages on all images.
Application transferred too few
scanlines [2009-09-08 11:12:03] Error:
JPEG: Couldn't load - file is probably
corrupted. [2009-09-08 11:12:11]
Debug: ....\src\msw\dib.cpp(134):
'CreateDIBSection' fail ed with error
0x00000000 (the operation completed
successfully.).
These errors can happen a la carte, or all together. What I think is happening is that at some point the memory becomes corrupted and anything that happens next, if I load a new transaction, or image, or do a cropping operation - it takes a dive.
So unfortunately after trying out the suggestion of moving the pre-loading function call to wxImage into the main gui thread I am still getting the error - again it will occur after too many images have been loaded into memory or if they sit in memory for too long. Then when I attempt to crop an image the i get a memory error - something is corrupting, whether in the former case I am using too much or dont have enough (which makes no sense because I've increased my paging file size to astronomical proportions) or in the latter case where the length of time is causing a leak or corruption
The only way I think I can go at this point is to use a debugger - are there any easy ways to debug a wxPython application? I would like to see the memory usage in particular.
The main reason why I think I need to preload the images is because if I call wxImage on each image ( I show two at a time) each time i load a 'transaction' the interface from one transaction to the next is very slow and clunky - If I load them in memory its very fast - but then I get my memory error.
Two thoughts:
You do not mention if the downloading is running a separate thread (actually now I see that this is running in a separate thread, I should read more closely). I'm pretty sure that wx.Image is not thread-safe, so if you are instantiating wx.Images in a non-GUI thread, that could lead to trouble like this. (This is almost certainly the issue, most wx classes/objects/functions are not thread-safe).
I've been bitten by nasty IncRef/DecRef bugs in wxPython (due to the underlying C++ bindings) before (mostly associated with wx.Grid and associated classes). While I don't know of any with wx.Image, it wouldn't surprise me to find out you may be required to manually manage memory like you have to in wx.Grid sometimes.
Edit
You need to instantiate the wx.Image in the GUI thread, not the downloading thread (which your above code looks like you are currently instantiating in the non-GUI thread). In general this is almost always going to cause lots of problems in any GUI toolkit. You can search the wxPython mailing list for lots of emails where this is the case. Personally I would do this:
Queue for download urls.
Thread to download images.
Have the downloading thread places a disk location (watch out for race conditions!) in a separate queue and post custom wx.Event(threadsafe) (threadsafe with wx.PostEvent function) to the App thread.
Have the GUI thread pop the file locations and instantiate wx.Image ----> wx.Bitmap (maybe with wx.CallAfter to process when App is idle)
Display (Blit) as needed.