PyTorch: wrapping multiple records in one file? - image

Is there a standard way of encoding multiple records (in this case, data from multiple .png or .jpeg images) in one file that PyTorch can read? Something similar to TensorFlow's "TFRecord" or MXNet's "RecordIO", but for PyTorch.
I need to download image data from S3 for inference, and it's much slower if my image data is in many small .jpg files rather than fewer files.
Thanks.

One thing is to store batches of images together in a single npz file. Numpy's np.savez lets you save multiple arrays compressed into a single file. Then load the file as np arrays and use torch.from_numpy to convert to tensors.

Related

Efficient way to loop through large amount of files, convert them to webp and save the timestamp

I have a folder with about 750'000 images. Some images will change over time and new images will also be added every now and then. The folder-structure is about 4-5 levels deep with a maximum of 70'000 images per one single folder.
I now want to write a script that can do the following:
Loop through all the files
Check if the file is new (has not yet been converted) or changed since the last conversion
Convert the file from jpg or png to webp if above rules apply
My current solution is a python script that writes the conversion-times into a sqlite database. It works, but is really slow. I also thought about doing it in PowerShell due to better performance (I assume) but had no efficient way of storing the conversion-times.
What language would you recommend? Is there another way to convert jpg to webp without having to exernally call the command cwebp from within my script?

Predicting file size of output file when merging PDFs (using GhostScript or similar)

This is what I am trying to achieve:
I got several hundred small PDF files of varying size. I need to merge them into chunks of close to but no more than a certain target file size.
I am familiar with gs as well as pdftk (though i prefer to use gs).
Does anyone know a way of predicting the filesize of the merged output PDF beforehand so that i can use it to select the files to be included in the next chunk?
I am not aware of something like a --dry-run option for gs...
(If there is no other way i guess i would have to make a guess based on the sum of the input file sizes and go for trial and error.)
Thank you in advance!

Batch convert .dwg to pdf/image by deleting certain layers

I have certain .dwg files as input, I want raster images as output which contain only certain layers. I found one method for parsing one .dwg file but I want to know if there exist any methods or APIs that I could speed up this process cause I need to construct a dataset.
Thanks,
Zizhao

Writing a byte array of tags to tiff file using libTIFF

I was wondering, is there any way to write an array of bytes as a whole data block (a number of tags) to .tiff file using libTIFF?
As far as I know, .tiff file is not streamable due to random data block (IFD) location. But, as I may assume, data inside that block is being written in a predefined order. What I am trying to accomplish is to write the whole exif properties byte block from jpeg file to "Exif IFD" inside my tiff.
So, is there any function like TIFFSetField(), that populates the whole data block (IFD)?
You can write a TIFF data block of your own, and inside that block, data are formatted with a "private" format (it is the case, if memory serves, of RichTIFFIPTC). What you cannot do is send several tags to the TIFF object and expect them to end up in any particular order.
I believe that Photoshop, among others, always writes a fixed length data object as a single tag, and then rewrites its innards at leisure.
Due to the fact that EXIF collection and TIFF Tag collections overlap, you cannot do this and have your tags readable by libTIFF, though:
[tag1],[tag2],[tag3] ---> [privateTiffLongObject] --> not re-readable
[tag1],[tag2],[tag3] ---> [Tiff2],[Tiff3],[Tiff1] --> re-readable
That said, what is it that you're trying to accomplish? To simply shuttle tags from a JPEG file to a TIFF file, I'd venture that exiftool should be enough. I have often employed a workflow like the following:
(image) --> exiftool --> XML --> XML parsers -->
--> exiftool --> (new image)
Of course, if you need to do this for a large batch of images, performances may become a problem. That issue can be tackled more easily with RAM disks and SSD devices, though.
"Hacking" the TIFF format might leave you with files that are efficiently written and correctly handled by the software tools you now have, but won't be compatible with some other tool elsewhere -- and this might be discovered after you've done weeks of work.
Apparently not. The documentation states:
The TIFF specification requires that all information except an 8-byte
header can be placed anywhere in a file. In particular, it is
perfectly legitimate for directory information to be written after the
image data itself. Consequently TIFF is inherently not suitable for
passing through a stream-oriented mechanism such as UNIX pipes.
Software that require that data be organized in a file in a particular
order (e.g. directory information before image data) does not
correctly support TIFF. libtiff provides no mechanism for controlling
the placement of data in a file; image data is typically written
before directory information.

compress large amount of small files in different folders

In our appliation,we save some images in different folders like:
1
2
3
4
...
500
...
And inside each folder there are large amount of images whose size is (5kb-20kb).
Now we found that when we try to transfer these files,we have to compress them first using the winrar,however it cost toooooo much time!! Also two hours to compress one parent folder.
In fact the images in the application are map images like the google map tiles:
||
So I wonder if there is an good idea to save/transfer these small but large amount files?
Images like that are likely to already be compressed so you will get little gain in bandwidth use (and so transfer speed) from the compression step.
If the compression process is taking along time where your CPU is busy then try instead just creating a plain tar file (which joins all the files into one archive without applying any compression). I don't know about winrar but most other compression tools (like 7zip) can generate a tar file, so I'm guessing winrar can too.
If you regularly transfer the whole set of files but only small numbers are added/changed each time, you might want to look into other transfer methods like rsync. You don't describe either of your environments so I can't tell if this is likely to be available to you, but if it is rsync does an excellent job of only transferring changes (speeding up the transfer significantly) and it also always uses one connection so you don't get hit by the per file latency of FTP and other protocols - one file follows the previous one down the same connection as if the parts being transferred had been tared together so you don't need that extra step to pack the files at one and (and unpack them at the other).
Those images are already compressed. However, to increase transfer speed, you might try using rar in 'archive' mode. This does the same thing as tar: concatenates all the files together into one big file. Don't use any compression in your archive format.
Maybe you can use a fast compression library like Snappy. However, it can only compress a single file, and you surely don't want to transfer each file separately. I'd create an uncompressed TAR archive for that.

Resources