Extracting images from multiple PDF files using hexapdf - hexapdf no such file or directory # rb_sysopen - ruby

I'm in my master thesis and I have to extract images from about 500 pdf files, some people recommended hexapdf to me for this. I was able to install Ruby and hexapdf and now I'm kinda stuck getting the images out of the pdf's since I don't have a coding background. Any tips?
Thanks in advance.
I tried using the basic command for only one pdf to see what happened by using 'hexapdf images' followed by the pdf name but the result was 'no such file or directory # rb_sysopen'.

If you're getting no such file or directory # rb_sysopen, then that signals that the file you are trying to open does not exist. It sounds like this is probably the PDF that you are trying to extract images from.
I would check that you are following help provided by hexapdf documentation and that the path to your PDF is correct. If the file with your code and the PDF are in the same directory and you are running your code from that file, then you would do something like:
require 'hexapdf'
doc = HexaPDF::Document.open('my_pdf_document_filename.pdf')
If the file is somewhere else on the machine, it may be easiest to use a full file path instead of a relative file path which will depend on your system and such (e.g. /Users/username/thesis/image_processing/files/my_pdf_document_filename.pdf).

Related

Biopython: SeqIO.parse() FileNotFoundError

I'm new in Bioinformatics and Biopython, so I have some difficulties with it.
I was reading the Biopython (SeqIO) documentation, but when I try to execute some SeqIO.parse() commands I get FileNotFoundError.
For example, I want to get "example.fasta" file (which I don't have it on my PC). I try to do it with this command:
for record in SeqIO.parse("example.fasta", "fasta"):
print(record.id)
But, all I get is FileNotFoundError: [Errno 2] No such file or directory
Can someone help me with this?
My understanding is that FileNotFoundError occurs when the code tries to open a file on your computer and does not find it.
This can happen either because you simply do not have this file, or you gave the name with a typo, or the path to the file is not correct (This is an important notion: the path to the file should be absolute, or relative to the current working directory (usually the one from which you executed the python script)).
As suggested in the comments to your question, you seem to be expecting SeqIO.parse to get the file for you. This is not the case. The first argument you give to this function (in the example "example.fasta") is the path to an existing file that you want to "parse", that is, interpret its information content and make this content available to the rest of your program in a convenient form.
So in order to get this example working, you first need to get a fasta file. If you do not already have one, you can download some manually from genbank, or find one in the biopython installation (if you installed it from source and know where the source code is located), for instance in Tests/Quality/example.fasta.

How do I display local image in markdown?

Does anyone know how to display a local image in markdown? I don't want to set up a webserver for that.
I try the following in markdown, but it doesn't work:
![image](files/Users/jzhang/Desktop/Isolated.png)
I suspect the path is not correct. As mentioned by user7412219 ubuntu and windows deal with path differently. Try to put the image in the same folder as your Notebook and use:
![alt text](Isolated.png "Title")
On windows the desktop should be at: C:\Users\jzhang\Desktop
The following works with a relative path to an image into a subfolder next to the document:
![image info](./pictures/image.png)
Solution for Unix-like operating system.
STEP BY STEP :
Create a directory named like Images and put all the images that will be rendered by the Markdown.
For example, put example.png into Images.
To load example.png that was located under the Images directory before.
![title](Images/example.png)
Note : Images directory must be located under the same directory of your markdown text file which has .md extension.
To add an image in markdown file the .md file and the image should be in the same directory. As in my case my .md file was in doc folder so i also moved the image into the same folder. After that write the following syntax in .md file
![alt text](filename)
like ![Car Image](car.png)
This has worked for me.
The best solution is to provide a path relative to the folder where the md document is located.
Probably a browser is in trouble when it tries to resolve the absolute path of a local file. That can be solved by accessing the file trough a webserver, but even in that situation, the image path has to be right.
Having a folder at the same level of the document, containing all the images, is the cleanest and safest solution.
It will load on GitHub, local, local webserver.
images_folder/img.jpg < works
/images_folder/img.jpg < this will work on webserver's only (please read the note!)
Using the absolute path, the image will be accessible only with a url like this: http://hostname.doesntmatter/image_folder/img.jpg
if image has bracket it won't display
![alt text](Isolated(1).png)
rename the image and remove brackets
![alt text](Isolated-1.png)
Update:
if you have spaces in the file path, you should consider renaming it too or if you use JavaScript you can encode it using
encodeURIComponent(imagePath)
Also, always try to save images and files alike with lowercase, please develop that habit, just my personal view though
Adding a local image worked for me by like so: ![alt text](file://IMG_20181123_115829.jpg)
Without the file:// prefix it did not work (Win10, Notepad++ with MarkdownViewer++ addon)
Edit: I found out it also works with html tags, and that is way better:
<img src="file://IMG_20181123_115829.jpg" alt="alt text" width="200"/>
Edit2: In Atom editor it only works without the file:// prefix. What a mess.
Depending on your tool - you can also inject HTML into markdown.
<img src="./img/Isolated.png">
This assumes your folder structure is:
├── img
└── Isolated.jpg
├── README.md
Edited:
Working for me ( for local image )
![system schema](doc/systemDiagram.jpg)
tree
├── doc
  └── jobsSystemSchema.jpg
├── README.md
markdown file README.md is at the same level as doc directory.
In your case ,your markdown file should be at the same level as the directory files.
Working for me (absolute url with raw path)
![system schema](https://server/group/jobs/raw/master/doc/systemDiagram.jpg)
NOT working for me (url with blob path)
![system schema](https://server/group/jobs/blob/master/doc/systemDiagram.jpg)
Just add the relative image file route from the markdown file
![localImage](./client/src/assets/12.png)
This worked for me in ubuntu:
![Image](/home/gps/Pictures/test.png "a title")
Markdown file is in:
/home/gps/Documents/Markdown/
Image file is in:
/home/gps/Pictures/
To my knowledge, for VSCode on Linux, the local image can be normally displayed only when you put the image into the same folder as your .md post file.
i.e. only ![](image.jpg) or ![](./image.jpg) will work.
Even the absolute path like ![](/home/bala/image.jpg)also doesn't work.
In Jupyter Notebook Markdown, you can use
<img src="RelPathofFolder/File" style="width:800px;height:300px;">
Another possibility for not displayed local image is unintentional indent of the image reference - spaces before ![alt text](file).
This makes it 'code block' instead of 'image inclusion'. Just remove the leading spaces.
You may find following the syntax similar to reference links in markdown handy, especially when you have a text with many displays of the same image:
![optional text description of the image][number]
[number]: URL
For example:
![][1]
![This is an optional description][2]
[1]: /home/jerzy/ComputerScience/Parole/Screenshot_2020-10-13_11-53-29.png
[2]: /home/jerzy/ComputerScience/Parole/Screenshot_2020-10-13_11-53-30.png
I've had problems with inserting images in R Markdown. If I do the entire URL: C:/Users/Me/Desktop/Project/images/image.png it tends to work. Otherwise, I have to put the markdown in either the same directory as the image or in an ancestor directory to it. It appears that the declared knitting directory is ignored when referencing images.
Either put the image in the same folder as the markdown file or use a relative path to the image.
just copy the image and then paste it, you will get the output
![image.png](attachment:image.png)
The basic syntax is ![Image description](Any_Image_of_your_choice.png "title"). In my case, I used image name as Any\ Image\ of\ your\ choice.png in ![Image description](Any\ Image\ of\ your\ choice.png) instead of ![Image description](Any_Image_of_your_choice.png) and it was not working. So I would say make sure to check the image directory and also image name doesn't contain spaces if so use underscore(_) instead of space.
Faced issue while using markdown in Jupyter notebook in Ubuntu 18.04.
I got a solution:
a) Example Internet:
![image info e.g. Alt](URL Internet to Images.jpg "Image Description")
b) Example local Image:
![image Info](file:///<Path to your File><image>.jpg "Image Description")
![image Info](file:///C:/Users/<name>/Pictures/<image>.jpg "Image Description")
TurboByte

Read from a tar.gz file without saving the unpacked version

I have a tar.gz file saved on disk and I want to leave it packed there, but I need to open one file within the archive, read from it and save some information somewhere.
File structure:
base_folder
file_i_need.txt
other_folder
other_file
code (it is not much - I tried 10mio different ways and this is what is left)
def self.open_file(file)
uncompressed_file = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
uncompressed_file.rewind
end
When I run it in a console I get
<Gem::Package::TarReader:0x007fbaac178090>
and I can run commands on the entries. I just haven't figured out how to open an entry and read from it without saving it unpacked to disk. I mainly need the string from the text file.
Any help appreciated. I might just be missing something...
TarReader is Enumerable, returning Entry.
That said, to retrieve the text content from the file by it’s name one might
uncompressed = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
text = uncompressed.detect do |f|
f.fullname == 'base_folder/file_i_need.txt'
end.read
#⇒ Hello, I’m content of the text file, located inside gzipped tar
Hope it helps.

openFile with pandoc 1.13.2 - Windows 8.1

sorry for my english in my post (it is my first on this forum, and my question is perhaps stupid).
I encounter a problem in converting a html file to pdf file with pandoc.
Here is my code in the console
set Path=%Path%;C:\Users\nicolas\AppData\Local\Pandoc
(redirecting to Pandoc directory)
followed by
pandoc --data-dir=C:\Users\nicolas\Desktop essai.html -o essai.pdf
As indicated, my file is in the Desktop, but I got the following error:
pandoc: essai.html: openFile: does not exist (No such file or directory)
I get the same error if i do (with the file essai.html in the same folder as pandoc.exe):
pandoc essai.html -o essai.pdf
Have you any idea of the cause of my problem? (I precise that the file's name i want to convert is correct).
Remark: My original problem was to create a pdf faithful to the beautiful html file generated by Ipython Notebook via pandoc but I encounter the same kind of problem when i want to convert a .ipynb file in pdf with nbconvert.
I finally solve my problem by adding the full paths to my files (But I have used wkhtmltopdf which is simpler to use for a good result.)

Listing the contents of a LZMA compressed file?

Is it possible to list the contents of a LZMA file (.7zip) without uncompressing the whole file? Also, can I extract a single file from the LZMA file?
My problem: I have a 30GB .7z file that uncompresses to >5TB. I would like to manipulate the original .7z file without needing to do a full uncompress.
Yes. Start with XZ Utils. There are Perl and Python APIs.
You can find the file you want from the headers. Each file is compressed separately, so you can extract just the one you want.
Download lzma922.tar.bz2 from the LZMA SDK files page on Sourceforge, then extract the files and open up C/Util/7z/7zMain.c. There, you will find routines to extract a specific archive file from a .7z archive. You don't need to extract all the data from all the entries, the example code shows how to extract just the one you are interested in. This same code has logic to list the entries without extracting all the compressed data.
I solved this problem by installing 7zip (https://www.7-zip.org/) and using the parameter l. For example:
7z l file.7z
The output has some descriptive information and the list of files in the compressed files. Then, I call this inside python using the subprocess library:
import subprocess
output = subprocess.Popen(["7z","l", "file.7z"], stdout=subprocess.PIPE)
output = output.stdout.read().decode("utf-8")
Don't forget to make sure the program 7z is accessible in your PATH variable. I had to do this manually in Windows.

Resources