How to convert image to table - image

I have an image of a table (in my case .gif) and want to extract the table it was (ideally, .ods).
Is there any way to do so? (doing it manually is discarted, since the table has more than 1000 rows and 6 columns)
Here is a part of the image / table:

You will be able to get most of it through OCR, but you'll need to manually verify the data and fix some inaccuracies that will be there. It definitely won't be perfect.
First thing to do is to ensure you have a good quality image for the OCR software:
Here's what I did with your sample png (I'm using Windows):
I opened the image in The Gimp.
Removed the orange/blue backgrounds:
a) Select -> By Color and clicked the blue background
b) I held down Shift and clicked the orange background (this will add it to the current selection)
c) Edit -> Fill With BG Color (this sets it to white)
d) Ctrl-Shift-A to cancel the selection
I removed the partially cut off '305' line:
a) used the Rectangular Select tool button from the palette, and filled the selection with BG Color, as above
Let's remove the table border:
a) Click the 'Fuzzy Select' tool button from the palette
b) Click somewhere on the table border (you should see the 'marching ants' instead of the border)
c) Edit -> Fill With BG Color
d) Ctrl-Shift-A to cancel the selection again
We need to increase the number of pixels that the numbers use so that the OCR can better detect their shapes
a) Image -> Scale Image. I chose to scale by 1000% with Linear Interpolation (the other interpolations won't work as well)
Download and install Tesseract from GitHub
a) At the command prompt type (include the double-quotes to cope with spaces within the path, & change your paths as necessary):
"D:\Program Files (x86)\Tesseract-OCR\tesseract" "d:\temp\your_image.png" "d:\temp\your_txt_file_output"
The output with be a text file with an appended .txt extension. It will still have a few artifacts but we can easily correct those in Notepad++ (or similar):
a) The commas were seen as full-stops, so I did a Find and Replace of "." with "," (I'm assuming you don't have any decimal points in the data!)
b) There were some spaces before a few commas, so I did Find and Replace " ," with "," (note I included a space before the comma in the Find)
c) There were still some spaces in the numbers, so I did a Find and Replace of " " with "" (a space with an empty replace)
This gave the following result:
298 299 300 301 302 303 304
910,820,000 920,820,000 930,820,000 941,820,000
952,820,000 983,820,000 9?4,820,000 210,000
220,000 220,000 220,000 220,000 220,000
220,000 2,500 2,500 3,000 3,000
3,000 3,000 3,000 19,000 19,000
20,000 20,000 20,000 20,000 20,000
Note the question mark in the place of 7 in the second block of text. Things like that still need to be tidied up.
Lastly, you'd copy and paste the rows of text into your spreadsheet etc.

I wanted to post another option I finally found online.
https://convertio.co/es/ocr/
Even though I think K Scandrett answer deserves to be the correct one, since it doesn't rely on a URL, which might go down.

If this is a one-time/rare need and you are windows OS user and you have a Microsoft Excel installed, the application supports extracting the image data to excel. Follow this link for the complete reference.

Related

Why are my table border look weird in PDF viewers?

I generated a table with iText7 (C#):
var cell = new Cell().Add(new Paragraph(headers[c]).SetFont(font).SetFontColor(ColorConstants.WHITE).SetFontSize(size).SetBold());
cell.SetBackgroundColor(color);
cell.SetTextAlignment(iText.Layout.Properties.TextAlignment.CENTER);
cell.SetPadding(0);
cell.SetBorder(new SolidBorder(1));
table.AddCell(cell);
Document has the table, but on certain scalings, it looks weird on the edges:
Taking a closer look on the image above:
If however I change the zoom in the viewer directly, it looks OK:
How do I get rid of these unnecessary parts from the border?
I'm attaching here the resulted PDF for reference:
Download sample PDF
I also noticed that on iText KB pages, there is this kind of behavior:
https://kb.itextpdf.com/home/it7kb/faq/how-do-i-change-the-border-color-of-a-pdfpcell
See the red and blue bars' left edges:
This behaviour is not uncommon in PDF or other print drivers where vectors are printed rather than plotter definitions (often called "Dangles". It would be worse if the definition was rounded or square, rather than butt, and join as "mitre" cannot apply, see below). The overlap is intentional (to ensure both lines are inclusive). In a laser drum print that may be desirable overkill, but disastrous for any inkjet or screen. It looks like the cell is not bordered by a box, but using common straight vectors. Again this is often desirable optimisation but not when the weight is not honoured. Thus it depends if the viewer is using the correct thickness.
All desktop PDF viewers (icluding Chrome and FireFox) I tested showed the lines correctly as clean overlap without "Dangles". Acrobat has a reputation for undesirably thickening or thinning its standard defined lines depending on its user settings.

Find the number of displayed lines, with folding in Ace

Within an Ace editor, it is easy to find the number of lines in the edited document with the following:
myEditor.session.getLength();
But languages like JSON or XML can be "folded." That is, children properties or elements can be collapsed so only one single line is displayed for the parent.
Is there a way to get the number of lines actually displayed? Something like the following:
myEditor.session.getVisibleLength();
Note: the ultimate goal is to have an editor that adapts its height on the page to the content it displays (if lines are collapsed, then it should shrink, and if collapsed lines are expanded again, it should increase its height.)
UPDATE: After a user's response, I use the following. This is not the answer to the specific question I asked above, but rather the perfect answer to what I was trying to achieve overall:
const myEditor = ace.edit(elem, {minLines: 5, maxLines: 50});
To automatically change the height of the editor use maxLines option, but don't set it to a very large value as performance depends on the number of displayed lines.

How to convert xlsx to pdf on one page

I have a 13 column xlsx and I want convert to pdf.
I use this code: "soffice" ,"--headless","--convert-to", "pdf" , filepath ,"--outdir",outpath.
I can convert to pdf but the columns too many so they have been showed on four pages.
I need they show on one page.
And it show on straight , I need it show on horizontal.
Thanks
XLSX printout settings (PDF export) are part of the file contents so here is the same file saved with different settings but same export command. (convert-to implies headless, so not generally needed. The author decides a cell content and shape and also sets how many rows and columns will fit in a standard page such as A4 portrait or A4 Landscape etc. Thus only a macro can change print layout area. The best that may be possible externally is to scale it up or down on to bigger or smaller paper.
soffice --convert-to pdf:calc_pdf_Export "DataTables example Default.xlsx"
soffice --convert-to pdf:calc_pdf_Export "DataTables example A3.xlsx"
You need to change layout for printing and export in the preview screen if you want 13 columns you set area from A:1 to M:Y where Y is your desired number of lines (whatever their variable height may be.)

How to make two rows of words as big as one word in InDesign?

Im not sure how to express it so I posted a picture in link below.
It should look like this
Just enter the text on 3 lines like so:
MORE
AT
THE HALL
Then adjust the point sizes, leading, kearning, etc. to create the aesthetic you want.
In this case line 1 and 3 could have full justification.
You can use scaling of the text(as shown in the character panel in attached snapshot) because changing font size also moves the baseline and causes the text to shift downward.
These attributes are also exposed via scripting.

iTextSharp stamper wraps text

I'm using iTextSharp to fill in some stamper AcroFields.
stamper.AcroFields.SetField("Title", "Lipsum");
I created the pdf in illustrator and the form fields with Adobe Acrobat X Pro. The problem is that although the text fields are the width of the page, in the saved pdf the text wraps at about 1 third of the width.
Another question would be if it's possible the have the textfield autoSize in height, or a way to handle the overflow of the text.
1) I'd like to see that PDF. I suspect the fields aren't as wide as you think they are.
2) You can set a field's font size to zero to enable "auto sizing", which works both within Reader and iText. However, it sizes to the actual field size, not what you think it might be.
I'm guessing you drew a spiffy form field background in Illustrator, then put a field over it in Acrobat Pro, but didn't size the field width to match the spiffy illustrator background. Could be wrong, but that's my hunch.
That's the flattened PDF. Can I see the original with the form field still intact? Sorry I wasn't more specific. None the less, I can learn a little from reading this PDF:
Looking at the bounding boxes for the flattened field XObject and it's internal clipping rectangle, it looks like it should be using most of the page:
The page is ~600 points wide by ~850 tall.
The flattened field XObject is ~560 points wide by ~100 tall.
I wonder if there's some non-standard carriage return characters in your text that iText picks up on by Acrobat does not...
Anyway, I'd like to see the unflattened PDF. Filled in is good, but not flattened.
Okay, looked at the template. I don't see anything that would cause the line breaking you're seeing... which makes me think my second guess was right: new line characters.
Looking at the text layout code might give me a hint. Each of your lines of text goes like this (for example):
1 0 0 1 2 88.24 Tm 0 g (Die Semmerrolle der l{e4}nge nach zu einer grossen Roulade)Tj
n n n n n n Tm: text matrix
g: gray (0 g: black)
(...)Tj: show text
That's consistent with the code path when you set a text field value in the trunk of iText (and the most recent release[s]). That code (ColumnText) is quite good at breaking text properly, and used all over the place. The bounding box is correct (as shown in a couple places of the flattened PDF).
Check your input.

Resources