How to get value displayed in image created by Base64-encoded string? - ruby

I have a Base64-encoded string that is used as the source of an image element on a website and need to derive the value from said string. Is there any tool that allows this information to be extracted from an image based on the bits returned by decoding the Base64 string? An example is here:
<img src="">
This returns an image containing the value 210000, but I need some way, if possible to return that actual value.
If the only answer is some sort of OCR technology, any advice on where to start, specifically related to embedding this in a Ruby script, would be greatly appreciated.
Thanks in advance!

I have just tried your example with Tesseract, an open OCR system, on an Ubuntu command line.
Going from base64 to digits looked like this:
$ base64 -d image.base64 | tesseract - - digits | sed -e 's/\ //g'
With the output:
210000
I'm afraid I don't know how that would integrate with Ruby, but I hope this helps you.

I had to solve this myself today, here's what I came up with:
File.open('im.png','wb'){|f| f << Base64.decode64(src.sub('data:image/png;base64,',''))}
num = `tesseract -psm 8 -l eng im.png - digits`.gsub(/\D/,'')
I had to apt-get install tesseract-ocr and download the traineddata:
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
sudo mv -v eng.traineddata /usr/share/tesseract-ocr/tessdata/

Related

Smart Quotes and Ligatures in pandoc

I have a file text.txt which contains very basic latex/markdown. For example, it might be the following.
Here is some basic maths: $f(x) = ax + b$ defines a straight line, often called a "linear" function---but it's not _actually_ a linear function, eg $f(0) \ne 0$.
I would like to convert this into html using WebTeX. However, I don't want smart quotes---" should be outputted as basic straight lines, not curved on either end---or smart dashes------ should be literally three dashes, not an em-dash.
It seems that the smart option is good for this: pandoc manual, github 1, github 2. However, I can't quite work out the correct syntax. I have tried, for example, the following.
pandoc text.txt -f markdown-smart -t markdown-smart -s --webtex -o tex.html
Unfortunately this doesn't work.
I solved this while writing the question, so I'll post the answer below! (Spoiler alert: simply remove -t markdown-smart.)
Simply remove -t markdown-smart.
pandoc text.txt -f markdown-smart -s --webtex -o tex.html
I believe that this -t is saying "to markdown without smart". We are not trying to output markdown, but rather html. If the version with -t is viewed, then one sees that the code for embedding various images is included. If this is pasted into a markdown editor, then it should show up.
To get html, simply remove this.

Add part of filename as PDF metadata using bash script and exiftool

I have about 600 books in PDF format where the filename is in the format:
AuthorForename AuthorSurname - Title (Date).pdf
For example:
Foo Z. Bar - Writing Scripts for Idiots (2017)
Bar Foo - Fun with PDFs (2016)
The metadata is unfortunately missing for pretty much all of them so when I import them into Calibre the Author field is blank.
I'm trying to write a script that will take everything that appears before the '-', removes the trailing space, and then adds it as the author in the PDF metadata using exiftool.
So far I have the following:
for i in "*.pdf";
do exiftool -author=$(echo $i | sed 's/-.*//' | sed 's/[ \t]*$//') "$i";
done
When trying to run it, however, the following is returned:
Error: File not found - Z.
Error: File not found - Bar
Error: File not found - *.pdf
0 image files updated
3 files weren't updated due to errors
What about the -author= phrase is breaking here? Please could someone enlighten me?
You don't need to script this. In fact, doing so will be much slower than letting exiftool do it by itself as you would require exiftool to startup once for every file.
Try this
exiftool -ext pdf '-author<${filename;s/\s+-.*//}' /path/to/target/directory
Breakdown:
-ext pdf process only PDF files
-author the tag to copy to
< The copy from another tag option. In this case, the filename will be treated as a pseudo-tag
${filename;s/\s+-.*//} Copying from the filename, but first performing a regex on it. In this case, looking for 1 or more spaces, a dash, and the rest of the name and removing it.
Add -r if you want to recurse into subdirectories. Add -overwrite_original to avoid making backupfiles with _original added to the filename.
The error with your first command was that the value you wanted to assign had spaces in it and needed to be enclosed by quotes.

Grep every word from a file starting a pattern

So I have a file let's call "page.html". Within this file, there's some links/file paths I want to extract. I've been working in BASH trying to get this right but can't seem to do it. The words/links/paths I want to grab all start with "/funny/hello/there/". The goal is for all these words to go to the terminal so I can use them.
This is kinda what I've tried so far, with no luck:
grep -E '^/funny/hello/there/` page.html
and
grep -Po '/funny/hello/there/.*?` page.html
Any help would be greatly appreciated, Thanks.
Here is sample data from the file:
`<td data-title="Blah" class="Blah" >
fdsksldjfah
</td>`
My output gives me all the different line that look like this:
fdsksldjfah
The "/fkljaskdjfl" are all something different though.
What I want the output to look like:
/funny/hello/there/fkljaskdjfl
/funny/hello/there/kfjasdflas
/funny/hello/there/kdfhakjasa
You can use this grep command:
grep -o "/funny/hello/there/[^'\"[:blank:]]*" page.html
However one should avid parsing HTML using shell utilities and use dedicated HTML dom parsers instead.

Search in a webpage using bash

I am trying to retrieve a webpage, search it for some pattern, retrieve that value and do some calculations with it. My Problem is, i can't seem to figure out how to search for the pattern in a given string.
Lets say i retrieve a Page like this
content=$(curl -L http://google.com)
now i want to search for a value im interested in, which is basically a html tag.
<div class="digits">123,456,789</div>
No i did try to find this by using sed. My Attempt looked like this:
n=$(echo "$content"|sed '<div class=\"digits\">(\\d\\d,\\d\\d\\d,\\d\\d\\d)</div>')
i want to pull that value every, lets say 10 minutes, save it and estimate when 124,xxx,xxx will be met.
My Problem is i don't really know how to save those values, but i think i can figure that out on my own. Im more interested in how to retrieve that substring as i always get an error because of the "<".
i hope someone is able and willing to help me :)
Better use a proper parser with xpath :
xmllint --html --xpath '//*[#class="digits"]' http://domain.tld/
But it seems that the example url you gave in the comments don't contains this class name. You can prove it by running first :
curl -Ls url | grep -oP '<div\s+class="digits">\K[^<]+'
It's best to use a proper parser as #sputnick suggested.
Or you can try something like this:
curl -L url | perl -ne '/<div class="digits">([\d,]+)<.div>/ && {print $1, "\n"}'

Finding number of graphics cards in red hat weird error

So I know how to find the number of the video cards but in a ruby script I wrote I had this small little method to determine it:
def getNumCards
_numGpu = %x{lspci | grep VGA}.split("\n").size
end
But have determined I need to do a search for 3D as well as VGA so I changed it to:
def getNumCards
_numGpu = %x{lspci | grep "VGA\|3D"}.split("\n").size
end
But I am finding it returns 0 when I run the second. If I run the command on it's own on the command line, it shows me 3 video cards (1 on board VGA and two Tesla NVIDIA cards that come up as 3D cards). I am not sure what is happening in the split part that may be messing something up.
Any help would be awesome!
Cheers
man grep:
-E, --extended-regexp
...
egrep is the same as grep -E.
so, egrep should help
I'd go after this information one of two ways.
The almost-purely command-line version would be:
def getNumCards
`lspci | grep -P '\b(?:VGA|3D)\b' | wc -l`.to_i
end
which lets the OS do almost all the work, except for the final conversion to an integer.
-P '\b(?:VGA|3D)\b' is a Perl regex that says "find a word-break, then look for VGA or 3D, followed by another word-break". That'll help avoid any hits due to the targets being embedded in other strings.
The more-Ruby version would be:
def getNumCards
`lspci`.split("\n").grep(/\b(?:VGA|3D)\b/).count
end
It does the same thing, only all in Ruby.

Resources