Decoding weird symbols in a text file - text-files

One of my text files has become seemingly encrypted with strange symbols such as
���,
ל�9I,
I'm wondering if there's a way to decode this? I've looked around for different ways and thought I was getting close a few times but I never found anything.
This is for a personal college project so Im hoping there's a way to convert this to a human readable form?

Related

All my barrels are named index...and its confusing

Im working in VSCode in a large project with lots of barrels. It's bothersome and unnecessarily confusing that all my barrels are called index. The file path helps to differentiate but I would like to be able to more easily differentiate between barrels. Is this possible?
Idk if this is where I would ask this question, but i searched all over the internet and couldn't find anything.

How to convert PDF to PDF/A-1a using ghostscript? What conditions are needed to convert to PDF/A-1a?

I already did a lot of research and realized that clear information about "How to generate PDF/A-1a" or "...convert to PDF/A-1a" is really rare. I found some information to convert to PDF/A-1a via GhostScript, but I didn't make it to get it working. So, maybe there are some necessary conditions for the data missing in the first place. Conditions like propper metadata of the PDF, structured data for readability by a screen reader, alternative text for pictures, and a declaration of the given language of the text. I need a proper working GhostScript command with the corresponding gs version and the mandatory file conditions to generate or even convert to PDF/A-1a. PDF/A-1b means nothing to me because I'm already able to convert to that.
Thanks for any help.

Rainmeter: How to concatenate strings

I am getting data from a broken RSS feed that gives me wrong link. I wanted to fix this link so I made this code:
<link.*>(.*)&.*tid(.*)</link>
and the link could be like:
www.somedomain.com/?value=50&burrrdurrrr;tid=120
But the real working link is in this form:
www.somedomain.com/?value=50&tid=120
The thing that I'm asking is if my measure thing looks like this:
[FeedURL]
Measure=Plugin
Plugin=Plugins\WebParser.dll
Url=[Feed]
StringIndex=2 ;now I only get www.somedomain.com/?value=50
Substitute=#SubstituteFeed#
How am I supposed to concatenate the strings together to complete the url?
I'm guessing rather than &burrrdurrrr;, the link has &, which is how you have to write & in an HTML or XML file.
If that's the case, you just need to set the DecodeCharacterReference option, as described in this handy-looking tutorial. Another option mentioned there is Substitute, which would be able to strip it out even if it really was &burrrdurrrr;.
None of this is a particularly sensible way of dealing with HTML or XML - a much better approach would be a plugin which actually parsed the document structure and let you reference nodes using XPath or CSS rules - but you work with what you've got, I guess. (I've never heard of this "Rainmeter" before, despite its claim to be "the best known and most popular desktop customization program for Windows"; maybe because nobody else calls their program that, instead almost universally using the word "widget"?)

Reliably parse unpredictable CSV formats

I'm looking for some kind of solution to a problem we're having importing CSV files with Ruby. We keep running into all kinds of exceptions ranging from malformed lines to line ending problems. Right now we're using FasterCSV and have this hacky exception catching solution to try different combinations of delimiters and quotation styles. I don't like it.
All in all, it's an inelegant solution and it seems like this shouldn't be something we should have to deal with. I'm looking for a lib, in any language, that I can point to a file and it'll just figure out how it's formatted and give me the data I need from any CSV.
thanks
The Python CSV package is pretty good at this. However, when dealing with unpredictable CSV formats, I expect you'll have to do maintenance no matter what library you pick.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Resources