Marking or Tagging Non-structured Data - macos

I'm not entirely sure how to term this, but I've searched several phrases and haven't found what I need.
I have a whole lot of unstructured data that I need to get into a database. I used to do the heavy lifting with Needlebase and just clean up the data from there. But now that it's no more, I'm want for a good way to quickly grab pieces of text beyond select, copy, paste, lather, rinse, repeat.
Ideally something where I could select some text and a popup asks what it is (from a user-defined list, title, start time, image path, etc.) and then marks it as such. Naturally I would need to be able to mark the beginning and end of a record (all row data is consecutive, just not in an easily parseable format).
I could probably write something in a few hours that would do this, but I don't want to reinvent the wheel if something exists. I'm on OS X, but I'd be interested in software for any platform.

is your data in HTML format? if yes you can use Jsoup

Related

PDF - Edit raw text without special paid tool

Is there a way to edit the raw text from a PDF without any special paid software?
So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.
I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted.
Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.
Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.
(I use macOS)
The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.
Update
The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how thats possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)
echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf
for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf
For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd
However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.
So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).
In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.
So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.
Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)
There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.
Your choices for MacOS will be more limited thus search for the best you are willing to pay for.

Converting All Blocks to Lines and Text

When I receive a drawing, I wish to remove all definitions from previous drafters, such as blocks, styles, layers, groups, xrefs, etc. in order to retain only primitives: texts, lines and arcs, in summary, a single flat drawing.
This is a very routinary activity, and I've found many dissimilar answers through internet, often involving non-standard, non-canonical, combinations of the following commands:
LAYMRG, PURGE
AUDIT
SELECTSIMILAR
WBLOCK
EXPLODE, XPLODE
DIMSTYLE, BATTMAN
DXFOUT, WMFOUT, DXFIN, WMFIN
BURST
Unfortunately, after applying most them, the result sometimes retain many non-purgable objects, including:
Non-explodable blocks,
Dimensions with their own styles,
Blocks losing their text attributes (by XPLODE),
Changed fonts (by WMFOUT),
Do AutoCAD have some canonical way to do this?
I think it's not so easy. If there is such command, I don't know that, but...
In situation You described, You should attach drawing You get as External reference XRef . In that case, You can make such drawing displayed as darker or lighter, but without so many changes in drawing. Also if You get new version of such file, for example because Architect make some changes, You don't need to do anything, maybe only reload such file and new version is displayed.
You will have two separate files:
base, for example architecture
branch , for example electircal, HVAC, and so on. Your work.
Of corse You can think about some script (scr file of LISP) which will run all commands You want just by run one command. Create such script is not very complicated, but In my opinion it's easy and flexible enought to use XRef.

How to use "move..." verb to move sheets in Numbers?

I'm trying to figure out how to re-position sheets in Numbers. There is no way to insert things at specific location so I am hoping that I can find another way. The move verb drew my attention (it is in the Numbers dictionary) however there is little or no information, examples, usage scenarios or even what object types it works with.
Any insight in the context of the title?
The move in the Numbers dictionary is part of the Standard Suite, which typically works with files. I have tried using it to move text items and tables from one sheet to another, but it always fails. It is probably something they hope to provide functionality for some day.

Localizing spreadsheet cell names

I'm working on an application that has a spreadsheet-like interface. There is a grid of cells. Rows are numbered, and letters are used for the columns. So "names" like A2 and Q17 refer to cells in the grid.
I know I can use GetLocaleInfo(Ex) with LOCALE_SNATIVEDIGITS to get the appropriate digits for the user's locale, so I can format the row numbers. But I don't see something comparable for the locale-appropriate "alphabet".
I could imagine the same question arising for things like word processors that have an outline mode and need to be able to enumerate some list items with letters.
I've been pouring through the Windows NLS APIs, and I don't see anything like LOCALE_SNATIVEALPHABET nor do I see an API like EnumLocaleAlphabet. Am I missing such an API or am I stuck rolling my own?
Personally I haven't heard of such API. The closest to what you are asking would be ICU uchar's UBlockCode but it still won't give you concrete alphabet.
By the way, I don't think it is actually localize cell names unless you localize the whole User Interface. But in such case you may simply ask translators to provide valid cell symbols.
And this probably what you should do, because some writing systems do not have concept of alphabet at all. That is, it is called script, not alphabet. For example, I don't think it would be good idea to use Arabic for cell symbol (which glyph variant in such case?) nor I would use Chinese (all possible ideograms?).
My suggestion is: leave it to translators, if they want to localize it, that is OK, if they don't, just trust them, they really should know their craft.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Resources