Pretty text tables & trees - comments

I'm wondering what SO uses (if anything) to create text-based tables and trees? I'm running Notepad++, however I'm thinking of making a change.
When I'm not whiteboarding, I'm in Notepad++, however it's a painstaking process creating both trees and tables. I've seen some quick scripts around for CLI driven processes, like file system output, but I'm looking for something that allows one to quickly create arbitrary tables and trees (self contained GUI or file import perhaps) and dump them to text. No plugins for Notepad++ in my searches.
I have various graphical modeling tools, however I prefer monospaced text (I'm an ASCII art kid, not to mention for script docs inclusion) so no sense in mentioning Visio (blech) or the plethora of others (unless they happen to support this sort of functionality)
+- Thank +---------+----------------+
| | Any | Suggestions |
+- You +---------+----------------+
| | | Are | Certainly |
| +- Very +---------+----------------+
| | Welcome | |
+- Much +---------+----------------+
Note: Running Win7x64 + Cygwin

I'm not positive about trees, but I have used the perl Text::FormatTable module before and have found it very very helpful in automatting output from scripts into tables.
For straight editing yourself, I'd recommend org-mode for emacs, which has a fantastic ascii table-editing mode.

Related

Bulk Uploading to MediaWiki (Hierarchical Structure)

I have lots of Markdown files each contained in a folder with the same name as Markdown file. I use Pandoc to generate the MediaWiki file in Rendered folder.
For Example
ComputerScience
|
ComputerScience.md
|
Rendered
| |
| ComputerScience.wiki
Image
| |
| Computer.png
Resource
|
Algorithms.pdf
Every Markdown file has its own folder which contains other folders like Image, Resource, which are linked in the markdown file. To explain my structure: let me call the above structure as ComputerScience Container. Each markdown file has this container. These containers are classified in a hierarchical way - Several such containers can exist in a Folder (which I call here as, SuperFolder). These SuperFolder can Contain another SuperFolder. For example (The markdown folders are mentioned as Container):
Computer Science
|
Computer Science Container
|
Algorithms
| |
| Algorithms Container
| |
| DataStructure Container
Architecture Container
In above Computer Science Super Folder consist of Containers as well as another SuperFolder called Algorithms.
How can I upload this kind of hierarchical structure into local mediawiki?
Also, I would like to edit Markdown files and generate the updated MediaWiki files. I hope to update the Mediawiki files using a script.
Any suggestions on how should I approach this?
If you want to learn a tool that's flexible and can be reused in the future, I'd look at pywikibot. If you just want a quick one-off that can be used from bash, and the wiki is local, use edit.php.

Extract abstract / full text from scientific literature given DOI or Title

There are quite a lot of tools to extract text from PDF files[1-4]. However the problem with most scientific papers is the hardship to get access PDF directly mostly due to the need to pay for them. There are tools that provide easy access to papers' information such as metadata or bibtex , beyond the just the bibtex information[5-6]. What I want is like taking a step forward and go beyond just the bibtex/metadata:
Assuming that there is no direct access to the publications' PDF files, is there any way to obtain at least abstract of a scientific paper given the paper's DOI or title? Through my search I found that there has been some attempts [7] for some similar purpose. Does anyone know a website/tool that can help me obtain/extract abstract or full text of scientific papers? If there is not such tools, can you give me some suggestions for how I should go after solving this problem?
Thank you
[1] http://stackoverflow.com/questions/1813427/extracting-information-from-pdfs-of-research-papers
[2] https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf
[3] http://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf?lq=1
[4] http://stackoverflow.com/questions/14291856/extracting-article-contents-from-pdf-magazines?rq=1
[5] https://stackoverflow.com/questions/10507049/get-metadata-from-doi
[6] https://github.com/venthur/gscholar
[7] https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar
You can have a look at crossref text and datamining (tdm) service (http://tdmsupport.crossref.org/). This organization provides a RESTful API for free. There are more than 4000 publishers contributing to this tdm service.
You can find some examples from the link below:
https://github.com/CrossRef/rest-api-doc/blob/master/rest_api_tour.md
But to give a very simple example:
If you go to the link
http://api.crossref.org/works/10.1080/10260220290013453
you will see that other than some basic metadata, there are two other metadata namely, license and link where the former one gives under what kind of licence this publication is provided and the latter one gives the url of full text. For our example you will see on the license metadata that the license is creativecommons (CC) which means it is free to be used for tdm purposes. By searching for the publications with CC licenses within crossref you can access hundred thousands of publications with their full texts. From my latest research i can say that hindawi publication is the most friendly publisher. Even they provide more than 100K publications witt CC license. One last thing is that full texts might be provided in either in xml or pdf format. For those xml formats are highly structured thus easy to extract data.
To sum it up, you can automatically access many full texts through crossref tdm service by employing their API and simply writing a GET request. If you have further questions do not hesitate to ask.
Cheers.
Crossref may be worth checking. They allow members to include abstracts with the metadata, but it's optional, so it isn't comprehensive coverage. According to their helpdesk when I asked, they have abstracts available for around 450,000 DOIs registered as of June 2016.
If an abstract exists in their metadata, you can get it using their UNIXML format. Here's one specific example:
curl -LH "Accept:application/vnd.crossref.unixref+xml" http://dx.crossref.org/10.1155/2016/3845247
If the article is on PubMed (which contains around 25 million documents), you can use the Python package Entrez to retrieve the abstract.
Using curl (works in my linux):
curl http://api.crossref.org/works/10.1080/10260220290013453 2>&1 | # doi after works
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags
sed 's/<[^>]*>/ /g' # remove other tags
# add "echo" to show unicode characters
echo -e $(curl http://api.crossref.org/works/10.1155/2016/3845247 2>&1 | # doi after works
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g' | # substitute paragraph tags
sed 's/<[^>]*>/ /g') # remove other tags
using R:
library(rcrossref)
cr_abstract(doi = '10.1109/TASC.2010.2088091')

Text editor to view giant log files

As I have not yet setup some log rotating solution, I have a 3gb (38-million line) log file which I need to find some information in from a certain date. As using cat | grep is horribly slow, and using my current editor (Large Text File Viewer) is equally slow, I was wondering: Is there any text editor that works well with viewing >35-million line log files? I could just use the cat | grep solution and leave it running overnight, but with millions of errors to sort through there has to be a better way.
You might want to try using grep by itself:
grep 2011-04-09 logfile.txt
instead of needlessly using cat:
cat logfile.txt | grep 2011-04-09
When dealing with large amounts of data, this can make a difference.
Interesting reading is a Usenet posting from last year: why GNU grep is fast.
Since you are on Windows, you should really try multiple implementations of grep. Not all implementations of grep are equal. There are some truly awful implementations.
It is not necessary to use cat: Grep can read directly from the log file, unless it is locked against being shared with readers.
grep pattern logfile > tmpfile
should do the trick. Then you can use most any editor to examine the selected records, assuming it is quite selective.
I don't think you're going to get any faster than grep alone (as others have noted, you don't need the cat).
I personally find "more" and "less" are useful (for smaller files). The reason is that sometimes a pattern will get you in the general vicinity of where you want (i.e. a date and time) and then you can scroll through the file at that point.
the "/" is the search command for regular expressions in more.

Store and query a mapping in a file, without re-inventing the wheel

If I were using Python, I'd use a dict. If I were using Perl, I'd use a hash. But I'm using a Unix shell. How can I implement a persistent mapping table in a text file, using shell tools?
I need to look up mapping entries based on a string key, and query one of several fields for that key.
Unix already has colon-separated records for mappings like the system passwd table, but there doesn't appear to be a tool for reading arbitrary files formatted in this manner. So people resort to:
key=foo
fieldnum=3
value=$(cat /path/to/mapping | grep "^$key:" | cut -d':' -f$fieldnum)
but that's pretty long-winded. Surely I don't need to make a function to do that? Hasn't this wheel already been invented and implemented in a standard tool?
Given the conditions, I don't see anything hairy in the approach. But maybe consider awk to extract data. awk approach allows for picking only the first, or the last entry, or imposing any arbitrary additional conditions:
value=$(awk -F: "/^$key:/{print \$$fieldnum}" /path/to_mapping)
Once bundled in a function it's not that scary:)
I'm afraid there's no better way at least within POSIX. But you may also have a look at join command.
Bash supports arrays, which is not exactly the same. See for example this guide.
area[11]=23
area[13]=37
area[51]=UFOs
echo ${area[11]}
See this LinuxJournal article for Bash >= 4.0. For other versions of Bash you can fake it:
hput () {
eval hash"$1"='$2'
}
hget () {
eval echo '${hash'"$1"'#hash}'
}
# then
hput a blah
hget a # yields blah
Your example is one of several ways to do this using shell tools. Note that cat is unnecessary.
key=foo
fieldnum=3
filename=/path/to/mapping
value=$(grep "^$key:" "$filename" | cut -d':' -f$fieldnum)
Sometimes join comes in handy, too.
AWK, Python, Perl, sed and various XML, JSON and YAML tools as well as databases such as MySQL and SQLite can also be used, of course.
Without using them, everything else can sometimes be convoluted. Unfortunately, there isn't any "standard" utility. I would say that the answer posted by pooh comes closest. AWK is especially adept at dealing with plain-text fields and records.
The answer in this case appears to be: no, there's no widely-available implementation of the ‘passwd’ file format for the general case, and wheel re-invention is necessary in each case.

How do you convert character case in UNIX accurately? (assuming i18N)

I'm trying to get a feel for how to manipulate characters and character sets in UNIX accurately given the existance of differing locales - and doing so without requiring special tools outside of UNIX standard items.
My research has shown me the problem of the German sharp-s character: one character changes into two - and other problems. Using tr is apparently a very bad idea. The only alternative I see is this:
echo StUfF | perl -n -e "print lc($_);"
but I'm not certain that will work, and it requires Perl - not a bad requirement necessarily, but a very big hammer...
What about awk and grep and sed and ...? That, more or less, is my question: how can I be sure that text will be lower-cased in every locale?
Perl lc/uc works fine for most languages but it won't work with Turkish correctly, see this bug report of mine for details. But if you don't need to worry about Turkish, Perl is good to go.
You can't be sure that text will be correct in every locale. That's not possible, there are always some errors in software libraries regarding implementation of i18n related staff.
If you're not afraid of using C++ or Java, you may take a look at ICU which implement broad set of collation, normalization, etc. rules.

Resources