formatting file using bash - bash

I have a directory (Confidential) which contains a bunch of text files.
Confidential
:- Secret-file1.txt
:- Secret-file2.txt
:- Secret-file3.txt
I want to produced another textfile (Summary.txt) with textwidth, say, 80 and with following formating
Secret-file1 - This file describes various secret activities of
organization Secret-Organization-1
Secret-file2 - This file describes various secret activities of
organization Secret-Organization-2. This summarizes
their activities from year 2001.
Secret-file3 - This file describes various secret activities of
organization Secret-Organization-3. This summarizes
their activities from year 2024.
Where the second column is right-aligned and copied from first line of corresponding text file. For example, the "Secret-file1.txt" looks like this
This file describes various secret activities of organization Secret-Organization-1.
XXXXXXXXXXXXXXXXX BUNCH of TEXT TILL EOF XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
How can I do that? I am looking at various options at bash (e.g., sed, awk,grep, your-prefered-bash-built-in).
Thanks
A

This is the simplest thing that came to my mind, since you didn't write what you tried I'm leaving possible tweaks to you, but I believe this is a good start ;)
for file in "*"; do echo "$file\t\t$(head -1 "$file")"; done

You can do this cleanly with a few lines of Python:
#!/usr/bin/env python3.3
import glob
import textwrap
from os.path import basename
INDENT=' ' * 22
for filename in glob.glob("Confidential/*.txt"):
with open(filename, 'r') as secret:
print("{:20s}- {}\n".format(
basename(filename),
'\n'.join(textwrap.wrap(secret.readline(),
width=74,
initial_indent=INDENT,
subsequent_indent=INDENT)).strip()),
end="")
prints
Secret-file1.txt - This file describes various secret activities of
organization Secret-Organization-1
Secret-file2.txt - This file describes various secret activities of
organization Secret-Organization-2. This summarizes
their activities from year 2001.
Secret-file3.txt - This file describes various secret activities of
organization Secret-Organization-3. This summarizes
their activities from year 2024.
It’s not shell, but it’s going to be faster because you’re not forking a bunch of processes, and you’re not going to spend a ton of time with string-formatting and writing loops to indent the text when the textwrap module can do it for you.

Take a look at the fmt command in Unix. It can reformat your document in a specific width and even control indentations.
It's been a long while since I used it. However, it can follow indents, set width, etc. I have a feeling it may do what you want.
Another command to look at is pr. pr, by default breaks text into pages, and adds page numbers, but you can turn all of that offi. This is another command that may be able to munge your text the way you want.

Related

PDF - Edit raw text without special paid tool

Is there a way to edit the raw text from a PDF without any special paid software?
So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.
I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted.
Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.
Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.
(I use macOS)
The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.
Update
The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how thats possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)
echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf
for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf
For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd
However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.
So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).
In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.
So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.
Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)
There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.
Your choices for MacOS will be more limited thus search for the best you are willing to pay for.

Line breaks don't show up right in YAML?

I am looking at a YAML config file for a database, and all I see is a big jumble of text. However, I notice that there is a missing character every now and then if I use my keyboard's arrow keys to navigate around, I notice that there is occasionally a spot where the cursor gets stuck and requires me to press the arrow key two times instead of one. I am currently assuming that this is a line break that only YAML parsers can read. When I force a line break by pressing ENTER, the YAML parser does not understand the config file anymore. How can I get past this limitation without using a non-windows program? This line break has a Hex value of 0A.
As requested, a snippet of what the current YAML text looks like and what I would like it to look like can be found at the links below (due to StackExchange's limited use of indents. Note that these are two different files for a game's configuration. The API for the parser is here.
What I would like the config to look like
What the config currently looks like
It has also come to my attention that the second link might show it as a YAML file since it registers the line-break as a line break. However, the chunk below might give you an idea of what it looks like to me.
RWtorchLight: Version 1.2 made by MYCRAFTisbest
indent1: ''
NOTE: 'The Meta data valuse is the number after the :'
For Example: Black wool, put 35 in Light_Block and 15 in Meta Data
Light_Block: 89
Meta_Deta_LB: 0
IMPORTANT: The torch and boots are not compatable with Meta Data yet
Torch_Item: 50
Helmet_Item: 314
Boot_Item: 317
indent2: ''
Torch_Use: true
Helmet_Use: true
Boot_Use: true
T-or-T Mode: Will create dim light when wearing pumpkin and all below features
Trick-or-Treat Mode: true
C of C: Chance of Cookie is the chance of how often trick-or-treaters get candy
Set to: '"0" for no chance'
Chance of Cookie: 5000
N of C: 'Will randomly chose a number between 1 and # when Cookies are received'
Number of Cookies: 5
BACKGROUND
After reviewing your question and the associated discussion in comments, a likely case is your YAML file is being corrupted either by:
notepad.exe;
your FTP/SFTP/Web page/whatever used for uploading the text; OR
a combination of both of the above
PROBLEM
YAML syntax is whitespace and indentation sensitive, and using MSFT notepad.exe is not recommended because it may not support the encoding specified in your YAML file.
Since YAML uses whitespace to delimit the data, any kind of modification to the text that is not consistent with the original encoding and whitespace of the original YAML will potentially render the file unusable.
This is one of the aspects of YAML that makes it potentially more brittle than alternative formats, such as JSON or XML.
SOLUTION
Use another editor such as Notepad++ (as recommended in the comments) or, if you do not have sufficient privileges to install another text editor, use an online text editor such as editpad (http://www.editpad.org/) to edit and save the YAML to a local file on your machine.
After saving the file to your local machine using a text editor besides notepad.exe, upload your file using an option that does not apply any kind of text filter to the text.
For example, some websites strip out characters from user-uploaded text to prevent things data corruption and security risks.
STEP BY STEP
start with a known well-formed YAML file, such as the one you specified in "What I would like the config to look like"
paste it into Notepad++ (local machine) or editpad (web-based editor)
modify the YAML file so it matches the settings you want
save your modifications to the original file
upload the file via SFTP or other means that preserves the original encoding

Finding RNAs and information in a region

I want to find novel and known RNAs and transcripts in a sequence of about 10 KB. What is the most easiest way using bioinformatics tools to start with if that sequence is not well annotated in ensembl and UCSC browsers? Does splices ESTs and RNA sequencing data one option? I am new to bioinformatics, your suggestions are useful for me.
Thanks in advance
I am a bit unclear on what exactly your desired end-product or output would look like. But I might suggest doing multiple sequence alignments and looking for those with high scores. Chances are if this 10KB sequence will have some of those known sequences but they won't match exactly, so I think you want a program that gives you alignment scores and not just simple matches. I use Perl in combination with Clustal to make alignments. Basically, you will need to make .fasta or .aln files with both the 10KB sequence and a known sequence of interest according to those file formats' respective convention. You can use the GUI version of clustal if you are not too programming savvy. If you want to use Perl, here is a script I wrote for aligning a whole directory of .fasta files. It can perform many alignments in one fell swoop. NOTE: you must edit the clustal executable path in the last line (system call) to match its location on your computer for this script to function.
#!/usr/bin/perl
use warnings;
print "Please type the list file name of protein fasta files to align (end the directory path with a / or this will fail!): ";
$directory = <STDIN>;
chomp $directory;
opendir (DIR,$directory) or die $!;
my #file = readdir DIR;
closedir DIR;
my $add="_align.fasta";
foreach $file (#file) {
my $infile = "$directory$file";
(my $fileprefix = $infile) =~ s/\.[^.]+$//;
my $outfile="$fileprefix$add";
system "/Users/Wes/Desktop/eggNOG_files/clustalw-2.1-macosx/clustalw2 -INFILE=$infile -OUTFILE=$outfile -OUTPUT=FASTA";
}
Do you have a linux server or computer or are you relying on web and windows-based programs?
To align RNA-seq reads, people generally use splice read aligners like Tophat, although BLAST would probably work too.
Initially I wrote long response explaining how to do this in Linux but I've just realised that Galaxy might be a much easier solution for a beginner. Galaxy is an online bioinformatics tool with a very user friendly interface; it's particularly designed for beginners. You can sign up and log in at this website: https://main.g2.bx.psu.edu/
There are tutorials on how to do things (see 'Help' menu) but my basic workflow for your experiment would go something like this:
Log into Galaxy
Upload RNA-seq reads, EST reads and 10K genome sequence
In the menu on the left, click to expand "NGS-RNA sequencing", then click "Tophat for Illumina (assuming your RNA-seq reads are Illumina fastq reads)"
Align your RNA-seq reads using Tophat, make sure to select your 10K sequence as the reference genome.
Try aligning your EST reads with one of the programs. I'm not sure how successful this will be, Tophat isn't designed to work with long sequences so you might have to have a bit of a play or be a bit creative to get this working.
Use Cufflinks to create annotation for novel gene models, based on your RNA-seq reads and/or EST sequences.
Regarding viewing the output, I'm not sure what is available for a custom reference sequence on Windows, you might have to do a bit of research. For Linux/Mac, I'd recommend IGV.

"descript.ion" file spec?

There appears to be a somewhat standard "descript.ion" file in Windows programs universe which provides meta data for all/some of the files in a given directory.
I know there are various programs which write this file (example: NewsBin, UseNet downloader) and read it (Example: "FAR", a file manager mimicking old Norton Commander).
I'm writing my own file indexer, and would like to add the ability to parse and use the info from "descript.ion" files.
The problem I have is that I have not been able to find an actual spec for the file, despine much googling.
I reverse engineered it as best I could, but I'm not certain whether I captured 100% of the possible details, so I figured I'd ask SO.
Here are example lines from the file:
"Rus Song1.mp3" SovietMus 1/2, rus_song#gmail.com, Fri Aug 08 00:46:27 2008
RusSong2.mp3 SovietMus 2/2, rus_song#gmail.com, Fri Aug 08 01:46:22 2008
As it seems the structure is:
First "token" is a file name.
If the token starts with any letter but double quote, the token ends at the first space character.
If the token starts with the double quote, the end of token is the following double quote
Not sure what happens if filename contains a double quote, IIRC it's illegal in Windows filesystems, so escaping the quote may be a moot question)
Last token (end of line to the very last comma moving backwards) is a timestamp.
Second to last token (the very last comma to second-to-last comma moving backwards) is the name of the poster from the Usenet newsgroup. I'm not quite sure what happens in generic format since the only descript.ion files I saw were from NewsBin that is obviously Usenet centric.
Everything in between is a description, in NewsBin's case coming from post's subject.
QUESTIONs:
Does anyone know of a bit more official "descript.ion" file spec/documentation?
(or, at elast, have your own knowledge of those files and can verify my spec)
Does anyone know of any other programs that read or write this file?
Thanks!
The description files on my system are from Total Commander as well. They follow the basic spec mentioned in the other answers:
Filename Text I typed to describe the file
"Long filename" Some text
Each line ends in a normal Windows line break.
In addition, the program stores multi-line comments as follows:
Filename This is the first line\\nSecond line\\nLast line\x04\xc2
Here, I mean that the descript.ion file contains a backslash and a letter 'n' where I typed a line break, and two special characters 04 C2 at the end of the comment. In addition, the line is ended by a Windows line break 0D 0A.
Apparently, the two extra characters at the end of the line signal the end of a multiline comment. If I remove them, the comment is rendered as a single line in the GUI, and the '\n' sequences are displayed literally.
The original usage of DESCRIPT.ION was to provide longer more descriptive names to 8.3 filenames; all it had was the shortname and a longer description. As you've found, others have co-opted the name with varying formats and usages. Frankly speaking, I don't think you'll find any specific commonality among the various usages.
Format is simple: FileName remainder of the line is a description of the file
https://jpsoft.com/ascii/descfile.txt
(Wayback Machine)
The descript.ion file is extensively used in the file management utility "total commander", a shareware found in www.ghisler.com. From version 7.5 of TC, it can have length of 4096 bytes. I have been using it extensively to annotate my files without any issues. You may look up different user's experience at the total commander users forum.
the answer above looks correct for me, just a addition:
from http://filext.com/file-extension/ION
The ION file type is primarily associated with '4DOS'. Note: Norton Utilities also uses 4DOS.
http://www.optimasc.com/products/fileid/4dos-descext.pdf
Collected links to 4DOS description-aware programs of all kind and 4DOS tools.
http://www.4dos.info/4tools.htm
http://drupal.org/node/289988

Do standard windows .ini files allow comments?

Are comments allowed in Windows ini files? (...assuming you're using the GetPrivateProfileString api functions to read them...)
[Section]
Name=Value ; comment
; full line comment
And, is there a proper spec of the .INI file format anywhere?
Thanks for the replies - However maybe I wasn't clear enough. It's only the format as read by Windows API Calls that I'm interested in. I know other implementations allow comments, but it's specifically the MS Windows spec and implementation that I need to know about.
Windows INI API support for:
Line comments: yes, using semi-colon ;
Trailing comments: No
The authoritative source is the Windows API function that reads values out of INI files
GetPrivateProfileString
Retrieves a string from the specified section in an initialization file.
The reason "full line comments" work is because the requested value does not exist. For example, when parsing the following ini file contents:
[Application]
UseLiveData=1
;coke=zero
pepsi=diet ;gag
#stackoverflow=splotchy
Reading the values:
UseLiveData: 1
coke: not present
;coke: not present
pepsi: diet ;gag
stackoverflow: not present
#stackoverflow: splotchy
Update: I used to think that the number sign (#) was a pseudo line-comment character. The reason using leading # works to hide stackoverflow is because the name stackoverflow no longer exists. And it turns out that semi-colon (;) is a line-comment.
But there is no support for trailing comments.
I have seen comments in INI files, so yes. Please refer to this Wikipedia article. I could not find an official specification, but that is the correct syntax for comments, as many game INI files had this as I remember.
Edit
The API returns the Value and the Comment (forgot to mention this in my reply), just construct and example INI file and call the API on this (with comments) and you can see how this is returned.
USE A SEMI-COLON AT BEGINING OF LINE --->> ; <<---
Ex.
; last modified 1 April 2001 by John Doe
[owner]
name=John Doe
organization=Acme Widgets Inc.
I like the analysis of #Ian Boyd, because it is based on the official GetPrivateProfileString() method of Microsoft.
In my attempts of writing a Microsoft compatible INI parser, I'm having a closer look at the said Microsoft API and for comments I found out:
you can have line comments using semicolon
the semicolon needn't be the first character of the line; it can be preceded by space, tab or vertical tab
you can have trailing "comments" after a section even without semicolon. It's probably not intended to be a comment, but the parser will ignore it.
values outside a section cannot be accessed (at least I did not find a way), effectively making them useless except for commenting purposes
certainly abuse, but the parser overflows at 65536 characters, so anything after that will not be part of the value either. I would not rely on this, since Microsoft could fix this in later versions of Windows. Also, it's not very useful as a comment when you don't see it.
Example:
this=cannot be accessed
[section]this=is ignored
;this=is a line comment
;this=is a comment preceded by spaces
key=value <... 65530 spaces ...>this=cannot be parsed
Yes, it allows.
The way to comment is to use ; for a new line rather than just after the content you want to comment in the same line, which is allowable for other files where you want to comment.
Let me show you an example:
I use .ini file to pass some parameters for my training file when I use SUMO software. If I write like this:
width_layers = 400 ;the number of neurons per layer in the neural network.
I will get an error message which is
ValueError: invalid literal for int() with base 10: '400 ;the number of neurons per layer in the neural network.'
I have to create a line for that, which is
width_layers = 400
;the number of neurons per layer in the neural network.
Then, it will work. Hope it helps in detail!

Resources