How to detect a blank line between filled lines in .txt file and convert it to a single tab

How to detect a blank line between filled lines in .txt file and convert it to a single tab - bash

I an running a bash/.dat script (Mac terminal) and part of it is converting each line return into a TAB (to get it ready for nicely importing into Excel). The problem is that I also want to remove all extra blank lines except a single blank line when comes between two filled lines. So...
Line pre-A is blank
Line A has text
Line B has text
Line C is blank
Line D has text
Line E is blank
Line F is blank
Line C above would become a TAB and Line E and F (and pre-A) would be deleted. Also, sometimes there is a blank line before Line A (labelled Line pre-A above), so I'd want it removed but not replaced with a TAB.
So the result would be:
Line A text [TAB] Line B text [TAB] [TAB] Line D text
...and it'd be OK if Line D text was followed by a [TAB]. Make sense? Is this doable and, if so, how?
Thanks!

If perl is your option, would you please try:
perl -0777pe 's/^\n+//; s/\n{3,}/\t/g; s/\n/\t/g' file.txt
The -0777 option tells perl to slurp all lines at once to process
newline characters between lines.
The -pe option enables the one-liner programming.
The first substitution s/^\n+// removes the pre blank line(s).
The next s/\n{3,}/\t/g converts three or more consecutive newline
characters (meaning two or more blank lines) into a tab character.
The last s/\n/\t/g converts the newline characters into the same number
of tab characters.

Related

How to search for special shell characters (in Linux) from one massive file in another without changing them

I've got two massive files with millions of lines.
In the first file1 one of the lines is
Oz5,z!F,k"H,#$5,#%J,$&L,m'F,o(H,6X),c*7
and in the 2nd file2 there are many lines containing the above one, e.g.,
Oz5,z!F,k"H,#$5,#%J,$&L,m'F,o(H,6X),c*7.X5t,&&***b,ccc
I want to search for the lines from file1 in file2 and I face two problems:
search itself clashes with special characters in any shell (sh,bash,csh,...)
!F,k"H,#$5,#%J,$: event not found
I also tried egrep, awk, ack, ... - same result.
How can I go around that? The aforementioned nature of the strings to be searched does not allow me to treat them in any obvious way. E.g., I do not see how I can possibly substitute something for say "!"; because if I introduce "\!" that would clash with "\!" which is also a string in file1,2. Note that all prinatable ASCII characters in all combinations appear in file1 and file2.
What I would apparently need is a shell (perhaps a virtual one) which has no special characters. Is there such a Unix shell?
how to take line by line from file1 in order to search for them in file2 and extract them from file2 into file3?

I solved the problem in the following way.
All shells and search engines in them as well as most editors (like vi, vim) have special characters built in. But not Emacs.
I used Emacs macro as follows:
Split the Emacs window into 3 sub-windows one atop of another. Put file1 in the top one, file2 in the middle, and the ouput one (file3) in the bottom one. Start macro "C^x (" with the cursor at the begging of file1. Copy the line. Go to the beginning of the next line. Go to file2: C^x o. Search for the copied line. Copy the first found line containing the line from file1. Go to the beggining of file2. Go to file3. Paste the line from file2. Go to the next line. Go to file1. Close the macro "C^x )". Repeat the macro as many times as there are remaining lines (say n) in file1: M^n C^x e . (M=Esc).

How do I delete all lines from a file after (and including) a line that contains a defined string in a Bash script?

I'm hacking about a text file in the middle of a Bash script (on an RPI3B+ with OSMC installed) and trying to crop a file at the first line that contains the text "BLAH DE BLAH" (deleting everything in the same file after and including the first line it finds that text on).
For example (in the file filename.text):
This is the first line
This is the second line
This is the third line containing "BLAH DE BLAH"
This is the fourth line
This is the fifth line
Required output (in the file filename.text):
This is the first line
This is the second line
I've tried to investigate awk and sed related posts, but I'm finding it all so confusing as I can't find anything that does exactly what I need (some split at certain line numbers, some from the command line not a bash script, some before and after certain strings)... and I'm stuck. As you can see, I can't even work out how to format this post properly (my head hurts so much)!
Any help appreciated - thanks!

Looks like
sed '/BLAH DE BLAH/Q'
would do the job in GNU sed.

Modify a line below a specific line

I have a big file like this small example:
>ENSG00000002587|ENST00000002596
ATGGCCGCGCTGCTCCTGGGCGCGGTGCTGCTGGTGGCCCAGCCCCAGCTAGTGCCTTCC
>ENSG00000004059|ENST00000000233
ATGGGCCTCACCGTGTCCGCGCTCTTTTCGCGGATCTTCGGGAAGAAGCAGATGCGGATT
>ENSG00000003249|ENST00000002501
ATGGAGCCCCCGGAGGGCGCCGGCACCGGAGAGATCGTTAAGGAGGCTGAGGTGCCGCAG
GCTGCGCTGGGCGTCCCAGCCCAGGGGACAGGGGACAATGGCCACACGCCTGTGGAGGAG
>ENSG00000048028|ENST00000003302
ATGACTGCGGAGCTGCAGCAGGACGACGCGGCCGGCGCGGCAGACGGCCACGGCTCGAGC
TGCCAAATGCTGTTAAATCAACTGAGAGAAATCACAGGCATTCAGGACCCTTCCTTTCTC
CATGAAGCTCTGAAGGCCAGTAATGGTGACATTACTCAGGCAGTCAGCCTTCTCACTGAT
I want to remove the first 5 character of every line which is below the line that starts with >.
I do not know how to do that in command line. Do you know?
Here is the expected output:
>ENSG00000002587|ENST00000002596
CGCGCTGCTCCTGGGCGCGGTGCTGCTGGTGGCCCAGCCCCAGCTAGTGCCTTCC
>ENSG00000004059|ENST00000000233
CCTCACCGTGTCCGCGCTCTTTTCGCGGATCTTCGGGAAGAAGCAGATGCGGATT
>ENSG00000003249|ENST00000002501
GCCCCCGGAGGGCGCCGGCACCGGAGAGATCGTTAAGGAGGCTGAGGTGCCGCAG
GCTGCGCTGGGCGTCCCAGCCCAGGGGACAGGGGACAATGGCCACACGCCTGTGGAGGAG
>ENSG00000048028|ENST00000003302
TGCGGAGCTGCAGCAGGACGACGCGGCCGGCGCGGCAGACGGCCACGGCTCGAGC
TGCCAAATGCTGTTAAATCAACTGAGAGAAATCACAGGCATTCAGGACCCTTCCTTTCTC
CATGAAGCTCTGAAGGCCAGTAATGGTGACATTACTCAGGCAGTCAGCCTTCTCACTGAT

sed -E '/^>/{N;s/\n.{5}/\n/}' file
find line starting with >
join that line with next
replace newline and five chars with just newline

Remove carriage return end of variable

I'm getting really strange output for this program. What is the "Carriage Return" doing, and how to remove it - missing single quote in the end? Why is the letter "T" missing? How to write code to correct this?
code i'm using
#!/bin/bash
export DATABASE_LIST="/opt/halogen/crontab/etc/db_stat_list.cfg"
export v3=""
while read -r USERID ORACLE_SID2
do
v3="This is '${ORACLE_SID2}' "
echo $v3
done < <(tac $DATABASE_LIST)
output
'his is 'OT1SL80
'his is 'OT1SL010
The file I'm reading from is not corrupt and is small one with two lines
[oracle#ot1sldbm001v test2]$ cat /opt/halogen/crontab/etc/db_stat_list.cfg
asp_dba/dba OT1SL010
asp_dba/dba OT1SL80
Thank you

Your DATABASE_LIST file is in DOS/Windows format, with carriage return + linefeed at the end of each line. Unix uses just linefeed as a line terminator, so unix tools treat the carriage return as part of the content of the line. You can keep this from being a problem by telling the read command to treat the carriage return as whitespace (like spaces, tabs, etc), since read automatically removes whitespace from the beginning and end of lines:
...
while IFS="$IFS"$'\r' read -r USERID ORACLE_SID2
...
Note that since this assignment to IFS (which basically lists the whitespace characters) is a prefix to the read command, it only applies to that one command and doesn't have to be set back to normal afterward.

What changes when a file is saved in Kedit for windows that the unix2dos command doesn't do?

So I have a strange question. I have written a script that re-formats data files. I basically create new files with the right column order, spacing, and such. I then unix2dos these files (the program I am formatting these files for is DIPS for windows, and I assume that the files should be ansi). When I go to open the files in the DIPS Program however an error occurs and the file won't open.
When I create the same kind of data file through the DIPS program and open it in note pad, it matches exactly with the data files I have created with my script.
On the other hand if I open the data files that I have created with my script in Kedit first, save them, and then open them in the DIPS program everything works.
My question is what could saving in Kedit possibly do that unix2dos does not?
(Also if I try using note pad or word pad to save instead of Kedit the file doesn't open in DIPS)
Here is what was created using the diff command in unix
"
1,16c1,16
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
--
* This file is generated by Dips for Windows.
* The following 2 lines are the Title of this file.
Cobre Panama
Drill Hole B11106-GT
Number of Traverses: 0
Global Orientation is:
DIP/DIPDIRECTION
0.000000 (Declination)
NO QUANTITY
Number of extra columns are: 0
18c18
--
440c440
--
442c442
-1
-1
"
Any help would be appreciated! Thanks!

Okay! Figured it out.
Simply when you unix2dos your file you do not strip any space characters in between the last letter in a line and the line break character. When saving in Kedit you do strip the spaces between the last letter in a line and the line break character.
In my script I had a poor programing practice in which I was writing a string like this;
echo "This is an example string " >> outfile.txt
The character count is 32, and if you could see the break line character (chr(10)) the line would read;
This is an example string
If you unix2dos outfile.txt the line looks the same as above but with a different break line character. However when you place the file into Kedit and save it, now the character count is 25 and the line looks like this;
This is an example string
This occurs because Kedit does not preserve spaces at the end of a line. It places the return or line break character at the last letter or "non space" character in a line.
So programs that read literal input like DIPS (i'm guessing) or more widely used AutoCAD scripting will have a real problem with extra spaces before the return character. Basically in AutoCAD scripting a space in a line is treated as a return character. So if you have ten extra spaces at the end of a line it's treated the same as ten returns instead of the one you probably intended.
OH and if this helped you out or though it was good please give me a vote up!

unix2dos converts the line-break characters at the end of each line, from unix line breaks (10) to dos line breaks (13, 10)
Kedit could possible change the encoding of the file (like from ansi to UTF-8)
You can change the encoding of a file with the iconv utility (on a linux box)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio