using sed to change <CR><LF> to a symbol - windows

Am working on Windows Vista with GnuWin32 (sed 4.2.1 and core utilities 5.3.0). Also have ActivePerl 5.14.2 package.
I have a large multi record file. The end of each record in the file is denoted with four dollar signs ($$$$). Within each logical record are many "CRLF."
I would like to replace all instances of CRLF with a symbol such as |+|. Then I will replace $$$$ with CRLF. The result: one record per row for import into Excel for further manipulation.
I've tried several methods for transforming CRLF to |+| but without success.
For example, one method was: sed -e "s/[\r\n]/|+|/g" source_file_in target_file_out
Another method used tr -d to delete \r and then a second statement: sed -e "s/\n/|+|/g" source_file_in target_file_out
The tr statement worked; the sed statement did not.
I've read the following articles but don't see how to adapt them to replace \r\n with a symbol like |+|.
sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line
Replace string that contains CRLF?
How can I replace a newline (\n) using sed?
If this problem cannot be solved easily using sed (and tr), then I'll use Perl if someone shows me how.
Thank you Ed for your recommendation.
The awk script is not yet working completely, so I'll add some missing detail with the hope that you can fine tune your recommendation.
First, I'm running gawk v3.1.6.2962. I believe there may be differences in awk implementations, so this may be a useful bit of information.
Next, some more information about the type of data and origin of the data.
The data is about chemicals (text data that is input to a stereo-chemical drawing program).
The chemical files are in an .sdf format.
When I open "133711.sdf" in NotePad++ (using View/Show symbol/Show all characters), I see data that is shown in the screen shot:
https://dl.dropbox.com/u/3094317/_master_1_screen_shot_.png
As you see, LF only - no CR.
I believe this means that the origin of the .sdf files is a UNIX system.
Next, I run the Windows command COPY *.sdf _master_2_.txt. That creates the very large file-of-files that I want to parse into records.
_master_2_.txt has the same structure as 133711.sdf - LF only; no CR.
Then, I run your awk recommendation in a .BAT file. I need to replace your single quotes with double quotes because Microsoft made me.
awk -v FS="\r\n" -v OFS="|+|" -v RS="\$\$\$\$" -v ORS="\r\n" "{$1=$1}1" C:_master_2_.txt >C:\output.txt
I've attached a screen shout of output.txt:
https://dl.dropbox.com/u/3094317/output.txt.png
As you can see, the awk command did not successfully replace "\r\n" with "|+|".
Further, Windows created the output.txt with CRLF.
It did successfully replace the four $ with CRLF.
Is this information adequate to update your awk recommendation to handle the Windows-related issues?

Try this with GNU awk:
awk -v FS='\r\n' -v OFS='|+|' -v RS='\\$\\$\\$\\$' -v ORS='\r\n' '{$1=$1}1' file
I see from your updated question that you're on Windows. To avoid ridiculous quoting rules and issues, put this in a file named "whatever.awk":
BEGIN{FS="\r\n"; OFS="|+|"; RS="\\$\\$\\$\\$"; ORS="\r\n"} {$1=$1}1
and run it as
awk -f whatever.awk file
and see if that does what you want.

Related

how to extract all prefix words from an ispell .mwl file in bash

I have an ispell huge .mwl file and I want to remove all the ispell suffixes to generate a simple text-only words dictionnary
using unix ispell, bash or perl commands.
Is there ispell command options to do that?
(in unix, the .mwl.gz files are located in the /usr/share/ispell/ directory)
a short extract non exhaustive of the file:
a/MRSY
A'asia
a'body
a'thing
aaa
AAAS
Aaberg/M
Aachen/M
Aaedon/M
AAeE
AAeE's
aaerially
aaerialness
Aaerope/M
AAgr/M
aah/DGS
aal/MS
Aalborg
Aalesund
aalii/MS
Aaliyah/M
Aalst/M
Aalto
aam
Aandahl/M
Aani/M
Aaqbiye/M
Aar/MN
Aara/M
Aarau
aardvark/MS
aardwolf/M
aardwolves
Aaren/M
Aargau
aargh
Aarhus
Aarika/M
aarogramme
I'm not sure what you mean by suffix but I'll assume it's the part following the / or ' in your sample text. You can do this with a simple pipeline from Bash.
cat something.mwl | perl -pe 's{[/\x27].*$}{}; ' > stripped_something.txt
The -p switch means to run perl in a pipeline. Whatever you pipe in will be put into $_ one line at a time, worked on, and then printed out. Notice I put \x27 for the apostrophe in the regex. Escaping it in the command line is a big pain. If there are any other characters that start a suffix you can put them in the character class.
You can do any other work on the line before printing it out this way too.
See the perlrun documentation for more about the -p switch.

How to replace the character "F" in a huge .txt file with the return command?

I have a pretty large .txt file with data (8MB) and the data lines are separated with the character F.
To analyze this data I need to replace the letter F with the Return command.
This is how my file looks:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02F0.07, -0.35, 9.47, 78.73, 47.74, 0.05F-0.20, -0.43, 10.60, 79.00, 47.79, 0.07F-0.49, -0.14, 10.44, 76.84, 47.70, 0.10.. and so on
This is how it should look:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02
0.07, -0.35, 9.47, 78.73, 47.74, 0.05
-0.20, -0.43, 10.60, 79.00, 47.79, 0.07
-0.49, -0.14, 10.44, 76.84, 47.70, 0.10
... and so on
I have a macOS and Windows available. Already tried it with Excel, but the file seems to be to large, Excel just crashes. Any advice?
Try EditPad Lite on Windows. It's a notepad, that is able to handle big files.
You have to enable regular expressions (search->search options) to work correctly. After that you can open the search and replace F with \r\n (new line operator).
You can use TextEdit on a Mac. Use the find and replace option. It is very fast in the test I tried. I used a 5 M file and it ran in a few seconds. Refer to the previous question in Ask Different 'How to use find and replace to replace a character with new line' to see how to get newlinein character in find and replace option.
In MacOS, give this a try.
Using translate characters command
tr F '\n' < input.txt > output.txt
The result will be stored in a separate file. If no new file needed, just remove > output.txt from the command, it will display the result in the console.
Using stream editor command
sed -i '' $'s/F/\\\n/g' test.txt
The sed command will do the same operation with the use of regex. This replace the contents in the original file. To create a backup of the file, give the extension in the argument i (Ex : -i '.backup' creates a file backup test.txt.backup).
For more info, do man tr and man sed in your mac terminal.

How to remove newline breaks in fields in Unix file without using perl

I have a file which has newline breaks in one of the fields.
eg:
See third line :
"A"|"USD"|"123"|"AIRPROMOTION"|"EXPIRE"
"B"|"USD"|"456"|"AIRPROMOTION"|"EXPIRE"
"C"|"USD"|"789
"|"AIRPROMOTION"|"EXPIRE"
I tried the command perl -p00e 's/\n"|//g' which worked just fine for a small file.But my file is huge (~100MB) and it gives 'Segmentation fault' error.
What are the other options?
The reason of segmentation fault is your are enable the slurp mode. Don't do that. Instead read the file line by line.
Try this
perl -lne 'my $nxt_line = <>;($nxt_line=~m/^"\|"/)?print "$_$nxt_line":print "$_"' file.txt
In above script $nxt_line will store the next line of the file.. Then make the pattern match for to do it.
Try this! Should work like a charm!
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n"/"/g' input_file > output_file
I would use Notepad++ Replace function (\r\n\r\n and replace it with \r\n).
If you haven't it, you can download Notepad++ for free and is very useful application and has many uses.
At the View menu select Show Symbol and check on Show All Characters.
Press Cntrl+H or click on Search Menu and select Replace... option
Type in \r\n\r\n at Find what:
Type in \r\n at Replace it with
Click on Replace All button.
PS: The Text you have supplied is not just LF, it is CRLF which is \r\n. You can try your method. Remember you want to just replace CRLFCRLF with one CRLF, otherwise you will loose all your CRLF and all your text will appear in one line.

Excel saves tab delimited files without newline (UNIX/Mac os X)

This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform

bash templating

i have a template, with a var LINK
and a data file, links.txt, with one url per line
how in bash i can substitute LINK with the content of links.txt?
if i do
#!/bin/bash
LINKS=$(cat links.txt)
sed "s/LINKS/$LINK/g" template.xml
two problem:
$LINKS has the content of links.txt without newline
sed: 1: "s/LINKS/http://test ...": bad flag in substitute command: '/'
sed is not escaping the // in the links.txt file
thanks
Use some better language instead. I'd write a solution for bash + awk... but that's simply too much effort to go into. (See http://www.gnu.org/manual/gawk/gawk.html#Getline_002fVariable_002fFile if you really want to do that)
Just use any language where you don't have to mix control and content text. For example in python:
#!/usr/bin/env python
links = open('links.txt').read()
template = open('template.xml').read()
print template.replace('LINKS', links)
Watch out if you're trying to force sed solution with some other separator - you'll get into the same problems unless you find something disallowed in urls (but are you verifying that?) If you don't, you already have another problem - links can contain < and > and break your xml.
You can do this using ed:
ed template.xml <<EOF
/LINKS/d
.r links.txt
w output.txt
EOF
The first command will go to the line
containing LINKS and delete it.
The second line will insert the
contents of links.txt on the current
line.
The third command will write the file
to output.txt (if you omit output.txt
the edits will be saved to
template.xml).
Try running sed twice. On the first run, replace / with \/. The second run will be the same as what you currently have.
The character following the 's' in the sed command ends up the separator, so you'll want to use a character that is not present in the value of $LINK. For example, you could try a comma:
sed "s,LINKS,${LINK}\n,g" template.xml
Note that I also added a \n to add an additional newline.
Another option is to escape the forward slashes in $LINK, possibly using sed. If you don't have guarantees about the characters in $LINK, this may be safer.

Resources