This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform
Related
I have a pretty large .txt file with data (8MB) and the data lines are separated with the character F.
To analyze this data I need to replace the letter F with the Return command.
This is how my file looks:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02F0.07, -0.35, 9.47, 78.73, 47.74, 0.05F-0.20, -0.43, 10.60, 79.00, 47.79, 0.07F-0.49, -0.14, 10.44, 76.84, 47.70, 0.10.. and so on
This is how it should look:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02
0.07, -0.35, 9.47, 78.73, 47.74, 0.05
-0.20, -0.43, 10.60, 79.00, 47.79, 0.07
-0.49, -0.14, 10.44, 76.84, 47.70, 0.10
... and so on
I have a macOS and Windows available. Already tried it with Excel, but the file seems to be to large, Excel just crashes. Any advice?
Try EditPad Lite on Windows. It's a notepad, that is able to handle big files.
You have to enable regular expressions (search->search options) to work correctly. After that you can open the search and replace F with \r\n (new line operator).
You can use TextEdit on a Mac. Use the find and replace option. It is very fast in the test I tried. I used a 5 M file and it ran in a few seconds. Refer to the previous question in Ask Different 'How to use find and replace to replace a character with new line' to see how to get newlinein character in find and replace option.
In MacOS, give this a try.
Using translate characters command
tr F '\n' < input.txt > output.txt
The result will be stored in a separate file. If no new file needed, just remove > output.txt from the command, it will display the result in the console.
Using stream editor command
sed -i '' $'s/F/\\\n/g' test.txt
The sed command will do the same operation with the use of regex. This replace the contents in the original file. To create a backup of the file, give the extension in the argument i (Ex : -i '.backup' creates a file backup test.txt.backup).
For more info, do man tr and man sed in your mac terminal.
I have numerous files with extension .awesome containing lines like the following:
something =
[51,42,12]
Where something =* is in all the files as well as **[ (numbers vary.)
I would like to get rid of the newline, but don't know how. I came across tr, but worry it would replace all newlines. My files contain multiple newlines that I would like to retain (only change this newline.) I've been able to successfully to find and replace in the past with sed, but am having specifically with the special characters (\n and =.) In addition, I'm reading that sed is line by line and cannot handle something like this.
Any guidance would be appreciated.
GNU sed solution:
Sample test.awesome file contents:
some text
another text
something =
[51,42,12]
text
text
The job:
sed '/something =/{N; s/\n/ /;}' test.awesome
The output:
some text
another text
something = [51,42,12]
text
text
I have a file which has newline breaks in one of the fields.
eg:
See third line :
"A"|"USD"|"123"|"AIRPROMOTION"|"EXPIRE"
"B"|"USD"|"456"|"AIRPROMOTION"|"EXPIRE"
"C"|"USD"|"789
"|"AIRPROMOTION"|"EXPIRE"
I tried the command perl -p00e 's/\n"|//g' which worked just fine for a small file.But my file is huge (~100MB) and it gives 'Segmentation fault' error.
What are the other options?
The reason of segmentation fault is your are enable the slurp mode. Don't do that. Instead read the file line by line.
Try this
perl -lne 'my $nxt_line = <>;($nxt_line=~m/^"\|"/)?print "$_$nxt_line":print "$_"' file.txt
In above script $nxt_line will store the next line of the file.. Then make the pattern match for to do it.
Try this! Should work like a charm!
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n"/"/g' input_file > output_file
I would use Notepad++ Replace function (\r\n\r\n and replace it with \r\n).
If you haven't it, you can download Notepad++ for free and is very useful application and has many uses.
At the View menu select Show Symbol and check on Show All Characters.
Press Cntrl+H or click on Search Menu and select Replace... option
Type in \r\n\r\n at Find what:
Type in \r\n at Replace it with
Click on Replace All button.
PS: The Text you have supplied is not just LF, it is CRLF which is \r\n. You can try your method. Remember you want to just replace CRLFCRLF with one CRLF, otherwise you will loose all your CRLF and all your text will appear in one line.
I have a file with the following entries:
folder1/a_b.csv folder1/generated/
folder2/folder3/a_b1.csv folder12/generated/
folder4/b_c.csv folder123/generated/
folder5/d.csv folder1/new_folder/generated/
folder6/12.csv folder/anotherfolder/morefolder/evenmorefolder/generated/
I want to copy the csv file name from each line, paste them at the end of that line and append it with ".org". Hence, the changed file would look like
folder1/a_b.csv folder1/generated/a_b.csv.org
folder2/folder3/a_b1.csv folder12/generated/a_b1.csv.org
folder4/b_c.csv folder123/generated/b_c.csv.org
folder5/d.csv folder1/new_folder/generated/d.csv.org
folder6/12.csv folder/anotherfolder/morefolder/evenmorefolder/generated/12.csv.org
Basically, I am looking for a command in vim or sed using which I can search a pattern in each line and append it at the end of that line. Is it possible?
Thanks in advance.
Vim
Here's how to do this in Vim:
:%s/\([^/]*\.csv\)\( .*\)/&\1.org/
This global (:%) substitution matches the filename (characters that don't contain /, ending in .csv), and captures \(...\) it. It then matches the rest of the line, and captures that, too.
As a replacement, first keep the original match & (or \0), then append the first capture (\1) with the additional suffix.
sed
Though the regular expression syntax is somewhat different than in Vim, the identical expression can be used with sed:
sed -e 's/\([^/]*\.csv\)\( .*\)/&\1.org/' input
Alternatives
It looks like you want to do file renaming in batches. On Linux, the mmv command-line tool is well suited for that; you'll probably find many similar tools on the web, too.
This might work for you (GNU sed):
sed -r 's|/([^ ]*) .*|&\1.org|' file
Am working on Windows Vista with GnuWin32 (sed 4.2.1 and core utilities 5.3.0). Also have ActivePerl 5.14.2 package.
I have a large multi record file. The end of each record in the file is denoted with four dollar signs ($$$$). Within each logical record are many "CRLF."
I would like to replace all instances of CRLF with a symbol such as |+|. Then I will replace $$$$ with CRLF. The result: one record per row for import into Excel for further manipulation.
I've tried several methods for transforming CRLF to |+| but without success.
For example, one method was: sed -e "s/[\r\n]/|+|/g" source_file_in target_file_out
Another method used tr -d to delete \r and then a second statement: sed -e "s/\n/|+|/g" source_file_in target_file_out
The tr statement worked; the sed statement did not.
I've read the following articles but don't see how to adapt them to replace \r\n with a symbol like |+|.
sed: how to replace CR and/or LF with "\r" "\n", so any file will be in one line
Replace string that contains CRLF?
How can I replace a newline (\n) using sed?
If this problem cannot be solved easily using sed (and tr), then I'll use Perl if someone shows me how.
Thank you Ed for your recommendation.
The awk script is not yet working completely, so I'll add some missing detail with the hope that you can fine tune your recommendation.
First, I'm running gawk v3.1.6.2962. I believe there may be differences in awk implementations, so this may be a useful bit of information.
Next, some more information about the type of data and origin of the data.
The data is about chemicals (text data that is input to a stereo-chemical drawing program).
The chemical files are in an .sdf format.
When I open "133711.sdf" in NotePad++ (using View/Show symbol/Show all characters), I see data that is shown in the screen shot:
https://dl.dropbox.com/u/3094317/_master_1_screen_shot_.png
As you see, LF only - no CR.
I believe this means that the origin of the .sdf files is a UNIX system.
Next, I run the Windows command COPY *.sdf _master_2_.txt. That creates the very large file-of-files that I want to parse into records.
_master_2_.txt has the same structure as 133711.sdf - LF only; no CR.
Then, I run your awk recommendation in a .BAT file. I need to replace your single quotes with double quotes because Microsoft made me.
awk -v FS="\r\n" -v OFS="|+|" -v RS="\$\$\$\$" -v ORS="\r\n" "{$1=$1}1" C:_master_2_.txt >C:\output.txt
I've attached a screen shout of output.txt:
https://dl.dropbox.com/u/3094317/output.txt.png
As you can see, the awk command did not successfully replace "\r\n" with "|+|".
Further, Windows created the output.txt with CRLF.
It did successfully replace the four $ with CRLF.
Is this information adequate to update your awk recommendation to handle the Windows-related issues?
Try this with GNU awk:
awk -v FS='\r\n' -v OFS='|+|' -v RS='\\$\\$\\$\\$' -v ORS='\r\n' '{$1=$1}1' file
I see from your updated question that you're on Windows. To avoid ridiculous quoting rules and issues, put this in a file named "whatever.awk":
BEGIN{FS="\r\n"; OFS="|+|"; RS="\\$\\$\\$\\$"; ORS="\r\n"} {$1=$1}1
and run it as
awk -f whatever.awk file
and see if that does what you want.