Extracting the first two characters from a file in perl into another file - bash

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!

When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.

I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}

this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

Related

How to make and name multiple text files after using the cut command?

I have about 50 data text files that I need to remove several columns from.
I have been using the cut command to remove and rename them individually but I will have many more of the files and need a way to do it large scale.
Currently I have been using:
cut -f1,6,7,8 filename.txt >> filename_Fixed.txt
And I am able to remove the columns from all the files using:
cut -f1,6,7,8 *.txt
But I'm only able to get all the output in the terminal or I can write it to a single text file.
What I want is to edit several files using cut to remove the required columns:
filename1.txt
filename2.txt
filename3.txt
filename4.txt
.
.
.
And get the edited output to write to individual files:
filename_Fixed1.txt
filename_Fixed2.txt
filename_Fixed3.txt
filename_Fixed4.txt
.
.
.
But haven't been able to find a way to write the output to new text files. I'm new to using the command line and not much of a coder, so maybe I don't know what terms to search for? I haven't even been able to find anything doing google searches that has helped me. It seems like it should be simple, but I am struggling.
In desperation, I did try this bit of code, knowing it wouldn't work:
cut -f1,6,7,8 *.txt >> ( FILENAME ".fixed" )
I found the portion after ">>" nested in an awk command that output multiple files.
I also tried (again knowing it wouldn't work) to wild card the output files but got an ambiguous redirect error.
Did you try for?
for f in *.txt ; do
cut -f 1,6,7,8 "$f" > $(basename "$f" .txt)_fixed.txt
done
(N.B. I can't try the basename now, you can replace it with "${f}_fixed")
You can also process it all in awk itself which would make the process much more efficient, especially for large numbers of files, for example:
awk '
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
' *.txt
Which contains two rules (the expressions between {...}). The first:
NF < 8 {
print "contains less than 8 fields: ", FILENAME
next
}
simply checks that the file contains at least 8 fields (since you want field 8 as your last field). If the file contains less than 8 fields, it just skips to the next file in your list.
The second rule:
{ fn=FILENAME
idx=match(fn, /[0-9]+.*$/)
if (idx == 0) {
print "no numeric suffix for file: ", fn
next;
}
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx)
print $1,$6,$7,$8 > newfn
}
fn=FILENAME stores the current filename as fn to cut down typing,
idx=match(fn, /[0-9]+.*$/) locates the index where the numeric suffix for the filename begins (e.g. were "3.txt" starts),
if (idx == 0) then a numeric suffix was not found, warn, and move on to the next file,
newfn=substr(fn,1,idx-1) "_Fixed" substr(fn,idx) form the new filename from the non-numeric prefix (e.g. "filename"), add "_Fixed" with string-concatenation and then add the numeric suffix, and finally
print $1,$6,$7,$8 > newfn print fields (columns) 1,6,7,8 redirecting output to the new filename.
For more information on each of the string-functions used above, see the GNU awk User's Guide - 9.1.3 String-Manipulation Functions
If I understand what you were attempting, this should be able to handle as many files as you have -- so long as the files have a numeric suffix to place "_Fixed" before in the filename and each file has at least 8 fields (columns). You can just copy/middle-mouse-paste the entire command at the command-line to test.

Remove multiple lines where string occurs and concatenate

I'm new to Bash/Perl and trying to remove multiple lines in a text file where a string occurs. To remove a single line so far I have:
perl -ne '/somestring/ or print' /usr/file.txt > /usr/file1.tmp
To replace a second line I use:
perl -ne '/anotherstring/ or print' /usr/file.txt > /usr/file2.tmp
How can I concatenate file and file2.tmp?
Or how can I modify the command to remove multiple lines where somestring and anotherstring occur?
How can I concatenate file and file2.tmp?
That could be done with
cat file file2.tmp >> file3.tmp
Or if by file you mean file1.tmp,
cat file1.tmp file2.tmp >> file3.tmp
However, that is different from what you're describing in the rest of your question (i.e. removing any line where any of two patterns appears). That could be done by chaining your commands:
perl -ne '/somestring/ or print' /usr/file.txt > /usr/file1.tmp
perl -ne '/anotherstring/ or print' /usr/file1.tmp > /usr/file2.tmp
You can use a pipe to get rid of the intermediate file file1.tmp:
perl -ne '/somestring/ or print' /usr/file.txt | perl -ne '/anotherstring/ or print' > /usr/file2.tmp
This can also be done by using grep (assuming your strings don't make use of any Perl-specific regex features):
grep -v somestring /usr/file.txt | grep -v anotherstring > /usr/file2.tmp
Finally, you can combine the filtering into one command/regex:
perl -ne '/somestring|anotherstring/ or print' /usr/file.txt > /usr/file2.tmp
Or using grep:
grep -v 'somestring\|anotherstring' /usr/file.txt > /usr/file2.tmp
I had some fun with your program, and wrote a highly dynamic Perl program
to print the matches or non-matches for words in each line of any user defined file, and then right the requested lines which match or do not match the file to the screen and to a new user-defined outfile.
We will be parsing this file: iris_dataset.csv:
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.8,3,1.4,0.3,"setosa"
5.1,3.8,1.6,0.2,"setosa"
4.6,3.2,1.4,0.2,"setosa"
7,3.2,4.7,1.4,"versicolor"
6.4,3.2,4.5,1.5,"versicolor"
6.9,3.1,4.9,1.5,"versicolor"
6.6,3,4.4,1.4,"versicolor"
5.5,2.4,3.7,1,"versicolor"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
It's a comma separated value file with columns separated by commas.
You could see each column of items more nicely if you were looking at this file in a spreadsheet. What we will be looking for is Species of the file, so the possible items to match are "setosa", "versicolor", and "virginica".
My program first asks for the file that you want to read from..
In this case, it's iris_dataset.csv, though it could be any file. Then you write the name of a file that you would want to write to. I call it new_iris.csv, but you can call it anything.
Then we tell the program how many items we are looking for, so if there's 3 items I can type: setosa, versicolor, virginica in any order. If there are two I can type only two items, and if there is one, then I can only type only setosa or versicolor or virginica in this example file.
Then we are asked if we want to KEEP the lines which match our items,
or if we want to REMOVE the lines of the file which match our files. If we keep the matches, we get the lines which match those items printed to the screen and to our outfile. If we select remove, we get the lines which do not match those items printed to the screen and to our file. If we select neither KEEP nor REMOVE, then we get an error message and our new empty outfile is deleted since it contains nothing.
#!/usr/bin/env perl
# Program: perl_matching.pl
use strict; # Means that we have to explicitly declare our variables with "my", "our" or "local" as we want their scope defined.
use warnings; # We want to know if and if where errors are showing up in our program.
use feature 'say'; # Like print, but with automatic ending newline.
use feature 'switch'; # Perl given:when switch statement.
no warnings 'experimental'; # Perl has something against switch.
########### This block of code right here is basically equivalent to a unit ls command ##############
opendir(DIR, "."); # Opens the current working directory
my #files = readdir(DIR); # Reads all files in the current working directory into an array #files.
closedir(DIR); # Now that we have the array of files, we can close our current working directory.
say "Here are the list of files in your current working directory";
foreach(#files){print "$_\t";} # $_ is the default variable for each item in an array.
########### It is not critical to run the program ####################
say "\nGive me your filename to read from, extensions and all ..."; # It would be a good idea to have your filename in yoru working directory.
chomp(my $file_read = <STDIN>); # This makes the filename dynamic from user input.
say "Give me your filename to write to, extensions and all ...";
chomp(my $file_write = <STDIN>); # results will be printed to this file, and standard output. # chomp removes newlines from standard input.
# ' < ' to read from, and '>', to write to ...
# Opening your file to read from:
open(my $filehandle_read, '<', $file_read) or die "Problem reading file $_ because $!";
# Open your file to write to.
open(my $filehandle_write, '>', $file_write) or die "Problem reading file $_ because $!";
say "How many matches are you going to give me?";
my $match_num = <STDIN>;
say "Okay give me the matches now, pressing Enter key between each match.";
my $i = 1; # This is our incrementer between matches.
my $matches; # This is each match presented line by line.
my #match_list; # This is our array (list) of $matches
while($i <= $match_num)
{
$matches = <STDIN>; # One match at a time from standard input.
push #match_list, $matches; # Pushes all individual $matches into a list #match_list
$i = $i + 1; # Increase the incrementor by one so this loop don't last forever.
}
chomp(#match_list);
undef($matches); # I am clearing each match, so that I can redefine this variable.
$matches = join('|', #match_list); # " | " is part of a regular expression which means "or" for each item in this scalar matches.
say "This is what your redefined matches variable looks like: $matches";
say "Now you get a choice for your matches";
say "KEEP or REMOVE?"; # if you type Keep (case insensitive) you print only the matches to the new file. If you type Remove (case insensitive) you print only the lines to the newfile which do not contain the matches.
chomp(my $choice = <STDIN>);
my #lines_all = <$filehandle_read>; # The filehandle contains everything in the file, so we can pull all lines of the file to read into an array, where each item in the array is each line of the file opened for reading.
close $filehandle_read; # we can now close the filehandle for the file for reading since we just pulled all the information from it.
# We grep for the matching " =~ " lines of our file to read.
my #lines_matching = grep{$_ =~ m/$matches/} #lines_all;
# We grep for the non-matching " !~ " lines of our file to read.
# Note: $_ is a default variable for every item in the array.
my #lines_not_matching = grep{$_ !~ m/$matches/} #lines_all;
# This is a Perl style switch statement.
# Note: A given::when::when::default switch statement.
# is basically equivalent to ...
# while::if::elsif::else statement.
# In this switch statement only one choice is performed,
# which one depends on if you said "Keep" or "Remove" in your choice.
given($choice)
{
when($choice =~ m/Keep/i) # "i" is for case-insensitive, so Keep, KEEP, kEeP, etc are valid.
{
say #lines_matching; # Print the matching lines to the screen.
print $filehandle_write #lines_matching; # Print the matching lines to the file.
close $filehandle_write; # Close the file now that we are done with it.
}
when($choice =~ m/Remove/i)
{
say #lines_not_matching; # Print the lines that match to the screen.
print $filehandle_write #lines_not_matching; # Print the lines that do not match to the screen.
close $filehandle_write; # Close the file now that we are done with it.
}
default
{
say "You must have selected a choice other than Keep or Remove. Don't do that!";
close $filehandle_write; # Close the file now that we are done with it.
unlink($file_write) or warn "Could not unlink file $file_write"; # If you selected neither keep nor remove, we delete the new file to write to as it contains nothing.
}
}
Here is the script in action:
I ask to Remove the lines which contain versicolor and setosa, so only the lines which contain virginica will be printed to the screen and to my outfile which I called new_iris.csv. Again, I asked for 2 items. Note: As in my program, you can type the words Keep or Remove in any case insensitive manner.
>perl perl_matching.pl
Here are the list of files in your current working directory
. .. iris_dataset.csv perl_matching.pl
Give me your filename to read from, extensions and all ...
iris_dataset.csv
Give me your filename to write to, extensions and all ...
new_iris.csv
How many matches are you going to give me?
2
Okay give me the matches now, pressing Enter key between each match.
setosa
versicolor
This is what your redefined matches variable looks like: setosa|versicolor
Now you get a choice for your matches
KEEP or REMOVE?
Remove
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
So only those lines which do not contain the words setosa and versicolor are printed to our file: new_iris.csv:
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
I completely enjoy playing with standard input in Perl.
You can use my script to only print the lines of the file which contain
setosa. (You only ask for 1 match.)

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

use grep and awk to transfer data from .srt to .csv/xls

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.
a srt file would look like this:
1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.
2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!
3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!
while I want to output it into a csv file like this:
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.
What do you guys think? How am I suppose to write it? I tried
$grep filename.srt > filename.xls
It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?
Thank you in advance! :)
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
split($2,a,/ .* /)
print q $1 s a[1] s a[2] s $3 q
for (i=4;i<=NF;i++) {
print "", "", "", q $i q
}
}
$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
I think something like this should do it quite nicely:
awk -v RS= -F'\n' '
{
sub(" --> ","\x7c",$2) # change "-->" to "|"
printf "%s|%s|%s\n",$1,$2,$3 # print scene, time start, time stop, description
for(i=4;i<=NF;i++)printf "|||%s\n",$i # print remaining lines of description
}' file.srt
The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.
The sub() replaces the "-->" with a pipe symbol (|).
The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.
Output
1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!
As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $DEBUG=0;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0;
while(my $line=<>){
$row++; # move down a line in Excel worksheet
chomp $line; # strip CR
my #f=split /\|/, $line; # split fields of line into array #f[], on pipe symbols (|)
for(my $j=0;$j<scalar #f;$j++){ # loop through all fields
my $cell= chr(65+$j) . $row; # calcuate Excell cell, starting at A1 (65="A")
$worksheet->write($cell,$f[$j]); # write to spreadsheet
printf "%s:%s ",$cell,$f[$j] if $DEBUG;
}
printf "\n" if $DEBUG;
}
$workbook->close;
Output
My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0;
local $/ = ""; # set paragraph mode, so we read till next blank line as one record
while(my $para=<>){
$ExcelRow++; # move down a line in Excel worksheet
chomp $para; # strip CR
my #lines=split /\n/, $para; # split paragraph into lines on linefeed character
my $scene = $lines[0]; # pick up scene number from first line of para
my ($start,$end)=split / --> /,$lines[1]; # pick up start and end time from second line
my $cell=sprintf("A%d",$ExcelRow); # work out cell
$worksheet->write($cell,$scene); # write scene to spreadsheet column A
$cell=sprintf("B%d",$ExcelRow); # work out cell
$worksheet->write($cell,$start); # write start time to spreadsheet column B
$cell=sprintf("C%d",$ExcelRow); # work out cell
$worksheet->write($cell,$end); # write end time to spreadsheet column C
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[2]); # write description to spreadsheet column D
for(my $i=3;$i<scalar #lines;$i++){ # output additional lines of description
$ExcelRow++;
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[$i]);
}
}
$workbook->close;
Save the above on a file called srt2xls and then make it executable with the command:
chmod +x srt2xls
Then you can run it with
./srt2xls < SomeFileile.srt
and it will give you this spreadsheet called result.xlsx
Since you want to convert the srt into csv. below is awk command
awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv
detail veiw of awk
awk '{
gsub(" --> ","\x22,\x22");
if(NF!=0)
{
if(j<3)
k=k"\x22"$0"\x22,";
else
{
k="\x22"$0"\x22 ";
l=1
}
j=j+1
}
else
j=0;
if(j==3)
{
print k;
k=""
}
if(l==1)
{
print ",,,"k;
l=0;
k=""
}
}' inputfile > output.csv
take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

Taking multiple header (rows matching condition) and convert into a column

Hello I have a file that has multiple Headers in it that I need to have turned into column values. The file looks like this:
Day1
1,Smith,London
2,Bruce,Seattle
5,Will,Dallas
Day2
1,Mike,Frisco
4,James,LA
I would like the file to end up looking like this:
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
The file doesn't have sequential numbers before the names and it doesn't have the same quantity of records after the "Day" Header.
Does anyone have any ideas on how to accomplish this using the command-line?
In awk
awk -F, 'NF==1{a=$0;next}{print a","$0}' file
Checks if the number of fields is 1, if it is it sets a variable to that and skips the next block.
For each line that doesn't have 1 field, it prints the saved variable and the line
And in sed
sed -n '/,/!{h};/,/{x;G;s/\n/,/;p;s/,.*//;x}' file
Broken down for MrBones wild ride.
sed -n '
/,/!{h}; // If the line does not contain a comma overwrite buffer with line
/,/{ // If the line contains a comma, do everything inside the brackets
x; // Exchange the line for the held in buffer
G; // Append buffer to line
s/\n/,/; // Replace the newline with a comma
p; // Print the line
s/,.*//; // Remove everything after the first comma
x // exchange line for hold buffer to put title back in buffer for the next line.
}' file // The file you are using
In essence it saves the lines without a ,, i.e the headers. Then if its not a header, it switches the current line with the saved header and appends the now switched line to the end of the header. As it is appended with a newline, then the next statement replaces that with a comma. Then the line is printed. NExt to recover the header, everything after it is removed and it is swapped back into the buffer, ready for the next line.
sed '/^Day/ {h;d;}
G;s/\(.*\)\n\(.*\)/\2,\1/
' YourFile
posix compliant
print nothing if not at least 1 data after a Day
white line are treated as data
awk '{if ( $0 ~ /^Day/ ) Head = $0; else print Head "," $0}' YourFile
use Day as paragraph separator and content as header to use on following line
Perl solution:
#! /usr/bin/perl
use warnings;
use strict;
my $header;
while (<>) { # Read line by line.
if (/,/) { # If the line contains a comma,
print "$header,$_"; # prepend the header.
} else {
chomp; # Remove the newline.
$header = $_; # Remember the header.
}
}
Another sed version
sed -n '/Day[0-9]\+/{h;b end};{G;s/\(.*\)\n\(.*\)/\2,\1/;p;:end}'
Perl
$ perl -F, -wlane ' if(#F eq 1){$s=$F[0]; next}print "$s,$_"' file
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
This Perl one-line program will do as you ask. It requires Perl v5.14 or better
perl -ne'tr/,// ? print $c,$_ : ($c = s/\s*\z/,/r)' myfile.txt
for earlier versions of perl, use
perl -ne'tr/,// ? print $c,$_ : ($c = $_) =~ s/\s*\z/,/' myfile.txt
output
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
Another perl example- this time using $/ to separate each record.
use strict;
use warnings;
local $/ = "Day";
while (<>) {
next unless my ($num) = m/^(\d+)/;
for ( split /\n/ ) {
print "Day${num},$_\n" if m/,/;
}
}

Resources