Remove multiple lines where string occurs and concatenate - bash

I'm new to Bash/Perl and trying to remove multiple lines in a text file where a string occurs. To remove a single line so far I have:
perl -ne '/somestring/ or print' /usr/file.txt > /usr/file1.tmp
To replace a second line I use:
perl -ne '/anotherstring/ or print' /usr/file.txt > /usr/file2.tmp
How can I concatenate file and file2.tmp?
Or how can I modify the command to remove multiple lines where somestring and anotherstring occur?

How can I concatenate file and file2.tmp?
That could be done with
cat file file2.tmp >> file3.tmp
Or if by file you mean file1.tmp,
cat file1.tmp file2.tmp >> file3.tmp
However, that is different from what you're describing in the rest of your question (i.e. removing any line where any of two patterns appears). That could be done by chaining your commands:
perl -ne '/somestring/ or print' /usr/file.txt > /usr/file1.tmp
perl -ne '/anotherstring/ or print' /usr/file1.tmp > /usr/file2.tmp
You can use a pipe to get rid of the intermediate file file1.tmp:
perl -ne '/somestring/ or print' /usr/file.txt | perl -ne '/anotherstring/ or print' > /usr/file2.tmp
This can also be done by using grep (assuming your strings don't make use of any Perl-specific regex features):
grep -v somestring /usr/file.txt | grep -v anotherstring > /usr/file2.tmp
Finally, you can combine the filtering into one command/regex:
perl -ne '/somestring|anotherstring/ or print' /usr/file.txt > /usr/file2.tmp
Or using grep:
grep -v 'somestring\|anotherstring' /usr/file.txt > /usr/file2.tmp

I had some fun with your program, and wrote a highly dynamic Perl program
to print the matches or non-matches for words in each line of any user defined file, and then right the requested lines which match or do not match the file to the screen and to a new user-defined outfile.
We will be parsing this file: iris_dataset.csv:
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.8,3,1.4,0.3,"setosa"
5.1,3.8,1.6,0.2,"setosa"
4.6,3.2,1.4,0.2,"setosa"
7,3.2,4.7,1.4,"versicolor"
6.4,3.2,4.5,1.5,"versicolor"
6.9,3.1,4.9,1.5,"versicolor"
6.6,3,4.4,1.4,"versicolor"
5.5,2.4,3.7,1,"versicolor"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
It's a comma separated value file with columns separated by commas.
You could see each column of items more nicely if you were looking at this file in a spreadsheet. What we will be looking for is Species of the file, so the possible items to match are "setosa", "versicolor", and "virginica".
My program first asks for the file that you want to read from..
In this case, it's iris_dataset.csv, though it could be any file. Then you write the name of a file that you would want to write to. I call it new_iris.csv, but you can call it anything.
Then we tell the program how many items we are looking for, so if there's 3 items I can type: setosa, versicolor, virginica in any order. If there are two I can type only two items, and if there is one, then I can only type only setosa or versicolor or virginica in this example file.
Then we are asked if we want to KEEP the lines which match our items,
or if we want to REMOVE the lines of the file which match our files. If we keep the matches, we get the lines which match those items printed to the screen and to our outfile. If we select remove, we get the lines which do not match those items printed to the screen and to our file. If we select neither KEEP nor REMOVE, then we get an error message and our new empty outfile is deleted since it contains nothing.
#!/usr/bin/env perl
# Program: perl_matching.pl
use strict; # Means that we have to explicitly declare our variables with "my", "our" or "local" as we want their scope defined.
use warnings; # We want to know if and if where errors are showing up in our program.
use feature 'say'; # Like print, but with automatic ending newline.
use feature 'switch'; # Perl given:when switch statement.
no warnings 'experimental'; # Perl has something against switch.
########### This block of code right here is basically equivalent to a unit ls command ##############
opendir(DIR, "."); # Opens the current working directory
my #files = readdir(DIR); # Reads all files in the current working directory into an array #files.
closedir(DIR); # Now that we have the array of files, we can close our current working directory.
say "Here are the list of files in your current working directory";
foreach(#files){print "$_\t";} # $_ is the default variable for each item in an array.
########### It is not critical to run the program ####################
say "\nGive me your filename to read from, extensions and all ..."; # It would be a good idea to have your filename in yoru working directory.
chomp(my $file_read = <STDIN>); # This makes the filename dynamic from user input.
say "Give me your filename to write to, extensions and all ...";
chomp(my $file_write = <STDIN>); # results will be printed to this file, and standard output. # chomp removes newlines from standard input.
# ' < ' to read from, and '>', to write to ...
# Opening your file to read from:
open(my $filehandle_read, '<', $file_read) or die "Problem reading file $_ because $!";
# Open your file to write to.
open(my $filehandle_write, '>', $file_write) or die "Problem reading file $_ because $!";
say "How many matches are you going to give me?";
my $match_num = <STDIN>;
say "Okay give me the matches now, pressing Enter key between each match.";
my $i = 1; # This is our incrementer between matches.
my $matches; # This is each match presented line by line.
my #match_list; # This is our array (list) of $matches
while($i <= $match_num)
{
$matches = <STDIN>; # One match at a time from standard input.
push #match_list, $matches; # Pushes all individual $matches into a list #match_list
$i = $i + 1; # Increase the incrementor by one so this loop don't last forever.
}
chomp(#match_list);
undef($matches); # I am clearing each match, so that I can redefine this variable.
$matches = join('|', #match_list); # " | " is part of a regular expression which means "or" for each item in this scalar matches.
say "This is what your redefined matches variable looks like: $matches";
say "Now you get a choice for your matches";
say "KEEP or REMOVE?"; # if you type Keep (case insensitive) you print only the matches to the new file. If you type Remove (case insensitive) you print only the lines to the newfile which do not contain the matches.
chomp(my $choice = <STDIN>);
my #lines_all = <$filehandle_read>; # The filehandle contains everything in the file, so we can pull all lines of the file to read into an array, where each item in the array is each line of the file opened for reading.
close $filehandle_read; # we can now close the filehandle for the file for reading since we just pulled all the information from it.
# We grep for the matching " =~ " lines of our file to read.
my #lines_matching = grep{$_ =~ m/$matches/} #lines_all;
# We grep for the non-matching " !~ " lines of our file to read.
# Note: $_ is a default variable for every item in the array.
my #lines_not_matching = grep{$_ !~ m/$matches/} #lines_all;
# This is a Perl style switch statement.
# Note: A given::when::when::default switch statement.
# is basically equivalent to ...
# while::if::elsif::else statement.
# In this switch statement only one choice is performed,
# which one depends on if you said "Keep" or "Remove" in your choice.
given($choice)
{
when($choice =~ m/Keep/i) # "i" is for case-insensitive, so Keep, KEEP, kEeP, etc are valid.
{
say #lines_matching; # Print the matching lines to the screen.
print $filehandle_write #lines_matching; # Print the matching lines to the file.
close $filehandle_write; # Close the file now that we are done with it.
}
when($choice =~ m/Remove/i)
{
say #lines_not_matching; # Print the lines that match to the screen.
print $filehandle_write #lines_not_matching; # Print the lines that do not match to the screen.
close $filehandle_write; # Close the file now that we are done with it.
}
default
{
say "You must have selected a choice other than Keep or Remove. Don't do that!";
close $filehandle_write; # Close the file now that we are done with it.
unlink($file_write) or warn "Could not unlink file $file_write"; # If you selected neither keep nor remove, we delete the new file to write to as it contains nothing.
}
}
Here is the script in action:
I ask to Remove the lines which contain versicolor and setosa, so only the lines which contain virginica will be printed to the screen and to my outfile which I called new_iris.csv. Again, I asked for 2 items. Note: As in my program, you can type the words Keep or Remove in any case insensitive manner.
>perl perl_matching.pl
Here are the list of files in your current working directory
. .. iris_dataset.csv perl_matching.pl
Give me your filename to read from, extensions and all ...
iris_dataset.csv
Give me your filename to write to, extensions and all ...
new_iris.csv
How many matches are you going to give me?
2
Okay give me the matches now, pressing Enter key between each match.
setosa
versicolor
This is what your redefined matches variable looks like: setosa|versicolor
Now you get a choice for your matches
KEEP or REMOVE?
Remove
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
So only those lines which do not contain the words setosa and versicolor are printed to our file: new_iris.csv:
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
6.3,3.3,6,2.5,"virginica"
5.8,2.7,5.1,1.9,"virginica"
7.1,3,5.9,2.1,"virginica"
6.3,2.9,5.6,1.8,"virginica"
5.9,3,5.1,1.8,"virginica"
I completely enjoy playing with standard input in Perl.
You can use my script to only print the lines of the file which contain
setosa. (You only ask for 1 match.)

Related

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Can I find similar named files ignoring case, dashes, spaces or other characters?

EDIT 2:
lets say I have 2 directories one contains:
/dir1/Test File Name.txt
/dir1/This is anotherfile.txt
/dir1/And-Another File.txt
Directory 2 looks like:
/dir2/test-File_Name.txt
/dir2/test file_Name.txt
/dir2/This Is another file.txt
/dir2/And another_file.txt
How can I find (or match) files that are named similar, in this example file 1 from dir1 would match with file 1 and 2 on dir2 and so on
Trying to do this in bash. Say I have a file named "Test File 1.txt" I want to find any file that is named similar like:
test-file 1.txt
test file 1.txt
Test-file-1.txt
test-file_1.zip
etc etc
I can ignore case with find ./files/ -maxdepth 1 -iname $FILE but don't know how to ignore all the other characters.
Is there a way I can do this in bash?
EDIT:
Sorry, I forgot to mention that I need to iterate on all files, the file name is not always the same, I just used an example.
so it could be named "Test File 1.txt" or it could also be named something completely different "Something Else.txt"
So I want to look for all similar named files using a complete file name as base, but this file name can be different, hope I make more sense.
If Perl is your option, please try the following:
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
print "$f1 <=> $f2\n";
}
}
}'
Output:
dir1/And-Another File.txt <=> dir2/And another_file.txt
dir1/Test File Name.txt <=> dir2/test file_Name.txt
dir1/Test File Name.txt <=> dir2/test-File_Name.txt
dir1/This is anotherfile.txt <=> dir2/This Is another file.txt
[Explanations]
The concept is to generate a regex pattern on the fly from a filename
in one directory and match it with the files in the other directory.
File extension is replaced with a pattern which matches it.
Non-alphanumeric character and underscore are replaced with a pattern
which matches them including the case the character is missing so that
anotherfile and another file match.
i option added to the pattern enables case-insensitive match.
You can see the generated regex by uncommenting the noted line.
The possible problem is we can not generate a pattern which matches with
another file from the filename anotherfile. In other words, the
matching is one-directional. A possible workaround is to neglect non-alphanumeric characters and underscores at all in matching. It may result in unexpected overmatching depending on the word and punctuation. We will need to specifically define the similarity to step further.
[Edit]
In order to get the result back to bash variables, please try:
while read -r -d "" line; do
# do something with the bash variable "line"
echo "$line"
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')
The results is stored in the bash variable line in line by line.
If you want to tweak the output format, please modify the line push(#result, ...).
[EDIT]
Modified to work with the following filename pairs:
"Sample Filename.txt" <=> "Sample Filename (100).txt"
"Sample.Filename.txt" <=> "Sample Filename.txt"
Here's the updated code:
while read -r -d "" line; do
# do something with the bash variable "line"
echo $line
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
s#\.[^.]*$##; # remove extension
s#\s*\(.*?\)##; # remove parenthesis if any
s#\s*\[.*?\]##; # remove square bracket if any
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\s?((\\(.*?\\))|(\\[.*?\\]))?" . "\\.\\w+\$";
#print $pat . "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

Extracting the first two characters from a file in perl into another file

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.
I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}
this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

Resources