Delete n duplicate lines in a file - sorting

1. Briefly
I have a large text file (14MB). I need to remove in a file text blocks, contains 5 duplicate lines.
It would be nice, if would be possible make it use any gratis method.
I use Windows, but Cygwin solutions also would be nice.
2. Settings
1. File structure
I have a file test1.md. It consists of repeating blocks. Each block has a 10 lines. Structure of file (using PCRE regular expressions)
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
test1.md doesn't have another lines and text besides 10-lines blocks. It doesn't have blank lines and blocks with a greater or lesser number of lines than 10.
2. Example content of file
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
As can be seen in the example, test1.md has repeated 7-lines blocks. In example, these blocks is:
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
and
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
3. Expected behavior
I need to remove all repeat blocks. In my example I need to get:
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
If 7 lines duplicate 7 lines, which were already used in my file, duplicate 7 lines was removed.
If 1 (also 2—4) line duplicate 1 line, which were already used in my file, duplicate 1 line doesn't remove. In example words Sasha, Kazan, Chistopol and Katya duplicate, but these words doesn't remove.
4. Did not help
Googling
I find, that Unix commands sort, sed and awk can solve similar tasks, but I don't find, how I can solve my task use these commands.
5. Do not offer
Please, do not offer manually remove each text block. Possibly, I have about a few thousand different duplicate text blocks. Manually removing all duplicates may take a lot of time.

Here's a simple solution to your problem (if you have access to GNU sed, sort and uniq):
sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '\000'
A little explanation is in order:
since all your blocks begin with the word/line Millionaire, we can use that to split the file in (variably long) blocks by prepending a NUL character to each Millionaire;
then we sort those NUL-separated blocks (using to the -z flag), but ignoring the first 3 fields (in this case lines: Millionaire, \d+, QUESTION|ID...), using the -k/--key option with start position being the field 4 (in your case line 4), and the stop position being the end of the block;
after sorting, we can filter-out the duplicates with uniq, again using the NUL delimiter instead of newline (-z), and ignoring the first 3 fields (with -f/--skip-fields);
finally, we remove NUL delimiters with tr.
In general, solution for removing duplicate blocks like this should work whenever there's a way to split the file into blocks. Note that block-equality can be defined on a subset of fields (as we did above).

You can use Sublime Text's find and replace feature with the following regex:
Replace What: \A(?1)*?((^.*$\n){5})(?1)*?\K\1+
Replace With:
(i.e. replace with nothing)
This will find a block of 5 lines that exists later on in the document, and remove that duplicate/second occurrence of those 5 lines (and any immediately adjacent to it), leaving the others (i.e. the original 5 lines that are duplicates, and all other lines) untouched.
Unfortunately, due to the nature of the regex, you will need to perform this operation multiple times to remove all duplicates. It may be easier to keep invoking "Replace" than "Replace All" and having to re-open the panel each time. (Somehow the \K works as expected here, despite a report of it not working with "Replace".)

Please find below code for Windows Power Shell. The code is not in anyway optimized. Please edit test.txt in the below code to the file and make sure the working directory is the one tha. The output is a csv file that you can open in excel sort in order and delete the first column to delete index. I have no idea why those index comes and how to get rid of it. It was my first attempt with Windows Power Shell and I could not find syntax to declare a string array with a fixed size. Neverthless it works.
$d=Get-Content test.txt
$chk=#{};
$tot=$d.Count
$unique=#{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]
$chk[1]=$d[$i+1]
$chk[2]=$d[$i+2]
$chk[3]=$d[$i+3]
$chk[4]=$d[$i+4]
$i=$i+5
for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){
if($unique[$j+2] -eq $chk[2]){
if($unique[$j+3] -eq $chk[3]){
if($unique[$j+4] -eq $chk[4]){
$isunique=0
break
}
}
}
}
}
$j=$j+5
}
if ($isunique){
$unique[$g]=$chk[0]
$unique[$g+1]=$chk[1]
$unique[$g+2]=$chk[2]
$unique[$g+3]=$chk[3]
$unique[$g+4]=$chk[4]
$g=$g+5;
}
}
$unique | out-file test2.csv
![Screenshot] http://imgur.com/a/ZP9T5
People with Power Shell experience please optimize the code. I tried .Contains .Add, etc. but did not get the desired result. Hope it helped.

Here's an awk+sed method can meet your requirement.
$ sed '0~5 s/$/\n/g' file | awk -v RS= '!($0 in a){a[$0];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya

Your requirements aren't clear wrt what to do with overlapping blocks of 5 lines, how to deal with blocks of less than 5 lines at the end of the input, and various other edge cases so here's one way to identify the blocks of 5 (or less) lines that are duplicated:
$ cat tst.awk
{
for (i=1; i<=5; i++) {
blockNr = NR - i + 1
if ( blockNr > 0 ) {
blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") $0
}
}
}
END {
for (blockNr=1; blockNr in blocks; blockNr++) {
block = blocks[blockNr]
print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
print block
}
}
.
$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol
and you can build on that to:
print the lines that haven't already been printed from within each ORIG block by using their blockNr plus the current line number in that block (hint: (split(block,lines,RS)) and
figure out how to deal with your unspecified requirements.

Related

Bash Scripting - Variable Concatenation

Completely new to Linux and Bash scripting and I've been experimenting with the following script :
declare -a names=("Liam" "Noah" "Oliver" "William" "Elijah")
declare -a surnames=("Smith" "Johnson" "Williams" "Brown" "Jones")
declare -a countries=()
readarray countries < $2
i=5
id=1
while [ $i -gt 0 ]
do
i=$(($i - 1))
rname=${names[$RANDOM % ${#names[#]}]}
rsurname=${surnames[$RANDOM % ${#surnames[#]}]}
rcountry=${countries[$RANDOM % ${#countries[#]}]}
rage=$(($RANDOM % 5))
record="$id $rname $rsurname $rcountry"
#record="$id $rname $rsurname $rcountry $rage"
echo $record
id=$(($id + 1))
done
The script above produces the following result :
1 Liam Williams Andorra
2 Oliver Jones Andorra
3 Noah Brown Algeria
4 Liam Williams Albania
5 Oliver Williams Albania
but the problem becomes apparent when the line record="$id $rname $rsurname $rcountry" gets commented and the line record="$id $rname $rsurname $rcountry $rage" is active where the exact output on the second execution is :
4William Johnson Albania
2Elijah Smith Albania
2Oliver Brown Argentina
0William Williams Argentina
3Oliver Brown Angola
The file I am reading the countries from looks like this :
Albania
Algeria
Andorra
Angola
Argentina
Could you provide an explanation to why this happens?
Your countries input file has DOS-style <cr><lf> (carriage-return line-feed) line endings.
When you read lines from the file, each element of the countries array ends up looking like somename<cr>, and when printed the <cr> moves the cursor back to the beginning of the line, so the contents of $rage end up overwriting the beginning of the line.
The fix is to convert your countries input to use Unix style (<lf> only) line endings. You can do this with dos2unix <inputfile> > <outputfile>, for example.

Is it possible to take 2 columns from File 1, find them in File 2 and extract the relevant lines from File 2 into File 3?

I have 2 text files. File1 has about 1,000 lines and File2 has 20,000 lines. An extract of File1 is as follows:
/BBC Micro/Thrust
/Amiga/Alien Breed Special Edition '92
/Arcade-Vertical/amidar
/MAME (Advance)/mario
/Arcade-Vertical/mspacman
/Sharp X68000/Bubble Bobble (1989)(Dempa)
/BBC Micro/Chuckie Egg
An extract of File2 is as follows:
005;005;Arcade-Vertical;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
Alien 8 (Japan);Alien 8 (Japan);msx;;1987;Nippon Dexter Co., Ltd.;Action;1;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;;
Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Nintendo Game Boy Advance;;2003;THQ;Action;;;;;;;;;;
mario;mario;FBA;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Thunder Blade (1988)(U.S. Gold)[128K];Thunder Blade (1988)(U.S. Gold)[128K];ZX Spectrum;;;;;;;;;;;;;;
Thunder Mario v0.1 (SMB1 Hack);Thunder Mario v0.1 (SMB1 Hack);Nintendo NES Hacks 2;;;;;;;;;;;;;;
Thrust;Thrust;Vectrex;;;;;;;;;;;;;;
In File3 (the output file), using grep, sed, awk or a bash script, I would like to achieve the following output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
This is similar to a previous question I asked but not the same. I specifically want to avoid the possibility of Thrust;Thrust;Vectrex;;;;;;;;;;;;;; being recorded in File 3.
Using sudo awk -F\; 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1]}', I found that Thrust;Thrust;Vectrex;;;;;;;;;;;;;; was recorded in File 3 instead of Thrust;Thrust;BBC Micro;;;;;;;;;;;;;; (the latter being the output I'm seeking).
Equally, mario;mario;FBA;;;;;;;;;;;;;; won't appear in File3 because it does not match /MAME (Advance)/mario as "MAME (Advance)" doesn't match. That is good. The same for Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;; which doesn't match either "Sharp X68000" or "Bubble Bobble (1989)(Dempa)".
Using AWK and associative array You can use this:
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=1;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1]
' file2 file1
Output:
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Sorted per file1 solution (as per comment request):
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=++order;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1] { print(mfggames[$3"/"$1] FS $0); }
' file2 file1 | sort -n | cut -d ';' -f 2-
Output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;

Sorting by Date in Shell

Good day. Ive been trying to sort the following data from a txt file using shell script but as of now I`ve been unable to do so.
Here is what the data on the file looks like,:
Name:ID:Date
Clinton Mcdaniel:100:16/04/2016
Patience Mccarty:101:18/03/2013
Carol Holman:102:24/10/2013
Roth Lamb:103:11/02/2015
Chase Gardner:104:14/06/2014
Jacob Tucker:105:05/11/2013
Maite Barr:106:24/04/2014
Acton Galloway:107:18/01/2013
Helen Orr:108:10/05/2014
Avye Rose:109:07/06/2014
What i want to do is being able to sort this by Date instead of name or ID.
When i execute the following code i get this:
Code:
sort -t "/" -k3.9 -k3.4 -k3
Result:
Acton Galloway:107:18/01/2013
Amaya Lynn:149:11/08/2013
Anne Sullivan:190:12/01/2013
Bruno Hood:169:01/08/2013
Cameron Phelps:187:17/11/2013
Carol Holman:102:24/10/2013
Chaney Mcgee:183:11/09/2013
Drew Fowler:173:28/07/2013
Hadassah Green:176:17/01/2013
Jacob Tucker:105:05/11/2013
Jenette Morgan:160:28/11/2013
Lael Aguirre:148:29/05/2013
Lareina Morin:168:06/05/2013
Laura Mercado:171:06/06/2013
Leonard Richard:154:02/06/2013
As you can see it only sorts by the year, but the months and everything else are still a little out of place. Does anyone knows how to correctly sort this by date?
EDIT:
Well, I`ve found how to do it, answer below:
Code: sort -n -t":" -k3.9 -k3.4,3.5 -k3
Result:
Anne Sullivan:190:12/01/2013
Hadassah Green:176:17/01/2013
Acton Galloway:107:18/01/2013
Nasim Gonzalez:163:18/01/2013
Patience Mccarty:101:18/03/2013
Sacha Stevens:164:01/04/2013
Lareina Morin:168:06/05/2013
Lael Aguirre:148:29/05/2013
Leonard Richard:154:02/06/2013
Laura Mercado:171:06/06/2013
Drew Fowler:173:28/07/2013
Bruno Hood:169:01/08/2013
Virginia Puckett:144:08/08/2013
Moses Mckay:177:09/08/2013
Amaya Lynn:149:11/08/2013
Chaney Mcgee:183:11/09/2013
Willa Bond:153:22/09/2013
Oren Flores:184:27/09/2013
Olga Buckley:181:11/10/2013
Carol Holman:102:24/10/2013
Jacob Tucker:105:05/11/2013
Veda Gillespie:125:09/11/2013
Thor Workman:152:12/11/2013
Cameron Phelps:187:17/11/2013
Jenette Morgan:160:28/11/2013
Mason Contreras:129:29/12/2013
Martena Sosa:158:30/12/2013
Vivian Stevens:146:20/01/2014
Benedict Massey:175:02/03/2014
Macey Holden:127:01/04/2014
Orla Estrada:174:06/04/2014
Maite Barr:106:24/04/2014
Helen Orr:108:10/05/2014
Randall Colon:199:27/05/2014
Avye Rose:109:07/06/2014
Cleo Decker:117:12/06/2014
Chase Gardner:104:14/06/2014
Mark Lynn:113:21/06/2014
Geraldine Solis:197:24/06/2014
Thor Wheeler:180:25/06/2014
Aimee Martin:192:21/07/2014
Gareth Cervantes:166:26/08/2014
Serena Fernandez:122:24/09/2014
`
The sort you are using will fail for any date before year 2000 (e.g. 1999 will sort after 2098). Continuing from your question in the comment, you currently show
sort -n -t":" -k3.9 -k3.4,3.5 -k3
You should use
sort -n -t":" -k3.7 -k3.4,3.5 -k3.1,3.2
Explanation:
Your -t separates the fields on each colon. (':') The -k KEYDEF where KEYDEF is in the form f[.c][opt] (that's field.character option) (you need no separate option after character). Your date field is (field 3):
d d / m m / y y y y
1 2 3 4 5 6 7 8 9 0 -- chars counting from 1 in field 3
So you first sort by -k3.9 (the 9th character in field 3) which is the 2-digit year in the 4-digit field. You really want to sort on -k3.7 (which is the start of the 4-digit date)
You next sort by the month (characters 4,5) which is fine.
Lastly, you sort on -k3 (which fails to limit the characters considered). Just as you have limited the sort on the month to chars 4,5, you should limit the sort of the days to characters 1,2.
Putting that together gives you sort -n -t":" -k3.7 -k3.4,3.5 -k3.1,3.2. Hope that answers your question from the comment.
You're hamstrung by your (terrible, IMO) date format. Here's a bit of a Schwartzian transform:
awk -F'[:/]' '{printf "%s%s%s %s\n", $NF, $(NF-1), $(NF-2), $0}' file | sort -n | cut -d' ' -f2-
That extracts the year, month, day and adds it as a separate word to the start of each line. Then you can sort quite simply. Then discard that date.

Grep text between patterns using bash [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to sort data which reside in txt file. The sample data is as follows:
======
Jhon
Doe
score -
------
======
Ann
Smith
score +
------
======
Will
Marrow
score -
------
And I need to extract only sections where score + is defined. So the result should be
======
Ann
Smith
score +
------
I would try this one:
$ grep -B3 -A1 "score +" myfile
It means... grep three lines Before and one line After "score +".
Sed can do it as follows:
sed -n '/^======/{:a;N;/\n------/!ba;/score +/p}' infile
======
Ann
Smith
score +
------
where -n prevents printing, and
/^======/ { # If the pattern space starts with "======"
:a # Label to branch to
N # Append next line to pattern space
/\n------/!ba # If we don't match "------", branch to :a
/score +/p # If we match "score +", print the pattern space
}
Things could be more properly anchored with /\n------$/, but there are spaces at the end of the lines, and I'm not sure if those are real or copy-paste artefacts – but this work for the example data.
give this oneliner a try:
awk -v RS="==*" -F'\n' '{p=0;for(i=1;i<=NF;i++)if($i~/score \+/)p=1}p' file
with the given data, it outputs:
Ann
Smith
score +
------
The idea is, take all lines divided by ====... as one multiple-line record, and check if the record contains the searching pattern, print it out.
With GNU awk for multi-char RS:
$ awk -v RS='=+\n' '/score \+/' file
Ann
Smith
score +
------
Given:
$ echo "$txt"
======
Jhon
Doe
score -
------
======
Ann
Smith
score +
------
======
Will
Marrow
score -
------
You can create a toggle type match in awk to print only the section that you wist:
$ echo "$txt" | awk '/^=+/{f=1;s=$0;next} /^score \+/{f=2} f {s=s"\n"$0} /^-+$/ {if(f==2) {print s} f=0}'
======
Ann
Smith
score +
------
Use Grep Context Flags
Assuming you have a truly fixed-format file, you can just use fgrep (or GNU or BSD grep with the speedy --fixed-strings flag) along with the the --before-context and --after-context flags. For example:
$ fgrep -A1 -B3 'score +' /tmp/foo
======
Ann
Smith
score +
------
The flags will find your match, and include the three lines before and one line after each match. This gives you the output you're after, but with a lot less complexity than a sed or awk script. YMMV.

Bash: Cat a file in between characters

I've tried various solutions to find a good way to get through a file beginning with a specific word, and ending with a specific word.
Let's say I have a file named states.txt containing:
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
I want to cat states.txt and get the following states that begin with Idaho and end with South Dakota.
I also want to ignore the fact that the states are in alphabetical order (the actual file contents I am going for are not in such order).
The result should look like:
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Thank you for your time and patience on this one. I appreciate any help offered.
awk '/Idaho/{f=1} f; /South Dakota/{f=0}' file
See Explain awk command for many more awk range idioms.
Don't get into the habit of using /start/,/end/ as it makes trivial things very slightly briefer but requires a complete rewrite or duplicate conditions for even the slightest requirements change (e.g. not printing the bounding lines).
For example given this input file:
$ cat file
a
b
c
d
e
to print the lines between b and d inclusive and then excluding either or both bounding lines:
$ awk '/b/{f=1} f; /d/{f=0}' file
b
c
d
$ awk 'f; /b/{f=1} /d/{f=0}' file
c
d
$ awk '/b/{f=1} /d/{f=0} f;' file
b
c
$ awk '/d/{f=0} f; /b/{f=1}' file
c
Try that if your starting point was awk '/b/,/d/' file and notice the additional language constructs and duplicate conditions required:
$ awk '/b/,/d/' file
b
c
d
$ awk '/b/,/d/{if (!/b/) print}' file
c
d
$ awk '/b/,/d/{if (!/d/) print}' file
b
c
$ awk '/b/,/d/{if (!(/b/||/d/)) print}' file
c
Also, it's not obvious at all but an insidious bug crept into the above. Note the additional "b" that's now between "c" and "d" in this new input file:
$ cat file
a
b
c
b
d
e
and try again to exclude the first bounding line from the output:
$ awk 'f; /b/{f=1} /d/{f=0}' file
c
b
d
-> SUCCESS
$ awk '/b/,/d/{if (!/b/) print}' file
c
d
-> FAIL
You ACTUALLY need to write something like this to keep using a range and exclude the first bounding line
$ awk '/b/,/d/{if (c++) print; if (/d/) c=0}' file
c
b
d
but by then it's obviously getting kinda silly and you'd rewrite it to just use a flag like my original suggestion.
Use sed with a pattern range:
sed '/^Idaho$/,/^South Dakota$/!d' filename
Or awk with the same pattern range:
awk '/^Idaho$/,/^South Dakota$/' filename
In both cases, the ^ and $ match the beginning and end of the line, respectively, so ^Virginia$ matches only if the whole line is Virginia (i.e., West Virginia is not matched).
Or, if you prefer fixed-string matching over regex matching (it doesn't make a difference here but might in other circumstances):
awk '$0 == "Idaho", $0 == "South Dakota"' filename
#all bash
__IFS=$IFS
IFS=' '
list=$(cat file.txt)
start="Idaho"
stop="South Dakota"
fst=${list#*$start}
snd=${fst%$stop*}
result="$start$snd$stop"
echo $result
IFS=$__IFS
See http://tldp.org/LDP/abs/html/string-manipulation.html

Resources