I've tried various solutions to find a good way to get through a file beginning with a specific word, and ending with a specific word.
Let's say I have a file named states.txt containing:
Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
I want to cat states.txt and get the following states that begin with Idaho and end with South Dakota.
I also want to ignore the fact that the states are in alphabetical order (the actual file contents I am going for are not in such order).
The result should look like:
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Thank you for your time and patience on this one. I appreciate any help offered.
awk '/Idaho/{f=1} f; /South Dakota/{f=0}' file
See Explain awk command for many more awk range idioms.
Don't get into the habit of using /start/,/end/ as it makes trivial things very slightly briefer but requires a complete rewrite or duplicate conditions for even the slightest requirements change (e.g. not printing the bounding lines).
For example given this input file:
$ cat file
a
b
c
d
e
to print the lines between b and d inclusive and then excluding either or both bounding lines:
$ awk '/b/{f=1} f; /d/{f=0}' file
b
c
d
$ awk 'f; /b/{f=1} /d/{f=0}' file
c
d
$ awk '/b/{f=1} /d/{f=0} f;' file
b
c
$ awk '/d/{f=0} f; /b/{f=1}' file
c
Try that if your starting point was awk '/b/,/d/' file and notice the additional language constructs and duplicate conditions required:
$ awk '/b/,/d/' file
b
c
d
$ awk '/b/,/d/{if (!/b/) print}' file
c
d
$ awk '/b/,/d/{if (!/d/) print}' file
b
c
$ awk '/b/,/d/{if (!(/b/||/d/)) print}' file
c
Also, it's not obvious at all but an insidious bug crept into the above. Note the additional "b" that's now between "c" and "d" in this new input file:
$ cat file
a
b
c
b
d
e
and try again to exclude the first bounding line from the output:
$ awk 'f; /b/{f=1} /d/{f=0}' file
c
b
d
-> SUCCESS
$ awk '/b/,/d/{if (!/b/) print}' file
c
d
-> FAIL
You ACTUALLY need to write something like this to keep using a range and exclude the first bounding line
$ awk '/b/,/d/{if (c++) print; if (/d/) c=0}' file
c
b
d
but by then it's obviously getting kinda silly and you'd rewrite it to just use a flag like my original suggestion.
Use sed with a pattern range:
sed '/^Idaho$/,/^South Dakota$/!d' filename
Or awk with the same pattern range:
awk '/^Idaho$/,/^South Dakota$/' filename
In both cases, the ^ and $ match the beginning and end of the line, respectively, so ^Virginia$ matches only if the whole line is Virginia (i.e., West Virginia is not matched).
Or, if you prefer fixed-string matching over regex matching (it doesn't make a difference here but might in other circumstances):
awk '$0 == "Idaho", $0 == "South Dakota"' filename
#all bash
__IFS=$IFS
IFS=' '
list=$(cat file.txt)
start="Idaho"
stop="South Dakota"
fst=${list#*$start}
snd=${fst%$stop*}
result="$start$snd$stop"
echo $result
IFS=$__IFS
See http://tldp.org/LDP/abs/html/string-manipulation.html
Related
I have the following bash script called bank_scpt.txt:
#!/bin/bash
anz="$1"
wp="$2"
# anz fixed cost search patterns:
anz_fc="^Aver"
# wp fixed cost search patterns:
wp_fc="2degrees"
# Preperation to get anz file ready for concatenation.
anz="$(awk -v r="$anz_fc" 'BEGIN{FS=OFS="\t"} NR>1 {split($7,a,"/"); print a[3]"-"a[2]"-"a[1], $6, $2, $3, $4, "az" OFS ($6 > 0 ? "vi" : $2~r ? "fc" : "vc")}' "$anz" | column -s $'\t' -t)"
# Preperation to get wp file ready for concatenation.
wp="$(awk -v r="$wp_fc" 'BEGIN{FS="," ; OFS="\t"} NR>1 && $3~r {gsub(/"/, "", $0) ; split($1,a,"/"); print a[3]"-"a[2]"-"a[1], $2, $3, $4, $5, "wp", "fc"}' "$wp" | column -s $'\t' -t)"
echo "$anz" "$wp" |head -n 4
echo "$anz" "$wp" |tail -n 4
The idea behind this script is to concatenate two bank account txt files: anz.txt and wp.txt
When I run:
./bank_scpt.txt anz.txt wp.txt
I get the following desired output (Please note az and wp in column six indicate the bank text files the records come from az = anz.txt and wp = wp.txt):
2021-03-31 -8.50 Monthly A/C Fee az vc
2021-03-31 -250.00 Rutherford & Bond 4835******** 8848 C az vc
2021-03-31 -131.60 Avery Johnson Avery Johnso 592315 az fc
2021-03-31 50.00 Collins Tf 127 Driver Crescent az vi
2020-12-29 -71.50 2degrees Mobile Ltd DIRECT DEBIT 2365653 wp fc
2021-01-27 -70.00 2degrees Mobile Ltd DIRECT DEBIT 2365653 wp fc
2021-02-26 -70.00 2degrees Mobile Ltd DIRECT DEBIT 2365653 wp fc
2021-03-26 -70.00 2degrees Mobile Ltd DIRECT DEBIT 2365653 wp fc
However when I use a regex such as wp_fc="^2degr" I get the following output (the wp.txt file is completely ignored):
2021-03-31 -8.50 Monthly A/C Fee az vc
2021-03-31 -250.00 Rutherford & Bond 4835******** 8848 C az vc
2021-03-31 -131.60 Avery Johnson Avery Johnso 592315 az fc
2021-03-31 50.00 Collins Tf 127 Driver Crescent az vi
2020-04-09 64.40 Body Corporate Batchelor 1010 & 1036 az vi
2020-04-09 17.25 A D & C H Bailey Aron Bailey az vi
2020-04-06 46.00 Jm Lymburn 13 Thornley Titahi az vi
2020-04-02 17.25 A D & C H Bailey Aron Bailey az vi
My question is why am I able to use anz_fc="^Aver" but not wp_fc="^2degr"? And how can I change the second awk command so I can indeed use wp_fc="^2degr"?
I include here and excerpt of the original files:
head -n 5 anz.txt
Type Details Particulars Code Reference Amount Date ForeignCurrencyAmount ConversionCharge
Bank Fee Monthly A/C Fee -8.50 31/03/2021
Eft-Pos Rutherford & Bond 4835******** 8848 C 210331123119 -250.00 31/03/2021
Payment Avery Johnson Avery Johnso 592315 Labour -131.60 31/03/2021
Bill Payment Collins Tf 127 Driver Crescent I1600 50.00 31/03/2021
head -n 5 wp.txt
Date,Amount,Other Party,Description,Reference,Particulars,Analysis Code
01/04/2020,478.26,"ACC","Salary",,"ACC WKLY CMP","TO 02Apr2020"
02/04/2020,-7.50,"Edorne Labog","AUTOMATIC PAYMENT",,"Christian","Netflix"
02/04/2020,-150.00,"Christian rent cover","AUTOMATIC PAYMENT",,"146 Coromand",
26/03/2021,-70.00,"2degrees Mobile Ltd","DIRECT DEBIT","2365653",,"10009701292"
Please note that wp.txt is a csv file that I saved as a txt file.
As some fields of wp.txt are enclosed with double quotes, I assume
the field which starts with 2degree will be the same. (Although your
provided wp.txt unfortunately misses the crutial lines of 2degree.)
Then the condition $3~r in your awk script is testing "2degree"
against the pattern ^2degree which fails.
Then modify the line:
wp_fc="^2degr"
to something like:
wp_fc="^\"2degr"
then it will work.
As side notes:
It is always recommended to post the consistent set of input file(s),
your script, the result, and your expected result. Your provided
input files are not related with your initially posted output at all
and we cannot reproduce the problem.
You better to avoid putting the txt suffix to the executable script file.
It works, but is confusing.
I have 2 text files. File1 has about 1,000 lines and File2 has 20,000 lines. An extract of File1 is as follows:
/BBC Micro/Thrust
/Amiga/Alien Breed Special Edition '92
/Arcade-Vertical/amidar
/MAME (Advance)/mario
/Arcade-Vertical/mspacman
/Sharp X68000/Bubble Bobble (1989)(Dempa)
/BBC Micro/Chuckie Egg
An extract of File2 is as follows:
005;005;Arcade-Vertical;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
Alien 8 (Japan);Alien 8 (Japan);msx;;1987;Nippon Dexter Co., Ltd.;Action;1;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;;
Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Nintendo Game Boy Advance;;2003;THQ;Action;;;;;;;;;;
mario;mario;FBA;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Thunder Blade (1988)(U.S. Gold)[128K];Thunder Blade (1988)(U.S. Gold)[128K];ZX Spectrum;;;;;;;;;;;;;;
Thunder Mario v0.1 (SMB1 Hack);Thunder Mario v0.1 (SMB1 Hack);Nintendo NES Hacks 2;;;;;;;;;;;;;;
Thrust;Thrust;Vectrex;;;;;;;;;;;;;;
In File3 (the output file), using grep, sed, awk or a bash script, I would like to achieve the following output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
This is similar to a previous question I asked but not the same. I specifically want to avoid the possibility of Thrust;Thrust;Vectrex;;;;;;;;;;;;;; being recorded in File 3.
Using sudo awk -F\; 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1]}', I found that Thrust;Thrust;Vectrex;;;;;;;;;;;;;; was recorded in File 3 instead of Thrust;Thrust;BBC Micro;;;;;;;;;;;;;; (the latter being the output I'm seeking).
Equally, mario;mario;FBA;;;;;;;;;;;;;; won't appear in File3 because it does not match /MAME (Advance)/mario as "MAME (Advance)" doesn't match. That is good. The same for Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;; which doesn't match either "Sharp X68000" or "Bubble Bobble (1989)(Dempa)".
Using AWK and associative array You can use this:
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=1;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1]
' file2 file1
Output:
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Sorted per file1 solution (as per comment request):
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=++order;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1] { print(mfggames[$3"/"$1] FS $0); }
' file2 file1 | sort -n | cut -d ';' -f 2-
Output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
I am trying to do a line replace given contexts on two sides of a split. This seems much easier to do in python but my entire pipeline is in bash so I would love to stick to tools like sed, awk, grep, etc.
For example:
split_0 = split('\t')[0]
split_1 = split('\t')[1]
if (a b c in split_0 AND w x y z in split_1):
split_1 = split_1.replace('w x y z', 'w x_y z')
I can use awk to do splits like this:
awk -F '\t' '{print$1}'
But I don't know how to do this on both sides simultaneously in order to satisfy both conditions. Any help would be greatly appreciated.
Example input/output:
This is an example and I have many rules like this but basically what I want to do here is given an example where I have "ex" on the left side and "ih g z" on the right side, I want to make a substitution with ih g z going to ih g_z.
input: exam ih g z ae m
output: exam ih g_z ae m
I could do a brutal sed like:
sed 's/\(.*ex.*\t.*\)ih g z\(.*\)/\1ih g_z\2/g'
but this seems ugly and I am sure there is a much better way to do this. *I am not totally sure if the "\t" works that way in sed.
awk to the rescue!
awk -F'\t' '$1~/ex/ && $2~/ih g z/{sub("g z","g_z")}1' file
conditions on fields 1 and 2 separated by tab delimiter, replace string (once).
If you have a bunch of these replacement rules, it's better to not hard code them in the script
$ awk -F'\t' -v OFS='\t' 'NR==FNR{lr[NR]=$1; rr[NR]=$2;
ls[NR]=$3; rs[NR]=$4; next}
{for(i=1; i<=length(lr); i++)
if($1~lr[i] && $2~rr[i])
{gsub(ls[i],rs[i],$2);
print;
next}}1' rules file
111 2b2b2b
222 333u33u
4 bbb5az
9 nochange
where
$ head rules file
==> rules <==
1 2 a b
2 3 z u
4 5 e b
==> file <==
111 2a2a2a
222 333z33z
4 eee5az
9 nochange
Noticed that replacement will be for the first applicable rule on second field only and multiple times. Both files need to be tab delimited.
I have a folder which has files with the following contents.
ATOM 9 CE1 PHE A 1 70.635 -26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915 -26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816 26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795 30.848 101.863 1.00 44.44 N
In some files, the appearance of the 7th column as follows.
ATOM 9 CE1 PHE A 1 70.635-26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915-26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816-26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795-30.848 101.863 1.00 44.44 N
I would like to extract the name of files which have the above type of lines. What is the easy way to do this?
by refering to Erik E. Lorenz answer
you can simply do
grep -l '\s-\?[0-9.]\+-[0-9.]\+\s' dir/*
from grep manpage
-l
(The letter ell.) Write only the names of files containing selected
lines to standard output. Pathnames are written once per file searched.
If the standard input is searched, a pathname of (standard input) will
be written, in the POSIX locale. In other locales, standard input may be
replaced by something more appropriate in those locales.
A combination of grep and cut works for me:
grep -H -m 1 '\s-\?[0-9.]\+-[0-9.]\+\s' dir/* | cut -d: -f1
This performs the following steps:
for every file in dir/*, find the first match (-m 1) of two adjacent numbers separated by only a dash
print it with the filename prepended (-H). Should be the default anyway.
extract the file name using cut
This is fast since it only looks for the first line match. If there's other places with two adjacent numbers, consider changing the regex.
Edit:
This doesn't match scientific notation and may falsely report contents such as '.-.', for example in comments. If you're dealing with one of them, you have to expand the regex.
awk 'NF > 10 && $1 ~ /^[[:upper:]]+$/ && $2 ~ /^[[:digit:]]+/ { print FILENAME; nextfile }' *
Will print files that have more than 10 fields in which first field is all uppercase letters and second field is all digits.
Using GNU awk for nextfile:
awk '$7 ~ /[0-9]-[0-9]/{print FILENAME; nextfile}' *
or more efficiently since you just need to test the first line of each file if all lines in a given file have the same format:
awk 'FNR==1{if ($7 ~ /[0-9]-[0-9]/) print FILENAME; nextfile}' *
I'm new at bash scripting. I tried the following:
filename01 = ''
if [ $# -eq 0 ]
then
filename01 = 'newList01.txt'
else
filename01 = $1
fi
I get the following error:
./smallScript02.sh: line 9: filename01: command not found
./smallScript02.sh: line 13: filename01: command not found
I imagine that I am not treating the variables correctly, but I don't know how. Also, I am trying to use grep to extract the second and third words from a text file. The file looks like:
1966 Bart Starr QB Green Bay Packers
1967 Johnny Unitas QB Baltimore Colts
1968 Earl Morrall QB Baltimore Colts
1969 Roman Gabriel QB Los Angeles Rams
1970 John Brodie QB San Francisco 49ers
1971 Alan Page DT Minnesota Vikings
1972 Larry Brown RB Washington Redskins
Any help would be appreciated
When you assign variables in bash, there should be no spaces on either side of the = sign.
# good
filename0="newList01.txt"
# bad
filename0 = "newlist01.txt"
For your second problem, use awk not grep. The following will extract the second and third items from each line of a file whose name is stored in $filename0:
< $filename0 awk '{print $2 $3}'
In bash (and other bourne-type shells), you can use a default value if a variable is empty or not set:
filename01=${1:-newList01.txt}
I'd recommend spending some time with the bash manual: http://www.gnu.org/software/bash/manual/bashref.html
Here's a way to extract the name:
while read first second third rest; do
echo $second $third
done < "$filename01"