List all occurences of a particular string or number in the file : UNIX - shell

I have multiple files containing phone numbers in a format XXX XXX XXXX or +XX XXXXXX XXXX or maybe XXXXXXXXXX. Can I list all the phone numbers containing particular no. '845' against each file name.
Currently I am using:-
egrep -H 845 *

This might do what you want (uses GNU for word delimiters):
$ cat file
now is the 123 845 1234 winter of +01 238456 5432
our 1845234567 discontent.
$ cat tst.awk
{
while ( match($0,/(\<(([0-9]{3} ){2}[0-9]{4}|[0-9]{10})|[+][0-9]{2} [0-9]{6} [0-9]{4})\>/) ) {
tgt = substr($0,RSTART,RLENGTH)
$0 = substr($0,RSTART+RLENGTH)
if ( tgt ~ /845/ ) {
print tgt
}
}
}
$ awk -f tst.awk file
123 845 1234
+01 238456 5432
1845234567
If not, edit your question to provide some sample input and expected output.

You can try this:
awk '/845/ {print FILENAME,FNR,$0}' *
Or this:
grep -rF '845' *
This does not give you hits only on phone number, but all line with this nubers.
You does not tell if this is middle of a number, start or end.

To list the surrounding numbers around the match but nothing else,
grep -Ho '[0-9]*845[0-9]*' files
If your phone numbers can contain spaces and punctuation, maybe add those to the character classes; but be careful so as not to make the regex match two adjacent phone numbers in one go. (If they are always separated by text not in the character class, you're fine.)
(There is nothing -E in this particular regex so I'm not using egrep.)

Related

bash: substract constant number after prefix

I have a large text file with many entries like this:
/locus_tag="PREFIX_05485"
including the empty spaces in the beginning. Unfortunately, the first identifier does not start with 00001.
The only part in this line that is changing is the number.
I would like to change the PREFIX (this I can do easily with sed), but I also want to decrease the number so it looks like this:
/locus_tag="myNewPrefix_00001"
(the next entry should be ..."myNewPrefix_00002" and so on). Alternatively, the entry could also be without leading zeros.
As far as I know, sed cannot calculate (like substracting a constant number). Any ideas how I can solve that?
Thank you very much. If the question is unclear, please let me know and I will try to improve it.
EDIT: Sometimes the same number occurs twice (this should also be the case in the modified file, for instance
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
should be in the end
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"
You may use awk:
awk -v pf='myNewPrefix' 'BEGIN{FS=OFS="="}
$1 ~ /\/locus_tag$/ && split($2, a, /_/) == 2 {
$2 = sprintf("\"%s_%05d\"", pf, (a[2] in seen ? i : ++i)); seen[a[2]]
} 1' file
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00001"
/locus_tag="myNewPrefix_00002"
/locus_tag="myNewPrefix_00003"
Check this Perl one liner
/tmp> cat littlebird.txt
abcdef
/locus_tag="PREFIX_12345"
hello hai
/locus_tag="PREFIX_12345"
/locus_tag="PREFIX_12346"
/locus_tag="PREFIX_12347"
123 456
end
/tmp> perl -pe 'BEGIN{$r=qr/PREFIX_(.+)["]/} if(/$r/) {$kv{$1}++;$kv{$1}==1 and $kv2{$1}=sprintf("%04d",++$i) for(keys %kv) } s/$r/PREFIX_$kv2{$1}/g ' littlebird.txt
abcdef
/locus_tag="PREFIX_0001
hello hai
/locus_tag="PREFIX_0001
/locus_tag="PREFIX_0002
/locus_tag="PREFIX_0003
123 456
end
/tmp>

How to replace last n characters in the kth occurence of a line containing a certain substring using sed or awk?

Suppose I have a file that resembles the following format:
\\ Random other lines \\
...
27861NA+ NA+89122 13.480 11.554 10.082
27862NA+ NA+89123 2.166 5.896 10.108
27863NA+ NA+89124 8.289 6.843 3.090
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
I am trying to find a way of replacing the last 24 characters of each line with a string that I have prepared, for the first 3 instances of lines in the file that contain the string "NA+".
For example, my output would ideally look like:
\\ Random other lines \\
...
27861NA+ NA+89122 my first string hello
27862NA+ NA+89123 my second string foo
27863NA+ NA+89124 my final string bar $$
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
So far, I have found a sed command that will remove the last 24 characters from every line in the file:
sed 's/.\{24\}$//' myfile.txt
And also an awk command that will return the kth line that contains the desired substring:
awk '/NA+/{i++}i==1' myfile.txt
Does anyone have an idea about how I could replace the last 24 characters in the 1st, 2nd, and 3rd lines of my file that each contain a certain substring?
With single awk:
awk -v str="my string" '!f && /NA\+/{ f=1; n=NR+3 }n && n>NR{ $4=$5=""; $3=str }1' myfile.txt
string="my first string hello"
awk -v string="$string" '{ if ( $0 ~ "NA" ) {cnt++} if (cnt < 4 ) { print substr($0,1,length($0)-23)string } else { print }}' NA
Using awk, set a string and pass it awk with -v. Search for strings containing NA and the increment the variable cnt. When cnt is less that 4, print the everything but the last 23 characters adding the string passed to the end. Otherwise print the line.
This might work for you (GNU sed):
sed '/NA+/{x;s/\n/&/3;x;ta;H;s/.\{24\}$/some string/;b;:a;n;ba}' file
This uses the hold space (HS) to keep a count of the number of lines the script has seen of the required string (NA+). Once it has seen n (in this case n=3) such lines it just prints the remainder of the file.

Using awk to print a new column without apostrophes or spaces

I'm processing a text file and adding a column composed of certain components of other columns. A new requirement to remove spaces and apostrophes was requested and I'm not sure the most efficient way to accomplish this task.
The file's content can be created by the following script:
content=(
john smith thomas blank 123 123456 10
jane smith elizabeth blank 456 456123 12
erin "o'brien" margaret blank 789 789123 9
juan "de la cruz" carlos blank 1011 378943 4
)
# put this into a tab-separated file, with the syntactic (double) quotes above removed
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "${content[#]}" >infile
This is what I have now, but it fails to remove spaces and apostrophes:
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 tolower(substr($2,0,3)); }' infile > outfile
This throws an error "sub third parameter is not a changeable object", which makes sense since I'm trying to process output instead of input, I guess.
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 sub("'\''", "",tolower(substr($2,0,3))); }' infile > outfile
Is there a way I can print a combination of column 6 and part of column 2 in lower case, all while removing spaces and apostrophes from the output to the new column? Worst case scenario, I can just create a new file with my first command and process that output with a new awk command, but I'd like to do it in one pass is possible.
The second approach was close, but for order of operations:
awk -F "\t" '
BEGIN { OFS="\t"; }
{
var=$2;
sub("['\''[:space:]]", "", var);
var=substr(var, 0, 3);
print $1,$2,$3,$5,$6,$7,$6 var;
}
'
Assigning the contents you want to modify to a variable lets that variable be modified in-place.
Characters you want to remove should be removed before taking the substring, since otherwise you shorten your 3-character substring.
It's a guess since you didn't provide the expected output but is this what you're trying to do?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
abbr = $2
gsub(/[\047[:space:]]/,"",abbr)
abbr = tolower(substr(abbr,1,3))
print $1,$2,$3,$5,$6,$7,$6 abbr
}
$ awk -f tst.awk infile
john smith thomas 123 123456 10 123456smi
jane smith elizabeth 456 456123 12 456123smi
erin o'brien margaret 789 789123 9 789123obr
juan de la cruz carlos 1011 378943 4 378943del
Note that the way to represent a ' in a '-enclosed awk script is with the octal \047 (which will continue to work if/when you move your script to a file, unlike if you relied on "'\''" which only works from the command line), and that strings, arrays, and fields in awk start at 1, not 0, so your substr(..,0,3) is wrong and awk is treating the invalid start position of 0 as if you had used the first valid start position which is 1.
The "sub third parameter is not a changeable object" error you were getting is because sub() modifies the object you call it with as the 3rd argument but you're calling it with a literal string (the output of tolower(substr(...))) and you can't modify a literal string - try sub(/o/,"","foo") and you'll get the same error vs if you used var="foo"; sub(/o/,"",var) which is valid since you can modify the content of variables.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Bash script replace two fields in a text file using variables

This should be a simple fix but I cannot wrap my head around it at the moment.
I have a comma-delimited file called my_course that contains a list of courses with some information about them.
I need to get user input about the last two fields and change them accordingly.
Each field is constructed like:
CourseNumber,CourseTitle,CreditHours,Status,Grade
Example file:
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
I get the user input for 3 things: Course Number, Status (0 or 1), and Grade (A,B,C,N/A)
So far I have tried matching the line containing the course number and changing the last two fields. I haven't been about to figure out how to modify the last two fields using sed so I'm using this horrible jumble of awk and sed:
temporary=$(awk -v status=$status -v grade=$grade '
BEGIN { FS="," }; $(NF)=""; $(NF-1)="";
/'$cNum'/ {printf $0","status","grade;}' my_course)
sed -i "s/CSC$cNum.*/$temporary/g" my_course
The issue that I'm running into here is the number of fields in the course title can range from 1 to 4 so I can't just easily print the first n fields. I've tried removing the last two fields and appending the new values for status and grade but that isn't working for me.
Note: I have already done checks to ensure that the user inputs valid data.
Use a simple awk-script:
BEGIN {
FS=","
OFS=FS
}
$0 ~ course {
$(NF-1)=status
$NF=grade
} {print}
and on the cmd-line, set three parameters for the various parameters like course, status and grade.
in action:
$ cat input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC3210" -vstatus="1" -vgrade="A" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,1,A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC1010" -vstatus="1" -vgrade="B" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,1,B
It doesn't matter how much commas you have in course name as long as you look only at last two commas:
sed -i "/CSC$cNum/ s/.,[^,]*$$/$status,$grade/"
The trick is to use $ in pattern to match the end of line. $$ because of double quotes.
And don't bother building the "temporary" line - apply substitution only to line that matches course number.

Resources