search and replace specific positions of a fixed-length file - shell

I have several fixed length files where every position or position range is reserved for a particular field. The first few fields are year, term, name, DOB, gender...
Year starts in position 1 and is of length 2
Term starts in position 3 and is of length 1
Name starts in position 4 and is of length 35
DOB starts in position 39 and is of length 6
Gender starts in position 45 and is of length 1
...
This is true for all files. Not all fields are always present. Example, Name field may be 35 blanks/white spaces since it was not reported. The same may be true about other fields.
I need to search the Name field (whether it has a value or not and replace its contents with a dummy string which could be 'xxxxxxxx' but the length should not exceed 35 and after string replacement the position of all fields should not have changed.
All files have 80 fields.
Sample file containing 3 lines. Each line begins with 182:
182 1 405080711 001 0425594
07 5 4170000000000000 00000000000000000000000000000000000000000000 0000
9 05000002
182 1 205080712 001 0480201
07 5 3300000000000000 00000000000000000000000000000000000000000000 0000
05000004
182 2 005080713 001 0425824
07 5 3080000000000000 00000000000000000000000000000000000000000000 0000
05000005
'''
I am using the following sed command to replace a blank Name field with the below string.
However this overwrite all fields prior to Name which starts at position 35
sed -E 's/^(.{3})(.{36})/First Name of the student-Last Name/' File_name
'''
Open to use any other command such as awk etc.
Actual white spaces between fields may not be showing here due to auto formatting.
In sample line 1 above there are actually 41 spaces between "182" and "1"enter image description here
Appreciate any help.

Using perl:
perl -pe 'BEGIN { $name = sprintf "% -35.35s", "xxxxxxxx" }
substr($_, 3, 35) = $name' input.txt
perl uses 0-based indexes, so this replaces the 35 characters of each line of the input file starting at the 4th character with the value xxxxxxxx padded with enough trailing spaces to make 35 characters total (And if xxxxxxxx is more than 35 characters long, truncates it to 35). The modified line is then printed to standard output. Use perl -i -pe '...' input.txt to modify the file in place.
Or a similar awk version:
awk 'BEGIN { name = sprintf "% -35.35s", "xxxxxxxx" }
{ printf "%s%s%s\n", substr($0, 1, 3), name, substr($0, 39) }' input.txt
awk's substr doesn't have a way to replace part of a string like perl's does, so this one extracts the parts before and after the name field and prints them all out with the new name value. Not as elegant but gets the job done.

Related

Parse output of command and store in variables

I need to parse the output of the mmls command and store multiple values in variables using a BASH script.
Specifically, I need to store: sector size (512 in the example below), and start values (0,0,63,224910,240975 in the example below). Since the second set of values represent partitions, the number of values captured could vary.
mmls /mnt/E01Mnt/RAW/ewf1
DOS Partition Table
Offset Sector: 0
Units are in 512-byte sectors
Slot Start End Length Description
000: Meta 0000000000 0000000000 0000000001 Primary Table (#0)
001: ------- 0000000000 0000000062 0000000063 Unallocated
002: 000:000 0000000063 0000224909 0000224847 NTFS / exFAT (0x07)
003: 000:001 0000224910 0000240974 0000016065 DOS FAT12 (0x01)
004: ------- 0000240975 0000250878 0000009904 Unallocated
Here's a start:
$ awk '/^Units/{print $4+0} /^[0-9]/{print $3+0}' file
512
0
0
63
224910
240975
Try to solve the rest yourself and then let us know if you have questions.
Explanation: file is a file containing your sample input. You can replace awk '{script}' file with command | awk '{script}' if you're input is coming from the output of some command rather then being stored in a file.
^ is the universal regexp metacharacter for start of string while /.../ in awk means "find this regexp". So the above is looking for lines that start with the text shown (i.e. Units or digits) and then printing the 4th or 3rd space-separated field after adding zero to it to remove any trailing non-digits or leading zeros. man awk.
You need a bit of awk to start with.
values=( $(mmls /mnt/E01Mnt/RAW/ewf1 | awk '
/^Units are in/{match($4,/^[[:digit:]]+/,ss); print ss[0]}
NR>6{print $4}'
) )
Now you have a values array which contains both the sector size(first element) and the start values(subsequent elements) . We could do some array manipulation to separate individual elements.
secsize=${values[0]} # size of sector
declare -a sv # sv for start values
for((i=1;i<${#values[#]};i++))
do
sv+=( ${values[i]} )
done
echo "${sv[#]}" # print start values
unset values # You don't need values anymore.
Note: Requires GNU awk.

In bash, I want to generate based on a set of words a fixed set of 4 characters output for each word and to always match

I got these words
Frank_Sinatra
Dean_Martin
Ray_Charles
I want to generate 4 characters which will always match with those words and never change.
ej:
frk ) Frank_Sinatra
dnm ) Dean_Martin
Ray ) Ray_Charles
and it shall always match these 4 characters when I run it again (not random)
note:
Something like this:
String 32-bit checksum 8-bit checksum
ABC 326 0x146 70 0x46
ACB 410 0x19A 154 0x9A
BAC 350 0x15E 94 0x5E
BCA 450 0x1C2 194 0xC2
CAB 399 0x18F 143 0x8F
CBA 256 0x100 0 0x00
http://www.flounder.com/checksum.htm
Look at this command --->
echo -n Frank_Sinatra | md5sum
d0f7287be11d7bbfe53809088ea3b009 -
but instead of that long string, I wanted just 4 unique characters.
I did it like this:
echo -n "Frank_Sinatra" | md5sum > foo ; sed -i 's/./&\n#/4' foo
grep -v "#" foo > bar
I'm not going to write the entire program for you, but I can share some algorithm that can accomplish this. I can't guarantee that it is the most optimized algorithm.
Problem
Generate a 3-letter identifier for each line in a text file that is unique, such that grep will only match with the intended line.
Assumption
There exists a 3-letter identifier for each line such that grep will only match that line.
Algorithm
For every line in text file
Grab a permutation of the line, run grep on the file using that permutation.
If grep returns more than 2 lines, get a new permutation of the line, go back to previous step.
If grep returns only one line and that line matches our current line, we found a proper identifier. Store this identifier.

Using sed to go to a specific line, change pattern then print all between line and another pattern

So I need to change a specific line in a big textfile by something found one line before. What the text looks like:
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: [0-9][0-9][0-9][0-9][0-9] SOME TEXT
Tél. :
numbers
Fax :
numbers
"----------------------"
What I've found so far is (i believe i'm almost done):
K=0
while [ $K -lt 11519 ]; do
let K=K+1
L=`head -n $K file_that_contains_line_numbers_I_want.txt | tail -1`
M=`expr $L - 2`
dept=`head -n $L filename.txt | tail -1 | sed -e 's/Adresse:.*Code Postal: //' -e 's/[0-9]\{3\} .*//'`
sed -n ""$M"{s/Tél. :/$dept/; /----------------------/p; q}" filename.txt >>newfile.csv
done
Where $dept is the first two digits after Code Postal: .
What doesn't yet work is the last sed bit: I want the end file to look like the old file, just with the "Tél." part changed to $dept.
New file:
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
Obviously this pattern with the names repeat, but sometimes the lines Tél. and below are not there.
tl dr; I want to change a pattern in a file with something found one line up, with the thing found one line up changing.
If you found a different way to get $dept in a different line, I would be very happy to hear about it.
I know my code is not the least bit the most efficient, but I learned about sed one week ago only.
Thanks in advance for helping me/correcting me.
EDIT: As I've been asked to provide some input, here it is :
Nom: JOHN DOE
Société: APERTURE SCIENCE
Adresse: 37 RUE OF PARIS CS 30112 Code Postal: 51726 REIMS CEDEX
Tél. :
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: OLIVER TWIST
Société: NASA
Adresse: 40 RUE DU GINGEMBRE CS 70999 Code Postal: 67009 STRASBOURG CEDEX
Tél. :
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: BARACK OBAMA
Société: WHITE HOUSE
Adresse: 124 BOULEVARD DE GAULLE Code Postal: 75017 PARIS
Tél. :
12 34 56 78 90
"----------------------"
Output I want to achieve :
Nom: JOHN DOE
Société: APERTURE SCIENCE
Adresse: 37 RUE OF PARIS CS 30112 Code Postal: 51726 REIMS CEDEX
51
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: OLIVER TWIST
Société: NASA
Adresse: 40 RUE DU GINGEMBRE CS 70999 Code Postal: 67009 STRASBOURG CEDEX
67
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: BARACK OBAMA
Société: WHITE HOUSE
Adresse: 124 BOULEVARD DE GAULLE Code Postal: 75017 PARIS
75
12 34 56 78 90
"----------------------"
With sed :
$ sed '/.*Code Postal: \([0-9][0-9]\).*/{p;s//\1/;n;d}' file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
/.*Code Postal: \([0-9][0-9]\).*/ : search for line containing Code Postal: followed by two digits
p : print matching line (ie clone the line containing "Code Postal")
s//\1/ : substitute matching line (s//\1) with captured digits (\([0-9][0-9]\))
n read the next line ("Tél") and deletes it (d)
I've just seen your edit, you can achieve that with :
sed '/.*Code Postal: \([0-9][0-9]\).*/{p;s//\1/;N;/[0-9]/s/\n/ /;s/Tél\. : *//}' file
Note that the dept number will be output on a single line in the "OLIVER TWIST" block (because Tél.: is on a single line as in first block)
You do not provide sample input to check against, but this should work:
/Code Postal:/ {
match($0, /Code Postal: *([0-9][0-9])/, result);
dept = result[1];
}
/^Tél/ { $2 = dept }
{ print }
Save the code to a file, then call awk -f file input_file. It works like this: If the line matches "Code Postal", then save the first two digits of the postal code in the variable dept. If the line starts with "Tél", replace the second field with the value of dept. Then, print any line.
Here is my guess as to what you are trying to accomplish.
awk 'NR==FNR { # Store line numbers in a[]
a[$1] = $1; next }
FNR in a { m=1 } # We are in match range
/^------$/ { m=0 } # Separator: we are out of range
m && /^Adresse.*Code postal:/ { c=substr($NF, 1, 2); $NF = 90000 }
m && /^Tél\. :$/ { $0 = c }
{ print }' file_that_contains_line_numbers_I_want.txt filename > filename.new
This contains some common Awk idioms. The following is a really brief sketch of the script in human terms.
NR is the current line number overall, and FNR is the file number within the current file. When these are equal, it means you are reading the first input file. In this case, we read the line number into the array a and skip to the next line.
If we are falling through, we are reading the second file. When we see a line number which is present in a, we set the flag m to a true (non-zero) value to indicate that we are in a region where a substition should take place. When we see the dashed lines, we clear it, because this marks the end of the current record.
Finally, if we are in one of the targeted records (m is true) we look for the patterns and perform the requested extraction and substitution. NF is the number of fields in the current line, and $ selects a field, so $NF = 90000 replaces the last field on the line; and $0 is the entire input line, so when we see Tél. : we replace the whole line with the extracted code.
At the end of the script, we print whatever we are reading; the next in the first block skips the rest of the script, so we are printing only when we are in the second file. The resulting output should (hopefully!) be the result you require.
This should be orders of magnitude faster than reading the same file over and over again, and should work as long as the first file contains less than millions of line numbers (assuming modern hardware; if you have a really small machine with limited memory and no swap, maybe tens of thousands).
It sounds like this might be what you want, using GNU awk for the 3rd arg to match()):
$ awk 'match($0,/.*Code Postal: *([0-9][0-9])/,a){$0=$0 ORS a[1]} !/^Tél/' file
or gawk or mawk for gensub():
$ awk '{$0=gensub(/.*Code Postal: *([0-9][0-9]).*/,"&\n\\1",1)} !/^Tél/' file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
The above was run on this input file:
$ cat file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
Tél. :
numbers
Fax :
numbers
"----------------------"
The above matches the stated regexp, saves the captured 2 digits in array a[1] and adds that preceded by a newline (ORS) to the end of the current line before printing that and any other line that doesn't start with Tél.
Read Effective Awk programming, 4th Edition, by Arnold Robbins if you'll be doing any text manipulation in UNIX.

bash: find pattern in one file and apply some code for each pattern found

I created a script that will auto-login to router and checks for current CPU load, if load exceeds a certain threshold I need it print the current CPU value to the standard output.
i would like to search in script o/p for a certain pattern (the value 80 in this case which is the threshold for high CPU load) and then for each instance of the pattern it will check if current value is greater than 80 or not, if true then it will print 5 lines before the pattern followed by then the current line with the pattern.
Question1: how to loop over each instance of the pattern and apply some code on each of them separately?
Question2: How to print n lines before the pattern followed by x lines after the pattern?
ex. i used awk to search for the pattern "health" and print 6 lines after it as below:
awk '/health/{x=NR+6}(NR<=x){print}' ./logs/CpuCheck.log
I would like to do the same for the pattern "80" and this time print 5 lines before it and one line after....only if $3 (representing current CPU load) is exceeding the value 80
below is the output of auto-login script (file name: CpuCheck.log)
ABCD-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 39 36 36 47
WXYZ-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 29 31 31 43
Thanks in advance for the help
Rather than use awk, you could use the -B and -A and switches to grep, which print a number of lines before and after a pattern is matched:
grep -E -B 5 -A 1 '^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])' CpuCheck.log
The pattern matches lines which start with some numbers, followed by spaces, followed by 80, followed by a number greater between 81 and 100. The -E switch enables extended regular expressions (EREs), which are needed if you want to use the + character to mean "one or more". If your version of grep doesn't support EREs, you can instead use the slightly more verbose \{1,\} syntax:
grep -B 5 -A 1 '^[0-9]\{1,\}[[:space:]]\{1,\}80[[:space:]]\{1,\}\(100\|9[0-9]\|8[1-9]\)' CpuCheck.log
If grep isn't an option, one alternative would be to use awk. The easiest way would be to store all of the lines in a buffer:
awk 'f-->0;{a[NR]=$0}/^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])/{for(i=NR-5;i<=NR;++i)print i, a[i];f=1}'
This stores every line in an array a. When the third column is greater than 80, it prints the previous 5 lines from the array. It also sets the flag f to 1, so that f-->0 is true for the next line, causing it to be printed.
Originally I had opted for a comparison $3>80 instead of the regular expression but this isn't a good idea due to the varying format of the lines.
If the log file is really big, meaning that reading the whole thing into memory is unfeasible, you could implement a circular buffer so that only the previous 5 lines were stored, or alternatively, read the file twice.
Unfortunately, awk is stream-oriented and doesn't have a simple way to get the lines before the current line. But that doesn't mean it isn't possible:
awk '
BEGIN {
bufferSize = 6;
}
{
buffer[NR % bufferSize] = $0;
}
$2 == 80 && $3 > 80 {
# print the five lines before the match and the line with the match
for (i = 1; i <= bufferSize; i++) {
print buffer[(NR + i) % bufferSize];
}
}
' ./logs/CpuCheck.log
I think the easiest way with awk, by reading the file.
This should use essentially 0 memory except whatever is used to store the line numbers.
If there is only one occurence
awk 'NR==FNR&&$2=="80"{to=NR+1;from=NR-5}NR!=FNR&&FNR<=to&&FNR>=from' file{,}
If there are more than one occurences
awk 'NR==FNR&&$2=="80"{to[++x]=NR+1;from[x]=NR-5}
NR!=FNR{for(i in to)if(FNR<=to[i]&&FNR>=from[i]){print;next}}' file{,}
Input/output
Input
1
2
3
4
5
6
7
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
19
20
Output
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
How it works
NR==FNR&&$2=="80"{to[++x]=NR+5;from[x]=NR-5}
In the first file if the second field is 80 set to and from to the record number + or - whatever you want.
Increment the occurrence variable x.
NR!=FNR
In the second file
for(i in to)
For each occurrence
if(FNR<=to[i]&&FNR>=from[i]){print;next}
If the current record number(in this file) is between this occurrences to and from then print the line.Next prevents the line from being printed multiple times if occurrences of the pattern are close together.
file{,}
Use the file twice as two args. the {,} expands to file file

bash- searching for a string in a file and returning all the matching positions

I have a fasta file_imagine as a txt file in which even lines are sequences of characters and odd lines are sequence id's_ I would like to search for a string in sequences and get the position for matching substrings as well as their ids. Example:
Input:
>111
AACCTTGG
>222
CTTCCAACC
>333
AATCG
search for "CC" . output:
3 111
4 8 222
$ awk -F'CC' 'NR%2==1{id=substr($0,2);next} NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}' file
3 111
4 8 222
Explanation:
-F'CC'
awk breaks input lines into fields. We instruct it to use the sequence of interest, CC in this example, as the field separator.
NR%2==1{id=substr($0,2);next}
On odd number lines, we save the id to variable id. The assumption is that the first character is > and the id is whatever comes after. Having captured the id, we instruct awk to skip the remaining commands and start over with the next line.
NF>1{x=1+length($1); b=x; for (i=2;i<NF;i++){x+=length(FS $i); b=b " " x}; print b,id}
If awk finds only one field on an input line, NF==1, that means that there were no field separators found and we ignore those lines.
For the rest of the lines, we calculate the positions of each match in x and then save each value of x found in the string b.
Finally, we print the match locations, b, and the id.
Will print line number and the position of each start of each match.
awk 'NR%2==1{t=substr($0,2)}{z=a="";while(y=match($0,"CC")){a=a?a" "z+y:z+y;$0=substr($0,z=(y+RLENGTH));z-=1}}a{print a,t }' file
Neater
awk '
NR%2==1{t=substr($0,2)}
{
z=a=""
while ( y = match($0,"CC") ) {
a=a?a" "z+y:z+y
$0=substr($0,z=(y+RLENGTH))
z-=1
}
}
a { print a,t }' file
.
3 111
4 8 222

Resources