I am confused about this bash script line - bash

I am trying to convert a bash script to python for an intern project; basically, the script parses a table, and prints the information as an HTML document.
This line is confusing me. TMP is a temporary document that is the output of lsload, which outputs a table containing server host info.
# Force header text to lowercase
tr '[:upper:]' '[:lower:]' <${TMP} |head --lines=+1 |sed -e 's/[ \t]\+/ /g' >${H_TMP}
Okay, well the first tr command is converting the header text from uppercase to lowercase. I'm not really sure what the head command is doing. And I am confused as to what the sed is doing as well. Could anyone clarify what is going on in this line?
As a bonus, does anyone have ideas as to how I can convert this to Python?
EDIT: Okay, I seem to understand what sed is doing; it is converting any amount of spaces or tabs to just a single space. Just confused about head now.

You should be able to find the documentation for any Unix command easily by searching for its man page.
http://man7.org/linux/man-pages/man1/head.1.html
Any basic introduction to the Unix command line will also reveal that head reads the first n lines of a text file, and tail correspondingly reads the last n lines of a text file.
The entire snippet corresponds to
with open(os.environ['TMP']) as inputfile, open(os.environ['H_TMP'], 'w') as outputfile:
for line in inputfile:
# sed 's/[ \t]+/ /g' is re.sub(...)
# tr ... is lower()
line = re.sub(r'\s+', ' ' , line).lower()
outputfile.write(line)
# head --lines=1 -- quit after a single line
break
The regex escape \s matches many different whitespace characters; if your input is simply ASCII, it will overlap with the simple character class [ \t]. We can only guess whether you require this to match strictly those two characters if indeed you want to handle Unicode.
For maximum compactness, you could reduce this down to
with open(os.environ['TMP']) as inputfile, open(os.environ['H_TMP'], 'w') as outputfile:
outputfile.write(re.sub(r'\s+', ' ' , inputfile.readline()).lower())
If you want to read a fixed number of lines where that number is not 1, maybe look at enumerate():
with open(os.environ['TMP']) as inputfile, open(os.environ['H_TMP'], 'w') as outputfile:
for lineno, line in enumerate(inputfile, 1):
line = re.sub(r'\s+', ' ' , line).lower()
outputfile.write(line)
if lineno == 234:
break

Related

Newline is not '\n'

I have a text file which was created by Matlab (I don't have the source code), and was in the form:
a b c d
e f g h
I used
sed -i '' $'s/\t/\/g' filename
to replace all the tabs with commas and ended up have a file that looks like this:
a,b,c,d
e,f,g,h
then, I tried to remove all the line breaks using
tr '\n' ' ' < filename
It gave me only the last line, But when I manually edited the text file by placing the pointer to the end of the line and then pressing "del" and "enter" and re-ran the code it worked fine.
So, the newline in the text file is probably not symbolized by \n, what other chars are there to symbolize line breaks?
P.S If I run the tr line on the file before I remove the tabs I get an empty output.
Thank you.
Sounds like your newlines are \r\n (Windows-style ones). One option would be to remove them first using this command:
tr -s '\r\n' ' ' < file
The -s switch means each sequence of characters present in the input is only replaced by a single space. Thanks to glenn jackman for pointing this out.
Guessing your intention slightly, you may want to use something like this, to replace all spaces including line breaks with commas:
tr -s '[:space:]' ',' < file
You could then pipe this to sed to remove the trailing comma if you wanted.

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

bash script: write string with double quotes and blanks to file

I try to use sed to read a line from an ASCII file, parse it and write it slightly changed to a defined line number in an output file.
The line format in the input file is as follows:
linenumber:designator,"variable text content"
e.g.
3:string1,"this is text of string 1"
So the outfile should look as follows in line 3:
string1,"this is text of string 1"
The line includes the double quotes and the blanks. All old lines are moved one line down.
The user is responsible to provide a proper input file regarding the order of lines and has to consider that lines in the output file are moved down with each new line in the input file. The script does not know about any order except for the line number given in the input file.
A script shall read all lines and put the content of those lines into an outputfile at the given line numbers
including double quotes and blanks
without the line number part and the colon
The command I use successfully with the shell is e.g.:
sed -i '3istring1,"this is text of string 1"' outfile
No trouble with quotes, double quotes and blanks there.
Using the bash script
while read line
do
linenum=$(echo $line | cut -f1 -d:)
linestr=$(echo $line | cut -f2 -d:)
sedcmd="sed -i '"
sedcmd=${sedcmd}${linenum}
sedcmd=${sedcmd}i
sedcmd=${sedcmd}${linestr}
sedcmd=${sedcmd}"' outfile"
echo "---> $sedcmd"
$sedcmd
done < script/new_records.txt
shows exactly the same sed command with echo but returns with:
sed: -e expression #1, char 1: unknown command: `''
Apparently executing the sed command from within a bash script is different from executing it directly in the bash shell.
I tried a variety of escape sequences "\" before quotes, double quotes and blanks...but rather randomly, and neither of those was successful.
What do I have to do in order to write the string including blanks and double quotes to a specified line in a text file?
# Assuming OutFile exist and have enough line
while read ThisLine
do
LineNum=$(echo "${ThisLine}" | cut -f1 -d ":" )
echo "${ThisLine##*:}" > /tmp/LineContent.txt
sed -i -n "${LineNum} !{p;b;};r /tmp/LineContent.txt" OutFile
done < script/new_records.txt
Not the best thing because you assume lot of issue like enough line in outfile, no problem reading the line (what about escaped char in quoted string, ...) could occur
Okay, I'll give it a shot. If I understand what you're trying to do correctly, and if you're certain the code input file is not malformed, then
sed -i -f <(sed 's/:/i/' insertions.txt) datafile.txt
is the most straightforward way. This works because with an input specification of
number:text
all one has to do to is to replace the : with an i to get a sed command that says: "When handling line number, insert text". The <() bit is bash-style command substitution that expands to the name of a FIFO from which the output of the command can be read.
It might be prudent to guard against mistakes by saying something like
sed -i -f <(sed '/^[0-9]\+:/!d; s/:/i/' insertions.txt) datafile.txt
This removes all lines from insertions.txt that don't begin with a number followed by a colon because those are obviously broken.
Note that this all-in-one-go approach treats line numbers as they were in the input file. That is to say, given an insertions file with content
2:foo,"bar "
4:baz,"qux "
baz,"qux " will appear in line 5 of the output (before line 4 of the input). If this is not desired, sed will have to be called multiple times to handle each insertion individually, as in
while read insertion; do
sed -i "${insertion/:/i}" datafile.txt
done < insertions.txt
${insertion/:/i} is another bashism that replaces the first : in a shell variable with i and expands to the result, i.e., if insertion=1:2:3, then ${insertion/:/i} is 1i2:3.

How can I get the SOA serial number from a file with sed?

I store my SOA data for multiple domains in a single file that gets $INCLUDEd by zone files. I've written a small sed script that is supposed to get the serial number, increment it, then re-save the SOA file. It all works properly as long as the SOA file is in the proper format, with the entire record on one line, but it fails as soon as the record gets split into multiple lines.
For example, this works as input data:
# IN SOA dnsserver. hostmaster.example.net. ( 2013112202 21600 900 691200 86400 )
But this does not:
# IN SOA dnsserver. hostmaster.example.net. (
2013112202 ; Serial number
21600 ; Refresh every day, 86400 is 1 day
900 ; Retry refresh every 15 min
691200 ; Expire every 8 days
86400 ) ; Minimum TTL 1 day
I like comments, and I would like to spread things out. But I need my script to be able to find the serial number so that I can increment it and rewrite the file.
The SED that works on the single line is this:
SOA=$(sed 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' $SOAfile)
But for multi-line ... I'm a bit lost. I know I can join lines with N, but how do I know if I even need to? Do I need to write separate sed scripts based on some other analysis I do of the original file?
Please help! :-)
I wouldn't use sed for this. While you might be able to brute-force something, it would require a large amount of concentration to come up with it, and it would look like line noise, and so be almost unmaintainable afterwards.
What about this in awk?
The easiest way might be to split your records based on the # character, like so:
SOA=$(awk 'BEGIN{RS="#"} NR==2{print $6}' $SOAfile)
But that will break if you have comments containing # before the uncommented line, or if you have any comments between the # and the serial number. You could make a pipe to avoid these issues...
SOA=$(sed 's/;.*//;/^#/p;1,/^#/d' $SOAfile | awk 'BEGIN{RS="#"} NR==2{print $6}')
It may seem redundant to remove comments and strip the top of the file, but there could be other lines like #include which (however unlikely) could contain your record separator.
Or you could do something like this in pure awk:
SOA=$(awk -v field=6 '/^#/ { if($2=="IN"){field++} for(i=1;i<field;i++){if(i==NF){field=field-NF;getline;i=1}} print $field}' $SOAfile)
Or, broken out for easier reading:
awk -v field=6 '
/^#/ {
if ($2=="IN") {field++;}
for (i=1;i<field;i++) {
if(i==NF) {field=field-NF;getline;i=1;}
}
print $field; }' $SOAfile
This is flexible enough to handle any line splitting you might have, as it counts to field along multiple lines. It also adjusts the field number based on whether your zone segment contains the optional "IN" keyword.
A pure-sed solution would, instead of counting fields, use the first string of digits after an open bracket after your /^#/, like this:
SOA=$(sed -n '/^#/,/^[^;]*)/H;${;x;s/.*#[^(]*([^0-9]*//;s/[^0-9].*//;p;}' $SOAfile)
Looks like line noise, right? :-) Broken out for easier reading, it looks like this:
/^#/,/^[^;]*)/H # "Hold" the meaningful part of the file...
${ # Once we reach the end...
x # Copy the hold space back to the main buffer
s/.*#[^(]*([^0-9]*// # Remove stuff ahead of the serial
s/[^0-9].*// # Remove stuff after the serial
p # And print.
}
The idea here is that starting from the first line that begins with #, we copy the file into sed's hold space, then at the end of the file, do some substitutions to strip out all the text up to the serial number, and then after the serial number, and print whatever remains.
All of these work on single line and multi line zone SOA records I've tested with.
You can try the following - it's your original sed program preceded by commands to first read all input lines, if applicable:
SOA=$(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/.*#.*SOA[^0-9]*//;s/[^0-9].*//' \
"$SOAfile")
This form will work with both single- and multi-line input files.
Multi-line input files are first read as a whole before applying the substitutions.
Note: The awkward separate -e options are needed to keep FreeBSD happy with respect to labels and branching commands, which need a literal \n for termination - using separate -e options is a more readable alternative to splicing in literal newlines with $'\n'.
Alternative solution, using awk:
SOA=$(awk -v RS='#' '$1 == "IN" && $2 == "SOA" { print $6 }' "$SOAfile")
Again, this will work with both single- and multi-line record definitions.
The only constraint is that comments must not precede the serial number.
Additionally, if a file contained multiple records, the above would collect ALL serial numbers, separated by a newline each.
Why sed? grep is simplest in this case:
grep -A1 -e '#.*SOA' 1 | grep -oe '[0-9]*'
or: (maybe better):
grep -A1 -e '#.*SOA' 1 | grep 'Serial number' | grep -oe '[0-9]*'
This might work for you (GNU sed):
sed -nr '/# IN SOA/{/[0-9]/!N;s/[^0-9]+([0-9]+).*/\1/p}' file
For lines that contain # IN SOA if the line contains no numbers append the next line. Then extract the first sequence of numbers from the line(s).

How to insert a new line character after a fixed number of characters in a file

I am looking for a bash or sed script (preferably a one-liner) with which I can insert a new line character after a fixed number of characters in huge text file.
How about something like this? Change 20 is the number of characters before the newline, and temp.text is the file to replace in..
sed -e "s/.\{20\}/&\n/g" < temp.txt
Let N be a shell variable representing the count of characters after which you want a newline. If you want to continue the count accross lines:
perl -0xff -pe 's/(.{'$N'})/$1\n/sg' input
If you want to restart the count for each line, omit the -0xff argument.
Because I can't comment directly (to less reputations) a new hint to upper comments:
I prefer the sed command (exactly what I want) and also tested the Posix-Command fold. But there is a little difference between both commands for the original problem:
If you have a flat file with n*bytes records (without any linefeed characters) and use the sed command (with bytes as number (20 in the answer of #Kristian)) you got n lines if you count with wc. If you use the fold command you only got n-1 lines with wc!
This difference is sometimes important to know, if your input file doesn't contain any newline character, you got one after the last line with sed and got no one with fold
if you mean you want to insert your newline after a number of characters with respect to the whole file, eg after the 30th character in the whole file
gawk 'BEGIN{ FS=""; ch=30}
{
for(i=1;i<=NF;i++){
c+=1
if (c==ch){
print ""
c=0
}else{
printf $i
}
}
print ""
}' file
if you mean insert at specific number of characters in each line eg after every 5th character
gawk 'BEGIN{ FS=""; ch=5}
{
print substr($0,1,ch) "\n" substr($0,ch)
}' file
Append an empty line after a line with exactly 42 characters
sed -ie '/^.\{42\}$/a\
' huge_text_file
This might work for you:
echo aaaaaaaaaaaaaaaaaaaax | sed 's/./&\n/20'
aaaaaaaaaaaaaaaaaaaa
x

Resources