Change specific line of text using variable line number in bash - bash

I'm working on a project for a home server i'm running where I need to change the text on a specific line in a configuration file. The line i need to change has a unique attribute name with varying values however sometimes new lines will be added to the file causing the position of the line I want to change to move from time to time.
To write and test the script i picked a random config file on my pc which looks like this
# cat kvirc4.ini
# KVIrc configuration file
[Main]
LocalKvircDirectory=C:\Users\Aulus\AppData\Roaming\KVIrc4\
SourcesDate=538186288
I wrote a script to change the content on the 3rd line to "changed". If I write the script like this specifying the line number directly in the script it works as expected wit the output file showing what was in the source file with that one line changed.
# cat bashscript.sh
#!/bin/bash
awk '{ if (NR == 3) print "different"; else print $0}' kvirc4.ini.bak > output
However once I add a command to find the line number and save that to a variable which I use in awk to specify the line script no longer makes the change to that line leaving the original version of the file. I added the echo you see below to confirm my variable has the correct value and it does. It seems to me that the problem is likley that my awk command is not correctly reading the line number causing the if statement to remain false but i'm still a bit new to bash and awk so i'm not sure how to fix it. Can anyone offer advice that will help me find a solution?
# cat bashscript.sh
#!/bin/bash
lineno=$(awk '/LocalKvircDirectory/{ print NR; exit }' kvirc4.ini.bak);lineno=$(printf '%b' $lineno)
echo $lineno
# awk 'NR==$lineno {$0="changed"} { print }' kvirc4.ini.bak
awk '{ if (NR == "$lineno") print "different"; else print $0}' kvirc4.ini.bak > output

You are using bash parameter expansion style
("$lineno") in awk code, which have nothing to do.
Your "$lineno" is never expanded because it is single quoted.
You have several options:
You can use different quoting to make the "$lineno" work:
awk '{ if (NR == '"$lineno"') print ···
Use awk variable initialization in the command line
awk -v lineno="$lineno" '{ if (NR==lineno) print ···
Get the value of a bash variable which has been exported (i.e., in the environment) using the ENVIRON awk array :
export lineno
awk '{ if (NR==ENVIRON["lineno"]) print ···

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

The for loop overwrites or entries duplicates

Say, I have 250 files, and from which I need to extract certain information and store them in a text file.
I have tried for loop in the shell as following,
text= 'home/path/tothe/textfiles'
for sam in $(find ${text} -name \*_PG.tsv);do
#echo ${sam}
awk '{if($2=="ID") print FILENAME"\t""yes""\t""SAP""\t""LUFTA"}' ${sam}
done >> ${text}/metadata.txt
With > operator the output text file is overwritten and with >> the output text file is being entered multiple times or duplicate entry.
I would like to know where should I change to get rid of these issues. Thanks for suggestions !!
I think you can do this with a single invocation of awk:
path=home/path/tothe/textfiles
awk -v OFS='\t' '$2 == "ID" {
print FILENAME, "yes", "SAP", "LUFTA"
}' "$path"/*_PG.tsv > "$path"/metadata.txt
careful with your variable assignments, there should be no spaces around the =
use the shell to expand the list of files, without find
pass the full list of files as arguments to awk, instead of looping one by one
set the Output Field Separator OFS instead of writing \t to separate your fields
redirect the output to the metadata file
I assume that your awk script is behaving as you expect - I removed the useless if since awk scripts are written like condition { action }. I guess you only want one line of output per file, so you can probably add an exit inside the block to avoid processing the rest of the file.

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

Meaning of this shell script line with awk

I read this line of script in book [linux device drivers]. What does it do?
major=$(awk "\\$2= =\"$module\" {print \\$1}" /proc/devices)
as in context:
#!/bin/sh
module="scull"
device="scull"
mode="664"
# invoke insmod with all arguments we got
# and use a pathname, as newer modutils don't look in . by default
/sbin/insmod ./$module.ko $* || exit 1
# remove stale nodes
rm -f /dev/${device}[0-3]
major=$(awk "\\$2= =\"$module\" {print \\$1}" /proc/devices)
mknod /dev/${device}0 c $major 0
....
A better way to write this would be :
major=$(awk -v mod=$module '$2==mod{print $1}' /proc/devices)
I read this too but that line was not working for me. I had to modify it to
major=$(awk "\$2 == \"$module\" {print \$1}" /proc/devices)
The first part \$2 == \"$module\" is the pattern. When this is satisfied, that is, the second column is equal to "scull", the command print \$1 is executed which prints the first column. This value is stored in the variable major.
The $ needs to be escaped as they need to be passed as it is to awk.
/proc/devices contains the currently configured character and block devices for each module.
Expanding a few variables in your context, and fixing the syntax error in the equality, the command looks like this:
awk '$2=="scull" {print $1}' /proc/devices
This means "if the value of the second column is scull, then output the first column."
This command is run in a subshell — $(...) — and the output is assigned to the variable $major.
The explanation of the purpose is in the book:
The script to load a module that has been assigned a dynamic number can, therefore, be written using a tool such as awk to retrieve information from /proc/devices in order to create the files in /dev.
Note that in the distributed examples, the line in scull_load matches Vivek's correction.

Resources