Add a column to any position in a file in unix [using awk or sed] - shell

I'm looking for other alternatives/more intelligent 1 liner for following command, which should add a value to a requested column number.
I tried following following sed command works properly for adding value 4 to the 4th column.
[Need: As i have such file which contains 1000 records & many times i need to add a column in between at any position.]
My approch is sutaible for smaller scale only.
cat 1.txt
1|2|3|5
1|2|3|5
1|2|3|5
1|2|3|5
sed -i 's/1|2|3|/1|2|3|4|/g' 1.txt
cat 1.txt
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
thansk in advance.

Field Separators
http://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html
String Concatenation
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
Default pattern and action
http://www.gnu.org/software/gawk/manual/html_node/Very-Simple.html
awk -v FS='|' -v OFS='|' '{$3=$3"|"4} 1' 1.txt

One way using awk. Pass two arguments to the script, the column number and the value to insert. The script increments the number of fields (NF) and goes throught the last one until the indicated position and insert there the new value.
Run this command:
awk -v column=4 -v value="four" '
BEGIN {
FS = OFS = "|";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' 1.txt
With following output:
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5

One way using coreutils and process substitution:
f=1.txt
paste -d'|' \
<(cut -d'|' -f1-3 $f ) \
<(yes 4 | head -n`wc -l < $f`) \
<(cut -d'|' -f4- $f )

One way, using coreutils and process substitution:
sed 's/3|/3|4|/' 1.txt

Related

Bash - issue with deleting rows containing null values

I have a .csv file which needs to be modified in the following way: for each column in the file, check if that column contains any null entries. If it does, it gets removed from the file. Otherwise, that column stays. I attempted to solve this problem using the following script:
cp file-original.csv file-tmp.csv
for (( i=1;i<=65;i++)); do
for var in $(cut -d, -f$i file-tmp.csv); do
if [ -n $var ]; then
continue
else
cut -d, --complement -f$i file-tmp.csv > file-tmp.csv
break
fi
done
done
I'm assuming that the issue lies in saving the result of each iteration to a file which is also being iterated over (file-tmp.csv). However, I'm not sure on how to circumvent this.
You have to use a temp file as in
cut -d, --complement -f$i file-tmp.csv > tmp.csv && mv tmp.csv file-tmp.csv
for var in $(cut -d, -f$i file-tmp.csv) is buggy: you won't be able to detect an empty line like this, because word splitting will just skip over it.
You could avoid all the file copies in the first place by keeping track of the columns you want to drop, and then drop them all in one go:
for i in {1..65}; do
if grep -q '^$' <(cut -d, -f "$i" file-original.csv); then
drop+=("$i")
fi
done
cut -d, --complement -f "$(IFS=,; echo "${drop[*]}")" file-original.csv \
> file-tmp.csv
This uses grep to see if a column contains an empty line, avoiding the slow loop and the word splitting bug.
After the for loop, the drop array contains all the column numbers we want to drop, and $(IFS=,; echo "${drop[*]}") prints them as a comma separated list.
$ cat foo.csv
a,,c,d
a,b,,d
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ($inFldNr ~ /^$/) {
skip[inFldNr]
}
}
next
}
FNR==1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in skip) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk foo.csv foo.csv
a,d
a,d
Looking at your question, I found a very simple answer, using only grep command and output to a temporary file.
Assume your CSV file is called test.csv. The following creates a file test1.csv which has eliminated all of the lines containing null value :
grep -v null test.csv > test1.csv
-v option inverts the output of grep command, echoing lines that do not contain null within. The output can be forwarded to another file and then you can replace the original test.csv file.

How to take multiple argument in bash and pass them to awk?

I am writing a function in which I am replacing the leading/trailing space
from the column and if there is no value in the column replace it with null.
Function is working fine for one column but how can i modify it for multiple columns.
Function :
#cat trimfunction
#!/bin/bash
function trim
{
vCol=$1 ###input column name
vFile=$2 ###input file name
var3=/home/vipin/temp ###temp file
awk -v col="${vCol}" -f /home/vipin/colf.awk ${vFile} > $var3 ###operation
mv -f $var3 $vFile ###Forcefully mv
}
AWK script :
#cat colf.awk
#!/bin/awk -f
BEGIN{FS=OFS="|"}{
gsub(/^[ \t]+|[ \t]+$/, "", $col) ###replace space from 2nd column
}
{if ($col=="") {print $1,"NULL",$3} else print $0} ###replace whitespace with NULL
Input file : leading/trailing/white space in 2nd column
#cat filename.txt
1| 2016-01|00000321|12
2|2016-02 |000000432|13
3|2017-03 |000004312|54
4| |000005|32
5|2017-05|00000543|12
Script :
#cat script.sh
. /home/vipin/trimfunction
trim 2 filename.txt
Output file : leading/trailing/white space removed in 2nd column
#./script.sh
#cat filename.txt
1|2016-01|00000321|12
2|2016-02|000000432|13
3|2017-03|000004312|54
4|NULL|000005
5|2017-05|00000543|12
If input file is like below - ( white/leading/trailing space in 2nd
and 5th column of file)
1|2016-01|00000321|12|2016-01 |00000
2|2016-02 |000000432|13| 2016-01|00000
3| 2017-03|000004312|54| |00000
4| |000005|2016-02|0000
5|2017-05 |00000543|12|2016-02 |0000
How to achive below output - (All leading/trailing space trimmed and
white space replaced with NULL in 2nd and 5th col) something like trim
2 5 filename.txt trim 2 5 filename.txt ###passing two column name as
input
1|2016-01|00000321|12|2016-01|00000
2|2016-02|000000432|13|2016-01|00000
3|2017-03|000004312|54|NULL|00000
4|NULL|000005|2016-02|0000
5|2017-05|00000543|12|2016-02|0000
This will do what you said you wanted:
$ cat tst.sh
file="${!#}"
cols=( "$#" )
unset cols[$(( $# - 1 ))]
awk -v cols="${cols[*]}" '
BEGIN {
split(cols,c)
FS=OFS="|"
}
{
for (i in c) {
gsub(/^[[:space:]]+|[[:space:]]+$/,"",$(c[i]))
sub(/^$/,"NULL",$(c[i]))
}
print
}' "$file"
$ ./tst.sh 2 5 file
1|2016-01|00000321|12|2016-01|00000
2|2016-02|000000432|13|2016-01|00000
3|2017-03|000004312|54|NULL|00000
4|NULL|000005|2016-02|0000
5|2017-05|00000543|12|2016-02|0000
but if what you REALLY wanted was to operate on ALL fields instead of specific ones then of course there's a simpler solution.
Never do cmd file > tmp; mv tmp file by the way, always do cmd file > tmp && mv tmp file instead (note the &&) so you only overwrite your original file if the command succeeded. Also - always quote your shell variables unless you have a very specific purpose in mind by not doing so and fully understand all of the implications, so use "$file", not $file. Google it.
You can pass a list of columns to modify as a parameter. Create files
$ cat trim.awk
BEGIN {
split(c, a)
FS = OFS = "|"
}
{
for (i in a) {
i = a[i]
gsub(/^[ \t]+|[ \t]+$/, "", $i)
if (!length($i)) $i = "NULL"
}
print
}
and
$ cat filename.txt
1|2016-01|00000321|12|2016-01 |00000
2|2016-02 |000000432|13| 2016-01|00000
3| 2017-03|000004312|54| |00000
4| |000005|2016-02|0000
5|2017-05 |00000543|12|2016-02 |0000
Usage:
awk -v c="2 5" -f trim.awk filename.txt
If managing leading/trailing spaces is all you want to do, you probably don't want to do all(AWK code) that.
cat q1.txt | tr -s ' ' | sed 's/|\ |/|NULL|/g' | sed 's/\ //g' should do.
Break-down
tr -s ' ' : Squeeze multiple spaces into one
sed 's/|\ |/|NULL|/g' : Replace all "| |" with "|NULL|"
sed 's/\ //g' : Replace all spaces with empty string.

AWK: Compare two CSV files

I have two CSV files and I want to compare them using AWK and generate a new file.
file1.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc123","C:/pro/xyz"
"abc124","C:/pro/in"
file2.csv:
"no","loc"
"abc121","C:/pro/in"
"abc122","C:/pro/abc"
"abc125","C:/pro/xyz"
"abc126","C:/pro/in"
output.csv:
"file1","file2","Diff"
"abc121","abc121","Match"
"abc122","abc122","Match"
"abc123","","Unmatch"
"abc124","","Unmatch"
"","abc125","Unmatch"
"","abc126","Unmatch"
One way with awk:
script.awk:
BEGIN {
FS = ","
}
NR>1 && NR==FNR {
a[$1] = $2
next
}
FNR>1 {
print ($1 in a) ? $1 FS $1 FS "Match" : "\"\"" FS $1 FS "Unmatch"
delete a[$1]
}
END {
for (x in a) {
print x FS "\"\"" FS "Unmatch"
}
}
Output:
$ awk -f script.awk file1.csv file2.csv
"abc121","abc121",Match
"abc122","abc122",Match
"","abc125",Unmatch
"","abc126",Unmatch
"abc124","",Unmatch
"abc123","",Unmatch
I didn't use awk alone, but if I understood the gist of what you're asking correctly, I think this long one-liner should do it...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 file1.csv file2.csv | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e '1d' -e 's/^,/"",/' -e 's/,$/,"" /' -e 's/,,/,"",/g'
Description:
The join portion takes the two CSV files, joins them on the first column (default behavior of join) and outputs all four fields (-o 1.1 2.1 1.2 2.2), making sure to include rows that are unmatched for both files (-a 1 -a 2).
The awk portion takes that output and replaces combination of the 3rd and 4th columns to either "Match" or "Unmatch" based on if they do in fact match or not. I had to make an assumption on this behavior based on your example.
The sed portion deletes the "no","loc" header from the output (-e '1d') and replaces empty fields with open-close quote marks (-e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'). This last part might not be necessary for you.
EDIT:
As tripleee points out, the above fails if the two initial files are unsorted. Here's an updated command to fix that. It punts the header line and sorts each file before passing them to join...
join -t, -a 1 -a 2 -o 1.1 2.1 1.2 2.2 <( sed 1d file1.csv | sort ) <( sed 1d file2.csv | sort ) | awk -F, '{ if ( $3 == $4 ) var = "\"Match\""; else var = "\"Unmatch\"" ; print $1","$2","var }' | sed -e 's/^,/"",/' -e 's/,$/,""/' -e 's/,,/,"",/g'

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

finding pattern in a file

I have a txt file of 500 rows and one column.
The column in each row appears some what like this (as an example I am pasting two rows):
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB,chr22:49368010-49368760_NM_152247_CPT1B,chr22:49368010-49368760_NM_152253_CHKB
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB
Want I want to extract from each row is the values starting from NM_ or NR_
like
row 1 has NR_021492 NM_005198 NM_152247 NM_152253
row 2 has NR_021492 NM_005198
...
in tab delimited file
any suggestions for a bash command line?
Try:
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g'
Presuming GNU sed.
So
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g' your_file > tab_delimited_file
EDIT: Updated to not leave a trailing tab character on each row.
EDIT 2: Updated again to work for any chr-then-number sequence.
grep "NM" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NM_/'
grep "NR" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NR_/'
cat file|sed s/$.*!(NR)//;
Use a regular expression to remove everything before the NR
awk -F '[,:_-]' '{
for (i=1; i<NF; i++)
if ($i == "NR" || $i == "NM")
printf("%s_%s ", $i, $(i+1))
print ""
}'
This will also work, but will print each match on its own line: egrep -o 'N[RM]_[0-9]+

Resources