Bash - issue with deleting rows containing null values - bash

I have a .csv file which needs to be modified in the following way: for each column in the file, check if that column contains any null entries. If it does, it gets removed from the file. Otherwise, that column stays. I attempted to solve this problem using the following script:
cp file-original.csv file-tmp.csv
for (( i=1;i<=65;i++)); do
for var in $(cut -d, -f$i file-tmp.csv); do
if [ -n $var ]; then
continue
else
cut -d, --complement -f$i file-tmp.csv > file-tmp.csv
break
fi
done
done
I'm assuming that the issue lies in saving the result of each iteration to a file which is also being iterated over (file-tmp.csv). However, I'm not sure on how to circumvent this.

You have to use a temp file as in
cut -d, --complement -f$i file-tmp.csv > tmp.csv && mv tmp.csv file-tmp.csv
for var in $(cut -d, -f$i file-tmp.csv) is buggy: you won't be able to detect an empty line like this, because word splitting will just skip over it.
You could avoid all the file copies in the first place by keeping track of the columns you want to drop, and then drop them all in one go:
for i in {1..65}; do
if grep -q '^$' <(cut -d, -f "$i" file-original.csv); then
drop+=("$i")
fi
done
cut -d, --complement -f "$(IFS=,; echo "${drop[*]}")" file-original.csv \
> file-tmp.csv
This uses grep to see if a column contains an empty line, avoiding the slow loop and the word splitting bug.
After the for loop, the drop array contains all the column numbers we want to drop, and $(IFS=,; echo "${drop[*]}") prints them as a comma separated list.

$ cat foo.csv
a,,c,d
a,b,,d
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ($inFldNr ~ /^$/) {
skip[inFldNr]
}
}
next
}
FNR==1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in skip) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk foo.csv foo.csv
a,d
a,d

Looking at your question, I found a very simple answer, using only grep command and output to a temporary file.
Assume your CSV file is called test.csv. The following creates a file test1.csv which has eliminated all of the lines containing null value :
grep -v null test.csv > test1.csv
-v option inverts the output of grep command, echoing lines that do not contain null within. The output can be forwarded to another file and then you can replace the original test.csv file.

Related

Comparing 2 files with a for loop in bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!
I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.
Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

How to take multiple argument in bash and pass them to awk?

I am writing a function in which I am replacing the leading/trailing space
from the column and if there is no value in the column replace it with null.
Function is working fine for one column but how can i modify it for multiple columns.
Function :
#cat trimfunction
#!/bin/bash
function trim
{
vCol=$1 ###input column name
vFile=$2 ###input file name
var3=/home/vipin/temp ###temp file
awk -v col="${vCol}" -f /home/vipin/colf.awk ${vFile} > $var3 ###operation
mv -f $var3 $vFile ###Forcefully mv
}
AWK script :
#cat colf.awk
#!/bin/awk -f
BEGIN{FS=OFS="|"}{
gsub(/^[ \t]+|[ \t]+$/, "", $col) ###replace space from 2nd column
}
{if ($col=="") {print $1,"NULL",$3} else print $0} ###replace whitespace with NULL
Input file : leading/trailing/white space in 2nd column
#cat filename.txt
1| 2016-01|00000321|12
2|2016-02 |000000432|13
3|2017-03 |000004312|54
4| |000005|32
5|2017-05|00000543|12
Script :
#cat script.sh
. /home/vipin/trimfunction
trim 2 filename.txt
Output file : leading/trailing/white space removed in 2nd column
#./script.sh
#cat filename.txt
1|2016-01|00000321|12
2|2016-02|000000432|13
3|2017-03|000004312|54
4|NULL|000005
5|2017-05|00000543|12
If input file is like below - ( white/leading/trailing space in 2nd
and 5th column of file)
1|2016-01|00000321|12|2016-01 |00000
2|2016-02 |000000432|13| 2016-01|00000
3| 2017-03|000004312|54| |00000
4| |000005|2016-02|0000
5|2017-05 |00000543|12|2016-02 |0000
How to achive below output - (All leading/trailing space trimmed and
white space replaced with NULL in 2nd and 5th col) something like trim
2 5 filename.txt trim 2 5 filename.txt ###passing two column name as
input
1|2016-01|00000321|12|2016-01|00000
2|2016-02|000000432|13|2016-01|00000
3|2017-03|000004312|54|NULL|00000
4|NULL|000005|2016-02|0000
5|2017-05|00000543|12|2016-02|0000
This will do what you said you wanted:
$ cat tst.sh
file="${!#}"
cols=( "$#" )
unset cols[$(( $# - 1 ))]
awk -v cols="${cols[*]}" '
BEGIN {
split(cols,c)
FS=OFS="|"
}
{
for (i in c) {
gsub(/^[[:space:]]+|[[:space:]]+$/,"",$(c[i]))
sub(/^$/,"NULL",$(c[i]))
}
print
}' "$file"
$ ./tst.sh 2 5 file
1|2016-01|00000321|12|2016-01|00000
2|2016-02|000000432|13|2016-01|00000
3|2017-03|000004312|54|NULL|00000
4|NULL|000005|2016-02|0000
5|2017-05|00000543|12|2016-02|0000
but if what you REALLY wanted was to operate on ALL fields instead of specific ones then of course there's a simpler solution.
Never do cmd file > tmp; mv tmp file by the way, always do cmd file > tmp && mv tmp file instead (note the &&) so you only overwrite your original file if the command succeeded. Also - always quote your shell variables unless you have a very specific purpose in mind by not doing so and fully understand all of the implications, so use "$file", not $file. Google it.
You can pass a list of columns to modify as a parameter. Create files
$ cat trim.awk
BEGIN {
split(c, a)
FS = OFS = "|"
}
{
for (i in a) {
i = a[i]
gsub(/^[ \t]+|[ \t]+$/, "", $i)
if (!length($i)) $i = "NULL"
}
print
}
and
$ cat filename.txt
1|2016-01|00000321|12|2016-01 |00000
2|2016-02 |000000432|13| 2016-01|00000
3| 2017-03|000004312|54| |00000
4| |000005|2016-02|0000
5|2017-05 |00000543|12|2016-02 |0000
Usage:
awk -v c="2 5" -f trim.awk filename.txt
If managing leading/trailing spaces is all you want to do, you probably don't want to do all(AWK code) that.
cat q1.txt | tr -s ' ' | sed 's/|\ |/|NULL|/g' | sed 's/\ //g' should do.
Break-down
tr -s ' ' : Squeeze multiple spaces into one
sed 's/|\ |/|NULL|/g' : Replace all "| |" with "|NULL|"
sed 's/\ //g' : Replace all spaces with empty string.

Shell script: copying columns by header in a csv file to another csv file

I have a csv file which I'll be using as input with a format looking like this:
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
The key attributes of the input file are that each "value" will have a variable number of statistics, but the statistic type and "value" will always be separated by a "-". I then want to output the statistics of all the "values" to separate csv files.
The output would then look something like this:
value1.csv
xvalue,value1-avg,value1-median
1,3,4
value2.csv
xvalue,value2-avg
1,20
I've tried finding solutions to this, but all I can find are ways to copy by the column number, not the header name. I need to be able to use the header names to append the associated statistics to each of the output csv files.
Any help is greatly appreciated!
P.S. the output file may have already been written to during previous runs of this script, meaning the code should append to the output file
Untested but should be close:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Note that deleting a whole array with delete(outstr) is gawk-specific. With other awks you can use split("",outstr) to get the same effect.
Note that this appends the output you wanted to existing files BUT that means you'll get the header line repeated on every execution. If that's an issue, tell us how to know when to generate the header line or not but the solution I THINK you'll want would look something like this:
awk -F, '
NR==1 {
for (i=2;i<=NF;i++) {
outfile = $i
sub(/-.*/,".csv",outfile)
outfiles[i] = outfile
}
for (outfile in outfiles) {
exists[outfile] = ( ((getline tmp < outfile) > 0) && (tmp != "") )
close(outfile)
}
}
{
delete(outstr)
for (i=2;i<=NF;i++) {
outfile = outfiles[i]
outstr[outfile] = outstr[outfile] FS $i
}
for (outfile in outstr)
if ( (NR > 1) || !exists[outfile] )
print $1 outstr[outfile] >> outfile
}
' inFile.csv
Just figure out the name associated with each column and use that mapping to manipulate the columns. If you're trying to do this in awk, you can use associative arrays to store the column names and the rows those correspond to. If you're using ksh93 or bash, you can use associative arrays to store the column names and the rows those correspond to. If you're using perl or python or ruby or ... you can...
Or push the columns into an array to map the numbers to column numbers.
Either way, then you have a list of column headers, which can further be manipulated however you need to.
The solution I have found most useful to this kind of problem is to first retrieve the column number using an AWK script (encapsulated in a shell function) and then follow with a cut statement. This technique/strategy turns into a very concise, general and fast solution that can take advantage of co-processing. The non-append case is cleaner, but here is an example that handles the complication of the append you mentioned:
#! /bin/sh
fields() {
LC_ALL=C awk -F, -v pattern="$1" '{
j=0; split("", f)
for (i=1; i<=NF; i++) if ($(i) ~ pattern) f[j++] = i
if (j) {
printf("%s", f[0])
for (i=1; i<j; i++) printf(",%s", f[i])
}
exit 0
}' "$2"
}
cut_fields_with_append() {
if [ -s "$3" ]
then
cut -d, -f `fields "$1" "$2"` "$2" | sed '1 d' >> "$3"
else
cut -d, -f `fields "$1" "$2"` "$2" > "$3"
fi
}
cut_fields_with_append '^[^-]+$|1-' values.csv value1.csv &
cut_fields_with_append '^[^-]+$|2-' values.csv value2.csv &
cut_fields_with_append '^[^-]+$|3-' values.csv value3.csv &
wait
The result is as you would expect:
$ ls
values values.csv
$ cat values.csv
xValue,value1-avg,value1-median,value2-avg,value3-avg,value3-median
1,3,4,20,14,20
$ ./values
$ ls
value1.csv value2.csv value3.csv values values.csv
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
$ ./values
$ cat value1.csv
xValue,value1-avg,value1-median
1,3,4
1,3,4
$ cat value2.csv
xValue,value2-avg
1,20
1,20
$ cat value3.csv
xValue,value3-avg,value3-median
1,14,20
1,14,20
$

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

Add a column to any position in a file in unix [using awk or sed]

I'm looking for other alternatives/more intelligent 1 liner for following command, which should add a value to a requested column number.
I tried following following sed command works properly for adding value 4 to the 4th column.
[Need: As i have such file which contains 1000 records & many times i need to add a column in between at any position.]
My approch is sutaible for smaller scale only.
cat 1.txt
1|2|3|5
1|2|3|5
1|2|3|5
1|2|3|5
sed -i 's/1|2|3|/1|2|3|4|/g' 1.txt
cat 1.txt
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
1|2|3|4|5
thansk in advance.
Field Separators
http://www.gnu.org/software/gawk/manual/html_node/Field-Separators.html
String Concatenation
http://www.gnu.org/software/gawk/manual/html_node/Concatenation.html
Default pattern and action
http://www.gnu.org/software/gawk/manual/html_node/Very-Simple.html
awk -v FS='|' -v OFS='|' '{$3=$3"|"4} 1' 1.txt
One way using awk. Pass two arguments to the script, the column number and the value to insert. The script increments the number of fields (NF) and goes throught the last one until the indicated position and insert there the new value.
Run this command:
awk -v column=4 -v value="four" '
BEGIN {
FS = OFS = "|";
}
{
for ( i = NF + 1; i > column; i-- ) {
$i = $(i-1);
}
$i = value;
print $0;
}
' 1.txt
With following output:
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
1|2|3|four|5
One way using coreutils and process substitution:
f=1.txt
paste -d'|' \
<(cut -d'|' -f1-3 $f ) \
<(yes 4 | head -n`wc -l < $f`) \
<(cut -d'|' -f4- $f )
One way, using coreutils and process substitution:
sed 's/3|/3|4|/' 1.txt

Resources