Command to remove all but select columns for each file in unix directory - bash

I have a directory with many files in it and want to edit each file to only contain a select few columns.
I have the following code which will only print the first column
for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i"; done
but if I try to edit each file by adding >'$I' as below then I lose all the information in my files
for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i" > "$i"; done
However I want to be able to remove all but a select few columns in each file for example 1 and 3.

Given:
cat file
1 2 3
4 5 6
You can do in place editing with sed:
sed -i.bak -E 's/^([^[:space:]]*).*/\1/' file
cat file
1
4
If you want freedom to work with multiple columns and have in place editing, use GNU awk that also supports in place editing:
gawk -i inplace '{print $1, $3}' file
cat file
1 3
4 6
If you only have POSIX awk or wanted to use cut you generally do this:
Modify the file with awk, cut, sed, etc
Redirect the output to a temp file
Rename the temp file back to the original file name.
Like so:
awk '{print $1, $3}' file >tmp_file; mv tmp_file file
Or with cut:
cut -d ' ' -f 1,3 file >tmp_file; mv tmp_file file
To do a loop on files in a directory, you would do:
for fn in /directory_path/*.txt; do
awk -F '\t' '{ print $1 }' "$fn" >tmp_file
mv tmp_file "$fn"
done

Just to add a little more to #dawg's perfectly well working answer according to my use case.
I was dealing with CSVs, and standard CSV can have , in some values as long as it's in double quotes like for example, the below-mentioned row will be a valid CSV row.
col1,col2,col2
1,abc,"abc, inc"
But the command above was treating the , between the double quotes as delimiter too.
Also, the output file delimiter wasn't specified in the command.
These are the modifications I had to make for it handle the above two problems:
for fn in /home/ubuntu/dir/*.csv; do
awk -F ',' '{ FPAT = "([^,]*)|(\"[^\"]+\")"; OFS=","; print $1,$2 }' "$fn" >tmp_file
mv tmp_file "$fn"
done
The OSF delimiter will be the diameter of the output/result file.
The FPAT handles the case of , between quotation mark.
The regex and the information for that is mentioned ins awk's official documentation in section 4.7 Defining Fields by Content.
I was led to that solution through this answer.

Related

AWK remove blank lines and append empty columns to all csv files in the directory

Hi I am looking for a way to combine all the below commands together.
Remove blank lines in the csv file (comma delimited)
Add multiple empty columns to each line up to 100th column
Perform action 1 & 2 on all the files in the folder
I am still learning and this is the best I could get:
awk '!/^[[:space:]]*$/' x.csv > tmp && mv tmp x.csv
awk -F"," '($100="")1' OFS="," x.csv > tmp && mv tmp x.csv
They work out individually but I don't know how how to put them together and I am looking for ways to have it run through all the files under the directory.
Looking for concrete AWK code or shell script calling AWK.
Thank you!
An example input would be:
a,b,c
x,y,z
Expected output would be:
a,b,c,,,,,,,,,,
x,y,z,,,,,,,,,,
you can combine in one script without any loops
$ awk 'BEGIN{FS=OFS=","} FNR==1{close(f); f=FILENAME".updated"} NF{$100=""; print > f}' files...
it won't overwrite the original files.
You can pipe the output of the first to the other:
awk '!/^[[:space:]]*$/' x.csv | awk -F"," '($100="")1' OFS="," > new_x.csv
If you wanted to run the above on all the files in your directory, you would do:
shopt -s nullglob
for f in yourdirectory/*.csv; do
awk '!/^[[:space:]]*$/' "${f}" | awk -F"," '($100="")1' OFS="," > new_"${f}"
done
The shopt -s nullglob is so that an empty directory won't give you a literal *. Quoted from a good source for about looping through files
With recent enough GNU awk you could:
$ gawk -i inplace 'BEGIN{FS=OFS=","}/\S/{NF=100;$1=$1;print}' *
Explained:
$ gawk -i inplace ' # using GNU awk and in-place file editing
BEGIN {
FS=OFS="," # set delimiters to a comma
}
/\S/ { # gawk specific regex operator that matches any character that is not a space
NF=100 # set the field count to 100 which truncates fields above it
$1=$1 # edit the first field to rebuild the record to actually get the extra commas
print # output records
}' *
Some test data (the first empty record is empty, the second empty record has a space and a tab, trust me bro):
$ cat file
1,2,3
1,2,3,4,5,6,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
Output of cat file after the execution of the GNU awk program:
1,2,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100

Duplicate first column of multiple text files in bash

I have multiple text files each containing two columns and I would like to duplicate the first column in each file in bash to have three columns in the end.
File:
sP100227 1
sP100267 1
sP100291 1
sP100493 1
Output file:
sP100227 sP100227 1
sP100267 sP100267 1
sP100291 sP100291 1
sP100493 sP100493 1
I tried:
txt=path/to/*.txt
echo "$(paste <(cut -f1-2 $txt) > "$txt"
Could you please try following. Written and tested with shown samples in GNU awk. This will add fields to only those lines which have 2 fields in it.
awk 'NF==2{$1=$1 OFS $1} 1' Input_file
In case you don't care of number of fields and simply want to have value of 1st field 2 times then try following.
awk '{$1=$1 OFS $1} 1' Input_file
OR if you only have 2 fields in your Input_file then we need not to rewrite the complete line we could simply print them as follows.
awk '{print $1,$1,$2}' Input_file
To save output into same Input_file itself append > temp && mv temp Input_file for above solutions(after testing).
Use a temp file, with cut -f1 and paste, like so:
paste <(cut -f1 in_file) in_file > tmp_file
mv tmp_file in_file
Alternatively, use a Perl one-liner, like so:
perl -i.bak -lane 'print join "\t", $F[0], $_;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
The default delimiter in cut and paste is TAB, but your file looks to be space-separated.
You can't use the same file as input and output redirection, because when the shell opens the file for output it truncates it, so there's nothing for the program to read. Write to a new file and then rename it.
Your paste command is only being given one input file. And there's no need to use echo.
paste -d' ' <(cut -d' ' -f1 "$txt") "$txt" > "$txt.new" && mv "$txt.new" "$txt"
You can do this more easily using awk.
awk '{print $1, $0}' "$txt" > "$txt.new" && mv "$txt.new" "$txt"
GNU awk has an in-place extension, so you can use that if you like. See Save modifications in place with awk
Try sed -Ei 's/\s*(\S+)\s+/\1 \1 /1' $txt if your fields are separated by strings of one or more whitespace characters. This used the Stream Editor (sed) replaces (s///1) the first string of non-space characters (\S+) followed by a string of whitespace characters (\s+) with the same thing repeated with intervening spaces(\1 \1 ). It keeps the rest of the line. The -E to sed means use extended pattern matching (+, ( vs. \(). The -i means do it in-place, replacing the file with the output.
You could use awk and do awk '{ printf "%s %s\n",$1,$0 }'. This takes the first whitespace-delimited field ($1) and follows it with a space and the whole line ($0) followed by a newline. This is a little clearer than sed but it doesn't have the advantage of being in-place.
If you can guarantee they are delimited by only one space, with no leading spaces, you can use paste -d' ' <(cut -d' ' -f1 ${txt}) ${txt} > ${txt}.new; mv ${txt}.new ${txt}. The -d' ' sets the delimiter to space for both cut and paste. You know this but for others -f1 means extract the first -d-delimited field. The mv command replaces the input with the output.

How to capture first column values of a command?

I am new to shell scripting. I am trying to write a script that is suppose to run a command and use for loop to capture first column of the output and do further processing.
command: tst get files
output of this command is something like
NAME COUNT ADMIN
FileA.txt 30 adminA
FileB.txt 21 local
FileC.txt 9 local
FileD.txt 90 adminA
Here is what I have tried so far : UPDATED also want to run additional commands
#!/bin/bash
for f in $(tst get files)
do
echo "FILE :[${f}]"
tst setprimary ${f} && tst get dataload
done
the output I am seeing is something like
FILE :[NAME]
FILE :[COUNT]
FILE :[ADMIN]
FILE :[FileA.txt]
FILE :[30]
FILE :[adminA]
FILE :[FileB.txt]
FILE :[21]
FILE :[local]
FILE :[FileC.txt]
FILE :[9]
FILE :[local]
FILE :[FileD.txt]
FILE :[90]
FILE :[adminA]
I am looking for an output something like
FILE :[FileA.txt]
FILE :[FileB.txt]
FILE :[FileC.txt]
FILE :[FileD.txt]
What should I modify in the shell script to only capture NAME column values? Am I executing the tst get files command correctly in the for loop or is there a better way to execute a command and loop thru the results?
EDIT (Samuel Kirschner): you can do without the for loop entirely and just use awk to print the lines you're interested in
tst get files | awk 'NR > 1 {print "FILE :[" $1 "]"}'
If you want to keep the for loop for some reason and just extract the file name from the lines while skipping the header, you have a few choices. Awk is probably the easiest because of the NR builtin variable (which counts lines) and automatic field-splitting ($1 refers to the first field in the line, for instance), but you can use sed and cut as well.
You can use awk 'NR > 1 {print $1}' to get the first column (using any whitespace character as a delimiter while skipping the first line) or sed 1d | cut -d$'\t' -f1. Note that $'\t' is bash-specific syntax for a literal tab character, if your file is padded with spaces rather than using tabs to delimit fields, you can't use the sed ... | cut ... example.
i.e.
#!/bin/bash
for f in $(tst get files | awk 'NR > 1 {print $1}')
do
echo "FILE :[${f}]"
done
or
#!/bin/bash
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
to avoid unnecessary splitting in the for loop. It's best to set IFS to something specific outside the loop body to prevent 'a file with whitespace.txt' from being broken up.
OLD_IFS=IFS
IFS=$'\n\t'
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
You can just do:
tst get files | awk 'NR > 1 { printf "FILE :[%s]\n", $1 }'
Update: To answer extended problem as per comments below by OP:
while read -r file _; do
tst setprimary "$file" && tst get dataload
done < <(tst get files)
Or perl:
tst ... | perl -lanE 'say "File: [$F[0]]" if $.>1'
the variable $. contains the current line number

Shell script copying all columns of text file instead of specified ones

I trying to copy 3 columns from one text file and paste them into a new text file. However, whenever I execute this script, all of the columns in the original text file get copied. Here is the code I used:
cut -f 1,2,6 PROFILES.1.0.profile > compiledfile.txt
paste compiledfile.txt > myNewFile
Any suggestions as to what I'm doing wrong? Also, is there a simpler way to do this? Thanks!
Let's suppose that the input is comma-separated:
$ cat File
1,2,3,4,5,6,7
a,b,c,d,e,f,g
We can extract columns 1, 2, and 6 using cut:
$ cut -d, -f 1,2,6 File
1,2,6
a,b,f
Note the use of option -d, to specify that the column separator is a comma.
By default, cut uses a tab as the column separator. If the separator in your file is anything else, you must use the -d option.
Using awk
awk -vFS=your_delimiter_here -vOFS=your_delimiter_here 'print $1,$2,$6' PROFILES.1.0.profile > compiledfile.txt
should do it.
For comma separated fields the solution would be
awk -vFS=, -vOFS=, '{print $1,$2,$6}' PROFILES.1.0.profile > compiledfile.txt
FS is an awk builtin variable which stands for field-separator.
Similarly OFS stands for output-field-separator.
And the handy -v option with awk helps you assign a value to variable.
You could use awk to this.
awk -F "delimiter" '
{
print $1,$2 ,$3 #Where $1,$2 and so are column numbers
}' filename > newfile

How to increment number in a file

I have one file with the date like below,let say file name is file1.txt:
2013-12-29,1
Here I have to increment the number by 1, so it should be 1+1=2 like..
2013-12-29,2
I tried to use 'sed' to replace and must be with variables only.
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i 's\$oldnum\$newnum\g' file1.txt
But I get an error from sed syntax, is there any way for this. Thanks in advance.
Sed needs forward slashes, not back slashes. There are multiple interesting issues with your use of '\'s actually, but the quick fix should be (use double quotes too, as you see below):
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i "s/$oldnum\$/$newnum/g" file1.txt
However, I question whether sed is really the right tool for the job in this case. A more complete single tool ranging from awk to perl to python might work better in the long run.
Note that I used a $ end-of-line match to ensure you didn't replace 2012 with 2022, which I don't think you wanted.
usually I would like to use awk to do jobs like this
following is the code might work
awk -F',' '{printf("%s\t%d\n",$1,$2+1)}' file1.txt
Here is how to do it with awk
awk -F, '{$2=$2+1}1' OFS=, file1.txt
2013-12-29,2
or more simply (this will file if value is -1)
awk -F, '$2=$2+1' OFS=, file1.txt
To make a change to the change to the file, save it somewhere else (tmp in the example below) and then move it back to the original name:
awk -F, '{$2=$2+1}1' OFS=, file1.txt >tmp && mv tmp file1.txt
Or using GNU awk, you can do this to skip temp file:
awk -i include -F, '{$2=$2+1}1' OFS=, file1.txt
Another, single line, way would be
expr cat /tmp/file 2>/dev/null + 1 >/tmp/file
this works if the file doesn't exist or if the file doesnt contain a valid number - in both cases the file is (re)created with a value of 1
awk is the best for your problem, but you can also do the calculation in shell
In case you have more than one rows, I am using loop here
#!/bin/bash
IFS=,
while read DATE NUM
do
echo $DATE,$((NUM+1))
done < file1.txt
Bash one liner option with BC. Sample:
$ echo 3 > test
$ echo 1 + $(<test) | bc > test
$ cat test
4
Also works:
bc <<< "1 + $(<test)" > test

Resources