Pad/Fill missing columns in CSV file (using tabs) - bash

I have some CSV files with TAB as separator. The lines have variable amount of columns and I want to normalize that.
I need exactly say 10 columns so effectively I want to add empty column up until 10th column in case it has fewer columns.
Also I would like to loop all files in a folder and update the corresponding file and not just output or write to a new file.
I can manage to do it with commas like this:
awk -F, '{$10=""}1' OFS=',' file.txt
But when changing it to \t i breaks and adds too many columns:
awk -F, '{$10=""}1' OFS='\t' file.txt
Any inputs?

If you have GNU awk (sometimes called gawk), this will make sure that you have ten columns and it won't erase tenth if it is already there:
awk -F'\t' -v OFS='\t' '{NF=10}1' file >file.tmp && mv file.tmp file
Awk users value brevity and a further simplification, as suggested by JID, is possible. Since, under awk, NF=10 evaluates to true, we can set NF to 10 at the same time that we cause the line to be printed:
awk -F'\t' -v OFS='\t' 'NF=10' file >file.tmp && mv file.tmp file
MacOS: On a Mac, the default awk is BSD but GNU awk (gawk) can be installed using brew install gawk.

find /YourFolder -name "*.csv" -exec sed -i 's/$/\t\t\t\t\t\t\t\t\t/;s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/' {} \;
The find for taking all your CSV files
the sed
-i for inline editing and avoid temporary file
add 9 tab on each line then keep only the 10 first element (separated by 9 tab)
A version that only change line that are not compliant:
find /YourFolder -name "*.csv" -exec sed -i '/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
s/$/\t\t\t\t\t\t\t\t\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
}' {} \;
Auto adapt column number
# change the 2 occurance of "9" by the number of wanted column - 1
find /YourFolder -name "*.csv" -exec sed -i ':cycle
/^\([^\t]*\t\)\{9\}[^\t]*$/ ! {
# optimize with number ot \t on line below
s/$/\t/
s/^\(\([^\t]*\t\)\{9\}[^\t]*\).*/\1/
b cycle
}' {} \;
you can optimize your case by adding several \t instead of 1 per cycle (best should be the average missing column with a normal distribution)

Related

Command to remove all but select columns for each file in unix directory

I have a directory with many files in it and want to edit each file to only contain a select few columns.
I have the following code which will only print the first column
for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i"; done
but if I try to edit each file by adding >'$I' as below then I lose all the information in my files
for i in /directory_path/*.txt; do awk -F "\t" '{ print $1 }' "$i" > "$i"; done
However I want to be able to remove all but a select few columns in each file for example 1 and 3.
Given:
cat file
1 2 3
4 5 6
You can do in place editing with sed:
sed -i.bak -E 's/^([^[:space:]]*).*/\1/' file
cat file
1
4
If you want freedom to work with multiple columns and have in place editing, use GNU awk that also supports in place editing:
gawk -i inplace '{print $1, $3}' file
cat file
1 3
4 6
If you only have POSIX awk or wanted to use cut you generally do this:
Modify the file with awk, cut, sed, etc
Redirect the output to a temp file
Rename the temp file back to the original file name.
Like so:
awk '{print $1, $3}' file >tmp_file; mv tmp_file file
Or with cut:
cut -d ' ' -f 1,3 file >tmp_file; mv tmp_file file
To do a loop on files in a directory, you would do:
for fn in /directory_path/*.txt; do
awk -F '\t' '{ print $1 }' "$fn" >tmp_file
mv tmp_file "$fn"
done
Just to add a little more to #dawg's perfectly well working answer according to my use case.
I was dealing with CSVs, and standard CSV can have , in some values as long as it's in double quotes like for example, the below-mentioned row will be a valid CSV row.
col1,col2,col2
1,abc,"abc, inc"
But the command above was treating the , between the double quotes as delimiter too.
Also, the output file delimiter wasn't specified in the command.
These are the modifications I had to make for it handle the above two problems:
for fn in /home/ubuntu/dir/*.csv; do
awk -F ',' '{ FPAT = "([^,]*)|(\"[^\"]+\")"; OFS=","; print $1,$2 }' "$fn" >tmp_file
mv tmp_file "$fn"
done
The OSF delimiter will be the diameter of the output/result file.
The FPAT handles the case of , between quotation mark.
The regex and the information for that is mentioned ins awk's official documentation in section 4.7 Defining Fields by Content.
I was led to that solution through this answer.

AWK remove blank lines and append empty columns to all csv files in the directory

Hi I am looking for a way to combine all the below commands together.
Remove blank lines in the csv file (comma delimited)
Add multiple empty columns to each line up to 100th column
Perform action 1 & 2 on all the files in the folder
I am still learning and this is the best I could get:
awk '!/^[[:space:]]*$/' x.csv > tmp && mv tmp x.csv
awk -F"," '($100="")1' OFS="," x.csv > tmp && mv tmp x.csv
They work out individually but I don't know how how to put them together and I am looking for ways to have it run through all the files under the directory.
Looking for concrete AWK code or shell script calling AWK.
Thank you!
An example input would be:
a,b,c
x,y,z
Expected output would be:
a,b,c,,,,,,,,,,
x,y,z,,,,,,,,,,
you can combine in one script without any loops
$ awk 'BEGIN{FS=OFS=","} FNR==1{close(f); f=FILENAME".updated"} NF{$100=""; print > f}' files...
it won't overwrite the original files.
You can pipe the output of the first to the other:
awk '!/^[[:space:]]*$/' x.csv | awk -F"," '($100="")1' OFS="," > new_x.csv
If you wanted to run the above on all the files in your directory, you would do:
shopt -s nullglob
for f in yourdirectory/*.csv; do
awk '!/^[[:space:]]*$/' "${f}" | awk -F"," '($100="")1' OFS="," > new_"${f}"
done
The shopt -s nullglob is so that an empty directory won't give you a literal *. Quoted from a good source for about looping through files
With recent enough GNU awk you could:
$ gawk -i inplace 'BEGIN{FS=OFS=","}/\S/{NF=100;$1=$1;print}' *
Explained:
$ gawk -i inplace ' # using GNU awk and in-place file editing
BEGIN {
FS=OFS="," # set delimiters to a comma
}
/\S/ { # gawk specific regex operator that matches any character that is not a space
NF=100 # set the field count to 100 which truncates fields above it
$1=$1 # edit the first field to rebuild the record to actually get the extra commas
print # output records
}' *
Some test data (the first empty record is empty, the second empty record has a space and a tab, trust me bro):
$ cat file
1,2,3
1,2,3,4,5,6,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
Output of cat file after the execution of the GNU awk program:
1,2,3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100

Applying awk pattern to all files with same name, outputting each to a new file

I'm trying to recursively find all files with the same name in a directory, apply an awk pattern to them, and then output to the directory where each of those files lives a new updated version of the file.
I thought it was better to use a for loop than xargs, but I don't exactly how to make this work...
for f in $(find . -name FILENAME.txt );
do awk -F"\(corr\)" '{print $1,$2,$3,$4}' ./FILENAME.txt > ./newFILENAME.txt $f;
done
Ultimately I would like to be able to remove multiple strings from the file at once using -F, but also not sure how to do that using awk.
Also is there a way to remove "(cor*)" where the * represents a wildcard? Not sure how to do while keeping with the escape sequence for the parentheses
Thanks!
To use (corr*) as a field separator where * is a glob-style wildcard, try:
awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
For example:
$ echo '1(corr)2(corrTwo)3(corrThree)4' | awk -F'[(]corr[^)]*[)]' '{print $1,$2,$3,$4}'
1 2 3 4
To apply this command to every file under the current directory named FILENAME.txt, use:
find . -name FILENAME.txt -execdir sh -c 'awk -F'\''[(]corr[^)]*[)]'\'' '\''{print $1,$2,$3,$4}'\'' "$1" > ./newFILENAME.txt' Awk {} \;
Notes
Don't use:
for f in $(find . -name FILENAME.txt ); do
If any file or directory has whitespace or other shell-active characters in it, the results will be an unpleasant surprise.
Handling both parens and square brackets as field separators
Consider this test file:
$ cat file.txt
1(corr)2(corrTwo)3[some]4
To eliminate both types of separators and print the first four columns:
$ awk -F'[(]corr[^)]*[)]|[[][^]]*[]]' '{print $1,$2,$3,$4}' file.txt
1 2 3 4

Add filename to output of an xargs and awk command

I have a directory full of .txt files, each of which has two columns and many rows (>10000). For each of these files, I am trying to find the maximum value in the second column, and print the corresponding entry in columns 1 and 2 to an output file. For this, I have a working awk command.
find ./ -name "*.txt" | xargs -I FILE awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE >> out.txt
However, I would also like to print the name of the corresponding input file with each pair of numbers. The output would look something like:
file1.txt datum1 max1
file2.txt datum2 max2
For this, I tried to draw inspiration from this similar question:
add filename to beginning of file using find and sed,
but I couldn't quite get a working solution. My best effort so far looks something like this
find ./ -name "*.txt" | xargs -I FILE echo FILE | awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE >> out.txt
but I get the error:
awk: can't open file FILE
source line number 1
I tried various other approaches which are probably a few characters away from being correct:
(1)
find ./ -name "*.txt" | xargs -I FILE -c "echo FILE ; awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' FILE" >> out.txt
(2)
find ./ -name "*.txt" -exec sh -c "echo {} && awk '{if(max<$2){max=$2;datum=$1}}END{print datum, max}' {}" \; >> out.txt
I don't mind what command is used (xargs or exec or whatever), I only really care about the output.
If all the .txt files are in the current directory, try (GNU awk):
awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' *.txt
If you want to search both the current directory and all its subdirectories for .txt files, then try:
find . -name '*.txt' -exec awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' {} +
Because modern find has an -exec action, the command xargs is rarely needed anymore.
How it works
{if(max=="" || max<$2+0){max=$2;datum=$1}}
This finds the maximum column 2 and saves its and the corresponding value in column 1.
ENDFILE{print FILENAME, datum, max; max=""}
After the end of each file is reached, this prints the filename and column 1 and column 2 from the line with the maximum column 2.
Also, at the end of each file, max is reset to an empty string.
Example
Consider a directory with these three files:
$ cat file1.txt
1 1
2 2
$ cat file2.txt
3 12
5 14
4 13
$ cat file3.txt
1 0
2 1
Our command produces:
$ awk '{if(max=="" || max<$2+0){max=$2;datum=$1}}ENDFILE{print FILENAME, datum, max; max=""}' *.txt
file1.txt 2 2
file2.txt 5 14
file3.txt 2 1
BSD awk
If we cannot use ENDFILE, try:
$ awk 'FNR==1 && NR>1{print f, datum, max; max=""} max=="" || max<$2+0{max=$2;datum=$1;f=FILENAME} END{print f, datum, max}' *.txt
file1.txt 2 2
file2.txt 5 14
file3.txt 2 1
Because one awk process can analyze many files, this approach should be fast.
FNR==1 && NR>1{print f, datum, max; max=""}
Every time that we start a new file, we print the maximum from the previous file.
In awk, FNR is the line number of the current file and NR is the total number of lines read so far. When FNR==1 && NR>1, that means that we have finished at least one file and we are started on the next.
max=="" || max<$2+0{max=$2;datum=$1;f=FILENAME}
Like before, we capture the maximum of column 2 and the corresponding datum from column 1. We also record the filename as variable f.
END{print f, datum, max}
After we finish reading the last file, we print its maximum line.
If you have 10,000 files of 100,000 lines each, you will be quite a long time waiting if you start a new invocation of awk for each and every file like this because you will have to create 10,000 processes:
find . -name \*.txt -exec awk ....
I created some test files and found that the above takes just over 5 minutes on my iMac.
So, I decided to see what all those lovely Intel cores and all that lovely flash disk that I paid Apple so dearly for might be able to do using GNU Parallel.
Basically, it will run as many jobs in parallel as your CPU has cores - probably 4 or 8 on a decent Mac, and it can tag output lines with the parameters it supplied to the command:
parallel --tag -q awk 'BEGIN{max=$2;d=$1} $2>max {max=$2;d=$1} END{print d,max}' ::: *.txt
That produces the same results and now runs in 1 minute 22 seconds, nearly a 4x speedup, - not bad! But we can do better... as it stands above, we are still invoking a new awk for every file, so 10,000 awks, but in parallel, 8 at a time. It would be better to pass as many files as the OS permits to each of our 8 awks that run in parallel. Luckily, GNU Parallel will work out how many that is for us, with the -X option:
parallel -X -q gawk 'BEGINFILE{max=$2;d=$1} $2>max {max=$2;d=$1} ENDFILE{print FILENAME,d,max}' ::: *.txt
That now takes 49 seconds, but note that I am using gawk for ENDFILE/BEGINFILE and not the --tag option because each awk invocation is now receiving many hundreds of files rather than just one.
GNU Parallel and gawk can be easily installed on a Mac with homebrew. You just go to the homebrew website and copy and paste the one-liner into your terminal. Then you have a proper package manager on macOS and access to thousands of quality, useful, well managed packages.
Once you have homebrew installed, you can install GNU Parallel with:
brew install parallel
and you can install gawk with:
brew install gawk
If you don't want a package manager, it's worth noting that GNU Parallel is just a Perl script and macOS ships with Perl anyway. So, you can also install it very simply with:
(wget -O - pi.dk/3 || curl pi.dk/3/ ) | bash
Note that if your filenames are longer than about 25 characters, you will hit the limit of 262,144 characters on the argument length and get an error message telling you the argument list is too long. If that happens, just feed the names on stdin like this:
find . -name \*.txt -print0 | parallel -0 -X -q gawk 'BEGINFILE{max=$2;d=$1} $2>max {max=$2;d=$1} ENDFILE{print FILENAME,d,max}'
find . -name '*.txt' | xargs -n 1 -I FILE awk '(FNR==1) || (max<$2){max=$2;datum=$1} END{print FILENAME, datum, max}' FILE >> out.txt
find . -name '*.txt' -exec awk '(FNR==1) || (max<$2){max=$2;datum=$1} END{print FILENAME, datum, max}' {} \; >> out.txt
(edited by OP for typo)

Removing the first line of each file from a wildcard?

I am trying to copy about 100 CSVs into a PostgreSQL database. The CSVs aren't formed perfectly for the database, so I have to do some editing, which I am trying to do on the fly with piping.
Because each CSV file has a header, I need to remove the first line to prevent the headers from being copied into the database as an entity.
My attempt at this was the following:
sed -e "s:\.00::g" -e "s/\"\"//g" *.csv | tail -n +2 | cut -d "," -f1-109 |
psql -d intelliflight_pg -U intelliflight -c "\COPY flights FROM stdin WITH DELIMITER ',' CSV"
The problem I'm having with this is that it treats *.csv as a single file, and only removes the first line of the first file it sees, and leaves the rest of the files alone.
How can I get this to remove the first line of each individual file retrieved by the *.csv wildcard?
You can combine the sed and tail steps and use find to have per-file processing, then pipe the output of that to cut and psql:
find -name '*.csv' -exec sed '1d;s/\.00//g;s/""//g' {} \; | cut ...
This uses sed to remove the first line from each file, then does the substitutions on the rest of the files. Each file is processed, and the output of it all piped to cut and the rest of your commands.
Notice the single quotes around the sed argument, simplifying things somewhat with the quoting.
This also processes .csv files in subdirectories; if you don't want that, you have to limit recursion depth with
find -maxdepth 1 -name etc.
Can't test it right now but this should do :
awk -F, '
FNR == 1 {next}
{
gsub(/\.00/, "")
gsub(/""/, "")
NF = 109
print
}
' *.csv | psql ..
The NF= 109 line will drop any field after 109.

Resources