Using awk to split CSV file by column - macos

I have a CSV file that I need to split by date. I've tried using the AWK code listed below (found elsewhere).
awk -F"," 'NR>1 {print $0 >> ($1 ".csv"); close($1 ".csv")}' file.csv
I've tried running this within terminal in both OS X and Debian. In both cases there's no error message (so the code seems to run properly), but there's also no output. No output files, and no response at the command line.
My input file has ~6k rows of data that looks like this:
date,source,count,cost
2013-01-01,by,36,0
2013-01-01,by,42,1.37
2013-01-02,by,7,0.12
2013-01-03,by,11,4.62
What I'd like is for a new CSV file to be created containing all of the rows for a particular date. What am I overlooking?

I've resolved this. Following the logic of this thread, I checked my line endings with the file command and learned that the file had the old-style Mac line terminators. I opened my input CSV file with Text Wrangler and saved it again with Unix style line endings. Once I did that, the awk command listed above worked as expected. It took ~5 seconds to create 63 new CSV files broken out by date.

For retrieve information in a log file with ";" separator I use:
grep "END SESSION" filename.log | cut -d";" -f2
where
-d, --delimiter=DELIM use DELIM instead of TAB for field delimiter
-f, --fields=LIST select only these fields; also print any line
that contains no delimiter character, unless
the -s option is specified

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

Cut string of text after a character from a column each line of a csv, keeping the other columns, and printing to a new file

I have a CSV file with a first column that reads:
/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/BER5_OSSD_F008071.csv.0.01.out.csv
Followed by additional columns listing counts pulled from other CSV files.
What I want is to remove "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" from each line without affecting any other part of the file.
I've tried using sed, grep, and cut, but that only seems to print the output in the terminal or a new file only containing that part of the line, and not the rest of the columns. Can I remove the "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" and keep everything else the same?
You can use awk to get this job done.
Please see below code which will replace the contents/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/ with "" empty and will update the operation in the same file with the option inplace
yourfile.csv is the input file.
awk -i inplace '{sub(/\/Users\/swilki\/Desktop\/Africa_OSSD\/OSSD_Output\//,"")}1' yourfile.csv
The above will remove the "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" and keep everything else same.
Output of yourfile.csv:
BER5_OSSD_F008071.csv.0.01.out.csv
Option 2, If you want to print in a new file:
Below code will be give the replaced contents in the new file your_newfile.csv
awk '{sub(/\/Users\/swilki\/Desktop\/Africa_OSSD\/OSSD_Output\//,"")}1' yourfile.csv >your_newfile.csv

using awk in command line to add column to end of csv

I have downloaded this CSV file: https://www.nhsbsa.nhs.uk/sites/default/files/2020-05/Dispensing%20Data%20Jan%2020%20-%20CSV.csv
and am trying to add a column on to the end with a value of "null" for every row.
I have tried using this awk command:
awk 'BEGIN{FS=OFS=","}{print $0 OFS "null"}' ogfile.csv > newfile.csv
but it appears to be adding a new row after every row, with the second column having a field of "null"
new rows instead of new column
can anyone help me understand why this is happening?
Your source file has DOS/Windows line endings. When one sees anomalous output, this is a good first item to check. Two solutions:
Use a utility such as dos2unix to remove the unwanted \r character from your input file. dos2unix is available on most distributions.
or,
Modify your awk command to recognize and remove the offending characters:
awk 'BEGIN{RS="\r\n"; FS=OFS=","}{print $0 OFS "null"}' ogfile.csv

Convert many txt files to xls files with bash script

I'm trying to convert many text files to xls files. The style of the txt file is as follow:
"Name";"Login";"Role"
"Max Muster";"Bla102";"user"
"Heidi Held";"Held100";"admin"
I tried to work with this bash script:
for file in *.txt; do
tr ";" "," < "$file" | paste -d, <(seq 1 $(wc < "$file")) - > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
with this, I lost the header row and I also get a column with many Chinese signs, but the rest looks okay:
攀挀琀 | Max Muster | Bla102 | user
氀愀猀 | Heidi Held | Held100 | admin
How can I get rid of these Chinese signs and keep the header row?
The question unfortunately does not provide enough details to be sure what exactly the issues are; but we have identified in comments at least the following.
Apparently, the input file contains DOS carriage returns.
Apparently, soffice attempted to read the file as UTF-16, which is what produced the essentially random Chinese characters. (The characters could be anything; it's just more probable that a random Unicode BMP character will be in a Chinese/Japanese block.)
With those observations and a refactoring of the existing script, try
for file in *.txt; do
awk -F ';' 'BEGIN { OFS="," }
FNR==1 {
# Add UTF-8 BOM
printf "\357\273\277"
# Generate header line for soffice to discard
for (i=1; i<=NF; i++) printf "bogus%s", (i==NF ? "\n" : OFS)
}
{ sub(/\015/, ""); print FNR, $0 }' "$file" > "${file%.*}.xls"
soffice --headless --convert-to xls:"MS Excel 95" filename.xls "${file%.*}.xls"
done
In so many words, the Awk script splits each input line on semicolons (-F ';') and sets the output field separator OFS to a comma. On the first output line, we add a BOM and a synthetic header line for soffice to discard before the real output, so that the header line appears like a regular data line in the output. The sub takes care of removing any DOS carriage return character, and the variable FNR is the current input line's line number.
I'm not sure if the BOM or the bogus header line are strictly necessary, or if perhaps you need to pass in some additional options to make soffice treat the input as proper UTF-8. Perhaps you also need to include LC_ALL=C somewhere in the pipeline.

Bash extract parts from string and create csv

I have a document with 1+ million of the following strings and I like to create some new structures byextract some parts and create a csv file for it, what's the quickest way to do this?
document/0006-291X(85)91157-X
I would like to have a file with on each line the original string and the extracted parts
document/0006-291X(85)91157-X;0006-291X;85
You can try this one-liner awk:
awk -F "[/()]" -v OFS=';' '{print $0,$(NF-2),$(NF-1)}' your-file
It parses the fields of each line with taking /,(,) as delimiters. Then it prints out the whole line, the 3rd field and the second field starting from the end of the line. The option -v OFS=';' prints semicolumns as output field separator.

Resources