How to process many csv files into many separate dat files - bash

I have several thousand csv files I wish to reformat. They all have a standard filename with incremental integer, eg. file_1.csv, file_2.csv, file_3.csv, and they all have the same format:
CH1
s,Volts
-1e-06,-0.0028,
-9.998e-07,-0.0032,
-9.99e-07,-0.0036,
For 10,002 lines. I want to remove the header, and I want to separate the two columns into separate files. I have the following code which produces the results I want when I consider a single input file:
tail -10000 file_1.csv |
awk -F, '{print $1 > "s.dat"; print $2 > "Volts.dat"}'
However, I want something that will produce the equivalent files for each csv file, say, replace s.dat with s_$i.dat or similar, but I'm not sure how to go about this, and how to call in each separate csv file in a loop rather than explicitly stating it as file_1.csv.

awk to the rescue!
awk -F, 'FNR>2{print $1 > "s_"FILENAME".dat";
print $2 > "Volts_"FILENAME".dat"}' file*
or reading the filename from the data files
$ awk -F, 'FNR==2{s="_"FILENAME".dat";h1=$1s;h2=$2s}
FNR>2{print $1 > h1; print $2 > h2}' file*

Related

Converting from tsv to fasta

I have a bunch of TSV files in my folder and for everyone one of them I would like to get a fasta file where the header after the sign '>' is the name of the file.
My TSV file has 5 columns without header:
Thus:
inputfile called: "A.coseq.table_headless.tsv"
HIV1B-pol-seed 15 MAX 1959 GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
output file called "A.fasta"
>A_MAX
GTAACAGACTCACAATATGCATTAGGAATCATTCAAGC
I want to run the script simultaneously in bash for all the files and I have this script who does not work because in awk print statement I have a curly brace:
for sample in `ls *coseq.table_headless.tsv`
do
base1=$(basename $sample "coseq.table_headless.tsv")
awk '{print ">"${base1}"_"$3"\n"$5}' ${base1}coseq.table_headless.tsv > ${base1}fasta
done
Any idea how to correct this code?
Thank you very much
if the basename is the part until the first ".", you can get rid of the loop as well.
awk '{split(FILENAME,base,".");
print ">" base[1] "_" $3 "\n" $5 > base[1]".fasta"}' *coseq.table_headless.tsv
The other solutions posted so far have a few issues:
not closing the files as they're written will produce "too many
open files" errors unless you use GNU awk,
calculating the output file name every time a line is
read rather than once when the input file is opened is inefficient, and
using parenthesized expression on the right side of output
redirection is undefined behavior and so will only work in some awks
(including GNU awk).
This will work robustly and efficiently in all awks:
awk '
FNR==1 { close(out); f=FILENAME; sub(/\..*/,"",f); pfx=">"f"_"; out=f".fasta" }
{ print pfx $3 ORS $5 > out }
' *coseq.table_headless.tsv
Another awk solution:
awk '{ pfx=substr(FILENAME,1,index(FILENAME,".")-1);
printf(">%s_%s\n%s\n",pfx,$3,$5) > pfx".fasta" }' *coseq.table_headless.tsv
pfx contains the first part of filename (till the 1st .)

Fastest way to extract a column and then find its uniq items in a large delimited file

Hoping for help. I have a 3 million line file, data.txt, delimited with "|", e.g,.
"4"|"GESELLSCHAFT FUER NUCLEONIC & ELECT MBH"|"DE"|"0"
"5"|"IMPEX ESSEN VERTRIEB VON WERKZEUGEN GMBH"|"DE"|"0"
I need to extract the 3rd column ("DE") and then limit it to its unique values. Here is what I've come up with (gawk and gsort as I'm running MacOS and only had the "--parallel" option via GNU sort):
gawk -F "|" '{print $3}' data.txt \
| gsort --parallel=4 -u > countries.uniq
This works, but it isn't very fast. I have similar tasks coming up with some even larger (11M record) files, so I'm wondering if anyone can point out a faster way.
I hope to stay in shell, rather than say, Python, because some of the related processing is much easier done in shell.
Many thanks!
awk is tailor-made for such tasks. Here is a minimal awk logic that could do the trick for you.
awk -F"|" '!($3 in arr){print} {arr[$3]++} END{ for (i in arr) print i}' logFile
The logic is as awk processes every line, it adds the entry of the value in $3 only if it has not seen it before. The above prints both unique lines followed by unique entries from $3
If you want the unique lines only, you can exclude the END() clause
awk -F"|" '!($3 in arr){print} {arr[$3]++}' logFile > uniqueLinesOnly
If you want unique values only from the file remove the inside print
awk -F"|" '!($3 in arr){arr[$3]++} END{ for (i in arr) print i}' logFile > uniqueEntriesOnly
You can see how fast it is for a 11M record entry file. You can write it a new file using the redirect operator

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

awk Splitting huge file creates error "too many open files" [duplicate]

I am just splitting a very large csv file in to parts. When ever i run the following command. the doesn't completely split rather returns me the following error. how can i avoid the split the whole file.
awk -F, '{print > $2}' test1.csv
awk: YY1 makes too many open files
input record number 31608, file test1.csv
source line number 1
Just close the files after writing:
awk -F, '{print > $2; close($2)}' test1.csv
You must have a lot of lines. Are you sure that the second row repeats enough to put those records into an individual file? Anyway, awk is holding the files open until the end. You'll need a process that can close the file handles when not in use.
Perl to the rescue. Again.
#!perl
while( <> ) {
#content = split /,/, $_;
open ( OUT, ">> $content[1]") or die "whoops: $!";
print OUT $_;
close OUT;
}
usage: script.pl your_monster_file.csv
outputs the entire line into a file named the same as the value of the second CSV column in the current directory, assuming no quoted fields etc.

Joining specific parts of text from two files in third?

My question is based again, on linux shell programming, and this time, I have two textual files, with about 17,000 lines in each.
In first file i have lines which have this form:
[*] 11004, e01c5dee8efb188af91fb989a1039a12, isabelleann86#yahoo.com
And second file has form for each line:
e01c5dee8efb188af91fb989a1039a12:nathan09
Now I want to create third file from these two, to have form of:
isabelleann86#yahoo.com:nathan09
But notation please, hash e01c5dee8efb188af91fb989a1039a12 must correspond to both lines in first and second file, not like creating line with email_1 and password_3421.
Email from file one, and password from file two, where line has the same hash value?
I know it is maybe possible by using grep/awk combination, but I just do not know how to form it.
Here's one way using awk with multiple delimiters:
awk -F "[ ,:]+" 'FNR==NR { a[$3]=$4; next } $1 in a { print a[$1], $2 }' OFS=":" file1 file2 > file3
Results; contents of file3:
isabelleann86#yahoo.com:nathan09
Using awk
awk 'NR==FNR{a[$(NF-1)]=$NF;next}
$1"," in a {print a[$1","] FS $NF}' file1 FS=: file2

Resources