How to split a CSV file into multiple files based on column value - bash

I have CSV file which could look like this:
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
there could more or less rows and I need to split it into multiple .dat files each containing rows with the same value of the second column of this file. (Then I will make bar chart for each .dat file) For this case it should be two files:
data1.dat
name1;1;11880
name2;1;260.483
name3;1;3355.82
name4;1;4179.48
data2.dat
name1;2;10740.4
name2;2;1868.69
name3;2;341.375
name4;2;4783.9
Is there any simple way of doing it with bash?

You can use awk to generate a file containing only a particular value of the second column:
awk -F ';' '($2==1){print}' data.dat > data1.dat
Just change the value in the $2== condition.
Or, if you want to do this automatically, just use:
awk -F ';' '{print > ("data"$2".dat")}' data.dat
which will output to files containing the value of the second column in the name.

Try this:
while IFS=";" read -r a b c; do echo "$a;$b;$c" >> data${b}.dat; done <file

Related

Pass number of for loop elements to external command

I'm using for loop to iterate through .txt files in a directory and grab specified rows from the files. Afterwards the output is passed to pr command in order to print it as a table. Everything works fine, however I'm manually specifying the number of columns that the table should contain. This is cumbersome when the number of files is not constant.
The command I'm using:
for f in *txt; do awk -F"\t" 'FNR ~ /^(2|6|9)$/{print $2}' $f; done | pr -ts --column 4
How should I modify the command to replace '4' with elements number?
Edit:
The fundamental question was if one can provide matching files number to function outside the loop. Seeing the solutions I guess it is not possible to work around the problem. Until this conclusion the structure of the files was not really relevant.
However taking the above into account, I'm providing the files structure below.
Sample file.txt:
Irrelevant1 text
Placebo 1222327
Irrelevant1 text
Irrelevant2 text
Irrelevant3 text
Treatment1 105956
Irrelevant1 text
Irrelevant2 text
Treatment2 49271
Irrelevant1 text
Irrelevant2 text
The for loop generates the following from 4 *txt files:
1222327
105956
49271
969136
169119
9672
1297357
237210
11581
1189529
232095
13891
Expected pr output using a dynamically generated --column 4:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
Assumptions:
all input files generate the same number of output lines (otherwise we can add some code to keep track of the max number of lines and generate blank columns as needed)
Setup (columns are tab-delimited):
$ grep -n xxx f[1-4].txt
f1.txt:6:xxx 1222327
f1.txt:9:xxx 105956
f1.txt:24:xxx 49271
f2.txt:6:xxx 969136
f2.txt:9:xxx 169119
f2.txt:24:xxx 9672
f3.txt:6:xxx 1297357
f3.txt:9:xxx 237210
f3.txt:24:xxx 11581
f4.txt:6:xxx 1189529
f4.txt:9:xxx 232095
f4.txt:24:xxx 13891
One idea using awk to dynamically build the 'table' (replaces OP's current for loop):
awk -F'\t' '
FNR==1 { c=0 }
FNR ~ /^(6|9|24)$/ { ++c ; arr[c]=arr[c] (FNR==NR ? "" : " ") $2 }
END { for (i=1;i<=c;i++) print arr[i] }
' f[1-4].txt | column -t -o ' '
NOTE: we'll go ahead and let column take care of pretty-printing the table with a single space separating the columns, otherwise we could add some more code to awk to right-pad columns with spaces
This generates:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
You could just run ls and pipe the output to wc -l. Then once you've got that number you can assign it to a variable and place that variable in your command.
num=$(ls *.txt | wc -l)
I forget how to place bash variables in AWK, but I think you can do that. If not, respond back and I'll try to find a different answer.

AWK post-procession of multi-column data

I am working with the set of txt file containing multi column information present in one line. Within my bash script I use the following AWK expression to take the filename from each of the txt filles as well as the number from the 5th column and save it in 2 column format in results.CSV file (piped to SED, which remove path of the file and its extension from the final CSV file):
awk '-F, *' '{if(FNR==2) printf("%s| %s \n", FILENAME,$5) }' ${tmp}/*.txt | sed 's|\/Users/gleb/Desktop/scripts/clusterizator/tmp/||; s|\.txt||' >> ${home}/"${experiment}".csv
obtaining something (for 5 txt filles) like this as CSV:
lig177_cl_5.2| -0.1400
lig331_cl_3.5| -8.0000
lig394_cl_1.9| -4.3600
lig420_cl_3.8| -5.5200
lig550_cl_2.0| -4.3200
How it would be possible to modify my AWK expression in order to exclude "_cl_x.x" from the name of each txt file as well as add the name of the CSV as the comment to the first line of the resulted CSV file:
# results.CSV
lig177| -0.1400
lig331| -8.0000
lig394| -4.3600
lig420| -5.5200
lig550| -4.3200
based on the rest of the pipe, I think you want to do something like this and get rid of sed invocations.
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s| %s\n", f, $5) }' "${tmp}"/*.txt >> "${home}/${experiment}.csv"
this will convert
/Users/gleb/Desktop/scripts/clusterizator/tmp/lig177_cl_5.2.txt
to
lig177
The pattern replacement is generic
/path/to/the/file/filename_otherstringshere...
will extract only filename. From the last / char to the first _ char. This is based the greedy matching of regex patterns.
For the output filename, it's easier to do it before awk call, since it's a one line only.
$ echo "${experiment}.csv" > "${home}/${experiment}.csv"
$ awk ... >> "${home}/${experiment}.csv"

Take first first column from 1.csv and find in second 2.csv

I want to read two csv files (a.csv and b.csv) and write a new csv file new.csv with a status of each column. I want to do that with a shell script.
A.csv:
Inputfile_name,Date
abc.csv,2018/11/26 16.38.54
bbc.csv,2018/11/26 15.28.11
B.csv:
Outputfile_name,Date
abc_SUCCESS.csv,2018/11/26 17.20.11
bbc_FAIL.csv,2018/11/26 16.28.11
new.csv:
Inputfile_name,Date,Outputfile_name,Date,Status
abc.csv,2018/11/26 16.38.54,abc_SUCCESS.csv,2018/11/26 17.20.11,SUCCESS
bbc.csv,2018/11/26 15.28.11,bbc_FAIL.csv,2018/11/26 16.28.11,FAIL
Like so?
$ paste -d, A.csv B.csv | sed -e 's/\(SUCCESS\|FAIL\).*/&,\1/'
Inputfile_name,Date,Outputfile_name,Date
abc.csv,2018/11/26 16.38.54,abc_SUCCESS.csv,2018/11/26 17.20.11,SUCCESS
bbc.csv,2018/11/26 15.28.11,bbc_FAIL.csv,2018/11/26 16.28.11,FAIL
paste can concatenate the contents of two files linewise. And with sed you can do a search+replace operation for adding SUCCESS or FAIL at the end of each line.

Adding an extra value into CSV data, according to filename

Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!
With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C
can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.

combine lines of csv in bash

I want to create new csv file for each city combining several csv with rows and columns, one column has the name of cities, that repeat in all the csv files...
For example,
I have files with the name of the date,YYYYMMDD, 20140713.csv, 20140714.csv, 20140715.csv...
They have the same structure, same numbers of rows and columns, for example, 20140713.csv...
1. City, Data, TMinreal, TMaxreal, TMinext, TMaxext, DiffTMin, DiffTMax
2. Milano,20140714,19.0,28.8,18,27,1,1.8
3. Rome,20140714,18.1,29.3,14,29,4.1,0.3
4. Pisa,20140714,10.8,27.5,8,29,2.8,-1.5
5. Venecia,20140714,21.1,29.1,16,27,5.1,2.1
I want to combine all these csv files...and get, csv files with the name of the city, as Milano.csv and inside with the information about this city stored in all the csv combined.
For example, if I combine 20140713.csv, 20140714.csv, 20140715.csv, for Milano.csv
1. Milano,20140713,19.0,28.8,18,26,1,2.8
2. Milano,20140714,19.0,28.8,20,27,-1,1.8
3. Milano,20140715,21.0,26.8,19,27,2,-0.2
any idea? thank you
untested, but this should work:
awk -F, 'FNR==1{next} {file = $1".csv"; print > file}' 20*.csv
You can have this bash script:
#!/bin/bash
for FILE; do
{
read ## Skip header
while IFS=, read -r A B; do
echo "$A,$B" >> "$A".csv
done
} < "$FILE"
done
Then run as:
bash script.sh file1.csv file2.csv ...

Resources