Splitting CSVs into files named for one of the columns - bash

I have CSVs like this:
apple,file1.txt
banana,file1.txt
carrot,file2.txt
How can I get it to place all of the items from the left column into files named with the items in the right column? E.g. file.txt would contain this list:
apple
banana
So far, I have this:
while read line
do
firstcolumn=$(echo $line | awk -F ",*" '{print $1}')
secondcolumn=$(echo $line | awk -F ",*" '{print $2}')
done < Text/selection.csv

One way using awk:
awk 'BEGIN { FS = "," } { print $1 >> $2 }' infile

This should work -
awk -F, '{a[$1]=$2} END{for (i in a) print i > a[i]}' file
Test:
[jaypal:~/Temp] cat file
apple,file1.txt
banana,file1.txt
carrot,file2.txt
[jaypal:~/Temp] awk -F, '{a[$1]=$2} END{for (i in a) print i > a[i]}' file
[jaypal:~/Temp] ls file*
file file1.txt file2.txt
[jaypal:~/Temp] cat file1.txt
apple
banana
[jaypal:~/Temp] cat file2.txt
carrot
Update:
You can also do something like this -
awk -F, '{print $1 > $2}' INPUT_FILE

Pure Bash and under the assumption that all target files are empty or non-existing:
while IFS=',' read item file ; do
echo "$item" >> "$file"
done < "$infile"

sed loves this stuff...
sed "s%\(.*\),\(.*\)%echo \1 >> \2 %" inputfile.txt | sh

Related

bash paste: loop through pairs of files based on wildcard, generate separate output files

I am trying to get the paste command to loop through pairs of files, pasting them together and outputing each as a unique file.
I've tried a lot of things, here are a few:
for i in *_temp4.csv; do paste *_temp4.csv *_temp44.csv > ${i}_out.csv; done
#Each output contains each input file (rather than pairs). Obviously this is because of the * wildcard
for i in *_temp2.csv_temp4.csv; do paste $_temp2_temp4.csv $_temp3_temp44.csv > ${i}_out.csv; done
no error, empty output files
for i in *_temp2.csv_temp4.csv; do paste ${_temp2_temp4.csv} ${_temp3_temp44.csv} > ${i}_out.csv; done
output:
combo15.awk: line 12: ${_temp2_temp4.csv}: bad substitution
I think I must be missing something very basic about how $ gets used, but I've been googling all night to no avail.
my entire code, for context, although I don't see why the previous lines should influence anything about this.
for i in *.dat; do awk 'NR > 23 { print }' ${i} > ${i}_temp1.csv; done
for i in *_temp1.csv; do awk 'BEGIN{OFS=FS=","}$2==0{$2="between"}BEGIN{OFS=FS=","}$2==1{$2="lego"}BEGIN{OFS=FS=","}$2==2{$2="pin"}BEGIN{OFS=FS=","}$2==3{$2="dice"}BEGIN{OFS=FS=","}$2==4{$2="jack"}BEGIN{OFS=FS=","}$2==8{$2="escape"}{print}' ${i} > ${i}_temp2.csv; done
for i in *_temp2.csv; do awk -v OFS="," '{$4 = $1 - prev1; prev1 = $1; print;}' ${i} > ${i}_temp3.csv; done
for i in *_temp2.csv; do awk -F "," 'BEGIN{print "new line"}{print $2}' ${i} > ${i}_temp4.csv; done
for i in *_temp3.csv; do awk -F "," '{print $5}' ${i} > ${i}_temp44.csv; done
for i in *_temp2.csv_temp4.csv; do paste $_temp2_temp4.csv $_temp3_temp44.csv > ${i}_out.csv; done
Your problem is, that names of your files grow uncontrollably. This change should solve this problem:
for i in *.dat; do awk 'NR > 23 { print }' ${i} > ${i}_temp1.csv; done
for i in *.dat; do awk 'BEGIN{OFS=FS=","}$2==0{$2="between"}BEGIN{OFS=FS=","}$2==1{$2="lego"}BEGIN{OFS=FS=","}$2==2{$2="pin"}BEGIN{OFS=FS=","}$2==3{$2="dice"}BEGIN{OFS=FS=","}$2==4{$2="jack"}BEGIN{OFS=FS=","}$2==8{$2="escape"}{print}' ${i}_temp1.csv > ${i}_temp2.csv; done
for i in *.dat; do awk -v OFS="," '{$4 = $1 - prev1; prev1 = $1; print;}' ${i}_temp2.csv > ${i}_temp3.csv; done
for i in *.dat; do awk -F "," 'BEGIN{print "new line"}{print $2}' ${i}_temp2.csv > ${i}_temp4.csv; done
for i in *.dat; do awk -F "," '{print $5}' ${i}_temp3.csv > ${i}_temp44.csv; done
for i in *.dat; do paste ${i}_temp4.csv ${i}_temp44.csv > ${i}_out.csv; done

How to split a text file on a delimiter into multiple files in unix?

I have a text file that looks like this:
input_file
1|abc
2|def
3|ghi
n|etc...
I need to split this up into two files on the pipe delimeter. So this is the expected output:
File_1:
1
2
3
n
File_2:
abc
def
ghi
etc
I do not know how many lines the input file will have. How do you achieve this in ksh or bash?
Thank you.
awk would be suitable for this task:
awk -F\| '{print $1 > "File_1"; print $2 > "File_2"}' input_file
This splits your text on the "|" and prints each column to the respective file.
If there were more than two fields, you may prefer to use a loop instead:
awk -F\| '{for(i=1;i<=NF;++i) print $i > "File_" i}' input_file
cut -d '|' -f 1 input_file > File_1
cut -d '|' -f 2 input_file > File_2
Only with bash:
while IFS='|' read A B; do echo "$A" >>File_1; echo "$B" >>File_2; done <input_file
Here is another solution using other bash commands
cat input_file | cut -d '|' -f1 > File_1
cat input_file | cut -d '|' -f2 > File_2
Or you can put them together in one line
cat input_file | tee >(cut -d '|' -f1 > File_1) | cut -d '|' -f2 > File_2

Ignore empty fields

Given this file
$ cat foo.txt
,,,,dog,,,,,111,,,,222,,,333,,,444,,,
,,,,cat,,,,,555,,,,666,,,777,,,888,,,
,,,,mouse,,,,,999,,,,122,,,133,,,144,,,
I can print the first field like so
$ awk -F, '{print $5}' foo.txt
dog
cat
mouse
However I would like to ignore those empty fields so that I can call like this
$ awk -F, '{print $1}' foo.txt
You can use like this:
$ awk -F',+' '{print $2}' file
dog
cat
mouse
Similarly, you can use $3, $4 and $5 and so on.. $1 cannot be used in this case because the records begins with delimiter.
$ awk '{print $1}' FPAT=[^,]+ foo.txt
dog
cat
mouse
You can delete multiple repetition of a field with tr -s 'field':
$ tr -s ',' < your_file
,dog,111,222,333,444,
,cat,555,666,777,888,
,mouse,999,122,133,144,
And then you can access to dog, etc with:
$ tr -s ',' < your_file | awk -F, '{print $2}'
dog
cat
mouse
perl -anF,+ -e 'print "$F[1]\n"' foo.txt
dog
cat
mouse
this is no awk but you will get to use 1 instead of 2.
awk -F, '{gsub(/^,*|,*$/,"");gsub(/,+/,",");print $1}' your_file
tested below:
> cat temp
,,,,dog,,,,,111,,,,222,,,333,,,444,,,
,,,,cat,,,,,555,,,,666,,,777,,,888,,,
,,,,mouse,,,,,999,,,,122,,,133,,,144,,,
execution:
> awk -F, '{gsub(/^,*|,*$/,"");gsub(/,+/,",");print $1}' temp
dog
cat
mouse

get the file name from the path

I have a file file.txt having the following structure:-
./a/b/c/sdsd.c
./sdf/sdf/wer/saf/poi.c
./asd/wer/asdf/kljl.c
./wer/asdfo/wer/asf/asdf/hj.c
How can I get only the c file names from the path.
i.e., my output will be
sdsd.c
poi.c
kljl.c
hj.c
You can do this simpy with using awk.
set field seperator FS="/" and $NF will print the last field of every record.
awk 'BEGIN{FS="/"} {print $NF}' file.txt
or
awk -F/ '{print $NF}' file.txt
Or, you can do with cut and unix command rev like this
rev file.txt | cut -d '/' -f1 | rev
You can use basename command:
basename /a/b/c/sdsd.c
will give you sdsd.c
For a list of files in file.txt, this will do:
while IFS= read -r line; do basename "$line"; done < file.txt
Using sed:
$ sed 's|.*/||g' file
sdsd.c
poi.c
kljl.c
hj.c
The most simple one ($NF is the last column of current line):
awk -F/ '{print $NF}' file.txt
or using bash & parameter expansion:
while read file; do echo "${file##*/}"; done < file.txt
or bash with basename :
while read file; do basename "$file"; done < file.txt
OUTPUT
sdsd.c
poi.c
kljl.c
hj.c
Perl solution:
perl -F/ -ane 'print $F[#F-1]' your_file
Also you can use sed:
sed 's/.*[/]//g' your_file

Having trouble with awk

I am trying to assign a variable to an awk statement. I am getting an error. Here is the code:
for i in `checksums.txt` do
md=`echo $i|awk -F'|' '{print $1}'`
file=`echo $i|awk -F'|' '{print $2}'`
done
Thanks
for i in `checksums.txt` do
This will try to execute checksums.txt, which is very probably not what you want. If you want the contents of that file do:
for i in $(<checksums.txt) ; do
md=$(echo $i|awk -F'|' '{print $1}')
file=$(echo $i|awk -F'|' '{print $2}')
# ...
done
(This is not optimal, and will not do what you want if the file has lines with spaces in them, but at least it should get you started.)
You don't need external programs for this:
while IFS=\| read m f; do
printf 'md is %s, filename is %s\n' "$m" "$f"
done < checksums.txt
Edited as per new requirement.
Given the file is already sorted, you could use uniq (assuming GNU uniq and md hash length of 33 characters):
uniq -Dw33 checksums.txt
If GNU uniq is not available, you can use awk
(this version doesn't require a sorted input):
awk 'END {
for (M in m)
if (m[M] > 1)
print M, "==>", f[M]
}
{
m[$1]++
f[$1] = f[$1] ? f[$1] FS $2 : $2
}' checksums.txt
while read line
do
set -- `echo $line | tr '|' ' '`
echo md is $1, file is $2
done < checksums.txt

Resources