comparing csv files - shell

I want to write a shell script to compare two .csv files. First one contains filename,path the second .csv file contains filename,paht,target. Now, I want to compare the two .csv files and output the target name where the file from the first .csv exists in the second .csv file.
Ex.
a.csv
build.xml,/home/build/NUOP/project1
eesX.java,/home/build/adm/acl
b.csv
build.xml,/home/build/NUOP/project1,M1
eesX.java,/home/build/adm/acl,M2
ddexse3.htm,/home/class/adm/33eFg
I want the output to be something like this.
M1 and M2
Please help
Thanks,

If you don't necessarily need a shell script, you can easily do it in Python like this:
import csv
seen = set()
for row in csv.reader(open('a.csv')):
seen.add(tuple(row))
for row in csv.reader(open('b.csv')):
if tuple(row[:2]) in seen:
print row[2]

if those M1 and M2 are always at field 3 and 5, you can try this
awk -F"," 'FNR==NR{
split($3,b," ")
split($5,c," ")
a[$1]=b[1]" "c[1]
next
}
($1 in a){
print "found: " $1" "a[$1]
}' file2.txt file1.txt
output
# cat file2.txt
build.xml,/home/build/NUOP/project1,M1 eesX.java,/home/build/adm/acl,M2 ddexse3.htm,/home/class/adm/33eFg
filename, blah,M1 blah, blah, M2 blah , end
$ cat file1.txt
build.xml,/home/build/NUOP/project1 eesX.java,/home/build/adm/acl
$ ./shell.sh
found: build.xml M1 M2

try http://sourceforge.net/projects/csvdiff/
Quote:
csvdiff is a Perl script to diff/compare two csv files with the possibility to select the separator. Differences will be shown like: "Column XYZ in record 999" is different. After this, the actual and the expected result for this column will be shown.

Related

How Can I Use Sort or another bash cmd To Get 1 line from all the lines if 1st 2nd and 3rd Field are The same

I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).

Read and sum occurrence lines in bash

I have a file that includes lines below separated by comma ;
filename.txt
usernameA,10,10
usernameB,20,20
usernameA,10,10
usernameB,20,20
usernameC,10,10
I just want to parse the file and add numbers by username if occurs multiple times , so the result should be ;
usernameA=40
usernameB=80
usernameC=20
How can i achive this result using Bash script ?
Thank you,
$ awk -F, '{a[$1]+=$2+$3}END{for(x in a)print x "=" a[x]}' file
usernameA=40
usernameB=80
usernameC=20
This works for the given example.

Adding an extra value into CSV data, according to filename

Let's say i have the following type of filename formats :
CO#ATH2000.dat , CO#MAR2000.dat
Each of these, have data like that following:
....
"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
Now i also have the following .sh file that can merge ALL those .dat files into one single output .dat file.
for filename in `ls CO#*`; do
cat $filename >> CO#combined.dat
done
Now here is the problem. I want inside CO#combined.dat, at each line, before the start of the values, to have a 'standard' value according to the filename-parameter. For example i want each file with ATH in its filename have 3, at the start of each line and with MAR in its filename have 22,.
So the CO#combined.dat should be something like this:
....
3,"12-02-1984",3.8,4.1,3.8,3.8,3.8,3.7,4.1,4.3,3.8,4.1,5.0,4.8,4.5,4.3,4.3,4.3,4.1,4.5,4.3,4.3,4.3,4.5,4.3,4.1
3,"13-02-1984",3.7,4.3,4.3,4.3,4.1,4.3,4.5,4.8,4.8,5.0,5.2,5.0,5.2,5.2,5.2,4.8,4.8,4.8,4.8,4.8,4.8,4.8,4.5,4.3
20,"14-02-1984",3.8,4.1,3.8,3.8,3.8,3.8,3.8,4.2,4.5,4.5,4.1,3.6,3.6,3.4,3.4,3.2,3.4,3.2,3.2,3.2,2.9,2.7,2.5,2.2
20,"15-02-1984",2.2,2.2,2.0,2.0,2.0,1.8,2.1,2.6,2.6,2.5,2.4,2.4,2.4,2.5,2.7,2.7,2.6,2.6,2.7,2.6,2.8,2.8,2.8,2.8
..........
So in conclusion i want the script to do the above procedure!
Thanks in advance!
With awk you can take advantage of the built-in FILENAME variable along with the fact that you can supply multiple files to a given invocation. awk processes each file in turn, setting FILENAME to the name of the file whose records are currently being read.
With that you can set your prefix according to whatever pattern you wish to search for in the file name. Finally you can print the prefix and the original record.
Here's a demonstration on simplified versions of your sample input:
$ cat CO\#ATH2000.dat
1
2
3
$ cat CO\#MAR2000.dat
A
B
C
$ awk 'FILENAME ~ /MAR/ {pre=22} FILENAME ~ /ATH/ {pre=3} { print pre "," $0 }' CO*.dat
3,1
3,2
3,3
22,A
22,B
22,C
can be done simply
for f in CO#*; do
case ${f:3:3} in
ATH) k=3 ;;
*) k=22 ;;
esac;
sed "s/^/$k,/" $f >> all;
done
${f:3:3} extract the code ATH or MAR from the filename it's bash substring function; case converts the code to numerical counterpart; sed insert the numerical value and comma at the beginning of each line.

combine lines of csv in bash

I want to create new csv file for each city combining several csv with rows and columns, one column has the name of cities, that repeat in all the csv files...
For example,
I have files with the name of the date,YYYYMMDD, 20140713.csv, 20140714.csv, 20140715.csv...
They have the same structure, same numbers of rows and columns, for example, 20140713.csv...
1. City, Data, TMinreal, TMaxreal, TMinext, TMaxext, DiffTMin, DiffTMax
2. Milano,20140714,19.0,28.8,18,27,1,1.8
3. Rome,20140714,18.1,29.3,14,29,4.1,0.3
4. Pisa,20140714,10.8,27.5,8,29,2.8,-1.5
5. Venecia,20140714,21.1,29.1,16,27,5.1,2.1
I want to combine all these csv files...and get, csv files with the name of the city, as Milano.csv and inside with the information about this city stored in all the csv combined.
For example, if I combine 20140713.csv, 20140714.csv, 20140715.csv, for Milano.csv
1. Milano,20140713,19.0,28.8,18,26,1,2.8
2. Milano,20140714,19.0,28.8,20,27,-1,1.8
3. Milano,20140715,21.0,26.8,19,27,2,-0.2
any idea? thank you
untested, but this should work:
awk -F, 'FNR==1{next} {file = $1".csv"; print > file}' 20*.csv
You can have this bash script:
#!/bin/bash
for FILE; do
{
read ## Skip header
while IFS=, read -r A B; do
echo "$A,$B" >> "$A".csv
done
} < "$FILE"
done
Then run as:
bash script.sh file1.csv file2.csv ...

Using the first field in AWK as file name

The dataset is one big file with three columns: An ID of a section, something irrelevant and a line of text. An example could look like the following:
A01 001 This is a simple test.
A01 002 Just for exemplary purpose.
A01 003
A02 001 This is another text
I want to use the first column (in this example A01 and A02, which represent different texts) to be the file name, whichs content is everything in that line after the second column.
The example above should result two files, one with name A01 and content:
This is a simple test.
Just for exemplary purpose.
and another one A02 with content:
This is another text
My questions are:
Is AWK the appropriate program for this task? Or perhaps there are more convinient ways doing this?
How would this task be done?
awk is perfect for these kind of tasks. If you do not mind to have some leading spaces, you can use:
awk '{f=$1; $1=$2=""; print > f}' file
This will empty first and second fields and then print all the line into the f file, which was previously stored as first field.
And in case these spaces are bothering, you can delete them with sub(" ", ""):
awk '{f=$1; $1=$2=""; sub(" ", ""); print > f}' file
Bash will work too. Probably slower than awk if that's a concern
while read -r id num line; do
[[ $line ]] && echo "$line" >> $id
done < file

Resources