join file based on two columns for ALL files in directory - bash

I have four files in my directory: say a.txt; b.txt; c.txt; d.txt. I would like to join every file with all other files based on two common columns (i.e. join a.txt with b.txt, c.txt and d.txt; join b.txt with a.txt, c.txt and d.txt; join c.txt with a.txt, b.txt and d.txt). To do this for two of the files I can do:
join -j 2 <(sort -k2 a.txt) <(sort -k2 b.txt) > a_b.txt
How do I write this in a loop for all files in the directory? I've tried the code below but that's not working.
for i j in *; do join -j 2 <(sort -k2 $i) <(sort -k2 $j) > ${i_j}.txt
Any help/direction would be helpful! Thank you.

This might be a way to do it:
#!/bin/bash
files=( *.txt )
for i in "${files[#]}";do
for j in "${files[#]}";do
if [[ "$i" != "$j" ]];then
join -j 2 <(sort -k2 "$i") <(sort -k2 "$j") > "${i%.*}_$j"
fi
done
done

Related

Merge two files using Paste, but after the N row

I have two xlsxfiles, they are different but with only 1 thing in common: the date. I must convert to csv and merge them together.
file1
01/01/2013;horse;penguin
02/01/2013;cat;dog
03/01/2013;frog;whale
04/01/2013;mouse;bird
[...]
until nowadays, may 2017
No animals were hurt in writing this sample.
file2
14/02/2013;banana;cherry
15/02/2013;apple;mango
16/02/2013;orange;strawberry
[...]
until nowadays, may 2017
This is the result I wish to achieve:
But the dates are in epoch (here I leave them not epoch, so you can read them).
01/01/2013;horse;penguin
02/01/2013;cat;dog
03/01/2013;frog;whale
04/01/2013;mouse;bird
[...]
13/02/2013;fish;elephant
14/02/2013;bear;owl;banana;cherry
15/02/2013;monkey;bat;apple;mango
[...]
The following is the script I made.
1) the dates needs to be epoch
2) the sheet2 does not contain the date, the date is printed in the final file for both and I use the date from sheet1
#!/bin/bash
# VARS #
XLSX=$1
SHEET1="sheet1"
SHEET2="sheet2"
P_PATH=/tmp/extract
EXTRACTCSV=$P_PATH/extract.csv
TMP_CSV=$P_PATH/temp.csv
CSV_SPLIT=$P_PATH/processed.csv
CSV_FINAL=$P_PATH/${XLSX}.csv
# START #
[ -d $P_PATH ] || mkdir -p $P_PATH
rm -rfv $P_PATH/*
########################
# ssconvert on sheet 1 #
########################
ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'sheet='$SHEET1' separator=; format=automatic eol=unix' ${XLSX} ${EXTRACTCSV}"."${SHEET1}
if [ $? -gt 0 ]; then
echo "Ssconvert on $SHEET1 failed. Exiting."
exit
fi
########################
# ssconvert on sheet 2 #
########################
ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'sheet='$SHEET2' separator=; format=automatic eol=unix' ${XLSX} ${EXTRACTCSV}"."${SHEET2}
if [ $? -gt 0 ]; then
echo "Ssconvert on $SHEET2 failed. Exiting."
exit
fi
######################
# Processing SHEET 1 #
######################
cat ${EXTRACTCSV}"."${SHEET1} | awk -F';' '{print $1";"$2";"$6}' > ${TMP_CSV}"."${SHEET1}
# Modify to EPOCH #
while read line; do
colDate=$(echo $line | awk -F';' '{print $1}')
colB=$(echo $line | awk -F';' '{print $2}' )
colF=$(echo $line | awk -F';' '{print $3}' )
# Skip when date not set
if [ -z ${colDate} ]; then
continue
fi
epoch_date=$(date +%s -ud ${colDate})
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}
done <${TMP_CSV}"."${SHEET1}
######################
# Processing SHEET 2 #
######################
cat ${EXTRACTCSV}"."${SHEET2} | awk -F';' '{print $12";"$14";"$17}' > ${CSV_SPLIT}.${SHEET2}
##########################
# Merge the csv together #
##########################
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t > ${CSV_FINAL}
My Request:
The final command, the one to merge the 2 files together:
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t > ${CSV_FINAL}
works good, but the second file is printed on the row of the 01/01/2013.
I don't know how to modify the logic of this script, to begin pasting the 2nd file from the row of 14/02/2013.
Can anyone help me?
Looks like you want sort and merge file(s) by date.
File1:
sort -n -k3 -k2 -k1 -t '/' -o File1.sorted File1
File2:
sort -n -k3 -k2 -k1 -t '/' -o File2.sorted File2
Merge:
sort -n -m -k3 -k2 -k1 -t '/' -o result.sorted File1.sorted File2.sorted
OR as a single line using virtual file descriptors:
sort -n -m -k3 -k2 -k1 -t '/' <(sort -n -k3 -k2 -k1 -t '/' File1) <(sort -n -k3 -k2 -k1 -t '/')
-n will sort the fields numerically instead of lexical.
-m merges two sorted files
-k will sort by year, then day, then month (fields 3,2,1 respectively)
-t sets the delmiter
EXAMPLE:
sort -m -k3 -k2 -k1 -t '/' <(sort -k3 -k2 -k1 -t '/' t2) <(sort -k3 -k2 -k1 -t '/' t1)
12/01/2012;banana;pear
15/02/2013;apple;mango
14/02/2013;banana;cherry
02/01/2013;cat;dog
03/01/2013;frog;whale
01/01/2013;horse;penguin
04/01/2013;mouse;bird
16/02/2013;orange;strawberry
13/03/2015;mango;papaya
This is how I solved:
if [ $epoch_date -le 1360713600 ]; then
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}.part1
else
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}
fi
[...]
##########################
# Merge the csv together #
##########################
cat ${CSV_SPLIT}.${SHEET1}.part1 > ${CSV_FINAL}
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t >> ${CSV_FINAL}
I split the file1 in 2 parts when I read it, 1 part contain dates and values before the 14 Feb, the other part the rest.
And well.. easy.

Control if a new file incoming in a folder with "comm"

I'm using a comm in a infinite cycle for view if a new file incoming in a folder, but i not have difference from 2 files but for example if incominig file "a" i view in output:
a a.out a.txt b.txt test.cpp testshell.sh
a.out a.txt b.txt test.cpp testshell.sh
my Code is this:
#! /bin/ksh
ls1=$(ls);
echo $ls1 > a.txt;
while [[ 1 > 0 ]] ; do
ls2=$(ls);
echo $ls2 > b.txt;
#cat b.txt;
#sort b.txt > b.txt;
#diff -u a.txt b.txt;
#diff -a --suppress-common-lines -y a.txt b.txt
comm -3 a.txt b.txt;
printf "\n";
ls1=$ls2;
echo $ls1 > a.txt;
#cat a.txt;
#sleep 2;
#sort a.txt > a.txt;
done
THANKS
#! /bin/ksh
set -vx
PreCycle="$( ls -1 )"
while true
do
ThisCycle="$( ls -1 )"
echo "${PreCycle}${ThisCycle}" | uniq
PreCycle="${ThisCycle}"
sleep 10
done
give add and removed difference but without use of file. Could directly give new file same way but uniq -f 1 failed (don't understand why) when used on list prefixed by + and - depending of source

Taking line intersection of several files

I see comm can do 2 files and diff3 can do 3 files. I want to do for more files (5ish).
One way:
comm -12 file1 file2 >tmp1
comm -12 tmp1 file3 >tmp2
comm -12 tmp2 file4 >tmp3
comm -12 tmp3 file5
This process could be turned into a script
comm -12 $1 $2 > tmp1
for i in $(seq 3 1 $# 2>/dev/null); do
comm -12 tmp`expr $i - 2` $(eval echo '$'$i) >tmp`expr $i - 1`
done
if [ $# -eq 2 ]; then
cat tmp1
else
cat tmp`expr $i - 1`
fi
rm tmp*
This seems like poorly written code, even to a newbie like me, is there a better way?
It's quite a bit more convoluted than it has to be. Here's another way of doing it.
#!/bin/bash
# Create some temp files to avoid trashing and deleting tmp* in the directory
tmp=$(mktemp)
result=$(mktemp)
# The intersection of one file is itself
cp "$1" "$result"
shift
# For each additional file, intersect with the intermediate result
for file
do
comm -12 "$file" "$result" > "$tmp" && mv "$tmp" "$result"
done
cat "$result" && rm "$result"

Merge files with sort -m and give error if files not pre-sorted?

need some help out here.
I have two files,
file1.txt >
5555555555
1111111111
7777777777
file2.txt >
0000000000
8888888888
2222222222
4444444444
3333333333
when I run,
$ sort -m file1.txt file2.txt > file-c.txt
the output file-c.txt get the merged within file1 and file2 but it is not sorted.
file-c.txt >
0000000000
5555555555
1111111111
7777777777
8888888888
2222222222
4444444444
3333333333
When it happens I need an error saying that the files (file1 and file2) is not sorted and the merge can't merge the files before it has been sorted. So when I run $ sort -m file1.txt file2.txt > file-c.txt I have to get an error saying that it cannot merge file1 and file2 to file-c because they are not yet sorted.
Hope you guys understand me :D
If I understand what you're asking, you could do this:
DIFF1=$(diff <(cat file1.txt) <(sort file1.txt))
DIFF2=$(diff <(cat file2.txt) <(sort file2.txt))
if [ "$DIFF1" != "" ]; then
echo 'file1 is not sorted'
elif [ "$DIFF2" != "" ]; then
echo 'file2 is not sorted'
else
sort -m file1.txt file2.txt
fi
This works in Bash (and other shells) and does the following:
Set the DIFF1 variable to the output of a diff of a cat and a sort of file1 (this will be empty if the cat and sort are the same meaning if the file is sorted
Set the DIFF2 variable in the same manner as DIFF1 but for file2
Do a simple if .. elif .. else to check and see whether file1 AND file2 are sorted, and if so do a command line sort of the two
Is this what you were looking for?
EDIT: Alternately per #twalberg if your version of sort supports it, you can do this:
if ! sort -c file1.txt
then echo 'file1 is not sorted'
elif ! sort -c file2.txt
then echo 'file2 is not sorted'
else
sort -m file1.txt file2.txt
fi

Nested for loop comparing files

I am trying to write a bash script that looks at two files with the same name, each in a different directory.
I know this can be done with diff -r, however, I would like to take everything that is in the second file that is not in the first file and output it into an new file (also with the same file name)
I have written a (nested) loop with a grep command but it's not good and gives back a syntax error:
#!/bin/bash
FILES=/Path/dir1/*
FILES2=/Path/dir2/*
for f in $FILES
do
for i in $FILES2
do
if $f = $i
grep -vf $i $f > /Path/dir3/$i
done
done
Any help much appreciated.
try this
#!/bin/bash
cd /Path/dir1/
for f in *; do
comm -13 <(sort $f) <(sort /Path/dir2/$f) > /Path/dir3/$f
done
if syntax in shell is
if test_command;then commands;fi
commands are executed if test_command exit code is 0
if [ $f = $i ] ; then grep ... ; fi
but in your case it will be more efficient to get the file name
for i in $FILES; do
f=/Path/dir2/`basename $i`
grep
done
finally, maybe this will be more efficient than grep -v
comm -13 <(sort $f) <(sort $i)
comm -13 will get everything which is in the second and not in first ; comm without arguments generates 3 columns of output : first is only in first, second only in second and third what is common.
-13 or -1 -3 removes first and third column
#!/bin/bash
DIR1=/Path/dir1
DIR2=/Path/dir2
DIR3=/Path/dir3
for f in $DIR1/*
do
for i in $DIR2/*
do
if [ "$(basename $f)" = "$(basename $i)" ]
then
grep -vf "$i" "$f" > "$DIR3/$(basename $i)"
fi
done
done
This assumes no special characters in filenames. (eg, whitespace. Use double quotes if that is unacceptable.):
a=/path/dir1
b=/path/dir2
for i in $a/*; do test -e $b/${i##*/} &&
diff $i $b/${i##*/} | sed -n '/^< /s///p'; done

Resources