How to compare 2 files word by word and storing the different words in result output file - shell

Suppose there are two files:
File1.txt
My name is Anamika.
File2.txt
My name is Anamitra.
I want result file storing:
Result.txt
Anamika
Anamitra
I use putty so can't use wdiff, any other alternative.

not my greatest script, but it works. Other might come up with something more elegant.
#!/bin/bash
if [ $# != 2 ]
then
echo "Arguments: file1 file2"
exit 1
fi
file1=$1
file2=$2
# Do this for both files
for F in $file1 $file2
do
if [ ! -f $F ]
then
echo "ERROR: $F does not exist."
exit 2
else
# Create a temporary file with every word from the file
for w in $(cat $F)
do
echo $w >> ${F}.tmp
done
fi
done
# Compare the temporary files, since they are now 1 word per line
# The egrep keeps only the lines diff starts with > or <
# The awk keeps only the word (i.e. removes < or >)
# The sed removes any character that is not alphanumeric.
# Removes a . at the end for example
diff ${file1}.tmp ${file2}.tmp | egrep -E "<|>" | awk '{print $2}' | sed 's/[^a-zA-Z0-9]//g' > Result.txt
# Cleanup!
rm -f ${file1}.tmp ${file2}.tmp
This uses a trick with the for loop. If you use a for to loop on a file, it will loop on each word. NOT each line like beginners in bash tend to believe. Here it is actually a nice thing to know, since it transforms the files into 1 word per line.
Ex: file content == This is a sentence.
After the for loop is done, the temporary file will contain:
This
is
a
sentence.
Then it is trivial to run diff on the files.
One last detail, your sample output did not include a . at the end, hence the sed command to keep only alphanumeric charactes.

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Extract a line from a text file using grep?

I have a textfile called log.txt, and it logs the file name and the path it was gotten from. so something like this
2.txt
/home/test/etc/2.txt
basically the file name and its previous location. I want to use grep to grab the file directory save it as a variable and move the file back to its original location.
for var in "$#"
do
if grep "$var" log.txt
then
# code if found
else
# code if not found
fi
this just prints out to the console the 2.txt and its directory since the directory has 2.txt in it.
thanks.
Maybe flip the logic to make it more efficient?
f=''
while read prev
do case "$prev" in
*/*) f="${prev##*/}"; continue;; # remember the name
*) [[ -e "$f" ]] && mv "$f" "$prev";;
done < log.txt
That walks through all the files in the log and if they exist locally, move them back. Should be functionally the same without a grep per file.
If the name is always the same then why save it in the log at all?
If it is, then
while read prev
do f="${prev##*/}" # strip the path info
[[ -e "$f" ]] && mv "$f" "$prev"
done < <( grep / log.txt )
Having the file names on the same line would significantly simplify your script. But maybe try something like
# Convert from command-line arguments to lines
printf '%s\n' "$#" |
# Pair up with entries in file
awk 'NR==FNR { f[$0]; next }
FNR%2 { if ($0 in f) p=$0; else p=""; next }
p { print "mv \"" p "\" \"" $0 "\"" }' - log.txt |
sh
Test it by replacing sh with cat and see what you get. If it looks correct, switch back.
Briefly, something similar could perhaps be pulled off with printf '%s\n' "$#" | grep -A 1 -Fxf - log.txt but you end up having to parse the output to pair up the output lines anyway.
Another solution:
for f in `grep -v "/" log.txt`; do
grep "/$f" log.txt | xargs -I{} cp $f {}
done
grep -q (for "quiet") stops the output

Add filename of each file as a separator row when merging into a single file Bash Script

I have the current script which combines all the CSV files in a folder into a single CSV file and it works great. I need to add functionality to add the filename of the original csv's as a header row for each data block so I know which section is which.
Can someone assist as this is not by strong point and I am over my head
#!/bin/bash
OutFileName="./Data/all/all.csv" # Fix the output name
i=0 # Reset a counter
for filename in ./Data/all/*.csv; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 $filename > $OutFileName # Copy header if it is the first file
fi
tail -n +2 $filename >> $OutFileName # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done
I will be automating this and using and run shell script in apple automator.
Thank you got any help.
This is one of the files that are imported and output example
Example of current input file
Once combined I need the filename where the "headers are"
When you want to generate something like ...
Header1,Header2,Header3
file1.csv
a,b,c
x,y,z
file2.csv
1,2,3
9,9,9
file3.csv
...
... then you just have to insert an echo "$filename" >> "$OutFileName" in front of the tail command. Here is an updated version of your script with some minor improvements.
#!/bin/bash
out="./Data/all/all.csv"
i=0
rm -f "$out"
for file in ./Data/all/*.csv; do
(( i++ == 0)) && head -1 "$file"
echo "$file"
tail -n +2 "$file"
done > "$out"
There is no concept of "header line" other than the first line of the CSV file. What you can do is add a new column.
I've switched to Awk because it simplifies the script considerably. Your original would be literally a one-liner.
awk -F , 'NR==1 { OFS=FS; $(NF+1) = "Filename" }
FNR>1{ $(NF+1) = FILENAME }1' all/*.csv >all.csv
Not saving the output in the same directory as the inputs removes the pesky corner case handling.

Loop through filenames and delete last n charaters

I have many fq.gz files in a directory and I want to loop through each filename, delete the last 8 characters and print this to a sample name text file
eg.
984674-TAATGAGC-GCTGTAGA--R1.fq.gz
984674-TAATGAGC-GCTGTAGA--R2.fq.gz
will become:
984674-TAATGAGC-GCTGTAGA--
984674-TAATGAGC-GCTGTAGA--
At the moment I have a bash script to put each filename into an array and then print the array to a sample.txt file. I have then tried to mv the filename to the new desired filename:
#!/bin/bash
declare -a FILELIST
for f in *.gz; do
#FILELIST[length_of_FILELIST + 1]=filename
FILELIST[${#FILELIST[#]}+1]=$(echo "$f");
done
printf '%s\n' "${FILELIST[#]:1:10}" > sample.txt
sample_info=sample.txt
sample_files=($(cut -f 1 "$sample_info"))
for file in "${sample_files[#]}"
do
mv "$file" "${file::(-8)}"
echo $file
done
But the script isn't removing any characters. Can you help?
to loop through each filename, delete the last 8 characters and print this to a sample name, would this work for you:
for i in *fq.gz
do
echo ${i:0:-8}
done
Using substring removal, here. Assuming you want exactly 8 characters out from the end:
for n in *.fq.gz
do
echo "${n%%??.fq.gz}"
done
For a test,
$ n="984674-TAATGAGC-GCTGTAGA--R1.fq.gz"
$ echo "${n%%??.fq.gz}"
984674-TAATGAGC-GCTGTAGA--
OR
$ echo "${n%%????????}"
984674-TAATGAGC-GCTGTAGA--

Using a pipe to read a file, run script and write to the same file

I need to write a script with one line that gets a file and print on the same file on the end of each line the numbers of words on the sentence only if the word "word" Appears on it. I can use another script that can do what ever I want.
My problem is that after I run the script the file is empty, the file that I sent to the script.
This is the one line script:
#!/bin/bash
cat $1 | ./words_num word | cat $1
words_num
#!/bin/bash
while read line; do
temp=`echo $line | grep $1 | wc -l`
if (($temp==1)); then
word_cnt=`echo $line | wc -w`
echo "$line $word_cnt"
else
echo "$line"
fi
done
For example, before the file is:
bla bla blaa word
words blaa
bla bla
after file:
bla bla blaa word 4
words blaa 2
bla bla
Can you help?
The one-liner:
cat $1 | ./words_num word | cat $1
is peculiar. It is approximately equivalent to:
cat $1 | ./words_num word >/dev/null; cat $1
which is unlikely to be the intended result. It is also a candidate for a UUOC (Useless Use of cat) award.
If the intention is to overwrite the original file with the amended version, then you should probably write:
./words_num word < $1 > tmp.$$; mv tmp.$$ $1
If you want to see the results on the screen as well, then:
./words_num word < $1 | tee tmp.$$; mv tmp.$$ $1
Both these will leave a temporary file around if interrupted. You can avoid that with:
#!/bin/bash
trap "rm -f tmp.$$; exit 1" 0 1 2 3 13 15
./words_num word < $1 | tee tmp.$$
mv tmp.$$ $1
trap 0
The trap sets signal handlers (EXIT, HUP, INT, QUIT, PIPE, TERM) and removes the temporary file (if it exists) and exits with a failure status. The trap 0 at the end cancels the exit trap so the command exits successfully.
As for the words_num script, that seems to call for awk rather than shell:
#!/bin/bash
[ $# == 0 ] && { echo "Usage: $0 word [file ...]" >&2; exit 1; }
word=$1
shift
awk "/$word/"' { print $0, NF; next } { print }' "$#"
You can reduce that if you're into code golfing your awk scripts, but I prefer clarify to sub-par code. It looks for lines containing the word, prints the line along with the number of fields in the line, and moves to the next line. If the line doesn't match, it is simply printed. The assignment and shift mean that "$#" contains all the other arguments to words_num, and awk will automatically cycle through the named files, or read standard input if no files are named.
The script should check that the given word does not contain any slashes as that will mess up the regex (it would be OK to replace each one that appears with [/], a character class containing only a slash). That level of bullet-proofing is left for the interested user.
cat $1 | ./words_num word | tee $1

Resources