Compare File headers using Awk or Cmp - bash

I have many flat files in 1 directory.
Each file has a header and some data in it.
I want to compare header of one file with all the other files available in that directory.
This can be achieved using shell scripting but i want to do it using a single line code.
I tried it using the awk command but it is comparing the whole file not just the header.
for i in `ls -1 *a*` ; do cmp a.dat $i ; done
Can someone please let me know how can i do that?
Also if it can be achieved using awk.
I just need to check whether the header is matching or not.

I would try this: grab the first line of every file, extract the unique lines, and count them. The result should be one.
number_uniq=$( sed '1q' * | sort -u | wc -l )
That won't tell you which file is different.
files=(*)
reference_header=$( sed '1q' "${files[0]}" )
for file in "${files[#]:1}"; do
if [[ "$reference_header" != "$( sed '1q' "$file" )" ]]; then
echo "wrong header: $file"
fi
done

For what you describe, you can use md5 or cksum to take a signature of the bytes in the header.
Given 5 files (note that File 4.txt does not match):
$ for fn in *.txt; do echo "$fn:"; cat "$fn"; printf "\n\n"; done
File 1.txt:
what a great ride! it is a lovely day
/tmp/files/File 1.txt
File 2.txt:
what a great ride! it is a lovely day
/tmp/files/File 2.txt
File 3.txt:
what a great ride! it is a lovely day
/tmp/files/File 3.txt
File 4.txt:
what an awful ride! it is a horrible day
/tmp/files/File 4.txt
reference.txt:
what a great ride! it is a lovely day
/tmp/files/reference.txt
You can use md5 to get a signature and the check if they other files are the same.
First get the reference signature:
$ sig=$(head -1 reference.txt | md5)
$ echo $sig
549560de062a87ec69afff37abe18d8f
Then loop through the files:
for fn in *.txt; do
if [[ "$sig" != "$(head -1 "$fn" | md5)" ]]; then
echo "header of \"$fn\" does not match";
fi;
done
Prints:
header of "File 4.txt" does not match

Related

How to compare 2 files word by word and storing the different words in result output file

Suppose there are two files:
File1.txt
My name is Anamika.
File2.txt
My name is Anamitra.
I want result file storing:
Result.txt
Anamika
Anamitra
I use putty so can't use wdiff, any other alternative.
not my greatest script, but it works. Other might come up with something more elegant.
#!/bin/bash
if [ $# != 2 ]
then
echo "Arguments: file1 file2"
exit 1
fi
file1=$1
file2=$2
# Do this for both files
for F in $file1 $file2
do
if [ ! -f $F ]
then
echo "ERROR: $F does not exist."
exit 2
else
# Create a temporary file with every word from the file
for w in $(cat $F)
do
echo $w >> ${F}.tmp
done
fi
done
# Compare the temporary files, since they are now 1 word per line
# The egrep keeps only the lines diff starts with > or <
# The awk keeps only the word (i.e. removes < or >)
# The sed removes any character that is not alphanumeric.
# Removes a . at the end for example
diff ${file1}.tmp ${file2}.tmp | egrep -E "<|>" | awk '{print $2}' | sed 's/[^a-zA-Z0-9]//g' > Result.txt
# Cleanup!
rm -f ${file1}.tmp ${file2}.tmp
This uses a trick with the for loop. If you use a for to loop on a file, it will loop on each word. NOT each line like beginners in bash tend to believe. Here it is actually a nice thing to know, since it transforms the files into 1 word per line.
Ex: file content == This is a sentence.
After the for loop is done, the temporary file will contain:
This
is
a
sentence.
Then it is trivial to run diff on the files.
One last detail, your sample output did not include a . at the end, hence the sed command to keep only alphanumeric charactes.

Extract a line from a text file using grep?

I have a textfile called log.txt, and it logs the file name and the path it was gotten from. so something like this
2.txt
/home/test/etc/2.txt
basically the file name and its previous location. I want to use grep to grab the file directory save it as a variable and move the file back to its original location.
for var in "$#"
do
if grep "$var" log.txt
then
# code if found
else
# code if not found
fi
this just prints out to the console the 2.txt and its directory since the directory has 2.txt in it.
thanks.
Maybe flip the logic to make it more efficient?
f=''
while read prev
do case "$prev" in
*/*) f="${prev##*/}"; continue;; # remember the name
*) [[ -e "$f" ]] && mv "$f" "$prev";;
done < log.txt
That walks through all the files in the log and if they exist locally, move them back. Should be functionally the same without a grep per file.
If the name is always the same then why save it in the log at all?
If it is, then
while read prev
do f="${prev##*/}" # strip the path info
[[ -e "$f" ]] && mv "$f" "$prev"
done < <( grep / log.txt )
Having the file names on the same line would significantly simplify your script. But maybe try something like
# Convert from command-line arguments to lines
printf '%s\n' "$#" |
# Pair up with entries in file
awk 'NR==FNR { f[$0]; next }
FNR%2 { if ($0 in f) p=$0; else p=""; next }
p { print "mv \"" p "\" \"" $0 "\"" }' - log.txt |
sh
Test it by replacing sh with cat and see what you get. If it looks correct, switch back.
Briefly, something similar could perhaps be pulled off with printf '%s\n' "$#" | grep -A 1 -Fxf - log.txt but you end up having to parse the output to pair up the output lines anyway.
Another solution:
for f in `grep -v "/" log.txt`; do
grep "/$f" log.txt | xargs -I{} cp $f {}
done
grep -q (for "quiet") stops the output

Add filename of each file as a separator row when merging into a single file Bash Script

I have the current script which combines all the CSV files in a folder into a single CSV file and it works great. I need to add functionality to add the filename of the original csv's as a header row for each data block so I know which section is which.
Can someone assist as this is not by strong point and I am over my head
#!/bin/bash
OutFileName="./Data/all/all.csv" # Fix the output name
i=0 # Reset a counter
for filename in ./Data/all/*.csv; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 $filename > $OutFileName # Copy header if it is the first file
fi
tail -n +2 $filename >> $OutFileName # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done
I will be automating this and using and run shell script in apple automator.
Thank you got any help.
This is one of the files that are imported and output example
Example of current input file
Once combined I need the filename where the "headers are"
When you want to generate something like ...
Header1,Header2,Header3
file1.csv
a,b,c
x,y,z
file2.csv
1,2,3
9,9,9
file3.csv
...
... then you just have to insert an echo "$filename" >> "$OutFileName" in front of the tail command. Here is an updated version of your script with some minor improvements.
#!/bin/bash
out="./Data/all/all.csv"
i=0
rm -f "$out"
for file in ./Data/all/*.csv; do
(( i++ == 0)) && head -1 "$file"
echo "$file"
tail -n +2 "$file"
done > "$out"
There is no concept of "header line" other than the first line of the CSV file. What you can do is add a new column.
I've switched to Awk because it simplifies the script considerably. Your original would be literally a one-liner.
awk -F , 'NR==1 { OFS=FS; $(NF+1) = "Filename" }
FNR>1{ $(NF+1) = FILENAME }1' all/*.csv >all.csv
Not saving the output in the same directory as the inputs removes the pesky corner case handling.

Grep-ing a list of filename against a csv list of names

I have a CSV files containing a list of ids, numbers, each on a row. Let's call that file ids.csv
In a directory i have a big number of files, name "file_123456_smth.csv", where 123456 is an id that could be found in the ids csv file
Now, what I am trying to achieve: compare the names of the files with the ids stored in ids.csv. If 123456 is found in ids.csv then the filename should be displayed.
What i've tried:
ls -a | xargs grep -L cat ../../ids.csv
Of course, this does not work, but gives an idea of my direction.
Lets see if I understood you correctly...
$ cat ids.csv
123
456
789
$ ls *.csv
file_123_smth.csv file_321_smth.csv file_789_smth.csv ids.csv
$ ./c.sh
123 found in file_123_smth.csv
789 found in file_789_smth.csv
where c.sh looks like this:
#!/bin/bash
ID="ids.csv"
for file in *.csv
do
if [[ $file =~ file ]] # just do the filtering on files
then # containing the actual string "file"
id=$(cut -d_ -f2 <<< "$file")
grep -q "$id" $ID && echo "$id found in $file"
fi
done

head output till a specific line

I have a bunch of files in the following format.
A.txt:
some text1
more text2
XXX
more text
....
XXX
.
.
XXX
still more text
text again
Each file has at least 3 lines that start with XXX. Now, for each file A.txt I want to write all the lines till the 3rd occurrence of XXX (in the above example it is till the line before still more text) to file A_modified.txt.
I want to do this in bash and came up with grep -n -m 3 -w "^XXX$" * | cut -d: -f2 to get the corresponding line number in each file.
Is is possible to use head along with these line numbers to generate the required output?
PS: I know a simple python script would do the job but I am trying to do in this bash for no specific reason.
A simpler method would be to use awk. Assuming there's nothing but files of interest in your present working directory, try:
for i in *; do awk '/^XXX$/ { c++ } c<=3' "$i" > "$i.modified"; done
Or if your files are very big:
for i in *; do awk '/^XXX$/ { c++ } c>=3 { exit }1' "$i" > "$i.modified"; done
head -n will print out the first 'n' lines of the file
#!/bin/sh
for f in `ls *.txt`; do
echo "searching $f"
line_number=`grep -n -m 3 -w "^XXX$" $f | cut -d: -f1 | tail -1`
# line_number now stores the line of the 3rd XXX
# now dump out the first 'line_number' of lines from this file
head -n $line_number $f
done

Resources