Bash comparing two different files with different fields - bash

I am not sure if this is possible to do but I want to compare two character values from two different files. If they match I want to print out the field value in slot 2 from one of the files. Here is an example
# File 1
Date D
Tamb B
# File 2
F gge0001x gge0001y gge0001z
D 12-30-2006 12-30-2006 12-30-2006
T 14:15:20 14:15:55 14:16:27
B 15.8 16.1 15
Here is my thought behind the problem I want to do
if [ (field2) from (file1) == (field1) from (file2) ] ; do
echo (field1 from file1) and also (field2 from file2) on the same line
which prints out "Date 12-30-2006"
"Tamb 15.8"
" ... "
and continually run through every line from essentially file 1 printing out any matches that there are. I am assuming these will need to be some sort of array involved. Any thoughts on if this is the correct logic and if this is even possible?

This reformats file2 based on the abbreviations found in file1:
$ awk 'FNR==NR{a[$2]=$1;next;} $1 in a {print a[$1],$2;}' file1 file2
Date 12-30-2006
Tamb 15.8
How it works
FNR==NR{a[$2]=$1;next;}
This reads each line of file1 and saves the information in array a.
In more detail, NR is the number of lines that have been read in so far and FNR is the number of lines that have been read in so far from the current file. So, when NR==FNR, we know that awk is still processing the first file. Thus, the array assignment, a[$2]=$1 is only performed for the first file. The statement next tells awk to skip the rest of the code and jump to the next line.
$1 in a {print a[$1],$2;}
Because of the next statement, above, we know that, if we get to this line, we are working on file2.
If field 1 of file2 matches any a field 2 of file1, then print a reformatted version of the line.

Related

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

Use awk to compare file entry as well as condition

I have a file1 in the below format:
14-02-2017
one 01/02/2017
two 31/01/2017
three 14/02/2017
four 01/02/2017
five 03/02/2017
six 01/01/2017
And file2 in the below format:
11-02-2017
one 01/01/2017
two 31/01/2017
three 14/02/2017
four 11/01/2017
Requirement : I want to copy, replace (or add if necessary) those files mentioned file1 from some location to the location where file2 resides, whose date (in coulmn 2) is greater than the date mentioned in file 2. It is guaranteed that under no circumstances the file 2 will have a program's date greater than that of file one (but can be equal). Also the file entries missing in file 2 (but present in file 1) shall also be copied.
So that in this case, the files one, four, five and six shall be copied from some location to the file2 location, after the script execution
awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common
# File 1, column 2
f1c2=($(cut -f2 -s $common))
# File 2, column 2
f2c2=($(cut -f2 -s $file2))
for x in "${f1c2[#]}"
do
for y in "${f2c2[#]}"
do
if [[ $x >= $y ]]
then
//copy file pointed by field $1 in "common" to file2 path
break
fi
done
done
I was thinking of a way to use awk itself efficiently to do the comparison task to create the file "common". So that the file "common" will contain latest files in file 1, plus the missing entries in file 2. Following this way, I just need to copy all files mentioned in the file "common" without any concerns
I was trying to add some if block inside awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common, but I couldn't figure out how to address file1 column2 and file 2 column2 for comparing.
to get the date compared diff list you can try this
awk 'NR==FNR {a[$1]=$2; next}
$1 in a {split($2,b,"/"); split(a[$1],c,"/");
if(b[3]""b[2]""b[1] >= c[3]""c[2]""c[1]) delete a[$1]}
END {for(k in a) print k,a[k]}' file1 file2
six 01/01/2017
four 01/02/2017
five 03/02/2017
one 01/02/2017
and operate on the result for copying files...
Explanation
Given file 1 we want to remove the entries where date field is less than the matching entry in file 2.
NR==FNR {a[$1]=$2; next} cache the contents of file 1
$1 in a (now scanning second file) if a records exists in file 1
split($2,b,"/")... split date fields so that we can change the order to year-month-date for natural order comparison
if(b[3]...) delete a[$1] if the file 2 date is greater or equal to the one in file 1, delete the entry
END... print remaining entries, which will satisfy the requirement.
Parse 2 files simultaneously with awk is hard. So I suggest another algorithm:
- merge the file
- filter to keep the relevant lines
I may suggest to have a look on "comm" and "join" commands. Here an example
comm -23 <(sort file1) <(sort file2)

How to remove lines appear only once in a file using bash

How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
If your duplicated lines are consecutives, you can use uniq
uniq -D file
from the man pages:
-D print all duplicate lines
Just loop the file twice:
$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3
firstly to count how many times a line occurs: seen[ record ] keeps track of it as an array.
secondly to print those that appear more than once
Using single pass awk:
awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file
3
3
Using freq[$0]++ we count and store frequency of each line.
In the END block if frequency is greater than 1 then we print those lines as many times as the frequency.
Using awk, single pass:
$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3
If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.
Here's a POSIX-compliant awk alternative to the GNU-specific uniq -D:
awk '++seen[$0] == 2; seen[$0] >= 2' file
This turned out to be just a shorter reformulation of James Brown's helpful answer.
Unlike uniq, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.
That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.
For unsorted (ungrouped) data (and if preserving the input order is also important), consider:
fedorqui's helpful answer (elegant, but requires reading the file twice)
anubhava's helpful answer (single-pass solution, but a little more cumbersome).

File comparison

I am a beginner. I am looking for a basic shell script solving what looks a simple problem:
I have one long file, file A that looks like below:
I would like to generate a new file (Target file C ) that is essentially file A, but with an extra field on the first line, say "Comment" where all lines whose items of the first field that match any of the items in column 1 of file B are identified by a mark, say "SHARED". Files A and B are csv files
I have tried awk and a basic shell script that is easier for me to understand, but I could not get it to work. I could generate a blank target file, with the target
first line containing the 3 fields if necessary.
File A
"Part Number","Description"
"1468896-1","MCD-MXSER-21-P-X-0209"
"1495581-1","MC-P-15S5127854ST1"
"1497458-3","MC -N1-P-569RT1"
File B
"1466826-1"
"1495582-1"
"1495581-1"
Desired target file C
"Part Number","Description","Comment"
"1468896-1","MCD-MXSER-21-P-X-0209"
"1495581-1","MC-P-15S5127854ST1","SHARED"
"1497458-3","MC -N1-P-569RT1"
this one-liner should do the job:
awk -F, -v c='"Comment"' -v s='"SHARED"'
'NR==FNR{a[$1]=1;next}FNR==1{$0=$0 FS c}FNR>1&&a[$1]{$0=$0 FS s}7' fileb filea
If you want to do it in bash
#!/bin/bash
while IFS=, read f1 line
do
if grep -qw "$f1" fileB ; then
echo $f1,$line,\"SHARED\"
fi
echo $f1,$line
done < fileA
You can do it like this:
awk -F, 'FNR==NR{a[i++]=$1;next} {extra="";for(t in a)if($1==a[t])extra=",\"SHARED\"";print $0,extra}' fileB fileA
You will see both fileA and fileB are passed into awk. The processing in {} following FNR==NR only applies to fileB. It stores the first element of each line in an array a[] and then skips to the next line.
The processing in the second set of {} only applies to fileA. Basically it pre-sets a string called extra to nothing. It then tests if the first field of the current record is in array a[]. If it is, it sets extra to ",SHARED". It then prints the current record and the string extra which may, or may not, be ",SHARED".

Number of total records

I'm quite new using AWK. I just discover the FNR variable. I just wonder if it is possible to get the number of total records before processing the file?
So the FNR at the end of the file.
I just need it to do something like that
awk 'FNR<TOTALRECORDS-4 {print}'
In order to delete the 4 last lines of the files.
Thanks
If you merely want to print all but the last 4 lines of a file, use a different tool. But if you are doing some other processing with awk and need to incorporate this, just store the lines in a buffer and print them as needed. That is, store the most recent 4 lines, and print the last one in the buffer when you get a newline. For example:
awk 'NR>4 { print a[i%4]} {a[i++%4]=$0}' input
This keeps 4 lines in the array a. If we are in the first 4 lines of the file, do nothing but store the line in a. If we are on a line greater than 4, the first thing you do is print the line 4 lines back (stored in a at index i%4) You can put commands that manipulate $0 between these two action statements as needed.
To remove the last 4 lines from a file, you can just use head:
head -n -4 somefile > outputfile

Resources