Use awk to compare file entry as well as condition - bash

I have a file1 in the below format:
14-02-2017
one 01/02/2017
two 31/01/2017
three 14/02/2017
four 01/02/2017
five 03/02/2017
six 01/01/2017
And file2 in the below format:
11-02-2017
one 01/01/2017
two 31/01/2017
three 14/02/2017
four 11/01/2017
Requirement : I want to copy, replace (or add if necessary) those files mentioned file1 from some location to the location where file2 resides, whose date (in coulmn 2) is greater than the date mentioned in file 2. It is guaranteed that under no circumstances the file 2 will have a program's date greater than that of file one (but can be equal). Also the file entries missing in file 2 (but present in file 1) shall also be copied.
So that in this case, the files one, four, five and six shall be copied from some location to the file2 location, after the script execution
awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common
# File 1, column 2
f1c2=($(cut -f2 -s $common))
# File 2, column 2
f2c2=($(cut -f2 -s $file2))
for x in "${f1c2[#]}"
do
for y in "${f2c2[#]}"
do
if [[ $x >= $y ]]
then
//copy file pointed by field $1 in "common" to file2 path
break
fi
done
done
I was thinking of a way to use awk itself efficiently to do the comparison task to create the file "common". So that the file "common" will contain latest files in file 1, plus the missing entries in file 2. Following this way, I just need to copy all files mentioned in the file "common" without any concerns
I was trying to add some if block inside awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common, but I couldn't figure out how to address file1 column2 and file 2 column2 for comparing.

to get the date compared diff list you can try this
awk 'NR==FNR {a[$1]=$2; next}
$1 in a {split($2,b,"/"); split(a[$1],c,"/");
if(b[3]""b[2]""b[1] >= c[3]""c[2]""c[1]) delete a[$1]}
END {for(k in a) print k,a[k]}' file1 file2
six 01/01/2017
four 01/02/2017
five 03/02/2017
one 01/02/2017
and operate on the result for copying files...
Explanation
Given file 1 we want to remove the entries where date field is less than the matching entry in file 2.
NR==FNR {a[$1]=$2; next} cache the contents of file 1
$1 in a (now scanning second file) if a records exists in file 1
split($2,b,"/")... split date fields so that we can change the order to year-month-date for natural order comparison
if(b[3]...) delete a[$1] if the file 2 date is greater or equal to the one in file 1, delete the entry
END... print remaining entries, which will satisfy the requirement.

Parse 2 files simultaneously with awk is hard. So I suggest another algorithm:
- merge the file
- filter to keep the relevant lines
I may suggest to have a look on "comm" and "join" commands. Here an example
comm -23 <(sort file1) <(sort file2)

Related

Merge two .csv files into one and keep just one header - in bash

I have two .csv files in the same directory with the same number of columns and I want to combine them into one file, but keep just one header from the first file. The file name is always different, only the prefix remains the same:
orderline_123456.csv
Order_number,Quantity,Price
100,10,25.3
101,15,30.2
orderline_896524.csv
Order_number,Quantity,Price
102,20,12.33
103,3,3.4
The output file should be like:
file_load.csv
Order_number,Quantity,Price
100,10,25.3
101,15,30.2
102,20,12.33
103,3,3.4
This was already in the shell script file, because since now I needed to take only one file, but now I have to merge two files:
awk '(NR-1)%2{$1=$1}1' RS=\" ORS=\" orderline_*.csv >> file_to_load.csv
I tried changing it into
awk 'FNR == 1 && NR != 1 {next} (NR-1)%2{$1=$1}1' RS=\" ORS=\" orderline_*.csv >> file_to_load.csv
but I get the header twice in the output.
Could you please help me? How exactly should the command look like? I need to keep how it was defined before.
Thank you!
You're looking for
awk 'NR == 1 || FNR > 1' file ...
NR is the count of all records seen, and
FNR is the record number of the current file.
Sometimes the solution is divide the task in easy steps
1. Get the first line which is the header and store as variable
https://stackoverflow.com/a/2439587/3957754
header=$(head -n 1 file1.csv)
2. Get all lines of a file except the first one
How to tail all lines except first row
body=$(tail -n+2 file1.csv)
Repeat this for both files
3. Concat the header and n bodies
csv_merger.sh
header=$(head -n 1 file1.csv)
body1=$(tail -n+2 file1.csv)
body2=$(tail -n+2 file2.csv)
echo "$header" > merged.csv
echo "$body1" >> merged.csv
echo "$body2" >> merged.csv
Result
You could extend this script to handle more files
Using csvstack from the handy csvkit package is one way to merge CSV files with the same columns:
$ csvstack orderline_123456.csv orderline_896524.csv > file_load.csv

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

How can I delete the lines in a text file that exits in another text file [duplicate]

I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2

How do I merge 2 files if a different column in each file match and both files are in csv/double quote separated formats?

I've got 2 csv/double quote separated files. Column 26 in file 1 and column 2 in file 2 both contain domains and if I run the following
awk -F'"' '{print $26}' file1.csv
awk -F'"' '{print $2}' file2.txt
Then I can see that file 1 has 6 domains and file 2 has 3 domains.
All of the domains in file 2 are also in file 1.
I'd like to generate a new file containing of all columns in file 1 plus all of the columns in file 2 if column 2 in file 2 matches column 26 in file 1.
Also, I'm pretty sure that column 26 is always the last column in file 1 but file 2 can have any number of columns.
Does anyone know how can I do this in bash, awk, sed or similar?
#Bruce: Try:
awk -F'"' 'FNR==NR{A[$26]=$0;next} ($2 in A){print A[$2] FS $0}' file1 file2
So here I am checking FNR==NR condition which will be TRUE only when first file file1 is being read, then creating an array named A whose index is $26 field and setting it's value to current line and putting next will skip all further statements. Then checking $2 of file2 is present in file1's array A then printing the array A's value with current line's value.
Kindly provide sample Input_file and expected output in case above doesn't meet your requirements .

Bash comparing two different files with different fields

I am not sure if this is possible to do but I want to compare two character values from two different files. If they match I want to print out the field value in slot 2 from one of the files. Here is an example
# File 1
Date D
Tamb B
# File 2
F gge0001x gge0001y gge0001z
D 12-30-2006 12-30-2006 12-30-2006
T 14:15:20 14:15:55 14:16:27
B 15.8 16.1 15
Here is my thought behind the problem I want to do
if [ (field2) from (file1) == (field1) from (file2) ] ; do
echo (field1 from file1) and also (field2 from file2) on the same line
which prints out "Date 12-30-2006"
"Tamb 15.8"
" ... "
and continually run through every line from essentially file 1 printing out any matches that there are. I am assuming these will need to be some sort of array involved. Any thoughts on if this is the correct logic and if this is even possible?
This reformats file2 based on the abbreviations found in file1:
$ awk 'FNR==NR{a[$2]=$1;next;} $1 in a {print a[$1],$2;}' file1 file2
Date 12-30-2006
Tamb 15.8
How it works
FNR==NR{a[$2]=$1;next;}
This reads each line of file1 and saves the information in array a.
In more detail, NR is the number of lines that have been read in so far and FNR is the number of lines that have been read in so far from the current file. So, when NR==FNR, we know that awk is still processing the first file. Thus, the array assignment, a[$2]=$1 is only performed for the first file. The statement next tells awk to skip the rest of the code and jump to the next line.
$1 in a {print a[$1],$2;}
Because of the next statement, above, we know that, if we get to this line, we are working on file2.
If field 1 of file2 matches any a field 2 of file1, then print a reformatted version of the line.

Resources