I have two files. File1 and File2.
File1:
1 a
2 b
File2:
1 a
2 c
3 d
I would like to generate a file that has the following:
1 a
2 c
3 d
The lines that File2 has either inserted or updated into File1, sort of how the UPSERT feature works in SQL.
I'm guessing here, since question is a bit vague. Anyway, here's something in awk that just uses the first value as a key to store the 2nd value. 2nd value always overwrites content in array if the key is found multiple times:
$ awk '{a[$1]=$0}END{for (i in a) print a[i]}' f1 f2
1 a
2 c
3 d
EDIT: The new version takes an arbitrary wide file instead of being tied to two fields.
Related
I'm looking for the most efficient way to sum X columns of floats, each column is stored in a distinct file.
All files have exactly the same number of lines (a few hundred).
I do not know in advance the number X.
Example with X=3:
File1:
0.5
0
...
File2:
0
1.5
...
File3:
1.1
2
...
I'd like to generate a file, say sum_files:
1.6
3.5
...
Any efficient way to do this in awk or bash? (There exist solutions using adhoc python scripts, but I'm wondering how this can be done in awk or bash.)
Thanks!
I would harness GNU AWK's FNR built-in variable for this task following way:
awk '{arr[FNR]+=$1}END{for(i=1;i<=FNR;i+=1){print arr[i]}}' file1 file2 file3
Explanation: for each line do increase value in array arr under key being number of line in file by value of 1st field. After processing all files print values stored in arr. Note that FNR might be used in for as limit due to fact that all files have equal number of lines.
Read one line from each file and join them with a delimiter, this is one of the things paste(1) does well. Pass the result on to bc(1) to get the sums:
paste -d+ file1 file2 file3 | bc -l
Output:
1.6
3.5
Let's say I have two files like this:
File1:
A B C
File2:
D C B
The result file should be like: A B C D (order doesn't matter).
I could google that up if I knew exactly the name of this mechanic (it should probably have one, to me it looks like an OR).
Using linux command merge/cat file1 file2 > file3 outputs every single line like this A B C D C B but man pages of those two commands do not mention anything helpful for the purpose. I'd like to have an elegant solution like [command] [parameter] file1 file2 > file3 since I can write a bash script to do that but it seems pretty overkill.
This will concatenate, then sort, then remove duplicate lines:
LC_ALL=C sort -u input1.txt input2.txt > output.txt
When you do not need the output sorted, you can skip that step:
awk '{a[$0]} END {for (key in a) print key;}' file[12]
I have 2 large files (F1 and F2) with 200k+ rows each, and currently I am comparing each record in F1 against F2 to look for records unique only to F1, then comparing F2 to F1 to look for records unique only to F2.
I am doing this by reading in each line of the file using a 'while' loop then using 'grep' on the line against the file to see if a match is found.
This process takes about 3 hours to complete if there are no mismatches, and can be 6+ hours if there are a large number of mismatches (files barely matching so 200k+ mismatches).
Is there any way I can rewrite this script to accomplish the same function but in a faster time?
I have tried to rewrite the script using sed to try to delete the line in F2 if a match is found so that when comparing F2 to F1, only the values unique to F2 remain, however calling sed for every iteration of F1's lines does not seem to improve the performance much.
Example:
F1 contains:
A
B
E
F
F2 contains:
A
Y
B
Z
The output I'm expecting is when comparing F1 to F2:
E
F
And then comparing F2 to F1:
Y
Z
You want comm:
$ cat f1
A
B
E
F
$ cat f2
A
Y
B
Z
$ comm <(sort f1) <(sort f2)
A
B
E
F
Y
Z
Column 1 of comm output are those lines unique to f1. Column 2 are those lines unique to f2. Column 3 are lines found in both f1 and f2.
The parameters -1, -2, and -3 suppress the corresponding output. For example, if you want only the lines unique to f1, you can filter out the other columns:
$ comm -23 <(sort f1) <(sort f2)
E
F
Note that comm requires sorted input, which I supply in these examples using the bash command substitution syntax (<()). If you're not using bash, pre-sort into a temporary file.
Have you tried linux's diff?
Some useful options are -i, -w, -u, -y
Though, in that case, they'd have to have the same order (you could sort them first)
If sort order of the output is not important and you are only interested in the sorted set of lines that are unique in the set of all lines from both files, you can do:
sort F1 F2 | uniq -u
Grep is going to use compiled code to do the entirety of what you want if you simply treat one or the other of your files as a pattern file.
grep -vFx -f F1.txt F2.txt:
Y
Z
grep -vFx -f F2.txt F1.txt:
E
F
Explanation:
-v to print lines not matching those in the "pattern file"
specified with -f
-F - interpret patterns as fixed strings and not regexes, gleaned
from this
question, which I was reading to see if there was a practical limit to this. I am curious whether it will work with large line counts in both files.
-x - match entire lines
Sorting is not required. - You get the resulting unique lines in the order they appear. This method takes longer because it cannot assume the inputs are sorted, but if you are looking at multiline records, sorting really trashes the context. The performance is okay if the files are similar, because grep -v skips a line as soon as it matches any line in the "pattern" file. If the files are highly dissimilar, the performance is very slow, because it's checking every pattern vs every line before finally printing it.
I am not sure if this is possible to do but I want to compare two character values from two different files. If they match I want to print out the field value in slot 2 from one of the files. Here is an example
# File 1
Date D
Tamb B
# File 2
F gge0001x gge0001y gge0001z
D 12-30-2006 12-30-2006 12-30-2006
T 14:15:20 14:15:55 14:16:27
B 15.8 16.1 15
Here is my thought behind the problem I want to do
if [ (field2) from (file1) == (field1) from (file2) ] ; do
echo (field1 from file1) and also (field2 from file2) on the same line
which prints out "Date 12-30-2006"
"Tamb 15.8"
" ... "
and continually run through every line from essentially file 1 printing out any matches that there are. I am assuming these will need to be some sort of array involved. Any thoughts on if this is the correct logic and if this is even possible?
This reformats file2 based on the abbreviations found in file1:
$ awk 'FNR==NR{a[$2]=$1;next;} $1 in a {print a[$1],$2;}' file1 file2
Date 12-30-2006
Tamb 15.8
How it works
FNR==NR{a[$2]=$1;next;}
This reads each line of file1 and saves the information in array a.
In more detail, NR is the number of lines that have been read in so far and FNR is the number of lines that have been read in so far from the current file. So, when NR==FNR, we know that awk is still processing the first file. Thus, the array assignment, a[$2]=$1 is only performed for the first file. The statement next tells awk to skip the rest of the code and jump to the next line.
$1 in a {print a[$1],$2;}
Because of the next statement, above, we know that, if we get to this line, we are working on file2.
If field 1 of file2 matches any a field 2 of file1, then print a reformatted version of the line.
Suppose I have two lists of numbers in files f1, f2, each number one per line. I want to see how many numbers in the first list are not in the second and vice versa. Currently I am using grep -f f2 -v f1 and then repeating this using a shell script. This is pretty slow (quadratic time hurts). Is there a nicer way of doing this?
I like 'comm' for this sort of thing.
(files need to be sorted.)
$ cat f1
1
2
3
$ cat f2
1
4
5
$ comm f1 f2
1
2
3
4
5
$ comm -12 f1 f2
1
$ comm -23 f1 f2
2
3
$ comm -13 f1 f2
4
5
$
Couldn't you just put each number in a single line and then diff(1) them? You might need to sort the lists beforehand, though for that to work properly.
In the special case where one file is a subset of the other, the following:
cat f1 f2 | sort | uniq -u
would list the lines only in the larger file. And of course piping to wc -l will show the count.
However, that isn't exactly what you described.
This one-liner serves my particular needs often, but I'd love to see a more general solution.