Need to read two files that has a column each with decimal numbers and add them line by line into a third file. This I can do with bash and bc.
Problem:
In some cases these two files may contain non-numeric values. When I come across non-numeric values I need to know it is not a number and skip the line even if one file has a valid number and then continue adding the remaining lines. Would like to do it in Bash itself.
Example below:
file1
1.1
2.89
Nan
4.32
file2
2.1
2.1
42.6
1.1
File3 (result file)
3.2
4.99
5.42
Use:
paste -d+ file1 file2 | bc 2>/dev/null >file3
I would use awk and paste as:
paste file1 file2 | awk '/^([0-9]+\.?|\.?)([0-9]*)*[ \t]+[0-9]+\.?([0-9]*)*$/ { print $1+$2 }' > File3
Related
I'm looking for the most efficient way to sum X columns of floats, each column is stored in a distinct file.
All files have exactly the same number of lines (a few hundred).
I do not know in advance the number X.
Example with X=3:
File1:
0.5
0
...
File2:
0
1.5
...
File3:
1.1
2
...
I'd like to generate a file, say sum_files:
1.6
3.5
...
Any efficient way to do this in awk or bash? (There exist solutions using adhoc python scripts, but I'm wondering how this can be done in awk or bash.)
Thanks!
I would harness GNU AWK's FNR built-in variable for this task following way:
awk '{arr[FNR]+=$1}END{for(i=1;i<=FNR;i+=1){print arr[i]}}' file1 file2 file3
Explanation: for each line do increase value in array arr under key being number of line in file by value of 1st field. After processing all files print values stored in arr. Note that FNR might be used in for as limit due to fact that all files have equal number of lines.
Read one line from each file and join them with a delimiter, this is one of the things paste(1) does well. Pass the result on to bc(1) to get the sums:
paste -d+ file1 file2 file3 | bc -l
Output:
1.6
3.5
I have been using this to delete lines from first file that are in second file (difference).
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file2.txt file1.txt >output.txt
This works perfecty for lines that are exactly the same, also it work fast with files with millions of lines.
Now, I have stumble upon situation where I have lines that are similar to lines in first file, but not exactly the same, some lines have 8-9 characters appended at the begining, but they are the same up to the end of a line, like this:
file1
8952aa182685763d30758c730de536a9907f96e7
5e46468f50df8e410b0372dc8a550c0cec33d8bc
11111111-954f94fa00c220c40a49b37816c9146
5dd0a2058734e2c3e039f3a814fc86789474c65e
2222222-s54b2c1d6176b0aae91d85545670aa7a
file2
5e46468f50df8e410b0372dc8a550c0cec33d8bc
954f94fa00c220c40a49b37816c9146
s54b2c1d6176b0aae91d85545670aa7a
Wanted result:
8952aa182685763d30758c730de536a9907f96e7
5dd0a2058734e2c3e039f3a814fc86789474c65e
I tried to find a solution but so far I didn't, if you have a solution that was already solved share a link, thanks in advance.
The easiest way to find the lines in file1 without a partial match in file2 is:
grep -v -f file2 file1
Where you use the inverted match of those line listed in file2 against the lines in file1 resulting in:
8952aa182685763d30758c730de536a9907f96e7
5dd0a2058734e2c3e039f3a814fc86789474c65e
I believe what you are really after is the following:
$ awk -F'-' '(FNR==NR){a[$NF]; next}!($NF in a)' file2 file1
This splits each line in fields separated by a -. So for file1, the $NF value is given by
8952aa182685763d30758c730de536a9907f96e7 -> 8952aa182685763d30758c730de536a9907f96e7
5e46468f50df8e410b0372dc8a550c0cec33d8bc -> 5e46468f50df8e410b0372dc8a550c0cec33d8bc
11111111-954f94fa00c220c40a49b37816c9146 -> 954f94fa00c220c40a49b37816c9146
5dd0a2058734e2c3e039f3a814fc86789474c65e -> 5dd0a2058734e2c3e039f3a814fc86789474c65e
2222222-s54b2c1d6176b0aae91d85545670aa7a -> s54b2c1d6176b0aae91d85545670aa7a
This is exactly the string which you want to match from file2 which is also referenced with $NF as it contains a single field. This, however, could be problematic if there are naturally more hyphens in the lines.
This might be better than the grep solution as the grep solution might remove false positives. Imagine lines in file1 that look like:
xxs54b2c1d6176b0aae91d85545670aa7axxxxxx
yyys54b2c1d6176b0aae91d85545670aa7ayyyyy
zzzzs54b2c1d6176b0aae91d85545670aa7azzzz
All these will be removed. In the above case, this is not going to be the case.
You could also address the problem differently by stating
Don't show the lines of file1 where the lines of file2 match the end of the corresponding line in line1.
This can be solved with awk in the following way:
$ awk '(FNR==NR){a[$0]; next}
{for(str in a) if (index($0,str)+length(str)-1==length($0)) print }' file2 file1
We could have used match instead of index, but match will match ERE patterns and if str contains any special ERE patterns, it will miss its purpose.
I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2
I have a file1 in the below format:
14-02-2017
one 01/02/2017
two 31/01/2017
three 14/02/2017
four 01/02/2017
five 03/02/2017
six 01/01/2017
And file2 in the below format:
11-02-2017
one 01/01/2017
two 31/01/2017
three 14/02/2017
four 11/01/2017
Requirement : I want to copy, replace (or add if necessary) those files mentioned file1 from some location to the location where file2 resides, whose date (in coulmn 2) is greater than the date mentioned in file 2. It is guaranteed that under no circumstances the file 2 will have a program's date greater than that of file one (but can be equal). Also the file entries missing in file 2 (but present in file 1) shall also be copied.
So that in this case, the files one, four, five and six shall be copied from some location to the file2 location, after the script execution
awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common
# File 1, column 2
f1c2=($(cut -f2 -s $common))
# File 2, column 2
f2c2=($(cut -f2 -s $file2))
for x in "${f1c2[#]}"
do
for y in "${f2c2[#]}"
do
if [[ $x >= $y ]]
then
//copy file pointed by field $1 in "common" to file2 path
break
fi
done
done
I was thinking of a way to use awk itself efficiently to do the comparison task to create the file "common". So that the file "common" will contain latest files in file 1, plus the missing entries in file 2. Following this way, I just need to copy all files mentioned in the file "common" without any concerns
I was trying to add some if block inside awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common, but I couldn't figure out how to address file1 column2 and file 2 column2 for comparing.
to get the date compared diff list you can try this
awk 'NR==FNR {a[$1]=$2; next}
$1 in a {split($2,b,"/"); split(a[$1],c,"/");
if(b[3]""b[2]""b[1] >= c[3]""c[2]""c[1]) delete a[$1]}
END {for(k in a) print k,a[k]}' file1 file2
six 01/01/2017
four 01/02/2017
five 03/02/2017
one 01/02/2017
and operate on the result for copying files...
Explanation
Given file 1 we want to remove the entries where date field is less than the matching entry in file 2.
NR==FNR {a[$1]=$2; next} cache the contents of file 1
$1 in a (now scanning second file) if a records exists in file 1
split($2,b,"/")... split date fields so that we can change the order to year-month-date for natural order comparison
if(b[3]...) delete a[$1] if the file 2 date is greater or equal to the one in file 1, delete the entry
END... print remaining entries, which will satisfy the requirement.
Parse 2 files simultaneously with awk is hard. So I suggest another algorithm:
- merge the file
- filter to keep the relevant lines
I may suggest to have a look on "comm" and "join" commands. Here an example
comm -23 <(sort file1) <(sort file2)
I have two files like this;
File1
114.4.21.198,cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
114.4.21.205,cl_id=1O3M7A7Q0S3C6h85902g7b3h7_101pf
114.4.21.205,cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
114.4.21.213,cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
File2
cl_id=1B3O7M6C8T4O1b559i2g930m0_1165d
cl_id=1X3J7M6J0W5S9535180h90302_101p5
cl_id=1G3D7X6V6A7R81356e3g527m9_101nl
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1L3J7R7O0F0L74954h2g495h8_117qk
cl_id=1J3W7P7H0S3L6g85900g736h6_101ps
cl_id=1W3C7Z7W0U3J6795197g177j9_117p1
cl_id=1I3A7J7N0M3W6e950i7g2g2i0_1020h
cl_id=1Q3Y7Q7J0M3E62953e5g3g5k0_117p6
I want to compare cl_id values that exist on file1 but not exist on file2 and print out the first values from file1 (IP Address).
it should be like this
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
114.4.23.70
114.4.21.201
114.4.21.211
120.172.168.36
I have tried awk,grep diff, comm. but nothing come close. Please tell the correct command to do this.
thanks
One proper way to that is this:
grep -vFf file2 file1 | sed 's|,cl_id.*$||'
I do not see how you get your output. Where does 120.172.168.36 come from.
Here is one solution to compare
awk -F, 'NR==FNR {a[$0]++;next} !a[$1] {print $1}' file2 file1
114.4.21.198
114.4.21.205
114.4.21.205
114.4.21.213
Feed both files into AWK or perl with field separator=",". If there are two fields, add the fields to a dictionary/map/two arrays/whatever ("file1Lines"). If there is just one field (this is file 2), add it to a set/list/array/whatever ("file2Lines"). After reading all input:
Loop over the file1Lines. For each element, check whether the key part is present in file2Lines. If not, print the value part.
This seems like what you want to do and might work, efficiently:
grep -Ff file2.txt file1.txt | cut -f1 -d,
First the grep takes the lines from file2.txt to use as patterns, and finds the matching lines in file1.txt. The -F is to use the patterns as literal strings rather then regular expressions, though it doesn't really matter with your sample.
Finally the cut takes the first column from the output, using , as the column delimiter, resulting in a list of IP addresses.
The output is not exactly the same as your sample, but the sample didn't make sense anyway, as it contains text that was not in any of the input files. Not sure if this is what you wanted or something more.