I'm trying to join several files, which look like below
file1
DATE;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M01;AZ;PO;
2013M12;WT;UF;
file2
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;
2014M02;BA;LA;
2014M01;BR;ON;
I'm trying to merge them to have the following results
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M02;BA;LA;
2014M01;BR;ON;AZ;PO;
2013M12 WT;UF;
or
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M02;BA;LA;;
2014M01;BR;ON;AZ;PO;
2013M12;;WT;UF;
I tried join but it says filenameX is not sorted:
If you have any ideas, they are welcomed.
Best.
Will this work for you:
$ awk '
BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
{$0=($1 in a)?a[$1] $2 FS $3:$0; delete a[$1]}1;END{for(x in a) print a[x]}' file2 file1
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME
2014M01;BR;ON;AZ;PO
2013M12;WT;UF;
2014M02;BA;LA;
We set the field separators (Input and Output) to ;
We scan the first file and create an array indexed at column 1 and assign it value of entire line
Once the first file is completed, we start reading the second file. If the first column is present in our array, we append the current line to the line stored in array. We delete the array item.
Once all lines of second file are processed, we loop through the array to see if there are any items left. If so we print them.
Bash has this wonderful feature that allows to sort both files in-line:
$ join -t ';' -a 1 -a 2 -o 0 1.2 1.3 2.2 2.3 <(sort -n file1 ) <(sort -n file2)
DATE;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM
2013M12;WT;UF;;
2014M01;AZ;PO;BR;ON
2014M02;;;BA;LA
Explanation:
-t ';': use ; as both input and output separator.
-a 1 -a 2: also print unpairable lines from both file1 and file2.
-o 0 1.2 1.3 2.2 2.3: each line is formatted as 0 (the join field), 1.2 (2nd field of file1), 1.3 (3rd field of file1), etcetera.
<(sort -n file1): numeric sort file1 via bash process substitution.
<(sort -n file2): numeric sort file2 via bash process substitution.
For details on bash process substitution, see: http://tldp.org/LDP/abs/html/process-sub.html.
Related
Looking to perform an inner join on two different text files. Basically I'm looking for the inner join equivalent of the GNU join program. Does such a thing exist? If not, an awk or sed solution would be most helpful, but my first choice would be a Linux command.
Here's an example of what I'm looking to do
file 1:
0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB
0|App - CSCE Certificate LUA|Admit Type
1|App - CSCE Certificate LUA|Alias 1
2|App - CSCE Certificate LUA|Alias 2
3|App - CSCE Certificate LUA|Alias 3
4|App - CSCE Certificate LUA|Alias 4
file 2:
Alien Registration Card LUA
Results:
0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB
Here's an awk option, so you can avoid the bash dependency (for portability):
$ awk -F'|' 'NR==FNR{check[$0];next} $2 in check' file2 file1
How does this work?
-F'|' -- sets the field separator
'NR==FNR{check[$0];next} -- if the total record number matches the file record number (i.e. we're reading the first file provided), then we populate an array and continue.
$2 in check -- If the second field was mentioned in the array we created, print the line (which is the default action if no actions are provided).
file2 file1 -- the files. Order is important due to the NR==FNR construct.
Should not the file2 contain LUA at the end?
If yes, you can still use join:
join -t'|' -12 <(sort -t'|' -k2 file1) file2
Looks like you just need
grep -F -f file2 file1
You may modify this script:
cat file2 | while read line; do
grep $line file1 # or whatever you want to do with the $line variable
done
while loop reads file2 line by line and gives that line to the grep command that greps that line in file1. There're some extra output that maybe removed with grep options.
You can use paste command to combine file :
paste [option] source files [>destination file]
for your example it would be
paste file1.txt file2.txt >result.txt
I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2
I have a file1 in the below format:
14-02-2017
one 01/02/2017
two 31/01/2017
three 14/02/2017
four 01/02/2017
five 03/02/2017
six 01/01/2017
And file2 in the below format:
11-02-2017
one 01/01/2017
two 31/01/2017
three 14/02/2017
four 11/01/2017
Requirement : I want to copy, replace (or add if necessary) those files mentioned file1 from some location to the location where file2 resides, whose date (in coulmn 2) is greater than the date mentioned in file 2. It is guaranteed that under no circumstances the file 2 will have a program's date greater than that of file one (but can be equal). Also the file entries missing in file 2 (but present in file 1) shall also be copied.
So that in this case, the files one, four, five and six shall be copied from some location to the file2 location, after the script execution
awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common
# File 1, column 2
f1c2=($(cut -f2 -s $common))
# File 2, column 2
f2c2=($(cut -f2 -s $file2))
for x in "${f1c2[#]}"
do
for y in "${f2c2[#]}"
do
if [[ $x >= $y ]]
then
//copy file pointed by field $1 in "common" to file2 path
break
fi
done
done
I was thinking of a way to use awk itself efficiently to do the comparison task to create the file "common". So that the file "common" will contain latest files in file 1, plus the missing entries in file 2. Following this way, I just need to copy all files mentioned in the file "common" without any concerns
I was trying to add some if block inside awk -F' ' 'NR==FNR{c[$1]++;next};c[$1] > 0' $file2 $file1 > common, but I couldn't figure out how to address file1 column2 and file 2 column2 for comparing.
to get the date compared diff list you can try this
awk 'NR==FNR {a[$1]=$2; next}
$1 in a {split($2,b,"/"); split(a[$1],c,"/");
if(b[3]""b[2]""b[1] >= c[3]""c[2]""c[1]) delete a[$1]}
END {for(k in a) print k,a[k]}' file1 file2
six 01/01/2017
four 01/02/2017
five 03/02/2017
one 01/02/2017
and operate on the result for copying files...
Explanation
Given file 1 we want to remove the entries where date field is less than the matching entry in file 2.
NR==FNR {a[$1]=$2; next} cache the contents of file 1
$1 in a (now scanning second file) if a records exists in file 1
split($2,b,"/")... split date fields so that we can change the order to year-month-date for natural order comparison
if(b[3]...) delete a[$1] if the file 2 date is greater or equal to the one in file 1, delete the entry
END... print remaining entries, which will satisfy the requirement.
Parse 2 files simultaneously with awk is hard. So I suggest another algorithm:
- merge the file
- filter to keep the relevant lines
I may suggest to have a look on "comm" and "join" commands. Here an example
comm -23 <(sort file1) <(sort file2)
I'm automating a workflow with a bash script on Mac OSX. In this workflow, I'd like to add a command that deletes a header from my table (.txt) file that is tab delimited. It looks as follows:
header1 header2 header3
a 1
b 2
c 3
d 4
e 5
f 6
As you can see, the third column, named header3, is empty.
I've noted this post or this one but I don't understand the arguments.
Could you suggest a line of code that automatically deletes the third column, or (even better) deletes the header called 'header3'?
awk is designed to work with whitespace-separated text columns:
awk '{print $1 "\t" $2}' input.txt > output.txt
I found the answer here in Table 2C.
sed s/header3//g input.txt > output.txt
I have 2 files, one contains this :
file1.txt
632121S0 126.78.202.250 1
131145S0 126.178.20.250 1
the other contain this : file2.txt
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
313359S2 126.137.37.250 OBS
I want to end up with a third file which contains :
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
Only the lines which start by the same string in both files. I can't remember how to do it. I tried several grep, egrep and find, i still cannot use it properly...
Can you help please ?
You can use this awk:
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
It is based on the idea of two file processing, by looping through files as this:
first loop through first file, storing the first field in the array a.
then loop through second file, checking if its first field is in the array a. If that is true, the line is printed.
To do this with grep, you need to use a process substitution:
grep -f <(cut -d' ' -f1 file1.txt) file2.txt
grep -f uses a file as a list of patterns to search for within file2. In this case, instead of passing file1 unaltered, process substitution is used to output only the first column of the file.
If you have a lot of these lines, then the utility join would likely be useful.
join - join lines of two files on a common field
Here's a set of examples.