Deleting lines from a file with binary pattern strings - bash

I've got two files. File A contains text written in N lines, and File B contains a binary pattern string of 0 and 1 that has N length too.
I want to delete the lines from File A that has the same line number that the one on File B that contains a 0.
I've read that it might be a good idea to do it with awk, but I don't have any idea of how to use it.
Files are very long, like 2000 lines for example (they are video traces)
For example:
File A:
Line 1: 123456
Line 2: 789012
Line 3: 345678
Line 4: 901234
File B:
Line 1: 1
Line 2: 0
Line 3: 0
Line 4: 1
After the execution:
File A:
Line 1: 123456
Line 2: 901234

You can use paste and cut for this:
paste fileB fileA | grep '^1' | cut -f2-
paste fileB fileA - pastes file contents side by side, delimited by a tab
grep '^1' - filters that lines that start with 1
cut -f2- - extracts the content that we need
Both cut and paste use tab as the default delimiter.
This is very similar to Benjamin's solution. A small advantage here is that it would work even if fileA were to have more than one field per line.

Assuming Line 1: etc don't really exist in your input files all you need is:
awk 'NR==FNR{a[NR]=$0;next} a[FNR]' fileB fileA

You could use a decorate – filter – undecorate pattern:
paste fileA fileB | grep -v '0$' | cut -f1
This prints the lines of each file next to each other (paste), then filters the lines that end with 0 (grep), then removes the lines from the second file (cut).
This breaks if fileA contains the delimiter used for paste and cut (a tab by default). To avoid that, we could either swap the files (see codeforester's answer) or resort to something like
paste fileA fileB | sed -n '/1$/s/\t.$//p'
(if line ends with 1, remove tab and last character, then print) or
paste fileA fileB | grep -Po '.*(?=\t1$)'
(match only lines ending in 1, use zero-width look-ahead to exclude tab and 1 from match); the last solution requires a grep that supports Perl compatible regular expressions (PCRE) such as GNU grep.

Lots of interesting answers here. Here's a bash one:
while IFS= read -r -u3 line; IFS= read -r -u4 bool; do
((bool == 1)) && printf "%s\n" "$line"
done 3<fileA 4<fileB
This will be much slower than other solutions.

another paste/awk solution. If tab appears in data find another delimiter.
paste file2 file1 | awk -F'\t' '$1{print $2}'

A single awk command can read from both files.
awk '(getline flag < "fileB") > 0 && flag' fileA
After reading each line from fileA, read a line from fileB into a variable flag and test if its integer value is true or not. For true values, the line from fileA is printed.
Depending on your version of awk, you may need to use int(flag) or flag+0 to force the value to be treated as an integer rather than an ordinary non-empty string.

EDIT: #codeforester's comment if Line 1 or Line 2 are not part of your File1 and File2 then following may help.
awk 'FNR==NR{a[FNR]=$0;next} $0!=0{print a[FNR]}' filea fileb
Solution 2nd: Reading fileb file first and then reading filea then.
awk 'FNR==NR{if($0!=0){a[FNR]=$0};next} a[FNR]' fileb filea
Solution 1st's alternative in case OP has string(s) line1, line2 in his/her files.
Following awk may help here too.
awk '
FNR==NR{
a[FNR]=$NF;
next}
$NF!=0{
printf("%s%s\n","Line " ++count": ",a[FNR])
}' filea fileb

paste and sed combo:
paste -d'\n' fileB fileA | sed -n '/^1$/{n;p}'
123456
901234
You interleave the files:
1
123456
0
789012
0
345678
1
901234
Then you use sed to print the lines that follow directly a line that has only a 1. However this will not behave properly if you have entries that are composed only of a 1 in the fileA. If it is the case then you have to use the following sed command that takes into account if we are currently processing an odd/even line:
paste -d'\n' fileB fileA | sed -n '1~2{/^1$/{n;p}}'

Related

Unix/bash :Print filename as first string before each line in a log file

Looking for help on how to get append the name of the file as 1st string in each row of the file.
A file which has only content. I am trying to merge 2 files but with the content should have first string as the name of the file then row 1. etc. Lets consider 2 files with name FileA and FileB. FileA has 2 lines, FileB has 2 lines.
FileA
Tom is Cat
Jerry is Mouse
FileB
Cat is Tom
Mouse is Jerry
Expected Output of merged file
FileA Tom is Cat
FileA Jerry is Mouse
FileB Cat is Tom
FileB Mouse is Jerry.
I am struggling to find a solution to this. Please help
Use sed to substitute the filename at the beginning of each line of the file:
sed 's/^/FileA /' fileA >> mergedFile
sed 's/^/FileB /' fileB >> mergedFile
For an arbitrary number of files you can loop over all the filenames, and construct the sed substitution command dynamically using the variable with the filenames.
while read -r f
do
sed "s|^|$f |" "$f"
done < file.txt > merge.txt
Using awk and brace expansion.
awk '{print FILENAME, $0}' file{A,B} | tee mergefile
file names can be anything if that is not what you have, just put them as argument with awk
awk '{print FILENAME, $0}' filefoo filebar filemore ...
Can be done with grep also if your grep has the -H option/flag
grep -H . fileA fileB
Again filenames can be anything.
Using tee to send the output to stdout and mergefile.
If you prefer ripgrep over grep these two commands produce the same output:
$ grep --with-filename '' File*
$ rg --with-filename --no-heading --no-line-number '' File*
FileA:Tom is Cat
FileA:Jerry is Mouse
FileB:Cat is Tom
FileB:Mouse is Jerry

Compare column1 in File with column1 in File2, output {Column1 File1} that does not exist in file 2

Below is my file 1 content:
123|yid|def|
456|kks|jkl|
789|mno|vsasd|
and this is my file 2 content
123|abc|def|
456|ghi|jkl|
789|mno|pqr|
134|rst|uvw|
The only thing I want to compare in File 1 based on File 2 is column 1. Based on the files above, the output should only output:
134|rst|uvw|
Line to Line comparisons are not the answer since both column 2 and 3 contains different things but only column 1 contains the exact same thing in both files.
How can I achieve this?
Currently I'm using this in my code:
#sort FILEs first before comparing
sort $FILE_1 > $FILE_1_sorted
sort $FILE_2 > $FILE_2_sorted
for oid in $(cat $FILE_1_sorted |awk -F"|" '{print $1}');
do
echo "output oid $oid"
#for every oid in FILE 1, compare it with oid FILE 2 and output the difference
grep -v diff "^${oid}|" $FILE_1 $FILE_2 | grep \< | cut -d \ -f 2 > $FILE_1_tmp
You can do this in Awk very easily!
awk 'BEGIN{FS=OFS="|"}FNR==NR{unique[$1]; next}!($1 in unique)' file1 file2
Awk works by processing input lines one at a time. And there are special clauses which Awk provides, BEGIN{} and END{} which encloses actions to be run before and after the processing of the file.
So the part BEGIN{FS=OFS="|"} is set before the file processing happens, and FS and OFS are special variables in Awk which stand for input and output field separators. Since you have a provided a file that is de-limited by | you need to parse it by setting FS="|" also to print it back with |, so set OFS="|"
The main part of the command comes after BEGIN clause, the part FNR==NR is meant to process the first file argument provided in the command, because FNR keeps track of the line numbers for the both the files combined and NR for only the current file. So for each $1 in the first file, the values are hashed into the array called unique and then when the next file processing happens, the part !($1 in unique) will drop those lines in second file whose $1 value is not int the hashed array.
Here is another one liner that uses join, sort and grep
join -t"|" -j 1 -a 2 <(sort -t"|" -k1,1 file1) <(sort -t"|" -k1,1 file2) |\
grep -E -v '.*\|.*\|.*\|.*\|'
join does two things here. It pairs all lines from both files with matching keys and, with the -a 2 option, also prints the unmatched lines from file2.
Since join requires input files to be sorted, we sort them.
Finally, grep removes all lines that contain more than three fields from the output.

Using cut and grep commands in unix

I have a file (file1.txt) with text as:
aaa,,,,,
aaa,10001781,,,,
aaa,10001782,,,,
bbb,10001783,,,,
My file2 contents are:
11111111
10001781
11111222
I need to search second field of file1 in file2 and delete the line from file1 if pattern is matching.So output will be:
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
Can I use grep and cut commands for this?
This prints lines from file1.txt only if the second field is not in file2:
$ awk -F, 'FNR==NR{a[$1]=1; next;} !a[$2]' file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
How it works
This works by reading file2 and keeping track of all lines seen in an associative array a. Then, lines in file1.txt are printed only if its column 2 is not in a. In more detail:
FNR==NR{a[$1]=1; next;}
When reading file2, set a[$1] to 1 to signal that we have seen the value on this line. We then instruct awk to skip the rest of the commands and start over on the next line.
This section is only run for file2 because file2 is listed first on the command line and FNR==NR only when we are reading the first file listed on the command line. This is because FNR is the number of lines read from the current file and NR is the total number of lines read so far. These two are equal only for the first file.
!a[$2]
When reading file1.txt, a[$2] evaluates to true if column 2 was seen in file2. Since ! is negation, !a[$2] evaluates to true when column 2 was not seen. When this evaluates to true, the line is printed.
Alternative
This is the same logic, expressed in a slightly different style, as suggested in the comments by Tom Fenech:
$ awk -F, 'FNR==NR{a[$1]; next;} !($2 in a)' file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
Soulution with grep
$ grep -vf file2 file1.txt
aaa,,,,,
aaa,10001782,,,,
bbb,10001783,,,,
John1024's awk soulution would be faster for large files though.

How to fill empty lines from one file with corresponding lines from another file, in BASH?

I have two files, file1.txt and file2.txt. Each has an identical number of lines, but some of the lines in file1.txt are empty. This is easiest to see when the content of the two files is displayed in parallel:
file1.txt file2.txt
cat bear
fish eagle
spider leopard
snail
catfish rainbow trout
snake
koala
rabbit fish
I need to assemble these files together, such that the empty lines in file1.txt are filled with the data found in the lines (of the same line number) from file2.txt. The result in file3.txt would look like this:
cat
fish
spider
snail
catfish
snake
koala
rabbit
The best I can do so far, is create a while read -r line loop, create a counter that counts how many times the while loop has looped, then use an if-conditional to check if $line is empty, then use cut to obtain the line number from file2.txt according to the number on the counter. This method seems really inefficient.
Sometimes file2.txt might contain some empty lines. If file1.txt has an empty line and file2.txt also has an empty line in the same place, the result is an empty line in file3.txt.
How can I fill the empty lines in one file with corresponding lines from another file?
paste file1.txt file2.txt | awk -F '\t' '$1 { print $1 ; next } { print $2 }'
Here is the way to handle these files with awk:
awk 'FNR==NR {a[NR]=$0;next} {print (NF?$0:a[FNR])}' file2 file1
cat
fish
spider
snail
catfish
snake
koala
rabbit
First it store every data of the file2 in array a using record number as index
Then it prints file1, bit it thest if file1 contains data for each record
If there is data for this record, then use it, if not get one from file2
One with getline (harmless in this case) :
awk '{getline p<f; print NF?$0:p; p=x}' f=file2 file1
Just for fun:
paste file1.txt file2.txt | sed -E 's/^ //g' | cut -f1
This deletes tabs that are at the beginning of a line (those missing from file1) and then takes the first column.
(For OSX, \t doesn't work in sed, so to get the TAB character, you type ctrl-V then Tab)
a solution without awk :
paste -d"#" file1 file2 | sed 's/^#\(.*\)/\1/' | cut -d"#" -f1
Here is a Bash only solution.
for i in 1 2; do
while read line; do
if [ $i -eq 1 ]; then
arr1+=("$line")
else
arr2+=("$line")
fi
done < file${i}.txt
done
for r in ${!arr1[#]}; do
if [[ -n ${arr1[$r]} ]]; then
echo ${arr1[$r]}
else
echo ${arr2[$r]}
fi
done > file3.txt

grep "output of cat command - every line" in a different file

Sorry title of this question is little confusing but I couldnt think of anything else.
I am trying to do something like this
cat fileA.txt | grep `awk '{print $1}'` fileB.txt
fileA contains 100 lines while fileB contains 100 million lines.
What I want is get id from fileA, grep that id in a different file-fileB and print that line.
e.g fileA.txt
1234
1233
e.g.fileB.txt
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
Expected output is
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
Getting rid of cat and awk altogether:
grep -f fileA.txt fileB.txt
awk alone can do that job well:
awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' fileA fileB
see the test:
kent$ head a b
==> a <==
1234
1233
==> b <==
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
kent$ awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' a b
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
EDIT
add explanation:
-F'|' #| as field separator (fileA)
'NR==FNR{a[$0];next;} #save lines in fileA in array a
$1 in a #if $1(the 1st field) in fileB in array a, print the current line from FileB
for further details I cannot explain here, sorry. for example how awk handle two files, what is NR and what is FNR.. I suggest that try this awk line in case the accepted answer didn't work for you. If you want to dig a little bit deeper, read some awk tutorials.
If the id's are on distinct lines you could use the -f option in grep as such:
cut -d "|" -f1 < fileB.txt | grep -F -f fileA.txt
The cut command will ensure that only the first field is searched for in the pattern searching using grep.
From the man page:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
The empty file contains zero patterns, and therefore matches nothing.
(-f is specified by POSIX.)

Resources