diff two files based on part of lines

diff two files based on part of lines - shell

I have a file containing rows like
010203040506 azerty
020304050607 qwerty
and another file with rows like
xxxxxxxxxxxxxx 010203040506 yyyyyyyyyyyyyyyyyyyy
=> Files have millions of rows
=> How can I make something "like a diff" : get the first files rows but without the lines for which the number is in a line in the second file ?
Like in my example the result should be
020304050607 qwerty

Thanks to KamilCuk answer and this link https://www.gnu.org/software/coreutils/manual/html_node/Paired-and-unpaired-lines.html I found join -v 1 file1 file2 to show only unpaired lines

Related

how to remove last lines from CSV file

I am inserting SQL query result into a csv file. At the end of file two rows are getting added as shown below. Can anyone please tell me how to delete these two rows (bank row and '168 rows selected').
Also, it will be great if these rows don't get inserted into csv file in the first place while inserting sql query result into csv file.
pvm:0a12dd82,pvm:0a12dd84,TechnicalErrorProcess,21-JUN-19 07.01.58.560000 AM,pi_halted,38930,1
pvm:0a12dd77,pvm:0a12dd79,TechnicalErrorProcess,20-JUN-19 12.36.27.384000 PM,pi_halted,1572846,1
pvm:0a12dd6t,pvm:0a12dd6v,TechnicalErrorProcess,20-JUN-19 12.05.22.145000 PM,pi_halted,38929,1
pvm:0a12dd4h,pvm:0a12dd4l,TechnicalErrorProcess,17-JUN-19 07.11.43.522000 AM,pi_halted,9973686,1
168 rows selected.

For MSSQL, before select query append the following,
set nocount on; select ...
I'm not sure if that will work for other databases.

Filter the output of the command above to exclude to two last lines.
I see two ways for doing it with a bash command:
head --lines=-2
Or
sed -e '/rows selected/d' -e '/^ *$/d'
(Indeed, this was a placeholder).

You can specify a negative number for the -n parameter using the head command:
-n, --lines=[-]K
print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file
so:
head -n -2 input-file.txt > output-file.txt

How to delete a series of positions within a file based on a list of numbers with Bash

I'm pretty new in Bash scripting and i have a problem to solve. I have a file that look like this:
>atac
ATTGGCAATTAAATTCTTTT
>lipa
ATTACCAAGTAAATTCTTTT
.
.
.
where each even lines have the same length, but can have different characters, and i need to remove, in each even lines, a series of position listed in a .txt file. The .txt have only a list of number, one for each lines, that correspond to the positions to be removed and look like this:
3
5
8
10
11
the expected output must keep the same length for each even line, but in each of them, the positions listed in the .txt file must have been deleted.
Any suggestion?

If the "position" in the txt file indicates always the index of the original string, this awk-oneliner will help you:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x=""}7' your.txt FS="" OFS="" file
>atac
ATGCATAATTCTTTT
>lipa
ATACAGAATTCTTTT
We mark (as "-") the deleted char so that you can verify if the result is correct:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x="-"}7' txt FS="" OFS="" file
>atac
AT-G-CA-T--AATTCTTTT
>lipa
AT-A-CA-G--AATTCTTTT

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Need this in bash.
In a linux directory, I will have a CSV file. Arbitrarily, this file will have 6 rows.
Main_Export.csv
1,2,3,4,8100_group1,6,7,8
1,2,3,4,8100_group1,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8
I need to parse this file's 5th field (first four chars only) and take each row with 8100 (for example) and put those rows in a new file. Same with all other groups that exist, across the entire file.
Each new file can only contain the rows for its group (one file with the rows for 8100, one file for the rows with 3100, etc.)
Each filename needs to have that group# prepended to it.
The first four characters could be any numeric value, so I can't check these against a list - there are like 50 groups, and maintenance can't be done on this if a group # changes.
When parsing the fifth field, I only care about the first four characters
So we'd start with: Main_Export.csv and end up with four files:
Main_Export_$date.csv (unchanged)
8100_filenameconstant_$date.csv
3100_filenameconstant_$date.csv
5400_filenameconstant_$date.csv
I'm not sure the rules of the site. If I have to try this for myself first and then post this. I'll come back once I have an idea - but I'm at a total loss. Reading up on awk right now.

If I have understood well your problem this is very easy...
You can just:
$ awk -F, '{fifth=substr($5, 1, 4) ; print > (fifth "_mysuffix.csv")}' file.cv
or just:
$ awk -F, '{print > (substr($5, 1, 4) "_mysuffix.csv")}' file.csv
And you will get several files like:
$ cat 3100_mysuffix.csv
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
or...
$ cat 5400_mysuffix.csv
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8

Deleting the lines which match the output of another unix command

I have a file as below
cat file
a 1
a 2
b 3
I want to delete a 1 row and a 2 row as the first column of it is the same.
I tried cat file|uniq -f 1, im getting the desired output. But I want to delete this from the file.

awk 'NR==FNR{a[$1]++;next}a[$1]==1{print}' file file
This one-liner works for your needs. no matter if your file was sorted or not.
add some explanation:
This one-liner is gonna process the file twice, 1st go record (in a hashtable, key:1st col, value:occurences) the duplicated lines by the 1st column, in the 2nd run, check if the 1st col in the hashtable has value==1, if yes, print. Because those lines are unique lines respect to the col1.

Remove all lines from a given text file based on a given list of IDs

I have a list of IDs like so:
11002
10995
48981
And a tab delimited file like so:
11002 Bacteria;
10995 Metazoa
I am trying to delete all lines in the tab delimited file containing one of the IDs from the ID list file. For some reason the following won't work and just returns the same complete tab delimited file without any line removed whatsoever:
grep -v -f ID_file.txt tabdelimited_file.txt > New_tabdelimited_file.txt
I also tried numerous other combinations with grep, but currently I draw blank here.
Any idea why this is failing?
Any help would be greatly appreciated

Since you tagged this with awk, here is one way of doing it:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{ids[$1]++;next}!($1 in ids)' idFile tabFile > new_tabFile
BTW your grep command is correct. Just double check if your file is not formatted for windows.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

diff two files based on part of lines - shell

Thanks to KamilCuk answer and this link https://www.gnu.org/software/coreutils/manual/html_node/Paired-and-unpaired-lines.html I found join -v 1 file1 file2 to show only unpaired lines

Related

how to remove last lines from CSV file

How to delete a series of positions within a file based on a list of numbers with Bash

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Deleting the lines which match the output of another unix command

Remove all lines from a given text file based on a given list of IDs

Categories

Resources