Deleting the lines which match the output of another unix command - shell

I have a file as below
cat file
a 1
a 2
b 3
I want to delete a 1 row and a 2 row as the first column of it is the same.
I tried cat file|uniq -f 1, im getting the desired output. But I want to delete this from the file.

awk 'NR==FNR{a[$1]++;next}a[$1]==1{print}' file file
This one-liner works for your needs. no matter if your file was sorted or not.
add some explanation:
This one-liner is gonna process the file twice, 1st go record (in a hashtable, key:1st col, value:occurences) the duplicated lines by the 1st column, in the 2nd run, check if the 1st col in the hashtable has value==1, if yes, print. Because those lines are unique lines respect to the col1.

Related

diff two files based on part of lines

I have a file containing rows like
010203040506 azerty
020304050607 qwerty
and another file with rows like
xxxxxxxxxxxxxx 010203040506 yyyyyyyyyyyyyyyyyyyy
=> Files have millions of rows
=> How can I make something "like a diff" : get the first files rows but without the lines for which the number is in a line in the second file ?
Like in my example the result should be
020304050607 qwerty
Thanks to KamilCuk answer and this link https://www.gnu.org/software/coreutils/manual/html_node/Paired-and-unpaired-lines.html I found join -v 1 file1 file2 to show only unpaired lines

Find position of the first occurence of a substring in a file

I have a very large file, which is made of only one line (no CR at all).
I have several occurences of the same pattern (let's say here , the pattern is ABCDE).
I want to return the starting position or the starting column of the first character of the first occurence of this pattern...
for example, if this is the data in the file :
123456ABCDEF456987ABCDEFjhkhkhkhABCDEF
I want to return 7 as the starting column of the first occurence of the pattern...
thanks community :-)
Use awk index() function:
awk -v pattern="ABCDE" '{print index($0,pattern)}' file
Use the "C" option of "split", so there will be no need to repair the files afterwards.
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Need this in bash.
In a linux directory, I will have a CSV file. Arbitrarily, this file will have 6 rows.
Main_Export.csv
1,2,3,4,8100_group1,6,7,8
1,2,3,4,8100_group1,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8
I need to parse this file's 5th field (first four chars only) and take each row with 8100 (for example) and put those rows in a new file. Same with all other groups that exist, across the entire file.
Each new file can only contain the rows for its group (one file with the rows for 8100, one file for the rows with 3100, etc.)
Each filename needs to have that group# prepended to it.
The first four characters could be any numeric value, so I can't check these against a list - there are like 50 groups, and maintenance can't be done on this if a group # changes.
When parsing the fifth field, I only care about the first four characters
So we'd start with: Main_Export.csv and end up with four files:
Main_Export_$date.csv (unchanged)
8100_filenameconstant_$date.csv
3100_filenameconstant_$date.csv
5400_filenameconstant_$date.csv
I'm not sure the rules of the site. If I have to try this for myself first and then post this. I'll come back once I have an idea - but I'm at a total loss. Reading up on awk right now.
If I have understood well your problem this is very easy...
You can just:
$ awk -F, '{fifth=substr($5, 1, 4) ; print > (fifth "_mysuffix.csv")}' file.cv
or just:
$ awk -F, '{print > (substr($5, 1, 4) "_mysuffix.csv")}' file.csv
And you will get several files like:
$ cat 3100_mysuffix.csv
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
or...
$ cat 5400_mysuffix.csv
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8

Delete lines in a file based on first row

I try to work on a whole series of txt files (actually .out, but behaves like a space delimited txt file). I want to delete certain lines in the text, based on the output compared to the first row.
So for example:
ID VAR1 VAR2
1 8 9
2 4 1
3 3 2
I want to delete all the lines with VAR1 < 0,5.
I found a way to do this manually in excel, but with 350+ files, this is going to be a long night, there are sure ways to do this more effective.. I worked on this set of files already in terminal (OSX).
This is a typical job for awk, the venerable language for file manipulation.
What awk does is match each line in a file to a condition, and provide an action for it. It also allows for easy elementary parsing of line columns. In this case, you want to test whether the second column is less than 0.5, and if so not print that line. Otherwise, print the line (in effect this removes lines for which the variable is less than 0.5.
Your variable is in column 2, which in awk is referred to as $2. Each full line is referred to by the variable $0.
So you would do something like this:
{ if ($2 < 0.5) {
}
else {
print $0
}
}
Or something like that, I haven't used awk for a while. The above code is an awk script. Apply it on your file, and redirect the output to a new file (which will have all the lines not satisfying the condition removed).

Resources