Extract a range of rows, with overlap using sed - macos

I have a (dummy) file that looks like this:
header
1
2
3
4
5
6
7
8
9
10
And I need a command that would give me different files made of rows extracted every four lines with one overlaping row. So I would have something like this:
1
2
3
4
3
4
5
6
5
6
7
8
7
8
9
10
So here is what I got (it is not much, sorry):
tail -n + 2 | sed -n 1,4p > window1.txt
But I don't know how to apply this over all the file, with an overlap.
Thanks in advance.

This might work for you (GNU sed and split):
sed -nr '1{N;N;N};:a;p;$q;s/^.*\n.*\n(.*\n.*)$/\1/;N;N;ba' file | split -dl4
EDIT:
To make this programmable use:
sed -nr ':a;$!{N;s/[^\n]+/&/4;Ta};p;$q;s/.*((\n[^\n]*){2})$/\1/;D' file |
split -dl4 file-name-prefix
Where 4 is the number lines per file and 2 is the number of overlap lines.
File-name-prefix is your chosen file name which will have numbers appended (see man split).

Related

How do I delete a row from a CSV in bash using sed or awk if a string exists in two columns?

I'm a beginner-mid level bash scripter and I'm not very familiar with working with csv files through terminal.
Through the hours of research I've wasted on this, I'm guessing sed or awk will be my best bet, I'm just not certain the best way to accomplish this.
The CSV is as follows:
Owner,id,permission.deleted,permission.displayName,permission.domain,permission.emailAddress,permission.id,permission.photoLink,permission.role,permission.type
owner#domain.com,some_file_id,False,Display Name,domain.com,writer#domain.com,permissionidnumber,,writer,user
owner#domain.com,some_file_id,False,Display Name,domain.com,owner#domain.com,permissionidnumber,url,owner,user
My goal is to remove any lines where the owner is granted permissions from the original csv.
Ideally, I'd like something along the lines of "If Column A (Owner) matches Column F (permission.emailAddress), delete the line"
Desired Output - Replace existing CSV with:
The CSV is as follows:
Owner,id,permission.deleted,permission.displayName,permission.domain,permission.emailAddress,permission.id,permission.photoLink,permission.role,permission.type
owner#domain.com,some_file_id,False,Display Name,domain.com,writer#domain.com,permissionidnumber,,writer,user
The command I'm running needs to use the CSV to read the permissions appropriately and I'm removing the owner since they retain ownership and if I try to grant it to them again, they receive an email and I'm trying to avoid spamming my users.
If I can't grab match two columns within the CSV and delete it from there, I can probably grab the owner#domain.com address and set it to a variable to use if that's easier. I just have to run this against ~100 unique users so the more I can automate, the better.
Using any awk in any shell on every Unix box, the following will execute orders of magnitude faster than a shell read loop with far simpler and far briefer code:
awk -F, '$1 != $6' file
For example:
$ awk -F, '$1 != $6' file
Owner,id,permission.deleted,permission.displayName,permission.domain,permission.emailAddress,permission.id,permission.photoLink,permission.role,permission.type
owner#domain.com,some_file_id,False,Display Name,domain.com,writer#domain.com,permissionidnumber,,writer,user
To modify the original file with GNU awk use awk -i inplace -F, '$1!=$6' file or with any awk awk -F, '$1!=$6' file > tmp && mv tmp file.
Maybe awk.
$: cat x
1 2 3 4 5 6 7
3 4 3 4 5 3 7
5 6 3 4 5 6 7
7 8 3 4 5 6 7
9 0 3 4 5 9 7
awk '$1 == $6 { next } 1' x
1 2 3 4 5 6 7
5 6 3 4 5 6 7
7 8 3 4 5 6 7

Bash: Quickly find diff of two files of second column only if the first column is the same

I have two large files around 7GB each. I would like to find the difference of the second file only if the number of the first column is the same for the two files. The two files are sorted but can have different number of lines.
The first file looks like this: (1.txt)
5 5
6 6
7 7
8 8
9 9
The second file looks like this: (2.txt):
3 3
4 4
5 5
6 6
7 4
8 4
9 9
The output should look like this:
7 4
8 4
Right now I have this one-liner, but I am not sure if it can go faster:
mawk 'NR==FNR{a[$1]=$2; next} ($1 in a) && a[$1]!=$2' 1.txt 2.txt
if the files are sorted on the joined key, the easiest (and fastest) will be
$ join file1 file2 | awk '$2!=$3{print $1,$3}'
7 4
8 4

Extract blocks of lines with sed

How would one go about with sed to extract n lines of a file every m-th line?
Say my textfile looks like this:
myfile.dat:
1
2
3
4
5
6
7
8
9
10
Say that I want to extract blocks of three lines and then skipping two lines throughout the entire file, such that my output looks like this:
output.dat:
1
2
3
6
7
8
Any suggestions on how one could achieve this with sed?
Edit:
For my example I could just have used
sed -n 'p;n;p;n;p;n;n' myfile.dat > output.dat
or with GNU sed (not preferred due to portability)
sed '1~5b;2~5b;3~5b;d' myfile.dat > output.dat
However, I typically want to print blocks of 2450 lines from a file with 49 002 450 lines, such that my outputfile contains 247 450 lines.
This might work for you (GNU sed):
sed -n '1~5,+2p' file
Starting at line 1, print line numbers with modulus 5 and the following two lines.
An alternative:
sed -n 'N;N;p;n;n' file
In your case the below would work. It's checking the remainder when divided by 5 is between 1 and 3:
awk 'NR%5==1, NR%5==3' myfile.dat

Linux command to remove lines containing a duplicated value in a text file?

If I have a text file with the following form
1 1
1 3
3 4
2 2
5 7
...
Is there a Linux command that can give me the following result?
1 3
3 4
5 7
...
So, I want to delete the lines 1 1 and 2 2.
Yes, you can use something like:
awk '$1!=$2{print}' inputfilename
or the slightly less verbose (thanks to ooga):
awk '$1!=$2' inputfilename
which uses the "missing action means print" feature of awk.
Both these awk commands print lines where the columns don't match, and throw away everything else.

Using grep to split a line, then search for lines with a number greater than 3

Let's say I have a file like:
thing1(space)thing2(space)thing3(space)thing4
E.g.
1 apple 3 4
3 banana 3 8
3 pear 11 12
13 cheeto 15 16
Can I only show lines where thing3 is greater than 3? (i.e. pear and cheeto)
I can easily do this in python, but can we do this in the shell? Maybe with awk? I'm still researching this.
You can do that easily with awk if that is an option available to you by saying:
awk '$3>3' inputFile
$ cat file
1 apple 3 4
3 banana 3 8
3 pear 11 12
13 cheeto 15 16
$ awk '$3>3' file
3 pear 11 12
13 cheeto 15 16
awk by default splits the line in to fields delimited by space and assigns them to variable which can be referenced by stating the column number. In your case you need to reference it by $3.

Resources