I'm automating a workflow with a bash script on Mac OSX. In this workflow, I'd like to add a command that deletes a header from my table (.txt) file that is tab delimited. It looks as follows:
header1 header2 header3
a 1
b 2
c 3
d 4
e 5
f 6
As you can see, the third column, named header3, is empty.
I've noted this post or this one but I don't understand the arguments.
Could you suggest a line of code that automatically deletes the third column, or (even better) deletes the header called 'header3'?
awk is designed to work with whitespace-separated text columns:
awk '{print $1 "\t" $2}' input.txt > output.txt
I found the answer here in Table 2C.
sed s/header3//g input.txt > output.txt
Related
My problem is the following: I have multiple tab separated files (A, B, C and D) each containing 40 columns of which the first 10 are always the same (all files also have the same number of rows). In order to have one file instead of four separate ones, I want to create a new file which contains the first 10 columns once (which are the same in all files) followed by column 25 of each file A, B, C and D since I'm not interested in the other columns.
So my output file should look like this:
column_1 column_2 column_3 .... column_9 column_10 column_25_A column_25_B column_25_C column_25_D
So far I was able to create a new file containing column_1 to column_10 using the following command:
awk -v FS='\t' -v OFS='\t' '{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' file_A.txt > output_file.txt
However, I cannot manage to now append the desired columns from the other files. I've tried the paste command as well as this one:
awk -v FS='\t' -v OFS='\t' '{print $25}' file_A.txt >> output_file
The above command however gives me correct column I want to append to the output file if I omit the redirection.
What do I have to do in order to append the desired columns from one file to another using awk? Or is this not possible?
untested
$ paste <(cut -f1-10,25 fileA) <(cut -f25 fileB) <(cut -f25 fileC) <(cut -f25 fileD)
I have a large file A (consisting of emails), one line for each mail. I also have another file B that contains another set of mails.
Which command would I use to remove all the addresses that appear in file B from the file A.
So, if file A contained:
A
B
C
and file B contained:
B
D
E
Then file A should be left with:
A
C
Now I know this is a question that might have been asked more often, but I only found one command online that gave me an error with a bad delimiter.
Any help would be much appreciated! Somebody will surely come up with a clever one-liner, but I'm not the shell expert.
If the files are sorted (they are in your example):
comm -23 file1 file2
-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...
See the man page here
grep -Fvxf <lines-to-remove> <all-lines>
works on non-sorted files (unlike comm)
maintains the order
is POSIX
Example:
cat <<EOF > A
b
1
a
0
01
b
1
EOF
cat <<EOF > B
0
1
EOF
grep -Fvxf B A
Output:
b
a
01
b
Explanation:
-F: use literal strings instead of the default BRE
-x: only consider matches that match the entire line
-v: print non-matching
-f file: take patterns from the given file
This method is slower on pre-sorted files than other methods, since it is more general. If speed matters as well, see: Fast way of finding lines in one file that are not in another?
Here's a quick bash automation for in-line operation:
remove-lines() (
remove_lines="$1"
all_lines="$2"
tmp_file="$(mktemp)"
grep -Fvxf "$remove_lines" "$all_lines" > "$tmp_file"
mv "$tmp_file" "$all_lines"
)
GitHub upstream.
usage:
remove-lines lines-to-remove remove-from-this-file
See also: https://unix.stackexchange.com/questions/28158/is-there-a-tool-to-get-the-lines-in-one-file-that-are-not-in-another
awk to the rescue!
This solution doesn't require sorted inputs. You have to provide fileB first.
awk 'NR==FNR{a[$0];next} !($0 in a)' fileB fileA
returns
A
C
How does it work?
NR==FNR{a[$0];next} idiom is for storing the first file in an associative array as keys for a later "contains" test.
NR==FNR is checking whether we're scanning the first file, where the global line counter (NR) equals to the current file line counter (FNR).
a[$0] adds the current line to the associative array as key, note that this behaves like a set, where there won't be any duplicate values (keys)
!($0 in a) we're now in the next file(s), in is a contains test, here it's checking whether current line is in the set we populated in the first step from the first file, ! negates the condition. What is missing here is the action, which by default is {print} and usually not written explicitly.
Note that this can now be used to remove blacklisted words.
$ awk '...' badwords allwords > goodwords
with a slight change it can clean multiple lists and create cleaned versions.
$ awk 'NR==FNR{a[$0];next} !($0 in a){print > FILENAME".clean"}' bad file1 file2 file3 ...
Another way to do the same thing (also requires sorted input):
join -v 1 fileA fileB
In Bash, if the files are not pre-sorted:
join -v 1 <(sort fileA) <(sort fileB)
You can do this unless your files are sorted
diff file-a file-b --new-line-format="" --old-line-format="%L" --unchanged-line-format="" > file-a
--new-line-format is for lines that are in file b but not in a
--old-.. is for lines that are in file a but not in b
--unchanged-.. is for lines that are in both.
%L makes it so the line is printed exactly.
man diff
for more details
This refinement of #karakfa's nice answer may be noticeably faster for very large files. As with that answer, neither file need be sorted, but speed is assured by virtue of awk's associative arrays. Only the lookup file is held in memory.
This formulation also allows for the possibility that only one particular field ($N) in the input file is to be used in the comparison.
# Print lines in the input unless the value in column $N
# appears in a lookup file, $LOOKUP;
# if $N is 0, then the entire line is used for comparison.
awk -v N=$N -v lookup="$LOOKUP" '
BEGIN { while ( getline < lookup ) { dictionary[$0]=$0 } }
!($N in dictionary) {print}'
(Another advantage of this approach is that it is easy to modify the comparison criterion, e.g. to trim leading and trailing white space.)
You can use Python:
python -c '
lines_to_remove = set()
with open("file B", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("file A", "r") as f:
for line in [line.strip() for line in f.readlines()]:
if line not in lines_to_remove:
print(line)
'
You can use -
diff fileA fileB | grep "^>" | cut -c3- > fileA
This will work for files that are not sorted as well.
Just to add to the Python answer to the user above, here is a faster solution:
python -c '
lines_to_remove = None
with open("partial file") as f:
lines_to_remove = {line.rstrip() for line in f.readlines()}
remaining_lines = None
with open("full file") as f:
remaining_lines = {line.rstrip() for line in f.readlines()} - lines_to_remove
with open("output file", "w") as f:
for line in remaining_lines:
f.write(line + "\n")
'
Raising the power of set subtraction.
To get the file after removing the lines which appears on another file
comm -23 <(sort bigFile.txt) <(sort smallfile.txt) > diff.txt
Here is a one liner that pipes the output of a website and removes the navigation elements using grep and lynx! you can replace lynx with cat FileA and unwanted-elements.txt with FileB.
lynx -dump -accept_all_cookies -nolist -width 1000 https://stackoverflow.com/ | grep -Fxvf unwanted-elements.txt
To remove common lines between two files you can use grep, comm or join command.
grep only works for small files. Use -v along with -f.
grep -vf file2 file1
This displays lines from file1 that do not match any line in file2.
comm is a utility command that works on lexically sorted files. It
takes two files as input and produces three text columns as output:
lines only in the first file; lines only in the second file; and lines
in both files. You can suppress printing of any column by using -1, -2
or -3 option accordingly.
comm -1 -3 file2 file1
This displays lines from file1 that do not match any line in file2.
Finally, there is join, a utility command that performs an equality
join on the specified files. Its -v option also allows to remove
common lines between two files.
join -v1 -v2 file1 file2
I have multiple (1086) files (.dat) and in each file I have 5 columns and 6384 lines.
I have a single file named "info.txt" which contains 2 columns and 6883 lines. First column gives the line numbers (to delete in .dat files) and 2nd column gives a number.
1 600
2 100
3 210
4 1200
etc...
I need to read in info.txt, find every-line number corresponding to values less than 300 in 2nd column (so it is 2 and 3 in above example). Then I need to read these values into sed-awk or grep and delete these #lines from each .dat file. (So I will delete every 2nd and 3rd row of dat files in the above example).
More general form of the question would be (I suppose):
How to read numbers as input from file, than assign them to the rows to be deleted from multiple files.
I am using bash but ksh help is also fine.
sed -i "$(awk '$2 < 300 { print $1 "d" }' info.txt)" *.dat
The Awk script creates a simple sed script to delete the selected lines; the script it run on all the *.dat files.
(If your sed lacks the -i option, you will need to write to a temporary file in a loop. On OSX and some *BSD you need -i "" with an empty argument.)
This might work for you (GNU sed):
sed -rn 's/^(\S+)\s*([1-9]|[1-9][0-9]|[12][0-9][0-9])$/\1d/p' info.txt |
sed -i -f - *.dat
This builds a script of the lines to delete from the info.txt file and then applies it to the .dat files.
N.B. the regexp is for numbers ranging from 1 to 299 as per OP request.
# create action list
cat info.txt | while read LineRef Index
do
if [ ${Index} -lt 300 ]
then
ActionReq="${ActionReq};${Index} b
"
fi
done
# apply action on files
for EachFile in ( YourListSelectionOf.dat )
do
sed -i -n -e "${ActionReq}
p" ${EachFile}
done
(not tested, no linux here). Limitation with sed about your request about line having the seconf value bigger than 300. A awk is more efficient in this operation.
I use sed in second loop to avoid reading/writing each file for every line to delete. I think that the second loop could be avoided with a list of file directly given to sed in place of file by file
This should create a new dat files with oldname_new.dat but I havent tested:
awk 'FNR==NR{if($2<300)a[$1]=$1;next}
!(FNR in a)
{print >FILENAME"_new.dat"}' info.txt *.dat
I have 2 files, one contains this :
file1.txt
632121S0 126.78.202.250 1
131145S0 126.178.20.250 1
the other contain this : file2.txt
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
313359S2 126.137.37.250 OBS
I want to end up with a third file which contains :
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
Only the lines which start by the same string in both files. I can't remember how to do it. I tried several grep, egrep and find, i still cannot use it properly...
Can you help please ?
You can use this awk:
$ awk 'FNR==NR {a[$1]; next} $1 in a' f1 f2
632121S0 126.78.202.250 OBS
131145S0 126.178.20.250 OBS
It is based on the idea of two file processing, by looping through files as this:
first loop through first file, storing the first field in the array a.
then loop through second file, checking if its first field is in the array a. If that is true, the line is printed.
To do this with grep, you need to use a process substitution:
grep -f <(cut -d' ' -f1 file1.txt) file2.txt
grep -f uses a file as a list of patterns to search for within file2. In this case, instead of passing file1 unaltered, process substitution is used to output only the first column of the file.
If you have a lot of these lines, then the utility join would likely be useful.
join - join lines of two files on a common field
Here's a set of examples.
I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file