Use awk to create index of words from file - shell

I'm learning UNIX for school and I'm supposed to create a command line that takes a text file and generates a dictionary index showing the words (exluding articles and prepositions) and the lines where it appears in the file.
I found a similar problem as mine in: https://unix.stackexchange.com/questions/169159/how-do-i-use-awk-to-create-an-index-of-words-in-file?newreg=a75eebee28fb4a3eadeef5a53c74b9a8 The problem is that when I run the solution
$ awk '
{
gsub(/[^[:alpha:] ]/,"");
for(i=1;i<=NF;i++) {
a[$i] = a[$i] ? a[$i]", "FNR : FNR;
}
}
END {
for (i in a) {
print i": "a[i];
}
}' file | sort
The output contains special characters (which I don't want) like:
-Quiero: 21
Sancho,: 2, 4, 8
How can I remove all the special characters and excluding articles and prepositions?

$ echo This is this test. | # some test text
awk '
BEGIN{
x["a"];x["an"];x["the"];x["on"] # the stop words
OFS=", " # list separator to a
}
{
for(i=1;i<=NF;i++) # list words in a line
if($i in x==0) { # if word is not a stop word
$i=tolower($i) # lowercase it
gsub(/^[^a-z]|[^a-z]$/,"",$i) # remove leading and trailing non-alphabets
a[$i]=a[$i] (a[$i]==""?"":OFS) NR # add record number to list
}
}
END { # after file is processed
for(i in a) # in no particular order
print i ": " a[i] # ... print elements in a
}'
this: 1, 1
test: 1
is: 1

Related

Editing text in Bash

I am trying to edit text in Bash, i got to point where i am no longer able to continue and i need help.
The text i need to edit:
Symbol Name Sector Market Cap, $K Last Links
AAPL
Apple Inc
Computers and Technology
2,006,722,560
118.03
AMGN
Amgen Inc
Medical
132,594,808
227.76
AXP
American Express Company
Finance
91,986,280
114.24
BA
Boeing Company
Aerospace
114,768,960
203.30
The text i need:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
I already tried :
sed 's/$/,/' BIPSukol.txt > BIPSukol1.txt | awk 'NR==1{print}' BIPSukol1.txt | awk '(NR-1)%5{printf "%s ", $0;next;}1' BIPSukol1.txt | sed 's/.$//'
But it doesnt quite do the job.
(BIPSukol1.txt is the name of the file i am editing)
The biggest problem you have is you do not have consistent delimiters between your fields. Some have commas, some don't and some are just a combination of 3-fields that happen to run together.
The tool you want is awk. It will allow you to treat the first line differently and then condition the output that follows with convenient counters you keep within the script. In awk you write rules (what comes between the outer {...} and then awk applies your rules in the order they are written. This allows you to "fix-up" your hap-hazard format and arrive at the desired output.
The first rule applied FNR==1 is applied to the 1st line. It loops over the fields and finds the problematic "Market Cap $K" field and considers it as one, skipping beyond it to output the remaining headings. It stores a counter count = NF - 3 as you only have 5 lines of data for each Symbol, and skips to the next record.
When count==n the next rule is triggered which just outputs the records stored in the a[] array, zeros count and deletes the a[] array for refilling.
The next rule is applied to every record (line) of input from the 2nd-on. It simply removes any whitespece from the fields by forcing awk to recalculate the fields with $1 = $1 and then stores the record in the array incrementing count.
The last rule, END is a special rule that runs after all records are processed (it lets you sum final tallies or output final lines of data) Here it is used to output the records that remain in a[] when the end of the file is reached.
Putting it altogether in another cut at awk:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
for (i=1;i<=n;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
delete a
count = 0
}
{
$1 = $1
a[++count] = $0
}
END {
for (i=1;i<=count;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
}
' file
Example Use/Output
Note: you can simply select-copy the script above and then middle-mouse-paste it into an xterm with the directory set so it contains file (you will need to rename file to whatever your input filename is)
$ awk '
> FNR==1 {
> for (i=1;i<=NF;i++)
> if ($i == "Market") {
> printf ",Market Cap $K"
> i = i + 2
> }
> else
> printf (i>1?",%s":"%s"), $i
> print ""
> n = NF-3
> count = 0
> next
> }
> count==n {
> for (i=1;i<=n;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> delete a
> count = 0
> }
> {
> $1 = $1
> a[++count] = $0
> }
> END {
> for (i=1;i<=count;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> }
> ' file
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
(note: it is unclear why you want the "Links" heading included since there is no information for that field -- but that is how your desired output is specified)
More Efficient No Array
You always have afterthoughts that creep in after you post an answer, no different than remembering a better way to answer a question as you are walking out of an exam, or thinking about the one additional question you wished you would have asked after you excuse a witness or rest your case at trial. (there was some song that captured it -- a little bit ironic :)
The following does essentially the same thing, but without using arrays. Instead it simply outputs the information after formatting it rather than buffer it in an array for output all at once. It was one of those type afterthoughts:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
print ""
count = 0
}
{
$1 = $1
printf (++count>1?",%s":"%s"), $0
}
END { print "" }
' file
(same output)
With your shown samples, could you please try following(written and tested in GNU awk). Considering that(by seeing OP's attempts) after header of Input_file you want to make every 5 lines into a single line.
awk '
BEGIN{
OFS=","
}
FNR==1{
NF--
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
OR if your awk doesn't support NF-- then try following.
awk '
BEGIN{
OFS=","
}
FNR==1{
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +Links( +)?$/,"",lastPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
NOTE: Looks like your header/first line needed special manipulation because we can't simply set , for all spaces, so taken care of it in this solution as per shown samples.
With GNU awk. If your first line is always the same.
echo 'Symbol,Name,Sector,Market Cap $K,Last,Links'
awk 'NR>1 && NF=5' RS='\n ' ORS='\n' FS='\n' OFS=',' file
Output:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Extract common lines from multiple text files and display original line numbers

What I want?
Extract the common lines from n large files.
Append the original line numbers of each files.
Example:
File1.txt has the following content
apple
banana
cat
File2.txt has the following content
boy
girl
banana
apple
File3.txt has the following content
foo
apple
bar
The output should be a different file
1 3 2 apple
1, 3 and 2 in the output are the original line numbers of File1.txt, File2.txt and File3.txt where the common line apple exists
I have tried using grep -nf File1.txt File2.txt File3.txt, but it returns
File2.txt:3:apple
File3.txt:2:apple
Associate each unique line with a space separated list of line numbers indicating where it is seen in each file in an array, and print these next to each other at the end if the line is found in all three files.
awk '{
n[$0] = n[$0] FNR OFS
c[$0]++
}
END {
for (r in c)
if (c[r] == 3)
print n[r] r
}' file1 file2 file3
If the number of files is unknown, refer to Ravinder's answer, or just change the hardcoded 3 in the END block with ARGC-1 as shown there.
GNU awk specific approach that works with any number of files:
#!/usr/bin/gawk -f
BEGINFILE {
nfiles++
}
{
lines[$0][nfiles] = FNR
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (line in lines) {
if (length(lines[line]) == nfiles) {
for (file = 1; file <= nfiles; file++)
printf "%d\t", lines[line][file]
print line
}
}
}
Example:
$ ./showlines file[123].txt
1 3 2 apple
Could you please try following, written and tested with GNU awk, one could make use of ARGC value which gives us total number of element passed to awk program.
awk '
{
a[$0]=(a[$0]?a[$0] OFS:"")FNR
count[$0]++
}
END{
for(i in count){
if(count[i]==(ARGC-1)){
print i,a[i]
}
}
}
' file1.txt file2.txt file3.txt
A perl solution
perl -ne '
$h{$_} .= "$.\t"; # append current line number and tab character to value in a hash with key current line
$. = 0 if eof; # reset line number when end of file is reached
END{
while ( ($k,$v) = each %h ) { # loop over has entries
if ( $v =~ y/\t// == 3 ) { # if value contains 3 tabs
print $v.$k # print value concatenated with key
}
}
}' file1.txt file2.txt file3.txt

Format output if column 2 field has more than one value

I have data which is colon delimeted as shown:
Joe:23;23;56:zz
Jim:44;44:cz
Rob:45;98:fc
In column 2 if there are more than one value then they need to print separately.
Duplicates should be removed and only unique values should also print.
I tried this to remove duplicates:
sort -u -t : -k 2,2 file_name
Output:
Joe:23;23;56:zz
Jim:44;44:cz
Rob:45;98:fc
Desired Output:
Jim:44:cz
Below ones need to print separately because column 2 has more than one value or we can append this output to other file.txt
Joe:23;56:zz
Rob:45;98:fc
Could you please try following. This will create 2 output files where one will have lines which have 2 values in 2nd column and other output file will have other than 2 values in 2nd column. Output file names will be out_file_two_cols and out_file_more_than_two_cols you could change it as per your need.
awk '
BEGIN{
FS=OFS=":"
}
{
delete a
val=""
num=split($2,array,";")
for(j=1;j<=num;j++){
if(!a[array[j]]++){
val=(val?val ";":"")array[j]
}
}
$2=val
num=split($2,array,";")
}
num==1{
print > ("out_file_two_cols")
next
}
{
print > ("out_file_more_than_two_cols")
}
' Input_file
Explanation: setting field separator and output field separator as : here for all lines of Input_file in BEGIN section. Then in main section deleting array named a and nullifying variable val, which will be explained further and being used by program in later section, deleting them to avoid conflict of getting their previous values here.
Splitting 2nd field into array by putting delimiter as ; and taking its total number of elements in num variable here. Now running for loop from 1 to till value of num here to traverse through all elements of 2nd field.
Checking condition if current value of 2nd field our of all elements not present in array a then add it on variable val and keep doing this for all elements of 2nd field.
Then assigning value of val to 2nd column. Now again checking now how many element present in new 2nd column by splitting it and num will tell us the same.
Then checking condition if num is 1 means current/edited 2nd field has only 1 element then print it one field output file else print it in other output file.
Here's a little ruby script
#!/usr/bin/env ruby
input = File.new "file_name"
single = File.new "one_value", "w"
multiple = File.new "two_value", "w"
input.each do |line|
fields = line.split ":"
value = fields[1].split(";").uniq.sort
fields[1] = value.join ";"
new_line = fields.join ":"
if value.size == 1
single << new_line
else
multiple << new_line
end
end
another awk
$ awk 'function join() {
s=sep="";
for(k in b) {s=s sep k; sep=";"}
return s}
BEGIN {FS=OFS=":"}
{n=split($2,a,";");
delete b;
for(i=1;i<=n;i++) b[a[i]]
$2=join()
if(length(b)==1) print;
else {multi=multi ORS $0}}
END {print "\nmultiple values:" multi}' file
Jim:44:cz
multiple values:
Joe:23;56:zz
Rob:45;98:fc
Source data:
$ cat colon.dat
Joe:23;23;56:zz
Jim:44;44:cz
Rob:45;98:fc
One awk solution:
awk -F":" ' # input field separator is colon
BEGIN { OFS=FS } # output field separate or colon
{ n=split($2,arr,";") # split field 2 by semi-colon
m=n # copy our array count
delete seen # reset seen array
for ( i=1 ; i<=n ; i++ ) { # loop through array indices
if ( arr[i] in seen ) { # if entry has been seen then ...
delete arr[i] # remove from array and ....
m-- # decrement our total count
}
else {
seen[arr[i]] # otherwise add element to the seen arrray
}
}
outf="single.out" # output file for single entries
if ( m >= 2 ) {
outf="multiple.out" # output file for multiple entries
}
printf "%s%s", $1, OFS > outf # print header
sep="" # separator for first field is empty string
for ( i in arr ) { # print remaining array elements
printf "%s%s", sep, arr[i] > outf
sep=";" # set separator to semi-colon for fields 2+
}
printf "%s%s\n", OFS, $3 > outf # print trailer
}
' colon.dat
NOTE: Remove comments to declutter code.
The above generates the following:
$ cat single.out
Jim:44:cz
$ cat multiple.out
Joe:23;56:zz
Rob:45;98:fc

AWK find if line is newline or #

I have the following, it's ignoring the lines with just # but not those with \n (empty/ just containing newline lines)
Do you know of a way I can hit two birds with one stone?
I.E. if the lines don't contain more than 1 char, then delete the line..
function check_duplicates {
awk '
FNR==1{files[FILENAME]}
{if((FILENAME, $0) in a) dupsInFile[FILENAME]
else
{a[FILENAME, $0]
dups[$0] = $0 in dups ? (dups[$0] RS FILENAME) : FILENAME
count[$0]++}}
{if ($0 ~ /#/) {
delete dups[$0]
}}
#Print duplicates in more than one file
END{for(k in dups)
{if(count[k] > 1)
{print ("\n\nDuplicate line found: " k) " - In the following file(s)"
print dups[k] }}
printf "\n";
}' $SITEFILES
awk '
NR {
b[$0]++
}
$0 in b {
if ($0 ~ /#/) {
delete b[$0]
}
if (b[$0]>1) {
print ("\n\nRepeated line found: "$0) " - In the following file"
print FILENAME
delete b[$0]
}
}' $SITEFILES
}
The expected input is usually as follows.
#File Path's
/path/to/file1
/path/to/file2
/path/to/file3
/path/to/file4
#
/more/paths/to/file1
/more/paths/to/file2
/more/paths/to/file3
/more/paths/to/file4
/more/paths/to/file5
/more/paths/to/file5
In this case, /more/paths/to/file5, occurs twice, and should be flagged as such.
However, there are also many newlines, which I'd rather ignore.
Er, it also has to be awk, I'm doing a tonne of post processing, and don't want to vary from awk for this bit, if that's okay :)
It really seems to be a bit tougher than I would have expected.
Cheers,
Ben
You can combine both the if into a single regex.
if ($0 ~ /#|\n/) {
delete dups[$0]
}
OR
To be more specific you can write
if ($0 ~ /^#?$/) {
delete dups[$0]
}
What it does
^ Matches starting of the line.
#? Matches one or zero #
$ Matches end of line.
So, ^$ matches empty lines and ^#$ matches lines with only #.

Sequence length of FASTA file

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

Resources