How to print certain fields in a column if one of the fields is less than a certain value? - sorting

I have a .txt file in that contains data about 100 colleges in the format
{COLLEGE NAME} {CITY, STATE} {RANK} {TUITION} {IN STATE TUITION} {ENROLLMENT}
For example here are two lines
YeshivaUniversity "New York, NY" 66 "$40,670 " "2,744"
FordhamUniversity "New York, NY" 60 "$47,317 " "8,855"
There are 98 more lines and the output should return all the colleges with tuition less than $30000?
Assuming that the field separator is space, how could I print the {COLLEGE NAME} {CITY, STATE} {TUITION} of colleges with {TUITION} less than $30,000? Is it possible to do with awk or sort?
I have tried some combinations of awk and the operators <=, but I get an error every time. For example
$ awk -F" " '{print $1, $2, $4<=30000}' data1a.txt
gives me a syntax error.

Using GNU awk, since it's got FPAT:
$ gawk '
BEGIN {
FPAT="([^ ]*)|(\"[^\"]+\")"
}
{
tuition=$4 # separate 4th column for cleaning
gsub(/[^0-9]/,"",tuition) # clean non-digits off
if(tuition<30000) # compare
print # and output
}'
Output for sample data:
(Next time, please post such sample that it has positive and negative cases.)
Also, it was mentioned in the comments: Delimited by single space and you have a space in name of University. That wasn't the case anymore when I saw your question but that could be tackled by counting the fields from the end, ie. $4 would be $(NF-1).

Related

How to get paragraphs of text by index number

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs
awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

Print Filename and Substring to csv For Each File in a Directory,

I've been trying to teach myself awk to accomplish the following, but haven't had much success.
I have a directory with several text files:
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
JV-26_S48_L_RepetitiveText_ToRemove.txt
...
The structure of each text file is as follows. The numbers may change, but the accompanying text will always remain the same.
JV-01_S01_L007_R2_002_RepetitiveText_ToRemove.txt
4620178 reads; of these:
4620178 (100.00%) were unpaired; of these:
1226814 (26.55%) aligned 0 times
3040861 (65.82%) aligned exactly 1 time
352503 (7.63%) aligned >1 times
73.45% overall alignment rate
JV-26_S48_L_RepetitiveText_ToRemove.txt
1601831 reads; of these:
1601831 (100.00%) were unpaired; of these:
58800 (3.67%) aligned 0 times
1344724 (83.95%) aligned exactly 1 time
198307 (12.38%) aligned >1 times
96.33% overall alignment rate
For each file in this directory, I want to compile a csv with:
Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads
JV-01_S01_L007_R2_002 4620178 3040861 352503 1226814
JV-26_S48_L 1601831 1344724 198307 58800
...
Is there any way to do this with a single for loop with awk? I was trying to use the match function.
For instance, if I could specify that match search within a specific line, and then search from left to right for a substring composed of any number of digits until a space is found. That would grab the substring of interest for that line.
Something along the lines of:
for file in *.txt
do
awk 'FNR == 1 {print FILENAME, match(NR==1, \d), match(NR==4, \d), match(NR==5, \d), match(NR==3, \d) } ' $file >> Names.csv
Here's an easy way, but it Requires GNU awk for multi-char RS.
You can read the file as a single record using the trick here. Then you just need to print out the fields you want (and this does depend on your assertion that the text is fixed)
$ awk -v RS="^$" '{print FILENAME, $1, $16, $22, $11}' jv-01 jv-26
jv-01 4620178 3040861 352503 1226814
jv-26 1601831 1344724 198307 58800
Could you please try following, written and tested with shown samples.
awk '
BEGIN{
print "Sample Total_Reads Uniquely_Mapped_Reads Multi_Mapped_Reads Unmapped_Reads"
}
FNR==1{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
total_reads=Uniquely_Mapped_Reads=times=Multi_Mapped_Reads=Unmapped_Reads=""
sub(/_RepetitiveText.*/,"",FILENAME)
file=FILENAME
}
/reads; of these/{
total_reads=$1
next
}
/aligned exactly 1 time/{U
niquely_Mapped_Reads=$1
next
}
/aligned >1 times/{
Multi_Mapped_Reads=$1
next
}
/aligned [0-9]+ times/{
Unmapped_Reads=$1
}
END{
if(total_reads){
print file,total_reads,Uniquely_Mapped_Reads,times,Multi_Mapped_Reads,Unmapped_Reads
}
}
' *.txt | column -t

How to remove part of the middle of a line/string by matching two known patterns in front and behind variable text to be removed

How to remove part of the middle of a line/string by matching two known patterns, one in front of text to be removed and one behind the text to be removed?
I have a Linux text file with thousands of one line, comma delimited records. unfortunately, all records are not the same format. Each line may have as many as four comma delimited fields of which only the first and last are constant, the two middle fields may, or may not, be present.
Examples of existing line (record) formats. Messy data but the first field is always present, as is the last field, starts with word ADDED.
FNAME LNAME, SOME COMMENT, JOINED DATE, ADDED TO DB DATE
FNAME LNAME, ADDED TO DB DATE
FNAME LNAME, SOME COMMENT, ADDED TO DB DATE
FNAME LNAME, JOINED DATE, ADDED TO DB DATE
Objective is to keep field one including the comma, throw away everything following the first comma, keeping the word "ADDED" and everything that follows to the end of line and insert a space between the first comma and the word ADDED.
For each line in parse the file from start of line to the first comma (keep this).
Parse rest of line up to the space before the word “Added” and throw it away.
Keep everything from the space before the word “ADDED” to end of line and concatenate the first part and last part to form one record per line with two fields separated by a comma and a space.
(if record is already in desired format, change nothing)
Final file to look like:
FNAME LNAME, ADDED TO DB DATE
or
Fred Flintstone, ADDED on January 1st 2015 By Barney Rubble
Thanks!
If you don't care about blank lines:
awk '{print $1,$NF}' FS=, OFS=, input
(Blank lines will be output as a single comma)
If you want to just skip blank lines, use:
awk 'NF>1{print $1,$NF}' FS=, OFS=, input
If you want to keep them:
awk '{printf( "%s%s\n", $1, NF>1 ? ","$NF : "")}' FS=, OFS=, input
Note that this will not ensure a single space after the comma, but will retain the spacing as in the final column of the original file. (that is, if there are 3 spaces after the final column in the original, you'll get 3 in the output). It's not clear to me from the description, but that seems like desirable behavior.
A Perl solution
perl -ne 'print join ", ", (split /,\s*/)[0,-1]' myfile
or
perl -pe 's/,.*(?=,)//' myfile
Both of those solutions work fine for me with the data you have given, but you may like to try
perl -pe 's/,.*(?=,\s*ADDED)//' myfile
You can use backreference:
sed 's/\(^[^,]*,\).* ADDED/\1 ADDED/' file
one more approach with awk could help here.
awk -F, '{val=$1;sub(/FNAME.*\,/,",");print val $0}' Input_file
Where I am making field separator as (,) then saving first field to variable named val, now substituting FNAME till comma with (,) in current line, now printing the value of variable val and new edited current line.
Using perl
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, "<", "file.txt" or die "$!: couldn't open file\n";
while(<$fh>) {
my #arr = split(/,/);
my $text = $arr[0] . ", " . $arr[$#arr];
print "$text\n";
}

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Resources