I have 2 files that are reports of size of the Databases (1 file is from yesterday, 1 from today).
I want to see how the size of each database changed, so I want to calculate the difference.
File looks like this:
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3"
"ASIA","3740156","3170088","570068","354000","server5"
"AFRICA","4871331","4101711","769620","318412","server4"
Other file is the same, only the numbers are different.
I want to see how the database size changed (so ONLY column "Use MB").
I guess I cannot use "diff" or "awk" options since numbers may change dramatically each day. The only good 'algoritm' I can think of is to subtract numbers between 5th and 6th double quote ("), how do I do that?
You can do this (using awk):
paste file1 file2 -d ',' |awk -F ',' '{gsub(/"/, "", $3); gsub(/"/, "", $9); print $3 - $9}'
paste puts the two files next to another, separated by a comma (-d ','). So you will have :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname","DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3","EUROPE","9133508","8336089","797419","896120","server3"
...
gsub(/"/, "", $3) removes the quotes around column 3
And finally we print column 3 minus column 9
Maybe I missed something, but I don't get why you could not use awk as it can totally do
The only good 'algoritm' I can think of is to subtract numbers between
5th and 6th double quote ("), how do I do that?
Let's say that file1 is :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8336089","797419","896120","server3"
"ASIA","3740156","3170088","570068","354000","server5"
"AFRICA","4871331","4101711","769620","318412","server4"
And file2 is :
"DATABASE","Alloc MB","Use MB","Free MB","Temp MB","Hostname"
"EUROPE","9133508","8335089","797419","896120","server3"
"ASIA","3740156","3170058","570068","354000","server5"
"AFRICA","4871331","4001711","769620","318412","server4"
Command
awk -F'[",]' 'NR>2&&NR==FNR{db[$2]=$8;next}FNR>2{print $2, db[$2]-$8}' file1 file2
gives you result :
EUROPE 1000
ASIA 30
AFRICA 100000
You can also use this answer to deal more properly with quotechars on awk.
If your awk version cannot support multiple field delimiters, you can try this :
awk -F, 'NR>2&&NR==FNR{db[$1]=$3;next}FNR>2{print $1, db[$1]-$3}' <(sed 's,",,g' file1) <(sed 's,",,g' file2)
Related
I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899
Bash 4.4.0
Ubuntu 16.04
I have several columns in a CSV file that are all capital letters and some are lowercase. Some columns have only one word while others may have 50 words. At this time, I convert column by column with 2 commands and it is quite taxing on the server when the file has 50k lines.
Example:
#-- Place the header line in a temp file
head -n 1 "$tmp_input1" > "$tmp_input3"
#-- Remove the header line in orginal file
tail -n +2 "$tmp_input1" > "$tmp_input1-temp" && mv "$tmp_input1-temp" "$tmp_input1"
#-- Change the words in the 11th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$11 = tolower($11); print}' "$tmp_input4" > "$tmp_input5"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input5"
#-- Change the words in the 12th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$12 = tolower($12); print}' "$tmp_input5" > "$tmp_input6"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input6"
#-- Change the words in the 13th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$13 = tolower($13); print}' "$tmp_input6" > "$tmp_input7"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input7"
cat "$tmp_input7" >> "$tmp_input3"
Is it possible to do multiple columns in a single command?
Here is an example of the csv file:
"dealer_id","vin","conditon","stocknumber","make","model","year","broken","trim","bodystyle","color","interiorcolor","interiorfabric","engine","enginedisplacement","engineaspiration","engineText","transmission","drivetrain","mpgcity","mpghighway","mileage","cylinders","fuelconditon","optiontext","description","titlestatus","warranty","price","specialprice","window_sticker_price","mirrorhangerprice","images","ModelCode","PackageCodes"
"JOHNVANC04A","2C4RC1N73JR290946","N","JR290946","Chrysler","Pacifica","2018","","Hybrid Limited FWD","Mini-van, Passenger","Brilliant BLACK Crystal PEARL Coat","","..LEATHER SEATS..","V6 Cylinder Engine","3.6L","","","AUTOMATIC","FWD","0","0","553","6","H","..1-SPEED A/T..,..AUTO-OFF HEADLIGHTS..,..BACK-UP CAMERA..,..COOLED DRIVER SEAT..,..CRUISE CONTROL..","======KEY FEATURES INCLUDE: . LEATHER SEATS. THIRD ROW SEAT. QUAD BUCKET SEATS. REAR AIR. HEATED DRIVER SEAT.","","0","41680","","48830","","http://i.autoupktech.com/c640/9c40231cbcfa4ef89425d108e4e3a410.jpg",http://i.autoupnktech.com/c640/9c40231cbcfa4ef89425d108e4e3a410.jpg","RUES53","AAX,AT2,DFQ,EH3,GWM,WPU"
Here's a snippet of the above columns refined
Column 11 should be - "Brilliant Black Crystal Pearl Coat"
Column 13 should be - "Leather Seats"
Column 16 should be - "Automatic"
Column 23 should be - "1-Speed A/T,Auto-Off Headlights,Back-up Camera"
Column 24 should be - "Key Features Include: Leather Seats,Third Row Seat"
Keep in mind, the double-quotes surrounding the columns can't be stripped. I only need to convert certain columns and not the entire file. Here's an example of the columns 11, 13, 16, 23 and 24 converted.
"Brilliant Black Crystal Pearl Coat","Leather Seats","Automatic","1-Speed A/T,Auto-Off Headlights,Back-up Camera","Key Features Include: Leather Seats,Third Row Seat"
Just to add another option, here is a one liner using just sed:
sed -i -e 's/.*/\L&/' -e 's/[a-z]*/\u&/g' filename
And here is a proof of concept:
$ cat testfile
jUSt,a,LONG,list of SOME,RAnDoM WoRDs
ANother LIne
OneMore,LiNe
$ sed -e 's/.*/\L&/' -e 's/[a-z]*/\u&/g' testfile
Just,A,Long,List Of Some,Random Words
Another Line
Onemore,Line
$
If you want to convert just the headers of the CSV file (first line), just replace s with 1s on both search patterns.
You can find an excellent article explaining the magic here: sed – Convert to Title Case.
Here is another alternative (off-topic here, I know) in Python 3:
import csv
from pathlib import Path
infile = Path('infile.csv')
outfile = Path('outfile.csv')
titled_cols = [10, 12, 15, 22, 23]
titled_data = []
with infile.open() as fin, outfile.open('w', newline='') as fout:
for row in csv.reader(fin, quoting=csv.QUOTE_ALL):
for i,col in enumerate(row):
if i in titled_cols:
col = col.title()
titled_data.append(row)
csv.writer(fout, quoting=csv.QUOTE_ALL).writerows(titled_data)
Just define the columns you want to be title cased on titled_cols (columns have zero based indexes) and it will do what you want.
I guess infile and outfile are self-explanatory and outfile will contain the modified version of your original file.
I hope it helps.
You could create a user-defined function and apply it to the columns you need to modify.
awk -F, 'function toproper(s) { return toupper(substr(s, 1, 1)) tolower(substr(s, 2, length(s))) } {printf("%s,%s,%s,%s\n", toproper($1), toproper($2), toproper($3), toproper($4));}'
Input:
FOO,BAR,BAZ,ETC
Output:
Foo,Bar,Baz,Etc
Assuming the fields of the csv file are not quoted by double quotes,
meaning that we can simply split a record on commas and whitespaces, how
about a Perl solution:
perl -pe 's/(^|(?<=[,\s]))([^,\s])([^,\s]*)((?=[,\s])|$)/\U$2\L$3/g' input.csv
input.csv:
Bash,4.4.0,Ubuntu,16.04
I have several columns in a CSV file,that, are, all capital letters
and some are lowercase.
Some columns have only,one,word,while others may have 50 words.
output:
Bash,4.4.0,Ubuntu,16.04
I Have Several Columns In A Csv File,That, Are, All Capital Letters
And Some Are Lowercase.
Some Columns Have Only,One,Word,While Others May Have 50 Words.
This version uses AWK to do the job:
This is the command (change file to your filename)
awk -F"," 'BEGIN{OFS=","}{ for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""tolower(substr($i,2,length($i)))}print $0}' file | awk -F" " 'BEGIN{OFS=" "} { for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""substr($i,2,length($i))}print $0}'
The test:
cat file
pepe is cool,ASDASD ASDAS,and no podpoiaops
awk -F"," 'BEGIN{OFS=","}{ for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""tolower(substr($i,2,length($i)))}print $0}' file | awk -F" " 'BEGIN{OFS=" "} { for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""substr($i,2,length($i))}print $0}'
Pepe Is Cool,Asdasd Asdas,And No Podpoiaops
Explanation
BEGIN{OFS=","} tells awk how to outuput the line.
The for statement uses NF, the built in internal variable for the
number of fields for each line
The substr divide and change the first letter of the field, and it's assigned to its line value again
All row is printed print $0
Finally, the second awk divides the lines created on the first example, but this time dividing with spaces as separator. This way, It detects all different words on the file, and changes every first Character of them.
Hope it helps
I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.
I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.
I have 100 text files which look like this:
File title
4
Realization number
variable 2 name
variable 3 name
variable 4 name
1 3452 4538 325.5
The first number on the 7th line (1) is the realization number, which SHOULD relate to the file name. i.e. The first file is called file1.txt and has realization number 1 (as shown above). The second file is called file2.txt and should have realization number 2 on the 7th line. file3.txt should have realization number 3 on the 7th line, and so on...
Unfortunately every file has realization=1, where they should be incremented according to the file name.
I want to extract variables 2, 3 and 4 from the 7th line (3452, 4538 and 325.5) in each of the files and append them to a summary file called summary.txt.
I know how to extract the information from 1 file:
awk 'NR==7,NR==7{print $2, $3, $4}' file1.txt
Which, correctly gives me:
3452 4538 325.5
My first problem is that this command doesn't seem to give the same results when run from a bash script on multiple files.
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR=7,NR==7{print $2, $3, $4}' File$((i)).txt
done
I get multiple lines being printed to the screen when I use the above script.
Secondly, I would like to output those values to the summary file along with the CORRECT preceeding realization number. i.e. I want a file that looks like this:
1 3452 4538 325.5
2 4582 6853 158.2
...
100 4865 3589 15.15
Thanks for any help!
You can simplify some things and get the result you're after:
#!/bin/bash
for ((i=1;i<=100;i++))
do
echo $i $(awk 'NR==7{print $2, $3, $4}' File$i.txt)
done
You really don't want to assign to NR=7 (as you did) and you don't need to repeat the NR==7,NR==7 either. You also really don't need the $((i)) notation when $i is sufficient.
If all the files are exactly 7 lines long, you can do it all in one awk command (instead of 100 of them):
awk 'NR%7==0 { print ++i, $2, $3, $4}' Files*.txt
Notice that you have only one = in your bash script. Does all the files have exactly 7 lines? If you are only interested in the 7th line then:
#!/bin/bash
for ((i=1;i<=100;i++));do
awk 'NR==7{print $2, $3, $4}' File$((i)).txt
done
Since your realization number starts from 1, you can simply add that using nl command.
For example, if your bash script is called s.sh then:
./s.sh | nl > summary.txt
will get you the result with the expected lines in summary.txt
Here's one way using awk:
awk 'FNR==7 { print ++i, $2, $3, $4 > "summary.txt" }' $(ls -v file*)
The -v flag simply sorts the glob by version numbers. If your version of ls doesn't support this flag, try: ls file* | sort -V instead.