Format output of grep plus regex bash shell

Format output of grep plus regex bash shell - bash

I have this situation.
In my script, I have to use the hdparm command on specific partion and extract the MB/s value calculated.
I'm able to achieve this thanks the us of grep and regex; so, if with
sudo hdparm -tT /dev/xvda1
the output is:
/dev/xvda1:
Timing cached reads: 12596 MB in 1.99 seconds = 6320.55 MB/sec
Timing buffered disk reads: 594 MB in 3.01 seconds = 197.12 MB/sec
with
sudo hdparm -tT /dev/xvda1 | grep -Po '.* \K[0-9.]+'
the results are:
6320.55
197.12
Now, the next request is to print data in a different way.
The desired output is:
/dev/xvda1: 6320.55 MB/sec, 197.12 MB/sec
But I don't know how to obtain this; summarizing, what is requested is to print the partion and, in a single line, the MB/s values extracted.

Seems like your last question was an XY problem.
If you want to append MB/sec anyways there is no need to remove it in the first place. Extracting 6320.55 MB/sec would have been a lot easier than extracting just 6320.55.
Anyways, an awk script is probably the best solution here:
awk -F' = ' '{a[NR]=$NF} END {printf "%s %s, %s\n", a[1], a[2], a[3]}'
If you don't need exactly that format, the script can be simplified to:
awk -F' = ' '{printf "%s ", $NF}'
which prints /dev/xvda1: 6320.55 MB/sec 197.12 MB/sec .

Related

value of total physical memory in linux OS

I have a following question from this question. Is the value of the total physical memory always shown in KB? Because I would like to print it in GB and I use this command
grep MemTotal /proc/meminfo | awk '{$2=$2/(1024^2); print $2}'
I'm not sure wheter I should add a if statement to prove the command grep MemTotal /proc/meminfo showing KB value or other value
Any help would be appreciated

You need not to use grep + awk, you could do this in a single awk itself. From explanation point of view, I have combined your attempted grep code within awk code itself. In awk program I am checking condition if 1st field is MemTotal: and 3rd field is kB then printing 2rd field's value in GB(taken from OP's attempted code itself).
awk '$1=="MemTotal:" && $3=="kB"{print $2/(1024^2)}' /proc/meminfo
OR if in case you want to make kB match in 3rd a case in-sensitive one then try following code:
awk '$1=="MemTotal:" && $3~/^[kK][bB]$/{print $2/(1024^2)}' /proc/meminfo

Is the value of the total physical memory always shown in KB?
Yes, the unit kB is fixed in the kernel code. See: 1 and 2

If you assume the MemTotal: entry is always the first line of /proc/meminfo, it is possible to get the Gigabyte value without spawning a sub-shell or external commands, and using only POSIX-shell grammar that works with ksh, ash, dsh, zsh, or bash:
#!/usr/bin/env sh
IFS=': ' read -r _ memTotal _ < /proc/meminfo;
printf 'Total RAM: %d GB\n' "$((memTotal / 1024000))"

Efficient search pattern in large CSV file

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.

with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv

This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

How to write Ram used to a file from a bash script

I am wanting to write the amount of ram used to a file in a bash script.
if you run the command free you get the following output
total used free shared buffers cached
Mem: 7930 4103 3826 0 59 2060
-/+ buffers/cache: 1983 5946
Swap: 15487 0 15487
I am wanting to pull the used bit out and write to a file something like
MemUsed: 4103
I have tried varies of
cat free | grep used' uniq >> ramInfo.txt but have been unable to get it correct.
I am completely new to shell scripts so forgive me if this is relatively simple.

You can do this and you will get the value:
free -h | awk '/^Mem/{print $4}'
You can also get the memory free in Kilobytes from /proc/meminfo:
cat /proc/meminfo | awk -F':' '/MemFree/{print $2}' | sed 's/^ *//g;s/ *$//g'

Optimising my script which lookups into a big compressed file

I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop.
Basically what it does is :
getting an info from a tsv
using that information to lookup with awk into a file
printing the line and exporting it
My issues are :
1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space)
2) It is long to look into it anyway
My ideas to improve it :
0) as said, if possible I'll decompress the file
using GNU parallel with parallel -j 0
./extract_awk_reads_in_bam.sh ::: reads_id_and_pos.tsv but I'm unsure it works as expected? I'm cutting down the time per research from 36 min to 16 so just a factor 2.5 ? (I have 16 cores)
I was thinking (but It may be redundant with GNU?) to split down
my list of info to look into into several files to launch them
parallely
sorting the bam file by reads name, and exiting awk after having
found 2 matches (can't be more than 2)
Here is the rest of my bash script, I'm really open for ideas to improve it but I'm not sure I am a superstar in programming, so maybe keeping it simple would help? :)
My bash script :
#/!bin/bash
while IFS=$'\t' read -r READ_ID_WH POS_HOTSPOT; do
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}" >> /data/bismark2/reads_done_so_far.txt
echo "$(date -Iseconds) read id is : ${READ_ID_WH} with position ${POS_HOTSPOT}"
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v read_id="$READ_ID_WH" -v pos_hotspot="$POS_HOTSPOT" '$1==read_id {printf $0 "\t%s\twh_genome",pos_hotspot}'| head -2 >> /data/bismark2/export_reads_mapped.tsv
done <"$1"
My tsv file has a format like :
READ_ABCDEF\t1200
Thank you a lot ++

TL;DR
Your new script will be:
#!/bin/bash
samtools view -# 2 /data/bismark2/aligned_on_nDNA/bamfile.bam | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
You are reading the entire file for each of the inputs. Better look for all of them at the same time. Start by extracting the interesting reads and then, on this subset, apply the second transformation.
samtools view -# 2 "$bam" | grep -f <(awk -F$'\t' '{print $1}' "$1") > "$sam"
Here you are getting all the reads with samtools and searching for all the terms that appear in the -f parameter of grep. That parameter is a file that contains the first column of the search input file. The output is a sam file with only the reads that are listed in the search input files.
awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {print $0, st_array[$1], "wh_genome"}' "$sam"
Finally, use awk for adding the extra information:
Open the search input file with awk at the beginning and read its contents into an array (st_array)
Set the Output Field Separator to the tabulator
Traverse the sam file and add the extra information from the pre-populated array.
I'm proposing this schema because I feel like grep is faster than awk for doing the search, but the same result can be obtained with awk alone:
samtools view -# 2 "$bam" | awk -v st="$1" 'BEGIN {OFS="\t"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "wh_genome"}}'
In this case, you only need to add a conditional to identify the interesting reads and get rid of the grep.
In any case you need to re-read the file more than once or to decompress it before working with it.

Bash script: write CPU utilization to file (Ubuntu)

I would like to write a bash script that writes the current CPU utilization to a file "logfile". I am using Intel® Core™ i7-4500U CPU # 1.80GHz × 4 and Ubuntu 15.10.
I have seen similar questions asked already in this forum, however not all of my questions were answered to 100 percent. By my research I came up with two possible ways of achieving my goal. First one is
mpstat | grep "all" | awk '{ print $3 + $5; }' >> logfile
(adding user CPU and system CPU) and my second candidate is
mpstat | grep "all" | awk '{ print 100 - $12; }' >> logfile
(100 - %idle CPU). Which one of those two is the right one for me if I am interested in the total CPU utilization (so all components that count in some form as CPU should be included).
Another question: By what I have learned by reading other threads, I think my second candidate
mpstat | grep "all" | awk '{ print 100 - $12; }' >> logfile
should be quite accurate. However, when I open the "System Monitor" and monitor the "CPU History" I observe significantly different CPU utilization. Another thing is that the values in the System Monitor are very dynamic (CPU varies between 4% and 18%) whereas over the same period the outcome of the second command remains almost constant. Has someone an explanation for that?
Many thanks for all comments!

This happens because mpstat's first line shows an average value calculated since the system booted (which will be much more "stable" - will tend to change less and less as time goes by ).
Quote from mpstat man page:
The interval parameter specifies the amount of time in seconds
between each report. A value of 0 (or no parameters at all)
indicates that processors statistics are to be reported for the time
since system startup (boot).
If you add an interval parameter, you will start to get back live numbers, which should more closely match your System Monitor output (try executing mpstat 1 vs. the plain mpstat).
Therefore, this Bash line should do the trick:
mpstat 1 1 | grep "all" | awk '{ print 100 - $NF; exit; }' >> logfile
and, to do it without grep (saving the extra process spawn):
mpstat 1 1 | awk '/all/{ print 100 - $NF; exit; }' >> logfile
(changed $12 to $NF for the case when the first line has a time and shifts the arguments over; with $NF we consistently get the last value, which is the idle value)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio