value of total physical memory in linux OS - bash

I have a following question from this question. Is the value of the total physical memory always shown in KB? Because I would like to print it in GB and I use this command
grep MemTotal /proc/meminfo | awk '{$2=$2/(1024^2); print $2}'
I'm not sure wheter I should add a if statement to prove the command grep MemTotal /proc/meminfo showing KB value or other value
Any help would be appreciated

You need not to use grep + awk, you could do this in a single awk itself. From explanation point of view, I have combined your attempted grep code within awk code itself. In awk program I am checking condition if 1st field is MemTotal: and 3rd field is kB then printing 2rd field's value in GB(taken from OP's attempted code itself).
awk '$1=="MemTotal:" && $3=="kB"{print $2/(1024^2)}' /proc/meminfo
OR if in case you want to make kB match in 3rd a case in-sensitive one then try following code:
awk '$1=="MemTotal:" && $3~/^[kK][bB]$/{print $2/(1024^2)}' /proc/meminfo

Is the value of the total physical memory always shown in KB?
Yes, the unit kB is fixed in the kernel code. See: 1 and 2

If you assume the MemTotal: entry is always the first line of /proc/meminfo, it is possible to get the Gigabyte value without spawning a sub-shell or external commands, and using only POSIX-shell grammar that works with ksh, ash, dsh, zsh, or bash:
#!/usr/bin/env sh
IFS=': ' read -r _ memTotal _ < /proc/meminfo;
printf 'Total RAM: %d GB\n' "$((memTotal / 1024000))"

Related

Efficient search pattern in large CSV file

I recently asked how to use awk to filter and output based on a searched pattern. I received some very useful answers being the one by user #anubhava the one that I found more straightforward and elegant. For the sake of clarity I am going to repeat some information of the original question.
I have a large CSV file (around 5GB) I need to identify 30 categories (in the action_type column) and create a separate file with only the rows matching each category.
My input file dataset.csv is something like this:
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
I am using the following to get the results I want (again, this is thanks to #anubhava).
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
This works as expected. But I have found it quite slow. It has been running for 14 hours now and, based on the size of the output files compared to the original file, it is not at even 20% of the whole process.
I am running this on a Windows 10 with an AMD Ryzen PRO 3500 200MHz, 4 Cores, 8 Logical Processors with 16GB Memory and an SDD drive. I am using GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.1.0, GNU MP 6.2.0). My CPU is currently at 30% and Memory at 51%. I am running awk inside a Cygwin64 Terminal.
I would love to hear some suggestions on how to improve the speed. As far as I can see it is not a capacity problem. Could it be the fact that this is running inside Cygwin? Is there an alternative solution? I was thinking about Silver Searcher but could not quite workout how to do the same thing awk is doing for me.
As always, I appreciate any advice.
with sorting:
awk -F, 'NR > 1{if(!seen[$2]++ && fn) close(fn); if(fn = $2 "_dataset.csv"; print >> fn}' < (sort -t, -nk2 dataset.csv)
or with gawk (unlimited number of opened fd-s)
gawk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn;}' dataset.csv
This is the right way to do it using any awk:
$ tail -n +2 file | sort -t, -k2,2n |
awk -F, '$2!=p{close(out); out=$2"_dataset.csv"; p=$2} {print > out}'
The reason I say this is the right approach is it doesn't rely on the 2nd field of the header line coming before the data values when sorted, doesn't require awk to test NR > 1 for every line of input, doesn't need an array to store $2s or any other values, and only keeps 1 output file open at a time (the more files open at once the slower any awk will run, especially gawk once you get past the limit of open files supported by other awks as gawk then has to start opening/closing the files in the background as needed). It also doesn't require you to empty existing output files before you run it, it will do that automatically, and it only does string concatenation to create the output file once per output file, not once per line.
Just like the currently accepted answer, the sort above could reorder the input lines that have the same $2 value - add -s if that's undesirable and you have GNU sort, with other sorts you need to replace the tail with a different awk command and add another sort arg.

How do I trim whitespace, but not newlines, from subshell output in bash?

There are many tens, maybe a hundred or more previous questions that seem "identical" to this already here, but after extensive search, I found NOTHING that even came close to working - though I did learn quite a lot - and so I decided to just RTFM and figure this out on my own.
The Problem
I wanted to search the output of a ps auxwww command to find processes of interest, and the issue was that I can't just simply use cut to find the exact data from them that I wanted. ps, it turns out, tries to columnate the output, adding either extra spaces or tabs that get in the way of using cut to get the correct data.
So, since I'm not a master at bash, I did a search... The answers I found were all focused on either variables - a "backup strategy" from my point of view that itself didn't solve the whole problem - or they only trimmed leading or trailing space or all "whitespace" including newlines. NOPE, Won't Work For Cut! And, neither will removing trailing newlines and so forth.
So, restated, the question is, how do we efficiently end up with the white space defined as simply a single space between other characters without eliminating newlines?
Below, I will give my answer, but I welcome others to give theirs - who knows, maybe someone has a better answer?!
Answer:
At least MY answer - please leave your own, too! - was to do this:
ps auxwww | grep <program> | tr -s [:blank:] | cut -d ' ' -f <field_of_interest>
This worked great!
Obviously, there are many ways to adapt this to other needs.
As an alternative to all of the pipes and grep with cut, you could simply use awk. The benefit of using awkwith the default field-separator (FS) being set to break on whitespace is that it considers any number of whitespace between fields as a single separator.
So using awk will do away with needing to use tr -s to "squeeze" whitespace to define fields. Further, awk gives far greater control over field matching using regular expressions rather than having to rely on grep of a full line and cut to locate a pre-determined field numbers. (though to some extent you will still have to tell awk what field out of the ps command you are interested in)
Using bash, you can also eliminate the pipe | by using process substitution to send the output of ps auxwww to awk on stdin using redirection, e.g. awk ... < <(ps auxwww) for a single tidy command line.
To get your "program" and "file_of_interest" into awk you have two options. You can initialize awk variables using the -v var=value option (there can be multiple -v otions given), or you can use the BEGIN rule to initialize the variables. The only difference being with -v you can provide a shell variable for value and there is no whitespace allowed surrounding the = sign, while within BEGIN any whitespace is ignored.
So in your case a couple of examples to get the virtual memory size for firefox processes, you could use:
awk -v prog="firefox" -v fnum="5" '
$11 ~ prog {print $fnum}
' < <(ps auxwww)
(above if you had myprog=firefox as a shell variable, you could use -v prog="$myprog" to initialize the prog variable for awk)
or using the BEGIN rule, you could do:
awk 'BEGIN {prog = "firefox"; fnum = "5"}
$11 ~ prog {print $fnum }
' < <(ps auxwww)
In each command above, it locates the COMMAND field from ps (field 11) and checks whether it contains firefox and if so it outputs field no. 5 the virtual memory size used by each process.
Both work fine as one-liners as well, e.g.
awk -v prog="firefox" -v fnum="5" '$11 ~ prog {print $fnum}' < <(ps auxwww)
Don't get me wrong, the pipeline is perfectly fine, it will just be slow. For short commands with limited output there won't be much difference, but when the output is large, awk will provide orders of magnitude improvement over having to tr and grep and cut reading over the same records three times.
The reason being, the pipes and the process on each side requires separate processes be spawned by the shell. So minimizes their use, improves the efficiency of what your script is doing. Now if the data is small as are the processes, there isn't much of a difference. However if you are reading a 3G file 3 times over -- that's is the difference in orders of magnitude. Hours verses minutes or seconds.
I had to use single quotes on CentosOS Linux to get tr working like described above:
ps -o ppid= $$ | tr -d '[:space:]'
You can reduce the number of pipes using this Perl one-liner, which uses Perl regexes instead of a separate grep process. This combines grep, tr and cut in a single command, with an easy way to manipulate the output (#F is the array of fields, 0-indexed):
Examples:
# Start an example process to provide the input for `ps` in the next commands:
/Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo &
# Print single space-delimited output of `ps` for all emacs processes:
ps auxwww | perl -lane 'print "#F" if $F[10] =~ /emacs/i'
# Prints:
# bar 72144 0.0 0.5 4610272 82320 s006 SN 11:15AM 0:01.31 /Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo
# Print emacs PID and file name opened with emacs:
ps auxwww | perl -lane 'print join "\t", #F[1, -1] if $F[10] =~ /emacs/i'
# Prints:
# 72144 /tmp/foo
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

Get total CPU usage from the terminal on a Mac? [duplicate]

Ive seen the same question asked on linux and windows but not mac (terminal). Can anyone tell me how to get the current processor utilization in %, so an example output would be 40%. Thanks
This works on a Mac (includes the %):
ps -A -o %cpu | awk '{s+=$1} END {print s "%"}'
To break this down a bit:
ps is the process status tool. Most *nix like operating systems support it. There are a few flags we want to pass to it:
-A means all processes, not just the ones running as you.
-o lets us specify the output we want. In this case, it all we want to the cpu% column of ps's output.
This will get us a list of all of the processes cpu usage, like
0.0
1.3
27.0
0.0
We now need to add up this list to get a final number, so we pipe ps's output to awk. awk is a pretty powerful tool for parsing and operating on text. We just simply add up the numbers, then print out the result, and add a "%" on the end.
Adding up all those CPU % can give a number > 100% (probably multiple cores).
Here's a simpler method, although it comes with some problems:
top -l 2 | grep -E "^CPU"
This gives 2 samples, the first of which is nonsense (because it calculates CPU load between samples).
Also, you need to use RegEx like (\d+\.\d*)% or some string functions to extract values, and add "user" and "sys" values to get the total.
(From How to get CPU utilisation, RAM utilisation in MAC from commandline)
Building on previous answers from #Jon R. and #Rounak D, the following line prints the sum of user and system values, with the added percent. I've have tested this value and I like that it roughly tracks well with the percentages shown in the macOS Activity Monitor.
top -l 2 | grep -E "^CPU" | tail -1 | awk '{ print $3 + $5"%" }'
You can then capture that value in a variable in script like this:
cpu_percent=$(top -l 2 | grep -E "^CPU" | tail -1 | awk '{ print $3 + $5"%" }')
PS: You might also be interested in the output of uptime, which shows system load.
Building upon #Jon R's answer, we can pick up the user CPU utilization through some simple pattern matching
top -l 1 | grep -E "^CPU" | grep -Eo '[^[:space:]]+%' | head -1
And if you want to get rid of the last % symbol as well,
top -l 1 | grep -E "^CPU" | grep -Eo '[^[:space:]]+%' | head -1 | sed s/\%/\/
top -F -R -o cpu
-F Do not calculate statistics on shared libraries, also known as frameworks.
-R Do not traverse and report the memory object map for each process.
-o cpu Order by CPU usage
Answer Source
You can do this.
printf "$(ps axo %cpu | awk '{ sum+=$1 } END { printf "%.1f\n", sum }' | tail -n 1),"

Unix(AIX): tail: unable to malloc memory Error, when using pipes

My requirement is to chop off the header and trailer records from a large file, I'm using a file of size 2.5GB with 1.8 million records. For doing so, I'm executing:
head -n $((count-1)) largeFile | tail -n $((count-2)) > outputFile
Whenever I select count>=725,000 records (size=1,063,577,322), the prompt is returning an error:
tail:unable to malloc memory
I assumed that the pipe buffer went full and tried:
head -n 1000000 largeFile | tail -n 720000 > outputFile
which should also fail since i'm passing count> 725000 to head, but, it generated the output.
Why it is so? As head is generating same amount of data (or more), both commands should fail, but the command is depending on tail count. Is it not the way where, first head writes into pipe and then tail uses pipe as input. If it is not, how parallelism is supported here, since tail works from end which is not known until head completes execution. Please correct me, I've assumed lot of things here.
PS: For the time being I've used grep to remove header and trailer. Also, ulimit on my machine returns:
pipe (512 byte) 64 {32 KB}
Thanks guys...
Just do this instead:
awk 'NR>2{print prev} {prev=$0}' largeFile > outputFile
it'll only store 1 line in memory at a time so no need to worry about memory issues.
Here's the result:
$ seq 5 | awk 'NR>2{print prev} {prev=$0}'
2
3
4
I did not test this with a large file, but it will avoid a pipe.
sed '1d;$d' largeFile > outputFile
Ed Morton and Walter A have already given workable alternatives; I'll take a stab at explaining why the original is failing. It's because of the way tail works: tail will read from the file (or pipe), starting at the beginning. It stores the last lines seen, and then when it reaches the end of the file, it outputs the stored lines. That means that when you use tail -n 725000, it needs to store the last 725,000 lines in memory, so it can print them when it reaches the end of the file. If 725,000 lines (most of a 2.5GB file) won't fit in memory, you get a malloc ("memory allocate") error.
Solution: use a process that doesn't have to buffer most of the file before outputting it, as both Ed and Walter's solutions do. As a bonus, they both trim the first line in the same process.

Performant way of displaying the number of unique column entries in a set of files?

I'm attempting to pipe a large amount of files in to a sequence of commands which displays the number of unique entries in a given column of said files. I'm inexperienced with the shell, but after a short while I was able to come up with this:
awk '{print $5 }' | sort | uniq | wc - l
This sequence of commands works fine for a small amount of files, but takes an unacceptable amount of time to execute on my target set. Is there a set of commands that can accomplish this more efficiently?
You can count unique occurrences of values in the fifth field in a single pass with awk:
awk '{if (!seen[$5]++) ++ctr} END {print ctr}'
This creates an array of the values in the fifth field and increments the ctr variable if the value has never seen before. The END rule prints the value of the counter.
With GNU awk, you can alternatively just check the length of the associative array in the end:
awk '{seen[$5]++} END {print length(seen)}'
Benjamin has supplied the good oil, but depending on just how much data is to be stored in the array, it may pay to pass the data to wc anyway:
awk '!_[$5]++' file | wc -l
the sortest and fastest (i could) using awk but not far from previous version of #BenjaminW. I think a bit faster (difference could only be interesting on very huge file) because of test made earlier in the process
awk '!E[$5]++{c++}END{print c}' YourFile
works with all awk version
GNU datamash has a count function for columns:
datamash -W count 5

Resources