How to Extract some Fields from Real Time Output of a Command in Bash script - bash

I want to extract some fields out of output of command xentop. It's like top command; provides an ongoing look at cpu usage,memory usage,...in real time.
If I run this command in batch mode, I will have its output as you see in a file:
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 13700 33.0 7127040 85.9 no limit n/a 8 0 0 0 0 0 0 0 0 0 0
fed18 -----r 738 190.6 1052640 12.7 1052672 12.7 3 1 259919 8265 1 0 82432 22750 2740966 1071672 0
and running this
cat file| tr '\r' '\n' | sed 's/[0-9][;][0-9][0-9][a-Z]/ /g' | col -bx | awk '{print $1,$4,$6}'
on this file gives me what I want
NAME CPU(%) MEM(%)
Domain-0 33.0 85.9
fed18 190.6 12.7
but my script doesn't work on realtime output of xentop. I even tried to just run xentop one time by setting itteration option as 1(xentop -i 1) but It does not work!
How can I pipe output of xentop as "not" realtime to my script?

It may not be sending any output to the standard output stream. There are several ways of sending output to the screen without using stdout. A quick google search didn't provide much information about how it works internally.

I use xentop version 1.0 on xenserver 7.0 like :
[root#xen] xentop -V
xentop 1.0
[root#xen] cat /etc/centos-release
XenServer release 7.0.0-125380c (xenenterprise)
If you want to save the xentop output you can do it with '-b' (batch mode) and '-i' (number of iterations before exiting) options :
[root#xen] xentop -b -i 1
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 132130 0.0 4194304 1.6 4194304 1.6 16 0 0 0 0 0 0 0 0 0 0
MY_VM --b--- 5652 0.0 16777208 6.3 16915456 6.3 4 0 0 0 1 - - - - - 0
[root#xen] xentop -b -i 1 > output.txt
[root#xen] cat output.txt
NAME STATE CPU(sec) CPU(%) MEM(k) MEM(%) MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS VBD_OO VBD_RD VBD_WR VBD_RSECT VBD_WSECT SSID
Domain-0 -----r 132130 0.0 4194304 1.6 4194304 1.6 16 0 0 0 0 0 0 0 0 0 0
MY_VM --b--- 5652 0.0 16777208 6.3 16915456 6.3 4 0 0 0 1 - - - - - 0

Related

Splitting a large file containing multiple molecules

I have a file that contains 10,000 molecules. Each molecule is ending with keyword $$$$. I want to split the main files into 10,000 separate files so that each file will have only 1 molecule. Each molecule have different number of lines. I have tried sed on test_file.txt as:
sed '/$$$$/q' test_file.txt > out.txt
input:
$ cat test_file.txt
ashu
vishu
jyoti
$$$$
Jatin
Vishal
Shivani
$$$$
output:
$ cat out.txt
ashu
vishu
jyoti
$$$$
I can loop it through whole main file to create 10,000 separate files but how to delete the last molecule that was just moved to new file from main file. Or please suggest if there is a better method for it, which I believe there is. Thanks.
Edit1:
$ cat short_library.sdf
untitled.cdx
csChFnd80/09142214492D
31 34 0 0 0 0 0 0 0 0999 V2000
8.4660 6.2927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
8.4660 4.8927 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2124 2.0951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.4249 2.7951 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 2 0 0 0 0
2 3 1 0 0 0 0
30 31 1 0 0 0 0
31 26 1 0 0 0 0
M END
> <Mol_ID> (1)
1
> <Formula> (1)
C22H24ClFN4O3
> <URL> (1)
http://www.selleckchem.com/products/Gefitinib.html
$$$$
Dimesna.cdx
csChFnd80/09142214492D
16 13 0 0 0 0 0 0 0 0999 V2000
2.4249 1.4000 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
3.6415 2.1024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8540 1.4024 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4904 1.7512 0.0000 Na 0 3 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
1 14 2 0 0 0 0
M END
> <Mol_ID> (2)
2
> <Formula> (2)
C4H8Na2O6S4
> <URL> (2)
http://www.selleckchem.com/products/Dimesna.html
$$$$
Here's a simple solution with standard awk:
LANG=C awk '
{ mol = (mol == "" ? $0 : mol "\n" $0) }
/^\$\$\$\$\r?$/ {
outFile = "molecule" ++fn ".sdf"
print mol > outFile
close(outFile)
mol = ""
}
' input.sdf
If you have csplit from GNU coreutils:
csplit -s -z -n5 -fmolecule test_file.txt '/^$$$$$/+1' '{*}'
This will do the whole job directly in bash:
molsplit.sh
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
if [[ "$line" = '$$$$' ]]; then
end=1
exec 3>&-
fi
done
Input is read from stdin, though that would be easy enough to change. Something like this:
./molsplit.sh < test_file.txt
ADDENDUM
From subsequent commentary, it seems that the input file being processed has Windows line endings, whereas the processing environment's native line ending format is UNIX-style. In that case, if the line-termination style is to be preserved then we need to modify how the delimiters are recognized. For example, this variation on the above will recognize any line that starts with $$$$ as a molecule delimiter:
#!/bin/bash
filenum=0
end=1
while read -r line; do
if [[ $end -eq 1 ]]; then
end=0
filenum=$((filenum + 1))
exec 3>"molecule${filenum}.sdf"
fi
echo "$line" 1>&3
case $line in
'$$$$'*) end=1; exec 3>&-;;
esac
done
The same statement that sets the current output file name also closes the previous one. close(_)^_ here is same as close(_)^0, which ensures the filename always increments for the next one, even if the close() action resulted in an error.
— if the output file naming scheme allows for leading-edge zeros, then change that bit to close(_)^(_<_), which ALWAYS results in a 1, for any possible string or number, including all forms of zero, the empty string, inf-inities, and nans.
mawk2 'BEGIN { getline __<(_ = "/dev/null")
ORS = RS = "[$][$][$][$][\r]?"(FS = RS)
__*= gsub("[^$\n]+", __, ORS)
} NF {
print > (_ ="mol" (__+=close(_)^_) ".txt") }' test_file.txt
The first part about getline from /dev/null neither sets $0 | NF nor modifies NR | FNR, but it's existence ensures the first time close(_) is called it wouldn't error out.
gcat -n mol12345.txt
1 Shivani
2 jyoti
3 Shivani
4 $$$$
it was reasonably speedy - from 5.60 MB synthetic test file created 187,710 files in 11.652 secs.

Grep rows from top command based on a condition

[xxxxx#xxxx3 ~]$ top
top - 16:29:00 up 197 days, 19:06, 12 users, load average: 19.16, 21.08, 21.58
Tasks: 3668 total, 21 running, 3646 sleeping, 0 stopped, 1 zombie
Cpu(s): 14.1%us, 6.8%sy, 0.0%ni, 79.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264389504k total, 53305000k used, 211084504k free, 859908k buffers
Swap: 134217720k total, 194124k used, 134023596k free, 12854016k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19938 jai_web 20 0 3089m 2.9g 7688 R 100.0 1.1 0:10.26 Engine
19943 jai_web 20 0 3089m 2.9g 7700 R 100.0 1.1 0:10.14 Engine
20147 jai_web 20 0 610m 454m 3556 R 78.4 0.2 0:02.54 java
77169 jai_web 20 0 9414m 1.4g 29m S 21.3 0.6 38:51.69 java
20160 jai_web 20 0 362m 196m 3336 R 16.7 0.1 0:00.54 java
272287 jai_web 20 0 20.1g 2.0g 5784 S 15.1 0.8 165:39.50 java
26597 jai_web 20 0 6371m 134m 3444 S 9.6 0.1 429:41.97 java
From the snippet of top command above i want to grep PIDs which have Value of TIME+ greater than 10:00:00 that belongs to 'java' process
so am expecting grep output as below:
77169 jai_web 20 0 9414m 1.4g 29m S 21.3 0.6 **38:51.69** java
272287 jai_web 20 0 20.1g 2.0g 5784 S 15.1 0.8 **165:39.58** java
26597 jai_web 20 0 6371m 134m 3444 S 9.6 0.1 **429:41.97** java
i have tried below:
top -p "$(pgrep -d ',' java)"
But doesnt satisfies my condition.Please assist
I would just do this for one time analysis.
$ top -n 1 -b | awk '$NF=="java" && $(NF-1) >= "10:00.00"'
Ok here is what I came up with...
You need to get the output of top, filter only the java lines, then check each line to see if the TIME is bigger than your limit. Here is what I did:
#!/bin/bash
#
tmpfile="/tmp/top.output"
top -o TIME -n 1 | grep java >$tmpfile
# filter each line and keep only the ones where TIME is bigger than a certain value
limit=10
while read line
do
# take the line and keep only the 11th field, which is the time value
# In that time value, keep only the first number
timevalue=$(echo $line | awk '{print $12}' | cut -d':' -f1)
# compare timevalue to the limit we set
if [ $timevalue -gt $limit ]
then
# output the entire line
echo $line
fi
done <$tmpfile
# cleanup
rm -f /tmp/top.output
The trick here is to extract the TIME value, only the first digits. The other digits are not significant, as long as the first is bigger than 10.
Someone might know of a way to do it via grep, but I doubt it, I have never seen conditionals in grep.

incomplete output variable stored

im actually working with a small script, this script uses a comand from a NAS EMC Storage, the main idea is to storage and output variable and use it for other command.
nameserver="$(nas_server -list -all | awk 'NR == 3 {print $6}')"
serverparam1="$(server_param "$nameserver" -facility NDMP -list)"
echo "$serverparam1"
So.. this command nas_server -list -all | awk 'NR == 3 {print $6} returns "server_3"
the idea is to storage the name "server_3" and use it in this other command:
server_param server_3 -facility NDMP -list
The problem with all this stuff, is that the output print is not "server_3" only get "ver_3" i don't know why this is happening.
This is the ouput of the terminal:
[nasadmin#xxxx ~]$ ./test.sh
: ver_3
: unknown hostver_3
This is the output from server_param
[nasadmin#xxxx ~]$ server_param server_3 -facility NDMP -list
server_3 :
param_name facility default current configured
maxProtocolVersion NDMP 4 4
scsiReserve NDMP 0 0
DHSMPassthrough NDMP 0 0
CDBFsinfoBufSizeInKB NDMP 1024 1024
noxlt NDMP 0 0
bufsz NDMP 128 128
convDialect NDMP 8859-1 8859-1
concurrentDataStreams NDMP 4 4
includeCkptFs NDMP 1 1
md5 NDMP 1 1
snapTimeout NDMP 5 5
dialect NDMP
forceRecursiveForNonDAR NDMP 0 0
excludeSvtlFs NDMP 1 1
tapeSilveringStr NDMP ts ts
portRange NDMP 1024-65535 1024-65535
snapsure NDMP 0 0
v4OldTapeCompatible NDMP 1 1
[nasadmin#xxxx ~]$ nas_server -list -all
id type acl slot groupID state name
1 1 0 2 0 server_2
2 4 0 3 0 server_3
id acl server mountedfs rootfs name
1 0 1 17 13 TEST_VDM-1
2 0 1 16 14 TEST_VDM-2
Thanks
This worked for me
nameserver="$(nas_server -list -all | awk 'NR == 5 {print $6}')"
nameserver1="$(dos2unix $nameserver)"
serverparam0="$(server_param "$nameserver0" -facility NDMP -list)"
echo "$serverparam0"

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

Search for a String in 1000 files and each file size is 1GB

I am working on SunOS (which is slightly brain-dead). And below is the Disk Throughput for the above Solaris Machine-
bash-3.00$ iostat -d 1 10
sd0 sd1 sd2 sd3
kps tps serv kps tps serv kps tps serv kps tps serv
0 0 0 551 16 8 553 16 8 554 16 8
0 0 0 701 11 25 0 0 0 1148 17 33
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
Problem Statement
I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.
All the 1000 files are under real-time folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String.
bash-3.00$ hadoop fs -ls /apps/technology/b_dps/real-time
So for the above problem statement, I am using the below command that will find all the files which contains the particular string-
hadoop fs -ls /apps/technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done
So in the above case it will find all the files which contains this string cec7051a1380a47a4497a107fecb84c1. And it is working fine for me and I am able to get the file names which contains the particular string.
My Question is-
But the problem with above command is, it is very very slow. So is there any way we can parallelize the above command or make the above command to search the files a lot faster?
Any suggestions will be appreciated.
You could write a simple MapReduce job to achieve this if you want. You don't actually need any reducers though, so the number of reducers would be set to zero. This way you can make use of the parallel processing power of MapReduce and chunk though the files much faster than a serial grep.
Just set up a Mapper that can be configured to search for the string you want. You will probably read in the files using the TextInputFormat, split the line and check for the values you are searching for. You can then write out the name of the current input file for the Mapper that matches.
Update:
To get going on this you could start with the standard word count example: http://wiki.apache.org/hadoop/WordCount. You can remove the Reducer and just modify the Mapper. It reads the input a line at a time where the line is contained in the value as a Text object. I dont know what format your data is, but you could even just convert the Text to a String and hardcode a .contains("") against that value to find the String you're searching for (for simplicity, not speed or best practice). You just need to workout which file the Mapper was processing when you get a hit and then write out the files name.
You can get a hint from grep class. It comes with the distribution in the example folder.
./bin/hadoop jar hadoop-mapred-examples-0.22.0.jar grep input output regex
For details source on the implementation of this class you can go to the directory. "src\examples\org\apache\hadoop\examples" that comes with the distribution
So you can do this in your main class:
Job searchjob = new Job(conf);
FileInputFormat.setInputPaths("job Name", "input direcotory in hdfs");
searchjob.setMapperClass(SearchMapper.class);
searchjob.setCombinerClass(LongSumReducer.class);
searchjob.setReducerClass(LongSumReducer.class);
In your SearchMapper.class you can do this.
public void map(K key, Text value,
OutputCollector<Text, LongWritable> output,
Reporter reporter)
throws IOException {
String text = value.toString();
Matcher matcher = pattern.matcher(text);
if(matcher.find()) {
output.collect(key,value);
}
If you have 1000 files, is there any reason to use a finely-grained parallelized technique? Why not just use xargs, or gnu parallel, and split the work over the files, instead of splitting the work within a file?
Also it looks like you are grepping a literal string (not a regex); you can use the -F grep flag to search for string literals, which may speed things up, depending on how grep is implemented/optimized.
I haven't worked with mapReduce specifically, so this post may or may not be on point.

Resources