grep a string between two patterns multiple instances in a file? - bash

I'm new to bash scripting and I need to make a script that will go through files of logs about jobs that ran and I need to extract certain values such as the memory used and then the memory requested to calculate the memory used.
To begin this I'm simply trying to get a grep command that will grep a value between two patterns in a file, which will be my starting point for this script.
The file looks something like this:
20200429:04/29/2020 04:25:32;S;1234567.vpbs3;user=xx group=xxxxxx=_xxx_xxx_xxxx jobname=xx_xxxxxx queue=xxx ctime=1588148732 qtime=1588148732 etime=1588148732 start=1588148732 exec_host=xxx2/1*8 exec_vnode=(xx2:mem=402653184kb:ncpus=8) Resource_List.mem=393216mb Resource_List.ncpus=8 Resource_List.nodect=1 Resource_List.place=free Resource_List.preempt_targets=NONE Resource_List.Qlist=xxxq Resource_List.select=1:mem=393216mb:ncpus=8 Resource_List.walltime=24:00:00 resource_assigned.mem=402653184kb resource_assigned.ncpus=8
The values in bold are what I need to extract. Its multiple jobs and dates, so the file goes on with multiple paragraphs like this of data with different dates and numbers.
From going through similar questions online, I've come up with:
egrep -Eo 'Resource_List.mem=.{1,50}' sampleoutput.txt | cut -d "=" -f 2-
and I get multple lines of this:
393216mb Resource_List.ncpus=8 Resource_List.nodec
and I'm stuck as to how to get only that '393216mb' as I've never really used grep or cut much. Any suggestions, even if its not using grep, would be greatly appreciated!

Use:
grep -o -E 'Resource_List.mem=[^\ ]+|resource_assigned.mem=[^\ ]+'

Very close! . is a wildcard, you want to match numbers.
egrep -Eo 'Resource_List.mem=[0-9]*..' sampleoutput.txt

Related

How to use grep/awk/sed to print until a certain character?

I am a complete beginner on shell scripting and I am trying to iterate through a set of JSON files and trying to extract a certain field out of it. Each JSON file has a "country:"xxx" field. In each JSON file, there are 10k of the same field with the same country name so I need only the first occurrence and I can do that using "-m 1".
I tried to use grep for this but could not figure out how to extract the whole field including the country name from each file at first occurrence.
for FILE in *.json;
do
grep -o -a -m 1 -h -r '"country":"' $FILE;
done
I tried to use another pipe and use the below pattern but it did not work
| egrep -o '^[^"]+'
Actual Output:
"country":"
"country":"
"country":"
Desired Output:
"country:"romania"
"country:"united kingdom"
"country:"tajikistan"
but I need the whole thing. Any help would be great. Thanks
There is one general answer on the question "I only want the first occurence", and that answer is:
... | head -n 1
This mean, whatever your do: take the head (the first lines), the -n switch gives you the possibility to say how many you want (one in this case).
The same can be done for the last occurence(s), but then you use tail instead of head (you can also use the -n switch).
After trying many things. I found the pattern I was looking for.
grep -Po '"country":.*?[^\\]",' $FILE | head -n 1;

Is there any faster way to grep billions of mismatch patterns in more than one file?

I wrote a script that calculates all possible mismatch patterns (depending a case) like the two below (please look at grep command) and writes an output file as sh with billion lines like this one:
LC_ALL=C grep -ch "AAAAAAAC[A-Z][A-Z][A-Z][A-Z]CGA[A-Z][A-Z]G\|C[A-Z][A-Z]TCG[A-Z][A-Z][A-Z][A-Z]GTTTTTTT" regions_A regions_B
The next step is to execute all this billions grep lines and write an output.
In order to run it as fast as I can, I look only for ASCII code (all my characters are ASCII) using LC_ALL. Moreover, I split the huge grep file in 16 parts and run them separately using 16 threads.
Does anybody know any faster method to grep my patterns?
Any help would be appreciated.
Thank you in advance!

Defining a variable using head and cut

might be an easy question, I'm new in bash and haven't been able to find the solution to my question.
I'm writing the following script:
for file in `ls *.map`; do
ID=${file%.map}
convertf -p ${ID}_par #this is a program that I use, no problem
NAME=head -n 1 ${ID}.ind | cut -f1 -d":" #Now: This step is the problem: don't seem to be able to make a proper NAME function. I just want to take the first column of the first line of the file ${ID}.ind
It gives me the return
line 5: bad substitution
any help?
Thanks!
There are a couple of issues in your code:
for file in `ls *.map` does not do what you want. It will fail e.g. if any of the filenames contains a space or *, but there's more. See http://mywiki.wooledge.org/BashPitfalls#for_i_in_.24.28ls_.2A.mp3.29 for details.
You should just use for file in *.map instead.
ALL_UPPERCASE names are generally used for system variables and built-in shell variables. Use lowercase for your own names.
That said,
for file in *.map; do
id="${file%.map}"
convertf -p "${id}_par"
name="$(head -n 1 "${id}.ind" | cut -f1 -d":")"
...
looks like it would work. We just use $( cmd ) to capture the output of a command in a string.

How do I delete all rows with a blank space in the third column within a file?

So, I have a file which contains the results of some calculations I've run in the past weeks. I've collected the results in a file which I intend to plot. It is basically a bunch of rows with the format "x" "y" "f(x,y)", like this:
1.7 4.7 -460.5338556921
1.7 4.9 -460.5368762353
1.7 5.5
However, some lines, exemplified by the last one, contain a blank space in the 3rd column, resulting from failed calculations. I'd still like to plot the viable points, but, as there are thousands of points (and therefore rows) that task just be accomplished easily by hand. I'd like to know how to make a script or program (I'd prefer a shell script, but I'll gladly go along with whatever works), which identifies those lines and deletes them. Does anyone know a way to do it?
awk '$3' <filename>
or better
awk 'NF > 2' <filename> # if in any entry in the column-3 happens to be zero
This will do the purpose!
The simplest form of grep command that should probably be understood by any shell these days:
grep -v '^[^[:space:]]*[[:space:]]*[^[:space:]]*[[:space:]]*$' <filename>
With grep:
grep ' .* [^ ]' file
or using ERE:
grep -E '\s\S+\s\S' file
I would to use:
perl -lanE 'print if #F==3 && /^[\d\s\.+-]+$/' file
will print only lines:
which contains 3 fields
and contains only numbers, spaces, and .+-
I do not know how you are going to plot. You would like a grep or awk solution and pipe all valid lines into your plotting application.
When you need to call a program for each set of values, you can skip the invalid lines when you are reading the values:
while read -r x y fxy; do
if [ -n "${fxy}" ]; then
myplotter "$x" "$y" "${fxy}"
fi
done < file

How to use grep to filter out words like food and foot but not liked foody or footed

Hey People so I know what I want to do which would be to use the grep command to filter out words like foot, food, fool from a dictionary file but still retain words like footed and foodilicous.
so this is the code I have so far
cat /home1/02836/sulstice/dictionary.txt | grep -E foo | grep -vE '^foo'
The cat command is just pulling the dictionary txt which is just a bunch of words.
The last command I feel like there would be something I can put to say ^foo(if there is a character and end of the word then omit that too).
There must be a way using the grep function, anyone got a way?
Thank you
With Gnu grep, you can use the -w flag to restrict the match to full words, so:
grep -w foo[[:alpha:]] /home1/02836/sulstice/dictionary.txt
will match full words which consist of foo plus one letter.
Note that there is no need for cat. You can tell grep which file(s) to search in.
Assuming that dictionary.txt has a single word per line then you should be able to just use ^foo$

Resources