shell command deal with only one column [closed] - bash

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm really curious: Is there any tool can assist to shell text processing programs -> cut one column out, provide to text processing programs and then paste it back.
For example, I have a file:
3f27,tom,17
6d44,jack,19
139a,jerry,7
I want to change field 2, remove all aeiou.
I known there are many ways to work around this problem. But why we do not face it?
I want a tool, like:
deal-only -d"," -f2 sed 's/[aeiou]//g'
This is more clean and powerful.
So, is anybody known such a tool, or similar solution?
If no, I want to create one.
As I said above, I known sed, or awk can deal above problem well.
But when you meet a complex problem, sed or awk cannot save you.
deal-only -d"," -f2 ./ip2country.rb
Here, I want to modify column 2 from ip to country.

Using awk:
# script.awk
BEGIN { FS="," }
{print $1 "," gensub("[aeiou]+", "", "g", $2) "," $3}
Then:
awk -f script.awk < data.txt

You may use the coprocess feature of bash (see e.g. here):
#!/bin/bash
coproc stdbuf -oL sed 's/[aeiou]//g'
while IFS="," read a b c ; do
echo "${b}" >&${COPROC[1]}
read -u ${COPROC[0]} b2
echo "${a},${b2},${c}"
done
Some random notes:
this is not POSIX
the standard output of process which filters the column data *MUST* be line buffered/unbuffered (this is the stdbuf -oL part -- see section "Buffering" in the above-mentioned document)
(AFAIK) the same effect can be achieved by spawning a background process and i/o redirection
(AFAIK) two named pipes linked to single external "resource-heavy" process input/output should work as well
I am not 100% sure if this is the best way, but it does work for me
Good luck!

Related

Apply sed command to multiple files with similar names with increasing numbering [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
I have 20 text files named as follows:
samp_100.out, samp_200.out, samp_300.out, ... ,samp_2000.out.
The naming is consistent and the numbering increases by 100 until 2000. I want to make a short script to (1) delete the first line of each script, and (2) apply the following command to each one of the files:
sed 's/ \+/,/g' ifile.txt > ofile.csv while keeping the naming the same when changed to a .csv extension
I am assuming I need to use a for loop, but I am not sure how to iterate through the file names.
This might work for you (GNU sed and parallel):
parallel "sed '1d;s/ \+/,/g' samp_{}.out > samp_{}.csv" ::: {100..2000..100}
Use GNU sed, GNU parallel and braces expansion, to delete first line and replace one or more spaces globally by commas for desired files and make copies.
Alternative:
for i in {100..2000..100}
do sed '1d;s/ \+/,/g' samp_${i}.out > samp_${i}.csv
done

How to rewrite a bad shell script to understand how to perform similar tasks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).
I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.
There we go:
# read reference file line by line
while read -r linE;
do
# field 2 will be grepped
pSeq=`echo $linE | cut -f2 -d" "`
# field 1 will be used as filename to store the grepped things
fName=`echo $linE | cut -f1 -d" "`
# grep the thing in a very big file
grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
# grep the same thing in another very big file and store it in the same file as abovr
grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF' >> $dir$fName".txt"
done < reference_file.csv
At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?
EDIT: when I mentioned the two very_big_files, I am talking > 5GB.
EDIT II: these should be the format of the files:
reference_file.csv:
object pattern
oj1 ptt1
oj2 ptt2
... ...
ojN pttN
a_very_big_file and another_very_big_file:
>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters
Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.
Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut for each individual field, use read to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep, sed, awk
Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep on the same file
You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.
Loops in awk are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.
Final script
To replace your script, you can try the following awk script. This assumes that ...
the filenames (first column) in reference_file are unique
the two big files do not contain > except for the header
the patterns (second column) in reference_file are not prefixes of each other.
If this is not the case, simply remove the break.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file

How can I remove digits from these strings? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a text file containing a few string values:
PIZZA_123
CHEESE_PIZZA_785
CHEESE_PANEER_PIZZA_256
I need to remove the numeric values in these values and need the following output. The tricky part for me is that these numeric values are random every time. I need to remove these numeric values and write the string values alone to a file.
CHEESE_PIZZA
CHEESE_PANEER_PIZZA
What is an easy way to do this?
sed 's/_[0-9]*$//' file > file2
Will do it.
There's more than one way to do it. For example, since the numbers always seem to be in the last field, we can just cut off the last field with a little help from the rev util. Suppose the input is pizza.txt:
rev pizza.txt | cut -d _ -f 2- | rev
Since this uses two utils and two pipes, it's not more efficient than sed. The sole advantage for students is that regex isn't necessary -- the only text needed is the _ as a field separator.
You can use the below script for this.
#!/bin/bash
V1=PIZZA_123
V2=CHEESE_PIZZA_785
V3=CHEESE_PANEER_PIZZA_256
IFS=0123456789
echo $V1>tem.txt
echo $V2>>tem.txt
echo $V3>>tem.txt
echo "here are the values:"
sed 's/...$//' tem.txt
rm -rf tem.txt

Bash: Grep Line Numbers to Correspond to AWK NR [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I suspect I am going around this the long way but please bear with me I am new to Bash, grep and awk ...
The summary of my problem is that line numbers in grep do no correspond to the actual line numbers in a file. From what I gather empty lines are discarded in the line numbering. I would prefer not to iterate through every line in a file to ensure 100% coverage.
What I am trying to do is grab a segment of lines from a file and process them using grep and awk
The grep call gets a list of line numbers since there could be more than one instance of a 'starting position' in a file:
startLnPOSs=($(cat "$c"| grep -e '^[^#]' | grep --label=root -e '^[[:space:]]start-pattern' -n -T | awk '{print $1}'
Then using awk I iterate from a starting point until an 'end' token is encountered.
declarations=($(cat "$c" | awk "$startLnPos;/end-pattern/{exit}" ))
To me this looks a bit like an xy problem as you are showing us what you are doing to solve a problem but not actually outlining the problem.
So on a guess I am thinking you want to return all the items between the start/end patterns to your array (which may also be erroneous, but again we do not know the overall picture).
So what you could do is:
declarations=($(awk '/start-pattern/,/end-pattern/' "$c"))
Or with sed (exactly the same):
declarations=($(sed -n '/start-pattern/,/end-pattern/p' "$c"))
Depending if you want those actual lines included or not the commands may need to be altered a little.
Was this the kind of thing you were looking to do??

Pipes & xargs => top [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am trying to use pipes and xargs to start top with a particular pid, but I cannot get it to work and I don't know why:
ps aux|grep ProgramName|awk '{print $2}'|head -n1|xargs top -pid
I get the correct pid printed to screen if I stop after head -n1, and manually adding that pid to the command top -pid XXX also works, but running the whole line as one command just does not return the top screen.
What am I doing wrong here?
EDITs: yes, "-pid" is indeed correct (further checking the remote shell revealed it is actually a Mac OS based system, not a Linux one)
What am I doing wrong here?
Several things:
You are using grep and awk in the same pipeline. Since awk does pattern matching, there is no reason to use grep as a separate process.
You are using awk and head in the same pipeline. Since awk can control the number of items it prints, there is no need to use head.
Your grep command will find both the indicated program, and the grep program.
You are using xargs to provide a single command line argument. Either backticks or $() is a better choice.
top takes a -p switch, not a -pid switch. (At least on my computer.)
Adding it all up, try:
$ top -p $(ps aux | awk '/ProgramName/ && ! /awk/ { print $2; exit; }')
Your problem is
the arg to top should be "-p" not "-pid"
xargs is for running non-interactive programs
Try this:
top -p "$(pgrep ProgramName | head -n 1)"
or
top -p "$(pgrep --oldest ProgramName)"
or
top -p "$(pgrep --newest ProgramName)"

Resources