I have a tab-separated file with the following format:
January Jay RESERVED 4
February Jay RESERVED 5
March Jay SUBMITTED 6
December Jay USED 7
What I would like to do is create spaces, or new lines between the lines where the third column is unique.
For this example, I would like this output:
January Jay RESERVED 4
February Jay RESERVED 5
March Jay SUBMITTED 6
December Jay USED 7
If your data is in a file called stuff:
lastVal="";cat stuff |while read i ; do thisVal=$(echo "$i" |cut -d$'\t' -f'3'); if [ "$lastVal" != "$thisVal" ]; then echo "" ;lastVal=$thisVal; fi ;echo "$i" ;done
Here's a version of the same command that you can use as a script. See usage below.
#!/bin/bash
lastVal="";
while read i ; do
thisVal=$(echo "$i" |cut -d$'\t' -f'3')
if [ "$lastVal" != "$thisVal" ]; then
echo ""
lastVal=$thisVal
fi
echo "$i"
done
If you name the script myScript.bash, you can use it one of these two ways:
cat yourfile | /path/to/myScript.bash
or
/path/to/MyScript.bash < yourfile
Note that if you want to insert a literal tab at the Bash prompt, you can enter ctrl+v and then hit tab. Ctrl+v lets you insert other special chars too. Ctrl+v lets you enter special chars like tab, so to add TAB as the delimiter in the cut -d' part, hit ctrl-v then hit tab (that's in Linux, not SO).
Awk can do this quite handily:
awk -F $'\t' '{print (v==$3 ? $0 : "\n"$0); v=$3}' foo.txt
awk is designed to work with whitespace-separated columns of data, so the third column is represented by $3. All we do is check if the value has changed, and print an extra line.
This doesn't check for "unique" values, but only a change in the value from the previous line. From what I can tell, that's the same thing as the answer you accepted.
Related
I'm stuck on a simple problem of finding a pattern in a string. I've never been comfortable with sed or regex in general.
I'm trying to get the number in the second column in one variable, and the number in the third column in another variable. The numbers are separated by tabs:
Here's what I have now :
while read line
do
middle="$(echo "$line" | sed 's/([0-9]+)\t\([0-9]+\)\t([0-9]+)\\.([0-9]+)/\1/')"
last="$(echo "$line" | sed 's/([0-9]+)\t([0-9]+)\t\([0-9]+)\\.([0-9]+\)/\1/')"
done
Here is the text :
11 1545 0.026666
12 1633 0.025444
13 1597 0.026424
14 1459 0.025634
I know there are simpler tools than 'sed', so feel free to put them to me in response.
Thanks.
This functionality is built into read.
while read first second third more; do
…
done
By default, read splits its input into whitespace-separated fields. Each variable receives one field, except the last one which receives whatever remains on the line. This matches your requirement provided there aren't any empty columns.
Use AWK to save yourself:
while read line
do
middle="$(awk '{print $2}' <<< "$line")"
last="$(awk '{print $3}' <<< "$line")"
done
I am trying to add some headers to a txt file. Actually i have found a script already but I want to edit a part of it.
Script (you can also find it here if you like: https://ucdavis-bioinformatics-training.github.io/2017-June-RNA-Seq-Workshop/thursday/counts.html):
for x in 03-alignment/*/*ReadsPerGene.out.tab; do \
s=`basename $x | cut -f1 -d_`
echo $s
done | paste -s > header.txt
The thing is that I want to "paste -s" but not starting from the first column but from the second.
I thought done | awk $2 | paste -s | header.txt could help but is doesnt.
Any ideas how to add the headers starting directly from the 2nd column, please?
So using the script above I take this output.
L004_AQAU-19 L004_AQAU-20 L004_AQAU-21 etc.
ALOMY0G001 0 0
ALOMY0G002 0 10
ALOMY0G003 20 15
ALOMY0G004 4 5 etc.
But i want to take the
L004_AQAU-19 L004_AQAU-20 etc.
ALOMY0G001 0 0
ALOMY0G002 0 10
ALOMY0G003 20 15
ALOMY0G004 4 5
etc.
You can make use of cut to get a range from a tab delimited line of words.
... paste -s | cut -f2- > header.txt
If you want to skip the first column, (don't have a header on the first column, just introduce an empty tab at the start.
... paste -s | sed 's/^/\t/' > header.txt
Here I use sed to replace something. You can replace like s/A/B/g where it replaces A with B. (The g indicates if this has to be done more than once per line.) These "patterns" are regular expressions (regex). For regex ^ is the beginning of the line. So here, I replace ^ (the beginning of the line) with \t (a tab). (paste uses tab by default, use paste -d, -s for example to use a comma.)
I apologize in advance if the solution to my problem is very straightforward and obvious, as I'm very new to shell scripting. For a program I'm working on, I need to update the contents of another file that was previously created. For example, say this is one of the files to be updated, student_1.item:
student_1 Sally Johnson
3 9
Mr. Ortiz
I am to create another bash file that asks for the name of the file to be updated, and then prompts the user for the following one at a time:
Student Name:
Student Number:
Grade:
Age:
Teacher:
The user is able to leave any of the above blank, and whatever isn't filled in isn't changed in the original student_1.item file. Whatever is filled out, however, should be changed and updated to whatever the user put in.
I believe that I'd need to understand the concept of environment variables, but I'm a little stuck. Would I first need to read the lines into variables from student_1.item and then export any changed variables back into student_1.item?
Again, my apologies if this is a silly question. Any help is appreciated!
Sounds a bit complicated, but here is untested solution: it reads a file line by line and assigns these lines to 3 variables. After that, these variables are parsed to get proper values of student.
# line number
n=0
# read file line by line
while read line; do
# assign line to a variable
case $n:
0) firstline=$line
;;
1) secondline=$line
;;
2) thirdline=$line
;;
esac
# increment line number
n=$((n+1))
done < student_1.item
# parse each line variable
student_number=$(echo $firstline | cut -d' ' -f1)
student_name=$(echo $firstline | cut -d' ' -f2,3)
student_grade=$(echo $secondline | cut -d' ' -f1)
student_age=$(echo $secondline | cut -d' ' -f2)
student_teacher=$thirdline
I have a script running to use output from commands that I run using a string from the file that I want to update.
for CLIENT in `cat /home/"$ID"/file.txt | awk '{print $3}'`
do
sed "/$CLIENT/ s/$/ $(sudo bpgetconfig -g $CLIENT -L | grep -i "version name")/" /home/"$ID"/file.txt >> /home/"$ID"/updated_file.txt
done
The output prints out the entire file once for each line with the matching line in question updated.
How do I sort it so that it only sends the matching line to the new file.
The input file contains lines similar to below:
"Host OS" "OS Version" "Hostname"
I want to run a script that will use the hostname to run a command and grab details about an application on the host and then print only the application version to the end of the line with the host in it:
"Host OS" "OS Version" "Hostname" "Application Version
What you're doing is very fragile (e.g. it'll break if the string represented by $CLIENT appears on other lines or multiple times on 1 line or as substrings or contains regexp metachars or...) and inefficient (you're reading file.txt one per iteration of the loop instead of once total) and employing anti-patterns (e.g. using a for loop to read lines of input, plus the UUOC, plus deprecated backticks, etc.)
Instead, let's say the command you wanted to run was printf '%s' 'the_third_string' | wc -c to replace each third string with the count of its characters. Then you'd do:
while read -r a b c rest; do
printf '%s %s %s %s\n' "$a" "$b" "$(printf '%s' "$c" | wc -c)" "$rest"
done < file
or if you had more to do and so it was worth using awk:
awk '{
cmd = "printf \047%s\047 \047" $3 "\047 | wc -c"
if ( (cmd | getline line) > 0 ) {
$3 = line
}
close(cmd)
print
}' file
For example given this input (courtesy of Rabbie Burns):
When chapman billies leave the street,
And drouthy neibors, neibors, meet;
As market days are wearing late,
And folk begin to tak the gate,
While we sit bousing at the nappy,
An' getting fou and unco happy,
We think na on the lang Scots miles,
The mosses, waters, slaps and stiles,
That lie between us and our hame,
Where sits our sulky, sullen dame,
Gathering her brows like gathering storm,
Nursing her wrath to keep it warm.
We get:
$ awk '{cmd="printf \047%s\047 \047"$3"\047 | wc -c"; if ( (cmd | getline line) > 0 ) $3=line; close(cmd)} 1' file
When chapman 7 leave the street,
And drouthy 8 neibors, meet;
As market 4 are wearing late,
And folk 5 to tak the gate,
While we 3 bousing at the nappy,
An' getting 3 and unco happy,
We think 2 on the lang Scots miles,
The mosses, 7 slaps and stiles,
That lie 7 us and our hame,
Where sits 3 sulky, sullen dame,
Gathering her 5 like gathering storm,
Nursing her 5 to keep it warm.
The immediate answer is to use sed -n to not print every line by default, and add a p command where you do want to print. But running sed in a loop is nearly always the wrong thing to do.
The following avoids the useless cat, the don't read lines with for antipattern, the obsolescent backticks, and the loop; but without knowledge of what your files look like, it's rather speculative. In particular, does command need to run for every match separately?
file=/home/"$ID"/file.txt
pat=$(awk '{ printf "\\|$3"}' "$file")
sed -n "/${pat#\\|}/ s/$/ $(command)/p' "$file" >> /home/"$ID"/updated_file.txt
The main beef here is collecting all the patterns we want to match into a single regex, and then running sed only once.
If command needs to be run uniquely for each line, this will not work out of the box. Maybe then turn back to a loop after all. If your task is actually to just run a command for each line in the file, try
while read -r line; do
# set -- $line
# client=$3
printf "%s " "$line"
command
done <file >>new_file
I included but commented out commands to extract the third field into $client before you run command.
(Your private variables should not have all-uppercase names; those are reserved for system variables.)
Perhaps in fact this is all you need:
while read -r os osver host; do
printf "%s " "$os" "$osver" "$host"
command "$host" something something
done </home/"$ID"/file.txt >/home/"$ID"/updated_file.txt
This assumes that the output of command is a well-formed single line of output with a final newline.
This might work for you (GNU sed, bash/dash):
echo "command () { expr length \"\$1\"; }" >> funlib
sed -E 's/^((\S+\s){2})(\S+)(.*)/. .\/funlib; echo "\1$(command "\3")\4"/e' file
As an example of a command, I create a function called command and append it to a file funlib in the current directory.
The sed invocation, sources the funlib and runs the command function in the RHS of the substitution command within an interpolation of string, displayed by the echo command which is made possible by the evaluation flag e.
N.B. The evaluation uses the dash shell or whatever the /bin/sh is symlinked to.
I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.