Suppose I have the following data:
# all the numbers are their own number. I want to reshape exactly as below
0 a
1 b
2 c
0 d
1 e
2 f
0 g
1 h
2 i
...
And I would like to reshape the data such that it is:
0 a d g ...
1 b e h ...
2 c f i ...
Without writing a complex composition. Is this possible using the unix/bash toolkit?
Yes, trivially I can do this inside a language. The idea is NOT TO "just" do that. So if some cat X.csv | rs [magic options] sort of solution (and rs, or the bash reshape command, would be great, except it isn't working here on debian stretch) exists, that is what I am looking for.
Otherwise, an equivalent answer that involves a composition of commands or script is out of scope: already got that, but would rather not have it.
Using GNU datamash:
$ datamash -s -W -g 1 collapse 2 < file
0 a,d,g
1 b,e,h
2 c,f,i
Options:
-s sort
-W use whitespace (spaces or tabs) as delimiters
-g 1 group on the first field
collapse 2 print comma-separated list of values of the second field
To convert the tabs and commas to space characters, pipe the output to tr:
$ datamash -s -W -g 1 collapse 2 < file | tr '\t,' ' '
0 a d g
1 b e h
2 c f i
bash version:
function reshape {
local index number key
declare -A result
while read index number; do
result[$index]+=" $number"
done
for key in "${!result[#]}"; do
echo "$key${result[$key]}"
done
}
reshape < input
We just need to make sure input is in unix format
suppose I have file containing numbers like:
1 4 7
2 5 8
and I want to add 1 to all these numbers, making the output like:
2 5 8
3 6 9
is there a simple one-line command (e.g. awk) to realize this?
try following once.
awk '{for(i=1;i<=NF;i++){$i=$i+1}} 1' Input_file
EDIT: As per OP's request without loop, here is a solution(written as per shown sample only).
With hardcoding of number of fields.
awk -v RS='[ \n]' '{ORS=NR%3==0?"\n":" ";print $0+1}' Input_file
OR
Without hardcoding number of fields.
awk -v RS='[ \n]' -v col=$(awk 'FNR==1{print NF}' Input_file) '{ORS=NR%col==0?"\n":" ";print $0+1}' Input_file
Explanation: So in EDIT section 1st solution I have hardcoded the number of fields by mentioning 3 there, in OR solution of EDIT, I am creating a variable named col which will read the very first line of Input_file to get the number of fields. Then it will not read all the Input_file, Now coming onto the code I have set Record separator as space or new line to it will add them without using a loop and it will add space each time after incrementing 1 in their values. It will print new line only when number of lines are completely divided by value of col(which is why we have taken number of fields in -v col section).
In native bash (no awk or other external tool needed):
#!/usr/bin/env bash
while read -r -a nums; do # read a line into an array, splitting on spaces
out=( ) # initialize an empty output array for that line
for num in "${nums[#]}"; do # iterate over the input array...
out+=( "$(( num + 1 ))" ) # ...and add n+1 to the output array.
done
printf '%s\n' "${out[*]}" # then print that output array with a newline following
done <in.txt >out.txt # with input from in.txt and output to out.txt
You can do this using gnu awk:
awk -v RS="[[:space:]]+" '{$0++; ORS=RT} 1' file
2 5 8
3 6 9
If you don't mind Perl:
perl -pe 's/(\d+)/$1+1/eg' file
Substitute any number composed of multiple digits (\d+) with that number ($1) plus 1. /e means to execute the replacement calculation, and /g means globally throughout the file.
As mentioned in the comments, the above only works for positive integers - per the OP's original sample file. If you wanted it to work with negative numbers, decimals and still retain text and spacing, you could go for something like this:
perl -pe 's/([-]?[.0-9]+)/$1+1/eg' file
Input file
Some column headers # words
1 4 7 # a comment
2 5 cat dog # spacing and stray words
+5 0 # plus sign
-7 4 # minus sign
+1000.6 # positive decimal
-21.789 # negative decimal
Output
Some column headers # words
2 5 8 # a comment
3 6 cat dog # spacing and stray words
+6 1 # plus sign
-6 5 # minus sign
+1001.6 # positive decimal
-20.789 # negative decimal
I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.
I have a set of 10000 text files (file1.txt, file2.txt,...file10000.txt). Each one has a different number of rows. I'd like to know which is the average number of rows, among these 10000 files, excluding the last row. For example:
File1:
a
b
c
d
last
File2:
a
b
c
last
File2:
a
b
c
d
e
last
here I should obtain 4 as result. I tried with python but it requires too much time to read all the files. How could I do with a shell script?
Here's one way:
touch file{1..3}.txt
file 1 has 1 line, file 2 two lines and so on...
$ for i in {1..3}; do wc -l file${i}.txt; done | awk '{sum+=$1}END{print sum/NR}'
2
So I have a series of scripts that generate intermediary text files along the way as a means of storing information across different scripts. Essentially the scripts detect rows within data that have been approved by the user for removal. The line numbers that are to be removed from the source file are stored in a file.
For example, say I have a source data file like this:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a4,b4,c4,d4
a5,b5,c5,d5
a6,b6,c6,d6
a7,b7,c7,d7
And the intermediary file would contain something like this:
1 3 4 5 6
Which would result, when the script is run, in an output data file as follows:
a2,b2,c2,d2
a7,b7,c7,d7
This all works fine, there is nothing to fix in this code. The problem is, when I'm dealing with actual data files sometimes there are literally thousands of numbers stored in the intermediary file for removal. This means I can't use a loop, because it will take a massive amount of time, and my current method of using sed gets overloaded with a error: too many arguments. Many of the line numbers are consecutive, so here's where I get to my question:
Is there a way in bash or awk to detect whether a series of space-separated numbers are consecutive?
I can sort out everything beyond that, I'm just stumped on how I could do this in one/a series of step(s). My plan, if I can detect consecutive values, is to change the intermediary file from:
1 3 4 5 6
To:
1 3-6
And then I'll be able to write code that will run on each range of values in a more manageable way.
If possible I'd like to avoid looping through each value and checking individually whether or not it's one step above the previous value, since I'm dealing with tens of thousands of numbers in a list.
If this is not possible in bash/awk, is there another way to accomplish this task to reduce the overall number of arguments passed to my script and greatly reduce the chances of encountering an error for too many arguments?
What about this?
BEGIN {
getline < "intermediate.txt"
split($0, skippedlines, " ")
skipindex = 1
}
{
if (skippedlines[skipindex] == NR)
++skipindex;
else
print
}
Use cat, join, and cut:
Files infile and ids:
a1,b1,c1,d1 1
a2,b2,c2,d2 3
a3,b3,c3,d3 4
a4,b4,c4,d4 5
a5,b5,c5,d5 6
a6,b6,c6,d6
a7,b7,c7,d7
Removal of selected lines:
$ join -v 2 ids <(cat -n infile) | cut -f 2 -d ' '
a2,b2,c2,d2
a7,b7,c7,d7
What's going on:
First, the initial file receives an id on each line, with cat -n infile;
then, the resulting file is joined on the first column with the file holding the ids;
only non-matching lines from second file are printed -- join -v 2;
the first column, with the ids, is removed;
and, it's a neat shell one-liner (:
In case your file with ids is written as an unique line, you can still make use of the above one-liner, simply adding a translation on the file with ids, as follows:
$ join -v 2 <(tr ' ' '\n' ids) <(cat -n infile) | cut -f 2 -d ' '
#jmihalicza's answer nicely uses awk to solve the whole problem of selecting the lines from source file that match those in the intermediate file. For completeness, the following awk program reduces the list of individual line numbers to ranges, where possible, which I think answers the original question:
{ for (j = 1; j <= NF; j++) {
lin[i++] = $j;
}
}
END {
start = lin[0];
j = 1;
while (j <= i) {
end = start
while (lin[j] == (lin[j-1]+1)) {
end = lin[j++];
}
if ((end+0) > (start+0)) {
printf "%d-%d ",start,end
} else {
printf "%d ",start
}
start = lin[j++];
}
}
Given this script, which I've called merge.awk and a file testlin.txt as follows:
1 3 4 5 6 9 10 11 13 15
... we can do this:
$ awk -f merge.awk <testlin.txt
1 3-6 9-11 13 15
This might work for you (GNU sed):
sed -r 's/\S+/&d/g;s/\s+/\n/g' intermediate_file | sed -f - source_file
Change the intermediate file into a sed script.