I have lots of text files which contains columns of numeric values (number of columns are different for each files). I use MATLAB to store each one's content like this:
id1 = fopen('texfile.txt','r');
A = fscanf(id1,'%f',[1 Inf]);
fclose(id1);
I wanted to know that if there is any simple way in bash script to do the same for me?
A simple equivalent of fscanf in Bash is the read builtin:
read -r A
If, on the other hand, we have got multiple columns of values, then awk can be used to extract the n-th column:
awk '{print $n}' < input > output
Not the simplest way imaginable, but you could use Bash arrays (Bash 4 and up).
First, read the file using newline as separator:
IFS_prev="$IFS"; IFS=$'\n';
A=($(cat "textfile.txt"))
IFS="$IFS_prev"
then, to refer to the jth element in the ith row, use this:
row=(${A[i]}) # extract ith row and split on spaces
element=${row[j]} # extract jth element
Related
I have a file with lines of a format XXXXXX_N where N is some number. For example:
41010401_1
42023920_3
45788_1
I would like to add N-1 lines before every line where N>1 such that I have lines for the specified XXXX value with all N values up to and including the original N:
41010401_1
42023920_1
42023920_2
42023920_3
45788_1
I thought about doing it with sed but I'm not sure how to conditionally append different amount of lines with different value which is based on what sed reads.
Is sed even the correct command to deal with this problem?
Any help would be appreciated.
One way in awk is to set field separators to underscore and print all missing records when 2nd field is greater than 1 in a loop like below.
$ awk 'BEGIN{FS=OFS="_"} $2>1{for(i=1;i<$2;i++) print $1,i} 1' file
41010401_1
42023920_1
42023920_2
42023920_3
45788_1
I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.
I am trying to write a personal webscraper for fun in unix. I have scraped a list of names and saved them to a file called "names". Secondly I mapped (mapfile index < names ) to an array all of the --> while(count -lt ... ) do ${index[$count]} would be how I call a single element within the array.
However I am having trouble because the mapfile added a trailing space to all elements within the array. Something like "AAPL ". I am wondering how to use a combination of sed, grep, and awk to trim the white space and if possible save the element back into the array.
Thanks.
Assuming you are indexing the array from a file, you can make use of bracket expression with sed:
mapfile -t index < <(sed 's/[[:space:]]*//g' names)
alternatively read can be another approach:
read -a index <<< $(sed 's/[[:space:]]*//g' names)
I want to combine the data of 3 (say) files having the same columns and datatype for those, into a single file, which I can further use for processing.
Currently I have to process the files one after the other. So, I am looking for a solution which I can write in a script to combine all the files into one single file.
For ex:
File 1:
mike,sweden,2015
tom,USA,1522
raj,india,455
File 2:
a,xyz,155
b,pqr,3215
c,lmn,3252
Expected combined file 3:
mike,sweden,2015
tom,USA,1522
raj,india,455
a,xyz,155
b,pqr,3215
c,lmn,3252
Kindly help me with this.
Answer to the original form of the question:
As #Lars states in a comment on the question, it looks like a simple concatenation of the input files is desired, which is precisely what cat is for (and even named for):
cat file1 file2 > file3
To fulfill the requirements you added later:
#!/bin/sh
# Concatenate the input files and sort them with duplicates removed
# and save to output file.
cat "$1" "$2" | sort -u > "$3"
Note, however, that you can combine the concatenation and sorting into a single step, as demonstrated by Jean-Baptiste Yunès's answer:
# Sort the input files directly with duplicates removed and save to output file.
sort -u "$1" "$2" > "$3"
Note that using sort is the simplest way to eliminate duplicates.
If you don't want sorting, you'll have to use a different, more complex approach, e.g. with awk:
#!/bin/sh
# Process the combined input and only
# output the first occurrence in a set of duplicates to the output file.
awk '!seen[$0]++' "$1" "$2" > "$3"
!seen[$0]++ is a common awk idiom to only print the first in a set of duplicates:
seen is an associative array that is filled with each input line ($0) as the key (index), with each element created on demand.
This implies that all lines from a set of duplicates (even if not adjacent) refer to the same array element.
In a numerical context, awk's variable values and array elements are implicitly 0, so when a given input line is seen for the first time and the post-decrement (++) is applied, the resulting value of the element is 1.
Whenever a duplicate of that line is later encountered, the value of the array element is incremented.
The net effect is that for any given input line !seen[$0]++ returns true if the input line is seen for the first time, and false for each of its duplicates, if any. Note that ++, due to being a post-increment, is only applied after !seen[$0] is evaluated.
! negates the value of seen[$0], causing a value of 0 - which is false in a Boolean context to return true, and any nonzero value (encountered for duplicates) to return false.
!seen[$0]++ is an instance of a so-called pattern in awk - a condition evaluated against the input line that determines whether the associated action (a block of code) should be processed. Here, there is no action, in which case awk implicitly simply prints the input line, if !seen[$0]++ indicates true.
The overall effect is: Lines are printed in input order, but for lines with duplicates only the first instance is printed, effectively eliminating duplicates.
Note that this approach can be problematic with large input files with few duplicates, because most of the data must then be held in memory.
A script like:
#!/bin/sh
sort "$1" "$2" | uniq > "$3"
should do the trick. Sort will sort the concatenation of the two files (two first args of the script), pass the result to uniq which will remove adjacent identical lines and push the result into the third file (third arg of the script).
If your file naming convention is same(say file1,file2,file3...fileN), then you can use this to combine all.
cat file* > combined_file
Edit: Script to do the same assuming you are passing file names as parameter
#!/bin/sh
cat $1 $2 $3 | uniq > combined_file
Now you can display combined_file if you want. Or access it directly.
I have a csv file with a number of columns. I am trying to replace the second column with the second to last column from the same file.
For example, if I have a file, sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3,4,5,6
a,e,c,d,e,f
g,k,i,j,k,l
Can anyone help me with this task? Also note that I will be discarding the last two columns afterwards with the cut function so I am open to separating the csv file to begin with so that I can replace the column in one csv file with another column from another csv file. Whichever is easier to implement. Thanks in advance for any help.
How about this simpler awk:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1)}'1 sample.csv
EDIT: Noticed that you also want to discard last 2 columns. Use this awk one-liner:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-2}'1 sample.csv
In bash
while IFS=, read -r -a arr; do
arr[1]="${arr[4]}";
printf -v output "%s," "${arr[#]}";
printf "%s\n" "${output%,}";
done < sample.csv
Pure bash solution, using IFS in a funny way:
# Set globally the IFS, you'll see it's funny
IFS=,
while read -ra a; do
a[1]=${a[#]: -2:1}
echo "${a[*]}"
done < file.csv
Setting globally the IFS variable is used twice: once in the read statement so that each field is split according to a coma and in the line echo "${a[*]}" where "${a[*]}" will expand to the fields of the array a separated by IFS... which is a coma!
Another special thing: you mentionned the second to last field, and that's exactly what ${a[#]: -2:1} will expand to (mind the space between : and -2), so that you don't have to count your number of fields.
Caveat. csv files need a special csv parser that is difficult to implement. This answer (and I guess all the other answers that will not use a genuine csv parser) might break if a field contains a coma, e.g.,
1,2,3,4,"a field, with a coma",5
If you want to discard the last two columns, don't use cut, but this instead:
IFS=,
while read -ra a; do
((${#a[#]}<2)) || continue # skip if array has less than two fields
a[1]=${a[#]: -2:1}
echo "${a[*]::${#a[#]}-2}"
done < file.csv