How would one remove the 3rd column for example from a csv file directly from the command line of the Mac terminal. I understand
cut -d',' -f3 data.csv
extracts the column info out directly into the terminal, but I want the 3rd column to be entirely removed from the dataset. How can I do this via the terminal?
Try
cut -d',' -f1-2,4- data.csv
All the examples seem a bit tricky if trying to delete multiple fields or if you want to see only several columns. As such I simply show only the columns I want. So if I only want to get columns 1,2 and 5 I'd do this:
cut -d, -f 1,2,5 hugeData.csv
NB: -d sets whatever the separator is in the file. In the example above it is a comma ,
My grep/regex is rusty, but if the number of columns is fixed, then you can simply make a grep statement for each quote pair (and its contents), and then replace with all but the third pair. That's a clunky of doing it; but nonetheless should get the job done.
man grep
and page down to the REGULAR EXPRESSIONS section for help with how to specify.
Related
I have a query in shell scripts that gives me a results like:
article;20200120
fruit;22
fish;23
I execute that report every day. I would like that when I execute the query the next day shows me output like that:
article;20200120;20200121
fruit;22;11
fish;23;12
These report I execute with postgre sql in a linux shell script. The output of csv is generated redirecting the ouput with ">>"
Please any help to achive that.
Thanks
This might be somewhat fragile, but it sounds like what you want can be accomplished with cut and paste.
Let's start with two files we want to join:
$ cat f1.csv
article;20200120
fruit;22
fish;23
$ cat f2.csv
article;20200121
fruit;11
fish;12
We first use cut to strip the headers from the second file, then send that into paste with the first file to combine corresponding lines:
$ cut -d ';' -f 2- f2.csv | paste -d ';' f1.csv -
article;20200120;20200121
fruit;22;11
fish;23;12
Parsing that command line, the -d ';' tells cut to use semicolons as the delimiter (the default is tab), and -f 2- says to print the second and later fields. f2.csv is the input file for cut. Then the -d ';' similarly tells paste to use semicolons to join the lines, and f1.csv - are the two files to paste together, in that order, with - representing the input piped in using the | shell operator.
Now, like I say, this is somewhat fragile. We're not matching the lines based on the header information, only their line number from the start of the file. If some fields are optional, or the set of fields changes over time, this will silently produce garbage. One way to mitigate that would be to first call cut -d ';' -f 1 on each of the input files and insist the results are the same before combining them.
I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.
So, I have a file which contains the results of some calculations I've run in the past weeks. I've collected the results in a file which I intend to plot. It is basically a bunch of rows with the format "x" "y" "f(x,y)", like this:
1.7 4.7 -460.5338556921
1.7 4.9 -460.5368762353
1.7 5.5
However, some lines, exemplified by the last one, contain a blank space in the 3rd column, resulting from failed calculations. I'd still like to plot the viable points, but, as there are thousands of points (and therefore rows) that task just be accomplished easily by hand. I'd like to know how to make a script or program (I'd prefer a shell script, but I'll gladly go along with whatever works), which identifies those lines and deletes them. Does anyone know a way to do it?
awk '$3' <filename>
or better
awk 'NF > 2' <filename> # if in any entry in the column-3 happens to be zero
This will do the purpose!
The simplest form of grep command that should probably be understood by any shell these days:
grep -v '^[^[:space:]]*[[:space:]]*[^[:space:]]*[[:space:]]*$' <filename>
With grep:
grep ' .* [^ ]' file
or using ERE:
grep -E '\s\S+\s\S' file
I would to use:
perl -lanE 'print if #F==3 && /^[\d\s\.+-]+$/' file
will print only lines:
which contains 3 fields
and contains only numbers, spaces, and .+-
I do not know how you are going to plot. You would like a grep or awk solution and pipe all valid lines into your plotting application.
When you need to call a program for each set of values, you can skip the invalid lines when you are reading the values:
while read -r x y fxy; do
if [ -n "${fxy}" ]; then
myplotter "$x" "$y" "${fxy}"
fi
done < file
I have a csv file I am looking at in bash that I am trying to manipulate. There are several things that I have/am trying to edit. Structure is like so where the first row are the column(field) headers
cat,dog,hippopotamus,zebra
1,,3,2
three species, five species,only one,multiple
at,home, at, home, wild, wild
How can I edit the field (column) names in the csv?
head -1 test.csv
shows what the field (column) names are, but it still has the commas in it as well and this doesn't allow for field name changing at all.
The other part about this is that I want to only edit titles that are greater than 8 characters in length, in which case I will just take the first 8 characters. I'm guessing I would use some sort of loop based on string length but since I don't know how to even edit the field name of just one column I'm not sure how to do this. In scenario above, changing hippopotamus to hippopot.
How can I replace empty cells in the csv to NA or NULL?
sed -i 's/ /NULL/g'
Thought would work but doesn't.
Some of the cells have commas within them, messing with the , delimiter. I used the code below and it seems to work, but is there a better/safer way to do this?
sed -i "s/, /_/g"
Or in a similar situation, if multiple columns contain strings sometimes with spaces within a string but I only want to remove the space in one of the columns while leaving the other columns alone, how can I achieve this?
sed -i 's/ //g' test.csv
Sed will allow a line number as a command prefix, to only work on a single line (or a range of numbers, to work on lines in that range). Try something like this.
sed -e '1s/cat/Feline/' test.csv > test2.csv
CSV files will store an empty field as either a comma at start of line, a comma at end of line, or a comma followed by another comma:
Field1,Field2,Field3
,"<-- empty field1",field3
field1,,"<-- empty field2"
field1,"empty field3-->",
You can use the following sed commands to fix these:
sed -e 's/^,/NA,/;s/,$/,NA/' -e ':loop' -e 's/,,/,NA,/g;tloop' test.csv
Your solution appears good. Be aware, however, that CSV should have quotes around any string containing a comma. And that's legit. It's also the point where sed stops being a good tool for manipulating CSV files. ;-) One suggestion would be to replace "interior" commas with "%2C", which is the HTML encoding for a comma. That's pretty distinctive, and at least somewhat standard.
sed numbers groups starting from the left-most paren. If your groups match multiple times, you can only get the last match contents, but if an outer group contains the multi-match, the outer group is still valid. (I assume here that you have already replaced the "interior" commas with something else.)
sed -e ':loop' -e '^\(\([^,]*,\)\{3\}\)\([^ ,]*\) /\1\3/;tloop'
This will remove the first space in column 4, then loop. It will stop when it finds the comma that ends the column, or end-of-line.
Note that the first part, called \1, is general. You can replace the 3 with whatever field, minus one, and that will get you to the start of the field. The actual work is in the second part, \3, where you can do what you like. (Note that \2 is included within \1, and not particularly useful.)
I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.