I have a CSV document with 47001 lines in it. Yet when I open it in Excel, there are only 31641 lines.
I know that 47001 is the correct number of lines; it's an export of a database table, whose size I know to be 47001. Additionally: wc -l my.csv returns 47001.
So, Excel's parsing fails. I suspect there is some funky control or whitespace character somewhere in this document.
How do I find out the variety of characters used in some document?
For example, consider this input file: ABCAAAaaa\n.
I would expect the alphabet of characters used in the file to be : ABCa\n.
Maybe if we compress it, we can somehow read the Huffman Tree?
I suspect it will be educational to compare the UTF-8 character variety versus the ASCII character variety. For example: Excel may parse multi-byte characters in ASCII, and thus interpret some bytes as control codepoints.
Here we go if you are on linux (the logic behind could be the same for all but for linux i give the command ) :
sed 's/./&\n/g' | sort -u | tr -d '\n'
What happend :
- First replace all letter by letter followed by "\n" [new line]
- Then sort all caracter and print uniq occurrences
- Remove all the "\n"
Then the input file :
ABCAAAaaa
will became :
A
B
C
A
A
A
a
a
a
After sort :
a
a
a
A
A
A
A
B
C
Then after uniq :
A
B
C
a
final output :
aABC
You can cut out of the original files some columns which are not likely to be changed by passing the cycle of being parsed and written out again, e. g. a pure text column like a name or a number. Names would be great. Then let this file pass the cycle and compare it to the original:
Here's the code:
cut -d, -f3,6,8 > columns.csv
This assumes that columns 3, 6, and 8 are the name columns and that a comma is the separator. Adjust these values according to your input file. Using a single column is also okay.
Now call Excel, parse the file columns.csv, write it out again as a csv file columns2.csv (with the same separator of course). Then:
diff columns.csv columns2.csv | less
A tool like meld instead of diff might also be handy to analyse the differences.
This will show you which lines experienced a change by the →parse→dump cycle. Hopefully it will affect only the lines you are looking for.
I need help in updating a huge csv file with 3.5 Million records. I need to update 3rd column with the mapping value from another file.
I tried reading the file and updating the 3rd column by searching the pattern in mapping file but since the actual file is having 3.5 million and mapping file is having ~1 million records, it seems to be running forever.
E.g.
Actual file:
123,123abc,456_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,5435_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4
Mapping File:
456_def,24_def
5435_defg,48_defg
Output expected:
123,123abc,24_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,48_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4
Pretty straight-forward in Awk
awk 'BEGIN{FS=OFS=","}FNR==NR{hash[$1]=$2; next}$3 in hash{$3=hash[$3]}1' mapFile actualFile
produces an output as you needed.
123,123abc,24_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,48_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4
To speed up things, you can change the locale setting to use ASCII,
Simply put, when using the locale C it will default to the server's base Unix/Linux language of ASCII. By default your locale is going to be internationalized and set to UTF-8, which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters, whereas with ASCII each character is encoded in a single byte sequence and its character set comprises of no longer than 128 unique characters. So just do
LC_ALL=C awk 'BEGIN{FS=OFS=","}FNR==NR{hash[$1]=$2; next}$3 in hash{$3=hash[$3]}1' mapFile actualFile
You can use awk for this:
awk 'BEGIN{FS=OFS=","} # Set field separator as comma
NR==FNR{a[$1]=$2;next} # Store the mapping file into the array a
{if($3 in a) $3=a[$3]} # Check if there is match, and change the column value
1 # Print the whole line
' mapping actualfile
For the format of x-axis, current I am using the following command in pngcairo terminal:
set format x "%.sK"
Which recognizes numbers from 100K to 900K but when it gets to 1 million it prints "1K" instead of "1000K".
what is the command to automatically set the label to "xK" before 1 million and to "xM" after 1 million?
These kind of labels are controlled by gnuplot's own format specifiers (see doc for gprintf):
set format x '%.s%c'
I'm tasked with imputing masked genotypes, and I have to mask (hide) 2% of genotypes.
The file I do this in looks like this (genotype.dat):
M rs4911642
M rs9604821
M rs9605903
M rs5746647
M rs5747968
M rs5747999
M rs2070501
M rs11089263
M rs2096537
and to mask it, I simply change M to S2.
Yet, I have to do this for 110 (2%) of 5505 lines, so my strategy of using a random number generator (generate 110 numbers between 1 and 5505 and then manually changing the corresponding line number's M to S2 took almost an hour... (I know, not terribly sophisticated).
I thought about saving the numbers in a separate file (maskedlines.txt) and then telling awk to replace the first character in that line number with S2, but I could not find any adjustable example of to do this.
Anyway, any suggestions of how to tackle this will be deeply appreciated.
Here's one simple way, if you have shuf (it's in Gnu coreutils, so if you have Linux, you almost certainly have it):
sed "$(printf '%ds/M/S2/;' $(shuf -n110 -i1-5505 | sort -n))" \
genotype.dat > genotype.masked
A more sophisticated version wouldn't depend on knowing that you want 110 of 5505 lines masked; you can easily extract the line count with lines=$(wc -l < genotype.dat), and from there you can compute the percentage.
shuf is used to produce a random sample of lines, usually from a file; the -i1-5505 option means to use the integers from 1 to 5505 instead, and -n110 means to produce a random sample of 110 (without repetition). I sorted that for efficiency before using printf to create a sed edit script.
awk 'NR==FNR{a[$1]=1;next;} a[FNR]{$1="S2"} 1' maskedlines.txt genotype.dat
How it works
In sum, we first read in maskedlines.txt into an associative array a. This file is assumed to have one number per line and a of that number is set to one. We then read in genotype.dat. If a for that line number is one, we change the first field to S2 to mask it. The line, whether changed or not, is then printed.
In detail:
NR==FNR{a[$1]=1;next;}
In awk, FNR is the number of records (lines) read so far from the current file and NR is the total number of lines read so far. So, when NR==FNR, we are reading the first file (maskedlines.txt). This file contains the line number of lines in genotype.dat that are to be masked. For each of these line numbers, we set a to 1. We then skip the rest of the commands and jump to the next line.
a[FNR]{$1="S2"}
If we get here, we are working on the second file: genotype.dat. For each line in this file, we check to see if its line number, FNR, was mentioned in maskedlines.txt. If it was, we set the first field to S2 to mask this line.
1
This is awk's cryptic shorthand to print the current line.
I'm trying to plot the 1st and 3rd columns of multiple files, where each file is supposed to be plotted to an own output.png.
My files have the following names:
VIB2--135--398.6241
VIB2--136--408.3192
VIB2--137--411.3725
...
The first number in the file name is an integer, which ranges from 135-162. The second number is just a decimal number and there is no regular spacing between the values.
Basically I want to do something like this
plot for [a=135:162] 'VIB2--'.a.'--*' u 1:3 w l
although this doesn't work, of course, since the ' * ' is just the placeholder I know from bash and I don't know, if there is something similar in gnuplot.
Furthermore, each of the files should be, as already said above, plotted to its own output.png, where the two numbers should be in the output name, e.g. VIB2--135--398.6241.png.
I tried to come up with a bash script for this, like (edited):
#!/bin/bash
for file in *
do
gnuplot < $file
set xtics 1
set xtics rotate
set terminal png size 1920,1080 enhanced
set output $file.png
plot "$file" u 1:3 w l
done
but I still get
gnuplot> 1 14 -0.05
^
line 0: invalid command
gnuplot> 2 14 0.01
^
line 0: invalid command
...
which are actually the numbers from my input file. So gnuplot thinks, that the numbers I want to plot are commands... ?? Also, when the end of the file is reached, I get the following error message
#PLOT 1
plot: an unrecognized command `0x20' was encountered in the input
plot: the input file `VIB2--162--496.0271' could not be parsed
I've seen a few questions similar to mine, but the solutions didn't really work for me and I cannot add a comment, since I do not have the reputation.
Please help me with this.
gnuplot < $file starts gnuplot and feeds it the content of $file as input. That means gnuplot will now try to execute the commands in the data file which doesn't work.
What you want is a "here document":
gnuplot <<EOF
set xtics 1
set xtics rotate
set terminal png size 1920,1080 enhanced
set output $file.png
plot "$file" u 1:3 w l
EOF
What this does is: The shell reads the text up to the line with solemn EOF, replaces all variables, puts that into a temporary file and then starts gnuplot feeding it the temporary file as input.
Be careful that the file names don't contain spaces, or set output $file.png will not work. To be safe, you should probably use set output "$file.png" but my gnuplot is a bit rusty.