Pig Not Making Any Progress - hadoop

I just wrote my first pig script, it does not seem to be making any progress. Some background info:
I'm running CDH4.5 on a CentOS 6.4 VM, all installed from Cloudera's yum repo. It is configured to all run in pseudo-distributed mode. Everything is running as a service and appears to be configured correctly (thank heaven!)
Here is my pig script:
A = LOAD '/user/msknapp/county_insurance_pp.txt' AS (fips:int,st:chararray,stfips:int,name:chararray,a:int,b:int,c:int,d:int,e:int,f:int,g:int);
DUMP A;
The input file was taken from data.gov, it's some insurance data. I pre-processed it, here is some useful info:
[msknapp#localhost data]$ cat county_insurance_pp.txt | grep BUTLER
1013 AL 1 BUTLER 54480 129 3287 57895
19023 IA 19 BUTLER 27291 29659 3386 25150 85486
20015 KS 20 BUTLER 233855 10028 456 29278 5759 279376
21031 KY 21 BUTLER 4164 453 4617
29023 MO 29 BUTLER 48240 5217 738 2042 25081 81317
31023 NE 31 BUTLER 4406 153 609 5168
39017 OH 39 BUTLER 856205 103041 3854 38648 203328 19832 1224910
42019 PA 42 BUTLER 1072941 19131 190 60648 68692 50230 1271832
[msknapp#localhost data]$ hadoop fs -cat /user/msknapp/county_insurance_pp.txt | head
1001 AL 1 AUTAUGA 215624 37156 46 130 53237 140420 446614
1003 AL 1 BALDWIN 1060297 95925 3284 31096 99241 200581 1490424
1005 AL 1 BARBOUR 37893 132 246 811 39082
1007 AL 1 BIBB 3127 70 241 34403 37841
1009 AL 1 BLOUNT 32311 135 11884 19392 4200 67922
1011 AL 1 BULLOCK 4301 336 274 186 5098
1013 AL 1 BUTLER 54480 129 3287 57895
1015 AL 1 CALHOUN 469959 92702 5373 2130 17069 532033 1119265
1017 AL 1 CHAMBERS 37238 3189 292 1953 42672
1019 AL 1 CHEROKEE 37984 190 117 1081 1277 40649
cat: Unable to write to output stream.
When I run the pig script on the command line I get a whole bunch of log statements and it looks like it is running, but once it starts, it never makes any progress, no matter how long I wait. These are the last couple lines:
2014-01-05 15:10:41,113 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1388936205793_0006
2014-01-05 15:10:41,511 [JobControl] INFO org.apache.hadoop.yarn.client.YarnClientImpl - Submitted application application_1388936205793_0006 to ResourceManager at /0.0.0.0:8032
2014-01-05 15:10:41,564 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8088/proxy/application_1388936205793_0006/
2014-01-05 15:10:41,653 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
I modified the pig script to point to my local filesystem file, and ran the pig script in local mode, and the job finished successfully in seconds. The local copy of the file is identical to the one hdfs has. I think for some reason pig can't make a solid connection to my HDFS.
Would somebody please tell me what I'm doing wrong?

Maybe try:
A = LOAD '/user/msknapp/county_insurance_pp.txt' USING PigStorage('\t') AS (fips:int,st:chararray,stfips:int,name:chararray,a:int,b:int,c:int,d:int,e:int,f:int,g:int);
DUMP A;

Related

Can't generate any alignments in MCScanX

I'm trying to find collinearity between a group of genes from two different species using MCScanX. But I don't know what I could be possibly doing wrong anymore. I've checked both input files countless times (.gff and .blast), and they seem to be in line with what the manual says.
Like, for the first species, I've downloaded the gff file from figshare. I already had the fasta file containing only the proteins of interest (that I also got from figshare), so gene ids matched. Then, I downloaded both the gff and the protein fasta file from coffee genome hub. I used the coffee proteins fasta file as the reference genome in rBLAST to align the first specie's genes against it. After blasting (and keeping only the first five best alignments with e-values greater than 1e-10), I filtered both gff files so they only contained genes that matched those in the blast file, and then concatenated them. So the final files look like this:
View (test.blast) #just imagine they're tab separated values
sp1.id1 sp2.id1 44.186 43 20 1 369 411 206 244 0.013 37.4sp1.id1 sp2.id2 25.203 123 80 4 301 413 542 662 0.00029 43.5sp1.id1 sp2.id3 27.843 255 130 15 97 333 458 676 1.75e-05 47.8sp1.id1 sp2.id4 26.667 105 65 3 301 396 329 430 0.004 39.7sp1.id1 sp2.id5 27.103 107 71 3 301 402 356 460 0.000217 43.5sp1.id2 sp2.id6 27.368 95 58 2 40 132 54 139 0.41 32sp1.id2 sp2.id7 27.5 120 82 3 23 138 770 888 0.042 35sp1.id2 sp2.id8 38.596 57 35 0 21 77 126 182 0.000217 42sp1.id2 sp2.id9 36.17 94 56 2 39 129 633 725 1.01e-05 46.6sp1.id2 sp2.id10 37.288 59 34 2 75 133 345 400 0.000105 43.1sp1.id3 sp2.id11 33.846 65 42 1 449 512 360 424 0.038 37.4sp1.id3 sp2.id12 40 50 16 2 676 725 672 707 6.7 30sp1.id3 sp2.id13 31.707 41 25 1 370 410 113 150 2.3 30.4sp1.id3 sp2.id14 31.081 74 45 1 483 550 1 74 3.3 30sp1.id3 sp2.id15 35.938 64 39 1 377 438 150 213 0.000185 43.5
View (test.gff) #just imagine they're tab separated values
ex0 sp2.id1 78543527 78548673ex0 sp2.id2 97152108 97154783ex1 sp2.id3 16555894 16557150ex2 sp2.id4 3166320 3168862ex3 sp2.id5 7206652 7209129ex4 sp2.id6 5079355 5084496ex5 sp2.id7 27162800 27167939ex6 sp2.id8 5584698 5589330ex6 sp2.id9 7085405 7087405ex7 sp2.id10 1105021 1109131ex8 sp2.id11 24426286 24430072ex9 sp2.id12 2734060 2737246ex9 sp2.id13 179361 183499ex10 sp2.id14 893983 899296ex11 sp2.id15 23731978 23733073ts1 sp1.id1 5444897 5448367ts2 sp1.id2 28930274 28935578ts3 sp1.id3 10716894 10721909
So I moved both files to the test folder inside MCScanX directory and ran MCScan (using Ubuntu 20.04.5 LTS, the WSL feature) with:
../MCScanX ./test
I've also tried
../MCScanX -b 2 ./test
(since "-b 2" is the parameter for inter-species patterns of syntenic blocks)
but all I ever get is
255 matches imported (17 discarded)85 pairwise comparisons0 alignments generated
What am I missing????
I should be getting a test.synteny file that, as per the manual's example, looks like this:
## Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0
0- 1: AT1G17290 AT1G72330 0
...
0-185: AT1G22330 AT1G78260 1e-63
0-186: AT1G22340 AT1G78270 3e-174
##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus

Open txt file inside windows 10 container

I have a simple docker container that is based on windows image:
FROM mcr.microsoft.com/windows:1903
WORKDIR /app1/
ENTRYPOINT powershell.exe
I run it interactively, using:
docker run -it -v c:\app1:c:\app1 test-image:1.0
There is a file called 1.txt inside app1 folder.
When I run:
.\app1\1.txt
I see no notepad.exe process, but instead I can spot OpenWith process:
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
78 5 1056 4452 0.02 1940 1 CExecSvc
74 5 5360 3792 0.02 1904 1 cmd
81 5 904 1364 0.00 1844 1 CompatTelRunner
156 10 6532 6088 0.00 1728 1 conhost
97 7 1196 4980 0.05 1896 1 conhost
286 13 1836 4976 0.27 984 1 csrss
37 6 1348 3356 0.06 524 1 fontdrvhost
0 0 60 8 0 0 Idle
831 22 4748 13844 0.17 460 1 lsass
546 25 13156 28920 0.17 1952 1 OfficeClickToRun
420 24 7400 28844 0.13 2472 1 OpenWith
376 22 6732 27168 0.13 2536 1 OpenWith
I suspect that some mapping might be missing, event though assoc shows that .txt file is associated with notepad.exe:
assoc .txt
.txt=txtfile
ftype txtfile
txtfile=%SystemRoot%\system32\NOTEPAD.EXE %1
What might be the problem here? Am I missing some register value?

Jmeter + InfluxDB: Response Codes are missing

I have an InfluxDB v1.7.9 installation and my Jmeter v5.2 is correctly sending data to it through the default Backend Listener (org.apache.jmeter.visualizers.backend.influxdb.HttpMetricsSender). I can see the data when querying the database.
Sample here:
time application avg count countError endedT hit max maxAT meanAT min minAT pct10.0 pct90.0 pct95.0 pct99.0 rb responseCode responseMessage sb startedT statut transaction
---- ----------- --- ----- ---------- ------ --- --- ----- ------ --- ----- ------- ------- ------- ------- -- ------------ --------------- -- -------- ------ -----------
1579001235935000000 grafanapoc-14-01-2020-1126 0 0 0 0 0 internal
1579001240085000000 grafanapoc-14-01-2020-1126 0 0 0 0 11 internal
1579001245091000000 grafanapoc-14-01-2020-1126 586.3529411764706 17 0 195 1177 197 246.6 1126.6 1177 1177 6302301 64159 all all
1579001245098000000 grafanapoc-14-01-2020-1126 197 1 197 197 197 197 197 197 10470 633 all GET - Page
1579001245100000000 grafanapoc-14-01-2020-1126 197 1 197 197 197 197 197 197 ok GET - Page
1579001245102000000 grafanapoc-14-01-2020-1126 259 1 259 259 259 259 259 259 9827 643 all GET - Privacy
1579001245102000000 grafanapoc-14-01-2020-1126 259 1 259 259 259 259 259 259 ok GET - Privacy
1579001245104000000 grafanapoc-14-01-2020-1126 710.8333333333334 12 1177 434 452.6 1158.1000000000001 1177 1177 6168994 56448 all GET - Homepage
1579001245106000000 grafanapoc-14-01-2020-1126 710.8333333333334 12 1177 434 452.6 1158.1000000000001 1177 1177 ok GET - Homepage
1579001245107000000 grafanapoc-14-01-2020-1126 327.3333333333333 3 387 273 273 387 387 387 ok GET - Contact
1579001245107000000 grafanapoc-14-01-2020-1126 327.3333333333333 3 387 273 273 387 387 387 113010 6435 all GET - Contact
1579001245109000000 grafanapoc-14-01-2020-1126 0 23 18 12 23 internal
1579001250083000000 grafanapoc-14-01-2020-1126 411.16666666666674 25 0 197 1177 143 179 712.0000000000001 1059.7000000000005 1177 5350040 69699 all all
However, as you can see from this sample, the 'responseCode' column is empty and only displays data when an error occurs (500, 404, Non HTTP response code, etc).
I am interested in recording all the Response Codes, not just errors.
I attempted to amend the jmeter.properties file defaults, without success. Can anyone help me identify the reason why the Response Codes for successful requests are not parsed over?
As per JMeter 5.2 response code and message are stored only for failed samplers:
private void addErrorMetric(String transaction, ErrorMetric err, long count) {
//
tag.append(TAG_RESPONSE_CODE).append(AbstractInfluxdbMetricsSender.tagToStringValue(err.getResponseCode()));
tag.append(TAG_RESPONSE_MESSAGE).append(AbstractInfluxdbMetricsSender.tagToStringValue(err.getResponseMessage()));
//
Unfortunately this is not something you can control via JMeter Properties, if you want to change this behaviour you need to amend InfluxdbBackendListenerClient and rebuild JMeter from source code

MPICH output not printing

Problem
I'm running an executable cp2k installed on HPC cluster using mpich-3.2. The output from the executable is printed in an out file. The problem is, that there is no output in the out file after some steps are printed, but when I see the status of my job on the cluster, it turns out that it is still running. Basically, the problem is that my job is still running, but the output is not getting printed.
Script
I'm using the following job script:
#!/bin/bash
#PBS -N test
#PBS -o test.log
#PBS -j oe
#PBS -l nodes=2:ppn=20
#PBS -q mini
#PBS -l walltime=2:00:00
cd $PBS_O_WORKDIR
echo Master process running on `hostname`
echo Directory is `pwd`
echo PBS has allocated the following nodes:
echo `cat $PBS_NBODEFILE`
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
export I_MPI_FABRICS=shm:dapl
export I_MPI_PROVIDER=psm2
export I_MPI_FALLBACK=0
export KMP_AFFINITY=verbose,scatter
export OMP_NUM_THREADS=1
export I_MPI_IFACE=ib0
echo Starting executation at `date`
EXEC="/home/arshil/software/cp2k-5.1.0/exe/local/cp2k.popt"
cp $EXEC ./cp2k
mpiexec -np $NPROCS --machinefile $PBS_NODEFILE ./cp2k -i test.inp >& out
rm cp2k
echo Finished at `date`
Error
The ouput in the out file:
SCF WAVEFUNCTION OPTIMIZATION
----------------------------------- OT ---------------------------------------
Minimizer : DIIS : direct inversion
in the iterative subspace
using 7 DIIS vectors
safer DIIS on
Preconditioner : FULL_SINGLE_INVERSE : inversion of
H + eS - 2*(Sc)(c^T*H*c+const)(Sc)^T
Precond_solver : DEFAULT
stepsize : 0.08000000 energy_gap : 0.08000000
eps_taylor : 0.10000E-15 max_taylor : 4
----------------------------------- OT ---------------------------------------
Step Update method Time Convergence Total energy Change
------------------------------------------------------------------------------
1 OT DIIS 0.80E-01 21.3 0.00002878 -8797.2068024142 -8.80E+03
2 OT DIIS 0.80E-01 10.9 0.00007114 -8797.2061897209 6.13E-04
3 OT DIIS 0.80E-01 10.8 0.00001688 -8797.2073257531 -1.14E-03
As it can be seen, there is no printing after step 3 in the output file, but the job is still running in the background. Even after the walltime is over, the output file remains the same as above. Where is the output going?
The executable cp2k is used to perform quantum chemical calculations and was installed on the cluster along with mpich-3.2. All CP2K needs is an input file with extension .inp. For my case, test.inp is the input file.
&FORCE_EVAL
METHOD Quickstep
&DFT
BASIS_SET_FILE_NAME GTH_BASIS_SETS
POTENTIAL_FILE_NAME GTH_POTENTIALS
&MGRID
NGRIDS 4
CUTOFF 380
REL_CUTOFF 60
&END MGRID
&QS
METHOD GPW
MAP_CONSISTENT
EXTRAPOLATION ASPC
EXTRAPOLATION_ORDER 3
&END QS
&SCF
MAX_SCF 1000
EPS_SCF 1.0E-5
SCF_GUESS ATOMIC
&OT
PRECONDITIONER FULL_SINGLE_INVERSE
MINIMIZER DIIS
N_DIIS 7
&END OT
&PRINT
&RESTART OFF
&END RESTART
&END PRINT
&END SCF
&XC
&XC_FUNCTIONAL PBE
&END XC_FUNCTIONAL
&vdW_POTENTIAL
DISPERSION_FUNCTIONAL PAIR_POTENTIAL
&PAIR_POTENTIAL
PARAMETER_FILE_NAME dftd3.dat
TYPE DFTD3
REFERENCE_FUNCTIONAL PBE
R_CUTOFF [angstrom] 12.3
&END PAIR_POTENTIAL
&END vdW_POTENTIAL
&END XC
&END DFT
&SUBSYS
&CELL
ABC 24.6904 24.6904 24.6904
PERIODIC XYZ
&END CELL
&KIND C
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q4
&END KIND
&KIND P
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q5
&END KIND
&KIND H
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q1
&END KIND
&KIND O
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q6
&END KIND
&KIND N
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q5
&END KIND
&KIND Mg
BASIS_SET TZV2P-GTH
POTENTIAL GTH-PBE-q10
&END KIND
&COLVAR
&COORDINATION
ATOMS_FROM 41
ATOMS_TO 38
R_0 [bohr] 4.5
NN 6
ND 12
&END COORDINATION
&END COLVAR
&COLVAR
&COORDINATION
ATOMS_FROM 41
ATOMS_TO 42 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 104 107 110 113 116 119 122 125 128 131 134 137 140 143 146 149 152 155 158 161 164 167 170 173 176 179 182 185 188 191 194 197 200 203 206 209 212 215 218 221 224 227 230 233 236 239 242 245 248 251 254 257 260 263 266 269 272 275 278 281 284 287 290 293 296 299 302 305 308 311 314 317 320 323 326 329 332 335 338 341 344 347 350 353 356 359 362 365 368 371 374 377 380 383 386 389 392 395 398 401 404 407 410 413 416 419 422 425 428 431 434 437 440 443 446 449 452 455 458 461 464 467 470 473 476 479 482 485 488 491 494 497 500 503 506 509 512 515 518 521 524 527 530 533 536 539 542 545 548 551 554 557 560 563 566 569 572 575 578 581 584 587 590 593 596 599 602 605 608 611 614 617 620 623 626 629 632 635 638 641 644 647 650 653 656 659 662 665 668 671 674 677 680 683 686 689 692 695 698 701 704 707 710 713 716 719 722 725 728 731 734 737 740 743 746 749 752 755 758 761 764 767 770 773 776 779 782 785 788 791 794 797 800 803 806 809 812 815 818 821 824 827 830 833 836 839 842 845 848 851 854 857 860 863 866 869 872 875 878 881 884 887 890 893 896 899 902 905 908 911 914 917 920 923 926 929 932 935 938 941 944 947 950 953 956 959 962 965 968 971 974 977 980 983 986 989 992 995 998 1001 1004 1007 1010 1013 1016 1019 1022 1025 1028 1031 1034 1037 1040 1043 1046 1049 1052 1055 1058 1061 1064 1067 1070 1073 1076 1079 1082 1085 1088 1091 1094 1097 1100 1103 1106 1109 1112 1115 1118 1121 1124 1127 1130 1133 1136 1139 1142 1145 1148 1151 1154 1157 1160 1163 1166 1169 1172 1175 1178 1181 1184 1187 1190 1193 1196 1199 1202 1205 1208 1211 1214 1217 1220 1223 1226 1229 1232 1235 1238 1241 1244 1247 1250 1253 1256 1259 1262 1265 1268 1271 1274 1277 1280 1283 1286 1289 1292 1295 1298 1301 1304 1307 1310 1313 1316 1319 1322 1325 1328 1331 1334 1337 1340 1343 1346 1349 1352 1355 1358 1361 1364 1367 1370 1373 1376 1379 1382 1385 1388 1391 1394 1397 1400 1403 1406 1409 1412 1415 1418 1421 1424 1427 1430 1433 1436 1439 1442 1445 1448 1451 1454 1457
ATOMS_TO 1460 1463 1466 1469 1472 1475 1478 1481 1484 1487 1490 1493 1496 1499 1502 1505
R_0 [bohr] 4.5
NN 6
ND 12
&END COORDINATION
&END COLVAR
&END SUBSYS
&END FORCE_EVAL
&GLOBAL
PROJECT test
RUN_TYPE MD
PRINT_LEVEL LOW
&END GLOBAL
&MOTION
&MD
ENSEMBLE NVT
STEPS 100000
TIMESTEP 0.5
TEMPERATURE 310
TEMP_TOL 100
&THERMOSTAT
&NOSE
LENGTH 3
YOSHIDA 3
TIMECON 100.0
MTS 2
&END NOSE
&END
&PRINT
&ENERGY
&EACH
MD 10
&END
&END
&PROGRAM_RUN_INFO
&EACH
MD 100
&END
&END
FORCE_LAST
&END PRINT
&END MD
&FREE_ENERGY
&METADYN
DO_HILLS
LAGRANGE .TRUE.
NT_HILLS 40
WW [kcalmol] 1
TEMPERATURE 310
TEMP_TOL 10
&METAVAR
SCALE 0.05
COLVAR 1
MASS 50
LAMBDA 2
&WALL
POSITION 0.0
TYPE QUARTIC
&QUARTIC
DIRECTION WALL_MINUS
K 10.0
&END
&END
&END METAVAR
&METAVAR
SCALE 0.05
COLVAR 2
MASS 50
LAMBDA 2
&WALL
POSITION 0.0
TYPE QUARTIC
&QUARTIC
DIRECTION WALL_MINUS
K 10.0
&END
&END
&END METAVAR
&PRINT
&COLVAR
COMMON_ITERATION_LEVELS 3
&EACH
MD 1
&END
&END
&HILLS
COMMON_ITERATION_LEVELS 3
&EACH
MD 1
&END
&END
&END
&END METADYN
&END
&PRINT
&TRAJECTORY
&EACH
MD 1
&END
&END
&VELOCITIES OFF
&END
&RESTART
&EACH
MD 20
&END
ADD_LAST NUMERIC
&END
&RESTART_HISTORY
&EACH
MD 2000
&END
&END
&END
&END MOTION
&EXT_RESTART
RESTART_FILE_NAME NVT-1.restart
RESTART_COUNTERS .FALSE.
&END
The problem in my opinion is not with the input file. It has got to do something with mpich-3.2. I would really appreciate some help.
This may be something similar going on / solutions that can be used here: Python "print" not working when embedded into MPI program It is not perfect as you are not using python however it may help.
At a basic level MPI launches many processes - but only the command that launches it has access to stdio etc. The redirect at the end of the line starting with mpiexec sends the stdout of mpiexec to a file. The output from your script is buffered by mpiexec until the processes end (either they complete or they are stopped).
Where the output is going is a good question and may require changes in test.np or some other way of shutting down (you mention you were out of wall time). I'm looking to solve the same problem - and will update this (if) I find an answer.
Also the output from different processes started by mpi can arrive in random order. I do not care about this but if you do you may need to pass the messages back to some common thread which sorts their order.

How to calculate and used memory of multiple instance of a single process using powershell?

I have following result while running below powershell command,
PS C:\> Get-Process svchost
Handles NPM(K) PM(K) WS(K) VM(M) CPU(s) Id ProcessName
------- ------ ----- ----- ----- ------ -- -----------
546 34 18528 14884 136 49.76 260 svchost
357 14 4856 4396 47 18.05 600 svchost
314 17 6088 5388 42 12.62 676 svchost
329 17 10044 8780 50 12.98 764 svchost
1515 49 36104 38980 454 232.04 812 svchost
301 33 9736 6428 54 2.90 832 svchost
328 26 8844 9744 52 4.32 856 svchost
247 18 8144 9912 77 37.50 904 svchost
46 5 1504 968 14 0.02 1512 svchost
278 15 4048 5660 43 3.88 2148 svchost
98 14 2536 2460 35 0.66 2504 svchost
Here im trying to calculte the total memory size PM(K) of process(s).i've following line in my ps1 script file
get-process svchost | foreach {$mem=("{0:N2}MB " -f ($_.pm/1mb))}
It gives the output in the following format
17.58MB 4.79MB 6.05MB 9.99MB 35.29MB 9.56MB 8.64MB 7.95MB 1.47MB 3.95MB 2.48MB
but i need total size as a single value like 107.75MB
How to calculate the total used memory size of svchost process ?
Thanks
You can use the Measure-Object cmdlet
$measure = Get-Process svchost | Measure-Object PM -Sum
$mem = ("{0:N2}MB " -f ($measure.sum / 1mb))
Also, you can calculate the total size of the entire collection using the += syntax
$mem = 0
Get-Process svchost | %{$mem += $_.pm}
"{0:N2}MB " -f ($mem/1mb)

Resources