Saving text file info into clojure data structure [duplicate] - data-structures

I have the following data in a .txt file:
1|John Smith|123 Here Street|456-4567
2|Sue Jones|43 Rose Court Street|345-7867
3|Fan Yuhong|165 Happy Lane|345-4533
I get the data and convert it to a vector using the following code:
(def custContents (slurp "cust.txt"))
(def custVector (clojure.string/split custContents #"\||\n"))
(def testing (into [] (partition 4 custVector )))
Which gives me the following vector:
[(1 John Smith 123 Here Street 456-4567) (2 Sue Jones 43 Rose Court Street
345-7867) (3 Fan Yuhong 165 Happy Lane 345-4533)]
I would like to convert it into a vector of vectors like this:
[[1 John Smith 123 Here Street 456-4567] [2 Sue Jones 43 Rose Court Street
345-7867] [3 Fan Yuhong 165 Happy Lane 345-4533]]

I would do it slightly differently, so you first break it up into lines, then process each line. It also makes the regex simpler:
(ns tst.demo.core
(:require
[clojure.string :as str] ))
(def data
"1|John Smith|123 Here Street|456-4567
2|Sue Jones|43 Rose Court Street|345-7867
3|Fan Yuhong|165 Happy Lane|345-4533")
(let [lines (str/split-lines data)
line-vecs-1 (mapv #(str/split % #"\|" ) lines)
line-vecs-2 (mapv #(str/split % #"[|]") lines)]
...)
with result:
lines => ["1|John Smith|123 Here Street|456-4567"
"2|Sue Jones|43 Rose Court Street|345-7867"
"3|Fan Yuhong|165 Happy Lane|345-4533"]
line-vecs-1 =>
[["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]
line-vecs-2 =>
[["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]
Note that there are 2 ways of doing the regex. line-vecs-1 shows a regex where the pipe character is escaped in the string. Since regex varies on different platform (e.g. on Java one would need "\|"), line-vecs-2 uses a regex class of a single character (the pipe), which sidesteps the need for escaping the pipe.
Update
Other Clojure Learning Resources:
Brave Clojure
Clojure CheatSheet
ClojureDocs.org
Clojure-Doc.org (similar but different)

> (mapv vec testing)
=> [["1" "John Smith" "123 Here Street" "456-4567"]
["2" "Sue Jones" "43 Rose Court Street" "345-7867"]
["3" "Fan Yuhong" "165 Happy Lane" "345-4533"]]

Related

shuf command to extract lines with spaces in CSV file

I have a CSV file with set of 1000 addresses. I used shuf command to shuffle the 10 lines randomly for a process. Since the addresses are available with spaces, the shuf command collects all 10 addresses into a single element in an array rather than 10 different elements in the array. Please help resolving the issue.
Sample CSV
from_address
"303 Co Rd 405, Floresville, TX 78114,US"
"4422 Oakside Dr, Houston, TX 77053,US"
"4218 S 245th Ct, Kent, WA 98032,US"
"1407 Marion Manor Dr, Marion, VA 24354,US"
"7400 Englewood Ave, Yakima, WA 98908,US"
"8012 Burly Wood Way, Hampton, GA 30253,US"
"931 Beacon Square Ct, Gaithersburg, MD 20878,US"
"12 Truval la, Nesconset, NY 11767,US"
"121 Pet Rock Ct, Clayton, NC 27520,US"
"235 Whitaker Rd, Westfield, PA 16950,US"
"13422 NE 133rd St, Kirkland, WA 98034,US"
"1620 27th St NW, Canton, OH 44709,US"
"488 Andrews Rd, Columbus, GA 31903,US"
"4742 Janet Ln, Bethlehem, PA 18017,US"
"2622 Cherokee Ct, West Palm Beach, FL 33406,US"
"111 Westbury Ct, Doylestown, PA 18901,US"
"820 Main St, Belpre, OH 45714,US"
"1307 Stevenson Ln, Towson, MD 21286,US"
"2725 Hartford Rd, East York, PA 17402,US"
"9 Winding Brook Rd, Rhinebeck, NY 12572,US"
"433 Willowbrook Dr, Norristown, PA 19403,US"
"208 N Kayla Dr, Granite Quarry, NC 28146,US"
"931 Pimlico Dr, Centerville, OH 45459,US"
Shell Script
list_=("$(shuf -n 10 sample_addresses.csv)")
echo ${#list_[#]}
Expected Result
10
Actual Result
1
list_=("$(shuf -n 10 sample_addresses.csv)")
That's creating a list with one single element.
To read the lines into an array, use the mapfile command:
mapfile -t list_ < <(shuf -n 10 sample_addresses.csv)
A good way to inspect the contents of a variable is
declare -p list_

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def encode(batch):
return tokenizer.encode_plus(batch["abstract"], max_length=32, add_special_tokens=True, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=False, return_tensors="pt")
dataset.set_transform(encode)
When I run this code, I have
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Instead of having a list of strings, I have a list of lists of strings. Here is the content of batch["article"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .', "said dpp alison saunders had ` damaged public confidence ' in justice .", 'ms saunders ruled lord janner unfit to stand trial over child abuse claims .', 'the cps has pursued at least 19 suspected paedophiles with dementia .'], ['an increasing number of surveys claim to reveal what makes us happiest .', 'but are these generic lists really of any use to us ?', 'janet street-porter makes her own list - of things making her unhappy !'], ["author of ` into the wild ' spoke to five rape victims in missoula , montana .", "` missoula : rape and the justice system in a college town ' was released april 21 .", "three of five victims profiled in the book sat down with abc 's nightline wednesday night .", 'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football players .', "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .", 'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable cause .', 'mr krakauer wrote book after realizing close friend was a rape victim .'], ['tesco announced a record annual loss of £ 6.38 billion yesterday .', 'drop in sales , one-off costs and pensions blamed for financial loss .', 'supermarket giant now under pressure to close 200 stores nationwide .', 'here , retail industry veterans , plus mail writers , identify what went wrong .'], ..., ['snp leader said alex salmond did not field questions over his family .', "said she was not ` moaning ' but also attacked criticism of women 's looks .", 'she made the remarks in latest programme profiling the main party leaders .', 'ms sturgeon also revealed her tv habits and recent image makeover .', 'she said she relaxed by eating steak and chips on a saturday night .']]
How could I fix this issue?

Assigning people to shopping days by last name for COVID

A question from an optimization mailing list:
With COVID-19 my local authorities have experimented with trying to assign people a day of the week to go shopping. They've simply divided by alphabetical order disregarding the frequency with which a surname is used in my town, with terrible results and
hours of queues. Is there a better way? (The supermarket is open 5 days a week.)
Note: Typically only a list of last names (possibly only popular last names) is available. Communicating who should shop when in a straight-forward fashion is important.
While it's possible to set this up as a mixed-integer programming problem (MIP), this isn't the best idea. Given the number of variables involved, the MIP solver will likely spend a long time proving an optimal answer. Sure, you may be able to cut the solver off after a time limit and get an acceptable approximation, but there are no guarantees.
The best way to solve this problem is via dynamic programming, which finds the optimal answer extremely quickly. First, we note that if we have k groups each of weight w what we want to do is minimize the maximum weight of the individuals in any group.
Call this optimal value V(n,k). Next, we note that for a potential partition point i, the value of partitioning at that point is:
Therefore, we can write V(n,k) as:
We have a few edge cases. If N≤0, then we have an empty list. Since this is unacceptable, we return ∞. If k=0, then there is no partition, so we simply return the sum of the subarray. This finally gives us the full functional form of the problem:
Now, we gather some data on the distribution of last names, say from the census and implement the recursion in Python, like so:
#!/usr/bin/env python3
import functools
import pandas as pd
class Partitioner:
def __init__(self):
self._weights = None
#functools.lru_cache(maxsize=None)
def _subarraysum(self,i,n):
"""Memoized Sum range [i,n)"""
return sum(self._weights[i:n])
#functools.lru_cache(maxsize=None)
def _V(self,n,k):
"""Find best split for the values n,k
#returns (Value, [List of Split Points])
"""
if n<=0: #Must have at least one element in each subarray
return float("inf"), []
if k==0: #We've filled all the subarrays
return self._subarraysum(0,n), []
#Look through all the split points and find the best one
val = float("inf"), []
for i in range(n):
vali, splitsi = self._V(i,k-1) #Best split for [0,i)
vali = max(vali, self._subarraysum(i,n)) #Max of that and value of [i,N)
if vali<val[0]:
val = vali, splitsi+[i]
return val
def run(self,weights,k):
"""Return a (Value,Splits) tuple for the best partitioning of N objects with
the given weights into k partitions"""
self._V.cache_clear()
self._subarraysum.cache_clear()
self._weights = weights
#Subtract 1 from k because we calculate using separators
return self._V(len(weights),k-1)
def GetShortNameDataFrame(df, k):
"""Returns a copy of df in which names are abbreviated to `k` characters long
"""
df = df.copy()
df['names'] = df['names'].str.slice(0,k)
return df.groupby('names').agg({"counts":sum}).reset_index()
def PrintSplits(df, splits):
"""Pretty print the name ranges"""
p = [0] + splits + [len(df)]
for i in range(len(p)-1):
startname = df['names'][p[i]]
endname = df['names'][p[i+1]-1]
count = sum(df['counts'][p[i]:p[i+1]])
print(f"{startname:>10}-{endname:>10} {count}")
def ExploreNames(namecounts,kmax):
"""Explore shorting names to lengths of [1,kmax] characters"""
for k in range(1,kmax+1):
print("k",k)
snames = GetShortNameDataFrame(namecounts,k)
#Find optimal way to partition the names into 5 bins
val, splits = Partitioner().run(snames['counts'].tolist(),5)
print(f"Max count: {val}")
PrintSplits(snames, splits)
#Source: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html
names = "SMITH JOHNSON WILLIAMS BROWN JONES GARCIA MILLER DAVIS RODRIGUEZ MARTINEZ HERNANDEZ LOPEZ GONZALEZ WILSON ANDERSON THOMAS TAYLOR MOORE JACKSON MARTIN LEE PEREZ THOMPSON WHITE HARRIS SANCHEZ CLARK RAMIREZ LEWIS ROBINSON WALKER YOUNG ALLEN KING WRIGHT SCOTT TORRES NGUYEN HILL FLORES GREEN ADAMS NELSON BAKER HALL RIVERA CAMPBELL MITCHELL CARTER ROBERTS GOMEZ PHILLIPS EVANS TURNER DIAZ PARKER CRUZ EDWARDS COLLINS REYES STEWART MORRIS MORALES MURPHY COOK ROGERS GUTIERREZ ORTIZ MORGAN COOPER PETERSON BAILEY REED KELLY HOWARD RAMOS KIM COX WARD RICHARDSON WATSON BROOKS CHAVEZ WOOD JAMES BENNETT GRAY MENDOZA RUIZ HUGHES PRICE ALVAREZ CASTILLO SANDERS PATEL MYERS LONG ROSS FOSTER JIMENEZ POWELL JENKINS PERRY RUSSELL SULLIVAN BELL COLEMAN BUTLER HENDERSON BARNES GONZALES FISHER VASQUEZ SIMMONS ROMERO JORDAN PATTERSON ALEXANDER HAMILTON GRAHAM REYNOLDS GRIFFIN WALLACE MORENO WEST COLE HAYES BRYANT HERRERA GIBSON ELLIS TRAN MEDINA AGUILAR STEVENS MURRAY FORD CASTRO MARSHALL OWENS HARRISON FERNANDEZ MCDONALD WOODS WASHINGTON KENNEDY WELLS VARGAS HENRY CHEN FREEMAN WEBB TUCKER GUZMAN BURNS CRAWFORD OLSON SIMPSON PORTER HUNTER GORDON MENDEZ SILVA SHAW SNYDER MASON DIXON MUNOZ HUNT HICKS HOLMES PALMER WAGNER BLACK ROBERTSON BOYD ROSE STONE SALAZAR FOX WARREN MILLS MEYER RICE SCHMIDT GARZA DANIELS FERGUSON NICHOLS STEPHENS SOTO WEAVER RYAN GARDNER PAYNE GRANT DUNN KELLEY SPENCER HAWKINS ARNOLD PIERCE VAZQUEZ HANSEN PETERS SANTOS HART BRADLEY KNIGHT ELLIOTT CUNNINGHAM DUNCAN ARMSTRONG HUDSON CARROLL LANE RILEY ANDREWS ALVARADO RAY DELGADO BERRY PERKINS HOFFMAN JOHNSTON MATTHEWS PENA RICHARDS CONTRERAS WILLIS CARPENTER LAWRENCE SANDOVAL GUERRERO GEORGE CHAPMAN RIOS ESTRADA ORTEGA WATKINS GREENE NUNEZ WHEELER VALDEZ HARPER BURKE LARSON SANTIAGO MALDONADO MORRISON FRANKLIN CARLSON AUSTIN DOMINGUEZ CARR LAWSON JACOBS OBRIEN LYNCH SINGH VEGA BISHOP MONTGOMERY OLIVER JENSEN HARVEY WILLIAMSON GILBERT DEAN SIMS ESPINOZA HOWELL LI WONG REID HANSON LE MCCOY GARRETT BURTON FULLER WANG WEBER WELCH ROJAS LUCAS MARQUEZ FIELDS PARK YANG LITTLE BANKS PADILLA DAY WALSH BOWMAN SCHULTZ LUNA FOWLER MEJIA DAVIDSON ACOSTA BREWER MAY HOLLAND JUAREZ NEWMAN PEARSON CURTIS CORTEZ DOUGLAS SCHNEIDER JOSEPH BARRETT NAVARRO FIGUEROA KELLER AVILA WADE MOLINA STANLEY HOPKINS CAMPOS BARNETT BATES CHAMBERS CALDWELL BECK LAMBERT MIRANDA BYRD CRAIG AYALA LOWE FRAZIER POWERS NEAL LEONARD GREGORY CARRILLO SUTTON FLEMING RHODES SHELTON SCHWARTZ NORRIS JENNINGS WATTS DURAN WALTERS COHEN MCDANIEL MORAN PARKS STEELE VAUGHN BECKER HOLT DELEON BARKER TERRY HALE LEON HAIL BENSON HAYNES HORTON MILES LYONS PHAM GRAVES BUSH THORNTON WOLFE WARNER CABRERA MCKINNEY MANN ZIMMERMAN DAWSON LARA FLETCHER PAGE MCCARTHY LOVE ROBLES CERVANTES SOLIS ERICKSON REEVES CHANG KLEIN SALINAS FUENTES BALDWIN DANIEL SIMON VELASQUEZ HARDY HIGGINS AGUIRRE LIN CUMMINGS CHANDLER SHARP BARBER BOWEN OCHOA DENNIS ROBBINS LIU RAMSEY FRANCIS GRIFFITH PAUL BLAIR OCONNOR CARDENAS PACHECO CROSS CALDERON QUINN MOSS SWANSON CHAN RIVAS KHAN RODGERS SERRANO FITZGERALD ROSALES STEVENSON CHRISTENSEN MANNING GILL CURRY MCLAUGHLIN HARMON MCGEE GROSS DOYLE GARNER NEWTON BURGESS REESE WALTON BLAKE TRUJILLO ADKINS BRADY GOODMAN ROMAN WEBSTER GOODWIN FISCHER HUANG POTTER DELACRUZ MONTOYA TODD WU HINES MULLINS CASTANEDA MALONE CANNON TATE MACK SHERMAN HUBBARD HODGES ZHANG GUERRA WOLF VALENCIA SAUNDERS FRANCO ROWE GALLAGHER FARMER HAMMOND HAMPTON TOWNSEND INGRAM WISE GALLEGOS CLARKE BARTON SCHROEDER MAXWELL WATERS LOGAN CAMACHO STRICKLAND NORMAN PERSON COLON PARSONS FRANK HARRINGTON GLOVER OSBORNE BUCHANAN CASEY FLOYD PATTON IBARRA BALL TYLER SUAREZ BOWERS OROZCO SALAS COBB GIBBS ANDRADE BAUER CONNER MOODY ESCOBAR MCGUIRE LLOYD MUELLER HARTMAN FRENCH KRAMER MCBRIDE POPE LINDSEY VELAZQUEZ NORTON MCCORMICK SPARKS FLYNN YATES HOGAN MARSH MACIAS VILLANUEVA ZAMORA PRATT STOKES OWEN BALLARD LANG BROCK VILLARREAL CHARLES DRAKE BARRERA CAIN PATRICK PINEDA BURNETT MERCADO SANTANA SHEPHERD BAUTISTA ALI SHAFFER LAMB TREVINO MCKENZIE HESS BEIL OLSEN COCHRAN MORTON NASH WILKINS PETERSEN BRIGGS SHAH ROTH NICHOLSON HOLLOWAY LOZANO RANGEL FLOWERS HOOVER SHORT ARIAS MORA VALENZUELA BRYAN MEYERS WEISS UNDERWOOD BASS GREER SUMMERS HOUSTON CARSON MORROW CLAYTON WHITAKER DECKER YODER COLLIER ZUNIGA CAREY WILCOX MELENDEZ POOLE ROBERSON LARSEN CONLEY DAVENPORT COPELAND MASSEY LAM HUFF ROCHA CAMERON JEFFERSON HOOD MONROE ANTHONY PITTMAN HUYNH RANDALL SINGLETON KIRK COMBS MATHIS CHRISTIAN SKINNER BRADFORD RICHARD GALVAN WALL BOONE KIRBY WILKINSON BRIDGES BRUCE ATKINSON VELEZ MEZA ROY VINCENT YORK HODGE VILLA ABBOTT ALLISON TAPIA GATES CHASE SOSA SWEENEY FARRELL WYATT DALTON HORN BARRON PHELPS YU DICKERSON HEATH FOLEY ATKINS MATHEWS BONILLA ACEVEDO BENITEZ ZAVALA HENSLEY GLENN CISNEROS HARRELL SHIELDS RUBIO HUFFMAN CHOI BOYER GARRISON ARROYO BOND KANE HANCOCK CALLAHAN DILLON CLINE WIGGINS GRIMES ARELLANO MELTON ONEILL SAVAGE HO BELTRAN PITTS PARRISH PONCE RICH BOOTH KOCH GOLDEN WARE BRENNAN MCDOWELL MARKS CANTU HUMPHREY BAXTER SAWYER CLAY TANNER HUTCHINSON KAUR BERG WILEY GILMORE RUSSO VILLEGAS HOBBS KEITH WILKERSON AHMED BEARD MCCLAIN MONTES MATA ROSARIO VANG WALTER HENSON ONEAL MOSLEY MCCLURE BEASLEY STEPHENSON SNOW HUERTA PRESTON VANCE BARRY JOHNS EATON BLACKWELL DYER PRINCE MACDONALD SOLOMON GUEVARA STAFFORD ENGLISH HURST WOODARD CORTES SHANNON KEMP NOLAN MCCULLOUGH MERRITT MURILLO MOON SALGADO STRONG KLINE CORDOVA BARAJAS ROACH ROSAS WINTERS JACOBSON LESTER KNOX BULLOCK KERR LEACH MEADOWS ORR DAVILA WHITEHEAD PRUITT KENT CONWAY MCKEE BARR DAVID DEJESUS MARIN BERGER MCINTYRE BLANKENSHIP GAINES PALACIOS CUEVAS BARTLETT DURHAM DORSEY MCCALL ODONNELL STEIN BROWNING STOUT LOWERY SLOAN MCLEAN HENDRICKS CALHOUN SEXTON CHUNG GENTRY HULL DUARTE ELLISON NIELSEN GILLESPIE BUCK MIDDLETON SELLERS LEBLANC ESPARZA HARDIN BRADSHAW MCINTOSH HOWE LIVINGSTON FROST GLASS MORSE KNAPP HERMAN STARK BRAVO NOBLE SPEARS WEEKS CORONA FREDERICK BUCKLEY MCFARLAND HEBERT ENRIQUEZ HICKMAN QUINTERO RANDOLPH SCHAEFER WALLS TREJO HOUSE REILLY PENNINGTON MICHAEL CONRAD GILES BENJAMIN CROSBY FITZPATRICK DONOVAN MAYS MAHONEY VALENTINE RAYMOND MEDRANO HAHN MCMILLAN SMALL BENTLEY FELIX PECK LUCERO BOYLE HANNA PACE RUSH HURLEY HARDING MCCONNELL BERNAL NAVA AYERS EVERETT VENTURA AVERY PUGH MAYER BENDER SHEPARD MCMAHON LANDRY CASE SAMPSON MOSES MAGANA BLACKBURN DUNLAP GOULD DUFFY VAUGHAN HERRING MCKAY ESPINOSA RIVERS FARLEY BERNARD ASHLEY FRIEDMAN POTTS TRUONG COSTA CORREA BLEVINS NIXON CLEMENTS FRY DELAROSA BEST BENTON LUGO PORTILLO DOUGHERTY CRANE HALEY PHAN VILLALOBOS BLANCHARD HORNE FINLEY QUINTANA LYNN ESQUIVEL BEAN DODSON MULLEN XIONG HAYDEN CANO LEVY HUBER RICHMOND MOYER LIM FRYE SHEPPARD MCCARTY AVALOS BOOKER WALLER PARRA WOODWARD JARAMILLO KRUEGER RASMUSSEN BRANDT PERALTA DONALDSON STUART FAULKNER MAYNARD GALINDO COFFEY ESTES SANFORD BURCH MADDOX VO OCONNELL VU ANDERSEN SPENCE MCPHERSON CHURCH SCHMITT STANTON LEAL CHERRY COMPTON DUDLEY SIERRA POLLARD ALFARO HESTER PROCTOR LU HINTON NOVAK GOOD MADDEN MCCANN TERRELL JARVIS DICKSON REYNA CANTRELL MAYO BRANCH HENDRIX ROLLINS ROWLAND WHITNEY DUKE ODOM DAUGHERTY TRAVIS TANG ARCHER"
counts = "2_442_977 1_932_812 1_625_252 1_437_026 1_425_470 1_166_120 1_161_437 1_116_357 1_094_924 1_060_159 1_043_281 874_523 841_025 801_882 784_404 756_142 751_209 724_374 708_099 702_625 693_023 681_645 664_644 660_491 624_252 612_752 562_679 557_423 531_781 529_821 523_129 484_447 482_607 465_422 458_980 439_530 437_813 437_645 434_827 433_969 430_182 427_865 424_958 419_586 407_076 391_114 386_157 384_486 376_966 376_774 365_655 360_802 355_593 348_627 347_636 336_221 334_201 332_423 329_770 327_904 324_957 318_884 311_777 308_417 302_589 302_261 293_218 286_899 286_280 280_791 278_297 277_845 277_030 267_394 264_826 263_464 262_352 261_231 260_464 259_798 252_579 251_663 250_898 250_715 249_379 247_599 246_116 242_771 238_234 236_271 235_251 233_983 230_420 230_374 229_973 229_895 229_374 229_368 227_764 227_118 224_874 222_653 221_741 221_558 220_990 220_599 219_070 218_847 218_393 218_241 214_758 214_703 212_781 210_182 208_614 208_403 205_423 204_621 201_746 201_159 200_247 198_406 197_276 196_925 195_818 195_289 194_246 192_773 192_711 190_667 188_968 188_498 188_497 186_512 185_674 184_910 184_832 184_134 183_922 182_719 181_091 180_842 180_497 177_425 177_386 176_865 176_230 173_835 170_964 169_580 169_149 168_878 167_446 167_044 165_925 164_457 164_035 163_181 163_054 162_440 161_833 161_717 161_633 160_400 160_262 160_213 159_480 158_483 158_421 158_320 156_780 156_601 155_795 154_738 153_666 153_469 153_397 153_329 152_703 152_334 152_147 151_942 150_895 149_500 147_034 147_005 146_570 146_426 145_584 144_646 144_451 143_837 143_452 142_894 142_601 142_277 141_427 140_693 139_951 139_751 138_893 138_629 138_322 137_977 137_513 137_232 137_184 136_720 136_713 135_765 135_718 135_187 135_044 134_963 134_317 134_227 133_872 133_799 133_501 133_171 132_985 132_812 131_440 131_401 131_373 131_303 130_776 130_529 130_164 130_152 129_898 129_699 128_948 128_677 128_625 127_939 127_794 127_470 127_256 127_083 126_101 125_350 125_058 124_995 124_461 122_877 122_587 122_212 121_526 121_130 120_621 120_552 119_706 119_304 119_076 119_053 118_614 118_557 117_708 116_749 116_673 116_618 115_953 115_900 115_679 115_662 114_959 114_940 114_030 113_374 112_154 112_041 111_786 111_371 111_360 111_144 110_967 110_744 110_697 110_529 110_116 109_883 109_433 108_987 108_421 107_690 107_533 107_522 106_696 106_033 105_936 105_833 105_365 105_091 105_079 105_007 104_888 104_518 104_515 104_057 103_930 103_418 103_318 103_306 102_538 101_949 101_931 101_836 101_801 101_694 101_458 101_290 100_959 100_104 99_807 98_468 98_268 97_314 97_040 96_979 96_867 96_810 96_111 95_681 95_622 94_988 93_944 93_786 93_678 93_628 92_904 92_507 92_463 92_260 92_152 91_970 91_694 91_475 91_384 91_129 90_964 90_677 90_670 90_517 90_071 89_796 89_700 89_649 89_401 89_376 89_091 88_728 88_615 88_586 88_230 88_060 87_859 87_531 87_414 87_162 87_000 86_618 86_363 86_240 86_081 85_974 85_195 84_942 84_516 84_320 84_179 84_018 83_967 83_928 83_781 83_621 83_616 83_510 83_265 83_182 83_067 83_063 82_992 82_950 82_873 82_458 82_161 82_146 82_085 81_978 81_939 81_471 81_156 81_006 80_742 80_526 80_460 80_364 80_252 79_803 79_517 79_508 79_316 79_186 78_990 78_848 78_822 78_677 78_482 78_381 78_370 78_350 78_327 78_260 78_256 78_026 77_923 77_652 77_642 77_557 77_085 76_986 76_908 76_897 76_664 76_205 76_171 76_095 75_996 75_356 75_185 75_169 75_143 74_949 74_948 74_919 74_816 74_737 74_542 74_503 74_458 74_324 74_092 73_931 73_919 73_854 73_797 73_664 73_599 73_145 73_136 72_918 72_625 72_451 72_357 72_328 72_175 72_109 71_844 71_759 71_721 71_717 71_646 71_368 71_286 71_085 71_058 71_056 70_502 70_362 70_223 70_125 70_071 70_031 70_000 69_943 69_943 69_879 69_834 69_617 69_515 69_472 69_360 69_345 68_649 68_373 68_281 68_233 67_977 67_961 67_929 67_909 67_893 67_769 67_704 67_411 67_338 67_310 67_304 66_959 66_858 66_827 66_648 66_556 66_454 66_293 66_063 66_059 66_056 66_013 66_003 65_904 65_468 65_125 65_064 65_037 65_004 64_572 64_429 64_403 64_327 64_202 64_191 64_106 63_991 63_936 63_899 63_881 63_760 63_736 63_722 63_649 63_440 63_400 63_254 63_085 62_304 62_227 61_883 61_729 61_671 61_639 61_630 61_625 61_529 61_369 61_355 61_211 61_162 60_998 60_948 60_845 60_820 60_791 60_761 60_667 60_479 60_264 60_002 59_943 59_913 59_882 59_595 59_486 59_463 59_356 59_350 59_213 58_714 58_634 58_480 58_408 58_287 58_278 58_151 58_040 57_779 57_549 57_549 57_497 57_477 57_477 57_464 57_383 57_143 57_127 57_112 57_064 57_044 57_043 56_953 56_900 56_872 56_840 56_638 56_616 56_576 56_410 56_380 56_347 56_322 56_286 56_230 56_226 56_180 55_960 55_917 55_895 55_850 55_595 55_554 55_484 55_251 55_240 55_179 55_174 55_136 55_114 55_021 54_996 54_764 54_621 54_394 54_257 54_217 54_198 54_046 54_015 53_893 53_822 53_794 53_792 53_767 53_739 53_682 53_419 53_376 53_265 53_230 53_159 53_095 53_059 52_920 52_817 52_739 52_701 52_651 52_569 52_481 52_457 52_410 52_321 52_211 52_184 52_138 52_070 52_044 52_035 51_889 51_877 51_865 51_671 51_592 51_475 51_351 51_288 51_153 51_081 51_043 50_920 50_837 50_832 50_788 50_786 50_786 50_742 50_686 50_614 50_610 50_584 50_558 50_524 50_465 50_258 50_247 50_245 50_104 50_069 50_028 49_914 49_817 49_776 49_740 49_733 49_549 49_481 49_402 49_395 49_360 49_316 49_238 49_217 49_177 49_126 49_056 49_033 49_028 48_844 48_813 48_781 48_753 48_746 48_720 48_719 48_696 48_599 48_522 48_487 48_444 48_319 48_207 48_165 48_142 48_120 48_051 48_036 48_024 48_013 47_979 47_963 47_742 47_693 47_641 47_528 47_455 47_367 47_324 47_274 47_246 47_184 47_175 47_170 47_168 46_717 46_534 46_454 46_394 46_393 46_244 46_240 46_229 46_147 46_146 46_054 45_852 45_594 45_558 45_528 45_469 45_432 45_390 45_305 45_153 45_019 44_938 44_914 44_808 44_784 44_742 44_740 44_711 44_581 44_500 44_388 44_388 44_373 44_365 44_325 44_320 44_137 44_130 44_040 44_038 43_904 43_851 43_842 43_830 43_821 43_798 43_701 43_648 43_635 43_631 43_483 43_460 43_389 43_329 43_305 43_278 43_261 43_260 43_197 43_180 43_133 43_110 43_027 43_018 42_983 42_827 42_773 42_693 42_639 42_578 42_577 42_575 42_559 42_469 42_465 42_379 42_265 42_103 42_015 41_802 41_774 41_771 41_750 41_735 41_700 41_667 41_665 41_565 41_553 41_394 41_348 41_300 41_275 41_271 41_163 41_158 41_129 41_063 41_025 41_021 41_000 40_884 40_854 40_736 40_707 40_598 40_590 40_563 40_449 40_410 40_408 40_397 40_395 40_275 40_261 40_250 40_237 40_212 40_193 40_165 40_055 39_986 39_921 39_890 39_879 39_802 39_796 39_787 39_754 39_693 39_670 39_623 39_593 39_580 39_564 39_559 39_555 39_551 39_430 39_411 39_391 39_319 39_277 39_216 39_105 39_097 39_063 38_924 38_835 38_830 38_733 38_681 38_667 38_662 38_528 38_512 38_499 38_374 38_277 38_267 38_265 38_232 38_229 38_147 38_044 38_029 37_932 37_923 37_912 37_903 37_890 37_884 37_870 37_858 37_836 37_754 37_695 37_689 37_672 37_657 37_644 37_578 37_571 37_566 37_502 37_499 37_451 37_368 37_228 37_170 37_053 37_050 37_021 36_973 36_960 36_944 36_922 36_840 36_805 36_765 36_764 36_755 36_743 36_636 36_613 36_585 36_558 36_540 36_466 36_460 36_429 36_423 36_318 36_312 36_269 36_250 36_236 36_194 36_179 36_150 36_129 36_125 36_072 36_043 35_997 35_958 35_877 35_830 35_781 35_770 35_749 35_725 35_642 35_636 35_628 35_606 35_461 35_446 35_438 35_408 35_408 35_350 35_312 35_291 35_266 35_228 35_225 35_194 35_132 35_121 35_118 35_053 35_020 34_987 34_985 34_961 34_949"
namecounts = pd.DataFrame({
"names": names.split(),
"counts": [int(x) for x in counts.split()]
})
namecounts = namecounts.sort_values(by="names").reset_index(drop=True)
ExploreNames(namecounts, 4)
Running this produces the following:
k 1
Max count: 26760957
A- C 22603070
D- H 26468447
I- M 26292942
N- R 18722725
S- Z 26760957
k 2
Max count: 24879502
AB- DA 24470220
DE- HO 23293726
HU- MI 23537487
MO- SA 24667206
SC- ZU 24879502
k 3
Max count: 24291136
ABB- DAV 24281947
DAW- HUG 24186818
HUL- MOO 24055053
MOR- SCH 24033187
SCO- ZUN 24291136
k 4
Max count: 24291136
ABBO- DAVI 24281947
DAWS- HUGH 24186818
HULL- MOOR 24055053
MORA- SCHW 24033187
SCOT- ZUNI 24291136
Which shows that for the top 1000 U.S. surnames using a three letter prefix results in an optimal assignment of last names to supermarket shopping days.
They've simply divided by alphabetical order disregarding any surname's initials statistics, with terrible results and hours of queues. Is there a better way?
Probably not.
The first problem is that it has to be easy enough for everyone to figure out and remember. Anything too complex will be a massive disaster; not just because people won't know which day they can go shopping, but also because people at the supermarket won't be able to rapidly determine if a customer is/isn't supposed to be shopping (and if the supermarket doesn't check then customers will ignore the restriction and go shopping whenever they like).
The second problem is that everyone has their own schedule. One person might leave work early on Fridays and normally do their shopping on the way home. Another person might get their pension on Wednesdays so they'll normally do their shopping on Thursday. Another person might teach music lessons on Monday, Wednesday and Friday so they'll want to do their shopping on Tuesday or Thursday. A lot of people might normally do their shopping on Saturday or Sunday because they work Monday to Friday. By assigning everyone a specific day you cause chaos with everyone's normal schedule. For example, all the people who work 9 am to 5 pm on the day they've been assigned will arrive at 8 am or 5:30 pm.
The third problem is that people will worry about using a little too much milk (or meat, or pasta, or toilet paper, or ..) and running out before their next shopping day; and they'll worry about forgetting to buy something (margarine, tomato sauce, sugar, ...) and being without any for 7 whole days. Because of this people will start panic buying and stockpiling, which will result in "abnormally high demand" (and product shortages) while they're building up their stockpile.
In other words, regardless of how you assign people to days, you will get terrible results and hours of queues (especially when it's first introduced).

Ruby ARGF & RegEx: How to split on paragraph carriage return "\r\n" but not end of line "\r\n"

I am trying to pre-process some text using regex in ruby to input into a mapper job and would like to split on the carriage return denoting the paragraph.
The text will be coming into the mapper using ARGF.each as part of a hadoop streaming job
"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n" # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"
Once I have done this I will chomp the newline /carriage return of each line.
This will look something like this:
ARGF.each do |text|
paragraph = text.split(INSERT_REGEX_HERE)
#some more blah will happen beyond here
end
UPDATE:
The desired output then is an array as follows:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
[1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]
Ultimately what I want is the following array with no carriage returns within the array:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
"daughter of James Stevenson, Esq. of South Park, in the county of"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
"1789\""
[1] "Precisely such had the paragraph originally stood from the printer's"
]
Thanks in advance for any insights.
Beware when you do ARGF.each do |text|, the text will be every single line, NOT the whole text block.
You can provide ARGF.each a special line separator, it will return you two "lines", which are the two paragraphs in your case.
Try this:
paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}
First, split input into two paragraphs, then use gsub to remove unwanted line breaks.
To split the text use:
result = text.gsub(/(?<!\")\\r\\n|(?<=\\\")\\r\\n/, '').split(/[\r\n]+\"\\r\\n\".*?[\r\n]+/)

using variables in gsub

I have a variable address which for now is a long string containing some unneccessary info, eg: "Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
Aboriginal Relations is in a variable called title, and I'm trying to call address.gsub!(title,''), but its returning the original string.
I've also tried address.gsub!(/#{title}/,'') and address.gsub!("#{title}",'') but those won't work either. Any ideas?
Sorry, the typo occurred when I typed it into stack overflow, heres the code and the output, copied and pasted:
(this is within a loop, so there will be multiple outputs)
p title
address.gsub!(title,'')
p address
output
"Aboriginal Relations "
"Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
"Aboriginal Tourism Advisory Council "
"Aboriginal Tourism Advisory Council 5th Floor Terrace Building 9515 107 Street Edmonton AB T5K 2C3 Phone 780 427-9687 Fax 780 422-7235 Email foip.fintprccs#gov.ab.ca"
"Acadia Foundation "
"Acadia Foundation PO Box 96 Oyen AB T0J 2J0 Phone 403 664-3384 Fax 403 664-3316 Email acadiafoundation#telus.net"
"Access Advisory Council "
"Access Advisory Council 12th Floor Centre West Building 10035 108 Street Edmonton AB T5J 3E1 Phone 780 427-2805 Fax 780 422-3204 Email barb.joyner#gov.ab.ca"
"ACCM Benevolent Association "
"ACCM Benevolent Association Suite 100 9403 95 Avenue Edmonton AB T6C 4M7 Phone 780 468-4648 Fax 780 468-4648 Email accmmanor#shaw.ca"
"Acme Municipal Library "
"Acme Municipal Library PO Box 326 Acme AB T0M 0A0 Phone 403 546-3845 Fax 403 546-2248 Email aamlibrary#marigold.ab.ca"
likewise, if I try address.match(/#{title}/) I get nil.
I'm assuming you're using ruby 1.9 or higher.
It's possible that the trailing whitespace is a non-breaking space:
p "Relations\u00a0" # looks like a trailing space, but strip won't remove it
to get rid of it:
"Relations\u00a0".gsub!(/^\u00a0|\u00a0$/, '') # => "Relations"
A more generic solution for all unicode whitespace:
"Relations\u00a0".gsub!(/^[[:space:]]|[[:space:]]$/, '') # => "Relations"
To see what the character is in your case:
title[-1].ord # => 160 (example only)
'%x' % title[-1].ord # => "a0" (hex equivalent; example only)
title = title[0..-2] seemed to solve it. for some reason strip and chomp wouldn't work.

Resources