sort through text file numerically by numbers in column - sorting

i have the following text file:
product whole gi 664313980 , location int { from 1990 , to 2895 , strand minus
product whole gi 664591270 , location int { from 567 , to 1319 , strand minus
product whole gi 664591271 , location int { from 1368 , to 1976 , strand minus
product whole gi 664591272 , location int { from 3032 , to 3574 , strand minus
product whole gi 664591273 , location int { from 3631 , to 4062 , strand minus
product whole gi 664591274 , location int { from 6285 , to 7448 , strand minus
product whole gi 664591275 , location int { from 7445 , to 8353 , strand minus
product whole gi 664591277 , location int { from 8350 , to 9108 , strand minus
product whole gi 664591278 , location int { from 9327 , to 11525
product whole gi 664591280 , location int { from 11697 , to 13364
product whole gi 664591282 , location int { from 13505 , to 14065 , strand minus
product whole gi 664591283 , location int { from 14131 , to 16107 , strand minus
product whole gi 664591285 , location int { from 16276 , to 17493 , strand minus
product whole gi 664591286 , location int { from 17490 , to 18056 , strand minus
product whole gi 664591288 , location int { from 18226 , to 18957
product whole gi 664591290 , location int { from 19138 , to 19584 , strand minus
product whole gi 664591293 , location int { from 19646 , to 22066
product whole gi 664591298 , location int { from 25742 , to 28477 , strand minus
product whole gi 664591299 , location int { from 28998 , to 29885 , strand minus
product whole gi 664591300 , location int { from 29882 , to 30883 , strand minus
product whole gi 664591301 , location int { from 30883 , to 32172 , strand minus
product whole gi 664591302 , location int { from 0 , to 368 , strand minus
product whole gi 739968934 , location int { from 24075 , to 25430
product whole gi 739968936 , location int { from 25352 , to 25648
product whole gi 739968938 , location int { from 4059 , to 4259 , strand minus
product whole gi 739968940 , location int { from 22250 , to 23854
and i want to sort it by the numbers after "from" and before the ","
i've tried using the sort linux command however i have not had much success.
any one know how to go about this? thanks

Try using sort -n -k10 <file>

Related

Assignment problem with 2 workers per job

Problem setup
Currently we are working at dispatching problem for a food-tech startup (e-grocery). We have jobs (orders to be delivered) and workers (couriers/packers/universal) The problem is to to assign orders to workers efficiently. At first step we've decided to optimise CTE (click-to-eat - time between order placement and order delivery)
The problem itself
The problem comes from the fact that sometimes it is efficient to have 2 worker per one job and not a single executor, because packer may know store "map" and courier may have bicycle what allows to execute job faster comparing to each of them separately even accounting for order transfer costs.
We have researched algorithms and found that our problem looks like assignment problem and there is an algorithmic solution (Hungarian algorithm), but the problem is that classic problem requires "each job is assigned to one worker and each worker is assigned one job" while in our case it is sometimes efficient to have 2 workers per one job.
What we have tried so far
insert (packer A + universal B) combination into the matrix of costs, but in this case we cannot add universal B into the matrix, because as a result we can get that universal B will be assigned to 2 jobs (as a separate unit and as part of combination with packer A)
Implement 2 Hungarian algorithms consequently: at first assign packaging, after that assign delivery. It works in vast majority of cases, but sometimes leads to inefficient solutions. If needed, I will add an example.
The question itself
I've googled a lot, but could not find anything that could direct me to the solution of the problem. If you have any links or ideas that we can use as a clue to solution, I will be happy to check them.
EDIT: I've added the brute force solution of my question. Hope that helps to understand the problem better
# constants
delivery_speed = np.array([5, 13]) # km per hour
delivery_distance = np.array([300, 2700]) # meters
flight_distance = np.array([500, 1900]) # meters время подлета
positions_amount = np.array([4, 8]) # number of positions in one order
assembly_speed = np.array([2, 3]) # minutes per position
transit_time = 5 * 60 # sec to transfer order
number_of_orders = 3 # number of orders in a batch
number_of_workers = 3
# size of optimization matrix
matrix_size = max(number_of_workers, number_of_orders)
# maximum diagonal length for delivery and flight
max_length = np.sqrt(max(delivery_distance)**2/2)
max_flight_length = np.sqrt(max(flight_distance)**2/2)
# store positions
A = np.array([delivery_distance[1], delivery_distance[1]])
B = np.array([A[0] + max_length / 2, A[1]])
possible_order_position_x = np.array([-max_length/2, max_length]) + A[0]
possible_order_position_y = np.array([-max_length, max_length]) + A[1]
possible_courier_position_x = np.array([-max_flight_length/2, max_flight_length]) + A[0]
possible_courier_position_y = np.array([-max_flight_length, max_flight_length]) + A[1]
# generate random data
def random_speed(speed_array):
return np.random.randint(speed_array[0], speed_array[1]+1)
def location(possible_x, possible_y):
return np.random.randint([possible_x[0], possible_y[0]],
[possible_x[1], possible_y[1]],
size=2)
def generate_couriers():
# generate couriers
couriers = {}
for courier in range(number_of_workers):
couriers[courier] = {
'position': location(possible_courier_position_x, possible_courier_position_y),
'delivery_speed': random_speed(delivery_speed),
'assembly_speed': random_speed(assembly_speed),
}
return couriers
couriers = generate_couriers()
store_location = {0: A, 1:B}
def generate_orders():
# generate orders
orders = {}
for order in range(number_of_orders):
orders[order] = {
'number_of_positions': random_speed(positions_amount),
'store_position': store_location[np.random.randint(2)],
'customer_position': location(possible_order_position_x, possible_order_position_y)
}
return orders
orders = generate_orders()
orders
# functions to calculate assembly and delivery speed
def travel_time(location_1, location_2, speed):
# time to get from current location to store
flight_distance = np.linalg.norm(location_1 - location_2)
delivery_speed = 1000 / (60 * 60) * speed # meters per second
return flight_distance / delivery_speed # seconds
def assembly_time(courier, order):
flight_time = travel_time(courier['position'], order['store_position'], courier['delivery_speed'])
assembly_time = courier['assembly_speed'] * order['number_of_positions'] * 60
return int(flight_time + assembly_time)
assembly_time(couriers[0], orders[0])
def brute_force_solution():
best_cte = np.inf
best_combination = [[],[]]
for first_phase in itertools.permutations(range(number_of_workers), number_of_orders):
assembly_time_s = pd.Series(index = range(number_of_orders), dtype=float)
for order, courier in enumerate(first_phase):
assembly_time_s[order] = assembly_time(couriers[courier], orders[order])
# start to work with delivery
for second_phase in itertools.permutations(range(number_of_workers), number_of_orders):
delivery_time_s = pd.Series(index = range(number_of_orders), dtype=float)
for order, courier in enumerate(second_phase):
delivery_time = travel_time(orders[order]['store_position'],
orders[order]['customer_position'],
couriers[courier]['delivery_speed'])
# different cases for different deliveries
if courier == first_phase[order]:
# if courier assemblied order, then deliver immidietely
delivery_time_s[order] = delivery_time
elif courier not in first_phase:
# flight during assembly, wait if needed, transfer time, delivery
flight_time = travel_time(orders[order]['store_position'],
couriers[courier]['position'],
couriers[courier]['delivery_speed'])
wait_time = max(flight_time - assembly_time_s[order], 0)
delivery_time_s[order] = transit_time + wait_time + delivery_time
else:
# case when shopper transfers her order and moves deliver other
# check if second order is in the same store
first_phase_order = first_phase.index(courier)
if (orders[first_phase_order]['store_position'] == orders[order]['store_position']).all():
# transit time - fixed and happens only once!
# wait, if second order has not been assemblied yet
# time to assembly assigned order
assembly_own = assembly_time_s[first_phase_order]
# time to wait, if order to deliver is assemblied slower
wait_time = max(assembly_time_s[order] - assembly_own, 0)
# delivery time is calculated as loop start
delivery_time_s[order] = transit_time + wait_time + delivery_time
else:
# transit own order - flight to the other store - wait if needed - tansit order - delivery_time
flight_time = travel_time(orders[first_phase_order]['store_position'],
orders[order]['store_position'],
couriers[courier]['delivery_speed'])
arrival_own = (assembly_time_s[first_phase_order] + transit_time + flight_time)
wait_time = max(assembly_time_s[order] - arrival_own, 0)
delivery_time_s[order] = ((transit_time * 2) + flight_time + wait_time + delivery_time)
delivery_time_s = delivery_time_s.astype(int)
# calculate and update best result, if needed
cte = (assembly_time_s + delivery_time_s).sum()
if cte < best_cte:
best_cte = cte
best_combination = [list(first_phase), list(second_phase)]
return best_cte, best_combination
best_cte, best_combination = brute_force_solution()
The Hungarian algorithm is an antiquated solution that remains popular for some unexplicable reason I do not understand (perhaps because it is conceptually simple.)
Use Minimum Cost Flow to model your problem. It is far more flexible, and has many efficient algorithms. It can also be proved to solve any problem the Hungarian algorithm can (proof is simple.)
Given the very vague description of your problem, you would want to model the underlying graph G=(V,E) with two layers of nodes V = (O,W), where O are the orders and W are the workers.
Edges can be directed, with each Worker having an edge with capacity 1 to every possible order. Connect the source node to the worker with edge capacity 1 and connect each order node to the sink node with capacity 2 (or higher, allowing for more workers per order).
What I described above is actually a maxflow instance not a MCF as it assigns no weights. You can however, assign weights to any of the edges.
Given your problem formulation, I don't understand how this is even an assignment problem, can you not use a simple first come first assign (queue) type strategy, given you don't seem to have any criteria to have a worker prefer working on a certain order over another.
I did a quick-and-dirty test with a model that can handle teams. Just for illustration purposes.
I created two types of workers: type A and type B, and also teams consisting of two workers (one of each type). In addition, I created random cost data.
Here is a partial printout of all the data.
---- 13 SET i workers,teams
a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 , a9 , a10
b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , b9 , b10
team1 , team2 , team3 , team4 , team5 , team6 , team7 , team8 , team9 , team10
team11 , team12 , team13 , team14 , team15 , team16 , team17 , team18 , team19 , team20
team21 , team22 , team23 , team24 , team25 , team26 , team27 , team28 , team29 , team30
team31 , team32 , team33 , team34 , team35 , team36 , team37 , team38 , team39 , team40
team41 , team42 , team43 , team44 , team45 , team46 , team47 , team48 , team49 , team50
team51 , team52 , team53 , team54 , team55 , team56 , team57 , team58 , team59 , team60
team61 , team62 , team63 , team64 , team65 , team66 , team67 , team68 , team69 , team70
team71 , team72 , team73 , team74 , team75 , team76 , team77 , team78 , team79 , team80
team81 , team82 , team83 , team84 , team85 , team86 , team87 , team88 , team89 , team90
team91 , team92 , team93 , team94 , team95 , team96 , team97 , team98 , team99 , team100
---- 13 SET w workers
a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 , a9 , a10, b1 , b2 , b3 , b4 , b5
b6 , b7 , b8 , b9 , b10
---- 13 SET a a workers
a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 , a9 , a10
---- 13 SET b b workers
b1 , b2 , b3 , b4 , b5 , b6 , b7 , b8 , b9 , b10
---- 13 SET t teams
team1 , team2 , team3 , team4 , team5 , team6 , team7 , team8 , team9 , team10
team11 , team12 , team13 , team14 , team15 , team16 , team17 , team18 , team19 , team20
team21 , team22 , team23 , team24 , team25 , team26 , team27 , team28 , team29 , team30
team31 , team32 , team33 , team34 , team35 , team36 , team37 , team38 , team39 , team40
team41 , team42 , team43 , team44 , team45 , team46 , team47 , team48 , team49 , team50
team51 , team52 , team53 , team54 , team55 , team56 , team57 , team58 , team59 , team60
team61 , team62 , team63 , team64 , team65 , team66 , team67 , team68 , team69 , team70
team71 , team72 , team73 , team74 , team75 , team76 , team77 , team78 , team79 , team80
team81 , team82 , team83 , team84 , team85 , team86 , team87 , team88 , team89 , team90
team91 , team92 , team93 , team94 , team95 , team96 , team97 , team98 , team99 , team100
---- 13 SET j jobs
job1 , job2 , job3 , job4 , job5 , job6 , job7 , job8 , job9 , job10, job11, job12
job13, job14, job15
---- 23 SET team composition of teams
team1 .a1 , team1 .b1 , team2 .a1 , team2 .b2 , team3 .a1 , team3 .b3 , team4 .a1
team4 .b4 , team5 .a1 , team5 .b5 , team6 .a1 , team6 .b6 , team7 .a1 , team7 .b7
team8 .a1 , team8 .b8 , team9 .a1 , team9 .b9 , team10 .a1 , team10 .b10, team11 .a2
team11 .b1 , team12 .a2 , team12 .b2 , team13 .a2 , team13 .b3 , team14 .a2 , team14 .b4
team15 .a2 , team15 .b5 , team16 .a2 , team16 .b6 , team17 .a2 , team17 .b7 , team18 .a2
team18 .b8 , team19 .a2 , team19 .b9 , team20 .a2 , team20 .b10, team21 .a3 , team21 .b1
team22 .a3 , team22 .b2 , team23 .a3 , team23 .b3 , team24 .a3 , team24 .b4 , team25 .a3
team25 .b5 , team26 .a3 , team26 .b6 , team27 .a3 , team27 .b7 , team28 .a3 , team28 .b8
team29 .a3 , team29 .b9 , team30 .a3 , team30 .b10, team31 .a4 , team31 .b1 , team32 .a4
team32 .b2 , team33 .a4 , team33 .b3 , team34 .a4 , team34 .b4 , team35 .a4 , team35 .b5
team36 .a4 , team36 .b6 , team37 .a4 , team37 .b7 , team38 .a4 , team38 .b8 , team39 .a4
team39 .b9 , team40 .a4 , team40 .b10, team41 .a5 , team41 .b1 , team42 .a5 , team42 .b2
team43 .a5 , team43 .b3 , team44 .a5 , team44 .b4 , team45 .a5 , team45 .b5 , team46 .a5
team46 .b6 , team47 .a5 , team47 .b7 , team48 .a5 , team48 .b8 , team49 .a5 , team49 .b9
team50 .a5 , team50 .b10, team51 .a6 , team51 .b1 , team52 .a6 , team52 .b2 , team53 .a6
team53 .b3 , team54 .a6 , team54 .b4 , team55 .a6 , team55 .b5 , team56 .a6 , team56 .b6
team57 .a6 , team57 .b7 , team58 .a6 , team58 .b8 , team59 .a6 , team59 .b9 , team60 .a6
team60 .b10, team61 .a7 , team61 .b1 , team62 .a7 , team62 .b2 , team63 .a7 , team63 .b3
team64 .a7 , team64 .b4 , team65 .a7 , team65 .b5 , team66 .a7 , team66 .b6 , team67 .a7
team67 .b7 , team68 .a7 , team68 .b8 , team69 .a7 , team69 .b9 , team70 .a7 , team70 .b10
team71 .a8 , team71 .b1 , team72 .a8 , team72 .b2 , team73 .a8 , team73 .b3 , team74 .a8
team74 .b4 , team75 .a8 , team75 .b5 , team76 .a8 , team76 .b6 , team77 .a8 , team77 .b7
team78 .a8 , team78 .b8 , team79 .a8 , team79 .b9 , team80 .a8 , team80 .b10, team81 .a9
team81 .b1 , team82 .a9 , team82 .b2 , team83 .a9 , team83 .b3 , team84 .a9 , team84 .b4
team85 .a9 , team85 .b5 , team86 .a9 , team86 .b6 , team87 .a9 , team87 .b7 , team88 .a9
team88 .b8 , team89 .a9 , team89 .b9 , team90 .a9 , team90 .b10, team91 .a10, team91 .b1
team92 .a10, team92 .b2 , team93 .a10, team93 .b3 , team94 .a10, team94 .b4 , team95 .a10
team95 .b5 , team96 .a10, team96 .b6 , team97 .a10, team97 .b7 , team98 .a10, team98 .b8
team99 .a10, team99 .b9 , team100.a10, team100.b10
---- 28 PARAMETER c cost coefficients
job1 job2 job3 job4 job5 job6 job7 job8 job9
a1 17.175 84.327 55.038 30.114 29.221 22.405 34.983 85.627 6.711
a2 63.972 15.952 25.008 66.893 43.536 35.970 35.144 13.149 15.010
a3 11.049 50.238 16.017 87.246 26.511 28.581 59.396 72.272 62.825
a4 18.210 64.573 56.075 76.996 29.781 66.111 75.582 62.745 28.386
a5 7.277 17.566 52.563 75.021 17.812 3.414 58.513 62.123 38.936
a6 78.340 30.003 12.548 74.887 6.923 20.202 0.507 26.961 49.985
a7 99.360 36.990 37.289 77.198 39.668 91.310 11.958 73.548 5.542
a8 22.575 39.612 27.601 15.237 93.632 42.266 13.466 38.606 37.463
a9 10.169 38.389 32.409 19.213 11.237 59.656 51.145 4.507 78.310
a10 50.659 15.925 65.689 52.388 12.440 98.672 22.812 67.565 77.678
b1 73.497 8.544 15.035 43.419 18.694 69.269 76.297 15.481 38.938
b2 8.712 54.040 12.686 73.400 11.323 48.835 79.560 49.205 53.356
b3 2.463 17.782 6.132 1.664 83.565 60.166 2.702 19.609 95.071
b4 39.334 80.546 54.099 39.072 55.782 93.276 34.877 0.829 94.884
b5 58.032 16.642 64.336 34.431 91.233 90.006 1.626 36.863 66.438
b6 49.662 4.493 77.370 53.297 74.677 72.005 63.160 11.492 97.116
b7 79.070 61.022 5.431 48.518 5.255 69.858 19.478 22.603 81.364
b8 81.953 86.041 21.269 45.679 3.836 32.300 43.988 31.533 13.477
b9 6.441 41.460 34.161 46.829 64.267 64.358 33.761 10.082 90.583
b10 40.419 11.167 75.113 80.340 2.366 48.088 27.859 90.161 1.759
team1 50.421 83.126 60.214 8.225 57.776 59.318 68.377 15.877 33.178
team2 57.624 71.983 68.373 1.985 83.980 71.005 15.551 61.071 66.155
team3 1.252 1.017 95.203 97.668 96.632 85.628 14.161 4.973 55.303
team4 34.968 11.734 58.598 44.553 41.232 91.451 21.378 22.417 54.233
team5 31.014 4.020 82.117 23.096 41.003 30.258 44.492 71.600 59.315
team6 68.639 67.463 33.213 75.994 17.678 68.248 67.299 83.121 51.517
team7 8.469 57.216 2.206 74.204 90.510 56.082 47.283 71.756 51.301
team8 96.552 95.789 89.923 32.755 45.710 59.618 87.862 17.067 63.360
team9 33.626 58.864 57.439 54.342 57.816 97.722 32.147 76.297 96.251
. . .
team98 21.277 8.252 28.341 97.284 47.146 22.196 56.537 89.966 15.708
team99 77.385 12.015 78.861 79.375 83.146 11.379 3.413 72.925 88.689
team100 11.050 20.276 21.448 27.928 15.073 76.671 91.574 94.498 7.094
(cost data for other jobs skipped)
My attempt to model this as Mixed-Integer Programming model is as follows:
Obviously, this is no longer a pure assignment problem. The first constraint is more complicated than we are used to. It says for each worker w, we can assign him/herself or any team that has w as member only once.
When I solve this without using teams, I get the following solution:
---- 56 VARIABLE x.L assignment
job1 job2 job3 job4 job5 job6 job7 job8 job9
a5 1
a6 1
b1 1
b3 1
b4 1
b6 1
b8 1
b9 1
b10 1
+ job10 job11 job12 job13 job14 job15
a1 1
a4 1
a7 1
b2 1
b5 1
b7 1
---- 56 VARIABLE z.L = 59.379 total cost
This is a standard assignment problem, but I solved it as an LP (so I could use the same tools).
When I allow teams, I get:
---- 65 VARIABLE x.L assignment
job1 job2 job3 job4 job5 job6 job7 job8 job9
a1 1
a5 1
a6 1
b3 1
b4 1
b9 1
b10 1
team17 1
team28 1
+ job10 job11 job12 job13 job14 job15
a4 1
a7 1
b2 1
b5 1
team86 1
team91 1
---- 65 VARIABLE z.L = 40.057 total cost
The objective is better (just because it could select teams with low "cost"). Also, note that in the solution no worker is selected twice (either individually or part of a team). Here is some additional solution report that confirms this:
---- 70 SET sol alternative solution report
job1 job2 job3 job4 job5 job6 job7 job8 job9
team17.a2 YES
team17.b7 YES
team28.a3 YES
team28.b8 YES
- .a1 YES
- .a5 YES
- .a6 YES
- .b3 YES
- .b4 YES
- .b9 YES
- .b10 YES
+ job10 job11 job12 job13 job14 job15
team86.a9 YES
team86.b6 YES
team91.a10 YES
team91.b1 YES
- .a4 YES
- .a7 YES
- .b2 YES
- .b5 YES
Note that the model is not very small:
MODEL STATISTICS
BLOCKS OF EQUATIONS 3 SINGLE EQUATIONS 36
BLOCKS OF VARIABLES 2 SINGLE VARIABLES 1,801
NON ZERO ELEMENTS 6,901 DISCRETE VARIABLES 1,800
However, the MIP model solves quite easily: less than a second.
I did not test the model on large data sets. It is just to show how one could approach a problem like this.
There's an obvious heuristic that you could try:
Solve the classical problem using the Hungarian Algorithm,
Then using the pool of unassigned agents, find the combined assignment for each that gives the biggest improvement.
Certainly not the optimum, but a clear first-order improvement over the Hungarian Algorithm alone.

Show duplicates in internal table

Each an every item should have an uniquie SecondNo + Drawing combination. Due to misentries, some combinations are there two times.
I need to create a report with ABAP which identifies those combinations and does not reflect the others.
Item: SecNo: Drawing:
121 904 5000 double
122 904 5000 double
123 816 5100
124 813 5200
125 812 4900 double
126 812 4900 double
127 814 5300
How can I solve this? I tried 2 approaches and failed:
Sorting the data and tried to print out each one when the value of the upper row is equal to the next value
counting the duplicates and showing all of them which are more then one.
Where do I put in the condition? in the loop area?
I tried this:
REPORT duplicates.
DATA: BEGIN OF lt_duplicates OCCURS 0,
f2(10),
f3(10),
END OF lt_duplicates,
it_test TYPE TABLE OF ztest WITH HEADER LINE,
i TYPE i.
SELECT DISTINCT f2 f3 FROM ztest INTO TABLE lt_duplicates.
LOOP AT lt_duplicates.
IF f2 = lt_duplicates-f2 AND f3 = lt_duplicates-f3.
ENDIF.
i = LINES( it_test ).
IF i > 1.
LOOP AT it_test.
WRITE :/ it_test-f1,it_test-f2,it_test-f3.
ENDLOOP.
ENDIF.
ENDLOOP.
From ABAP 7.40, you may use the GROUP BY constructs with the GROUP SIZE words so that to take into account only the groups with at least 2 elements.
ABAP statement LOOP AT ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines:
Either LOOP AT GROUP ...
Or ... FOR ... IN GROUP ...
ABAP expression ... VALUE|REDUCE|NEW type|#( FOR GROUPS ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines: ... FOR ... IN GROUP ...
For both constructs, it's possible to loop at the grouped lines in two ways:
* LOOP AT GROUP ...
* ... FOR ... IN GROUP ...
Line# Item SecNo Drawing
1 121 904 5000 double
2 122 904 5000 double
3 123 816 5100
4 124 813 5200
5 125 812 4900 double
6 126 812 4900 double
7 127 814 5300
You might want to produce the following table containing the duplicates:
SecNo Drawing Lines
904 5000 [1,2]
812 4900 [5,6]
Solution with LOOP AT ... GROUP BY ...:
TYPES: BEGIN OF t_line,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF t_line,
BEGIN OF t_duplicate,
secno TYPE i,
drawing TYPE i,
num_dup TYPE i, " number of duplicates
lines TYPE STANDARD TABLE OF REF TO t_line WITH EMPTY KEY,
END OF t_duplicate,
t_lines TYPE STANDARD TABLE OF t_line WITH EMPTY KEY,
t_duplicates TYPE STANDARD TABLE OF t_duplicate WITH EMPTY KEY.
DATA(table) = VALUE t_lines(
( item = 121 secno = 904 drawing = 5000 )
( item = 122 secno = 904 drawing = 5000 )
( item = 123 secno = 816 drawing = 5100 )
( item = 124 secno = 813 drawing = 5200 )
( item = 125 secno = 812 drawing = 4900 )
( item = 126 secno = 812 drawing = 4900 )
( item = 127 secno = 814 drawing = 5300 ) ).
DATA(expected_duplicates) = VALUE t_duplicates(
( secno = 904 drawing = 5000 num_dup = 2 lines = VALUE #( ( REF #( table[ 1 ] ) ) ( REF #( table[ 2 ] ) ) ) )
( secno = 812 drawing = 4900 num_dup = 2 lines = VALUE #( ( REF #( table[ 5 ] ) ) ( REF #( table[ 6 ] ) ) ) ) ).
DATA(actual_duplicates) = VALUE t_duplicates( ).
LOOP AT table
ASSIGNING FIELD-SYMBOL(<line>)
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
ASSIGNING FIELD-SYMBOL(<group_table>).
IF <group_table>-gs >= 2.
actual_duplicates = VALUE #( BASE actual_duplicates
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
ENDIF.
ENDLOOP.
WRITE : / 'List of duplicates:'.
SKIP 1.
WRITE : / 'Secno Drawing List of concerned items'.
WRITE : / '---------- ---------- ---------------------------------- ...'.
LOOP AT actual_duplicates ASSIGNING FIELD-SYMBOL(<duplicate>).
WRITE : / <duplicate>-secno, <duplicate>-drawing NO-GROUPING.
LOOP AT <duplicate>-lines INTO DATA(line).
WRITE line->*-item.
ENDLOOP.
ENDLOOP.
ASSERT actual_duplicates = expected_duplicates. " short dump if not equal
Output:
List of duplicates:
Secno Drawing List of concerned items
---------- ---------- ---------------------------------- ...
904 5000 121 122
812 4900 125 126
Solution with ... VALUE type|#( FOR GROUPS ... GROUP BY ...:
DATA(actual_duplicates) = VALUE t_duplicates(
FOR GROUPS <group_table> OF <line> IN table
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
DELETE actual_duplicates WHERE num_dup = 1.
Note: for deleting non-duplicates, instead of using an additional DELETE statement, it can be done inside the VALUE construct by adding a LINES OF COND construct which adds 1 line if group size >= 2, or none otherwise (if group size = 1):
...
gs = GROUP SIZE )
( LINES OF COND #( WHEN <group_table>-gs >= 2 THEN VALUE #( "<== new line
( secno = <group_table>-secno
...
... REF #( <line2> ) ) ) ) ) ) ) ). "<== 3 extra right parentheses
You can use AT...ENDAT for this, provided that you arrange the fields correctly:
TYPES: BEGIN OF t_my_line,
secno TYPE foo,
drawing TYPE bar,
item TYPE baz, " this field has to appear AFTER the other ones in the table
END OF t_my_line.
DATA: lt_my_table TYPE TABLE OF t_my_line,
lt_duplicates TYPE TABLE OF t_my_line.
FIELD-SYMBOLS: <ls_line> TYPE t_my_line.
START-OF-WHATEVER.
* ... fill the table ...
SORT lt_my_table BY secno drawing.
LOOP AT lt_my_table ASSIGNING <ls_line>.
AT NEW drawing. " whenever drawing or any field left of it changes...
FREE lt_duplicates.
ENDAT.
APPEND <ls_line> TO lt_duplicates.
AT END OF drawing.
IF lines( lt_duplicates ) > 1.
* congrats, here are your duplicates...
ENDIF.
ENDAT.
ENDLOOP.
I needed simply to report duplicate lines in error based on two fields so used the following.
LOOP AT gt_data INTO DATA(gs_data)
GROUP BY ( columnA = gs_data-columnA columnB = gs_data-columnB
size = GROUP SIZE index = GROUP INDEX ) ASCENDING
REFERENCE INTO DATA(group_ref).
IF group_ref->size > 1.
PERFORM insert_error USING group_ref->columnA group_ref->columnB.
ENDIF.
ENDLOOP.
Here is my 2p worth, you could cut some out of this depending on what you want to do, and you should consider the amount of data being processed too. This method is only really for smaller sets.
Personally I like to prevent erroneous records at the source. Catching an error during input. But if you do end up in a pickle there is definitely more than one way to solve the issue.
TYPES: BEGIN OF ty_itab,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF ty_itab.
TYPES: itab_tt TYPE STANDARD TABLE OF ty_itab.
DATA: lt_itab TYPE itab_tt,
lt_itab2 TYPE itab_tt,
lt_itab3 TYPE itab_tt.
lt_itab = VALUE #(
( item = '121' secno = '904' drawing = '5000' )
( item = '122' secno = '904' drawing = '5000' )
( item = '123' secno = '816' drawing = '5100' )
( item = '124' secno = '813' drawing = '5200' )
( item = '125' secno = '812' drawing = '4900' )
( item = '126' secno = '812' drawing = '4900' )
( item = '127' secno = '814' drawing = '5300' )
).
APPEND LINES OF lt_itab TO lt_itab2.
APPEND LINES OF lt_itab TO lt_itab3.
SORT lt_itab2 BY secno drawing.
DELETE ADJACENT DUPLICATES FROM lt_itab2 COMPARING secno drawing.
* Loop at what is hopefully the smaller itab.
LOOP AT lt_itab2 ASSIGNING FIELD-SYMBOL(<line>).
DELETE TABLE lt_itab3 FROM <line>.
ENDLOOP.
* itab1 has all originals.
* itab2 has the unique.
* itab3 has the duplicates.

How to convert windows-1256 to utf-8 in lua?

I need to convert Arabic text from windows-1256 to utf-8 how can I do that? any help?
thanks
Try lua-iconv, which binds iconv to Lua.
local win2utf_list = [[
0x00 0x0000 #NULL
0x01 0x0001 #START OF HEADING
0x02 0x0002 #START OF TEXT
-- Download full text from
-- http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT
0xFD 0x200E #LEFT-TO-RIGHT MARK
0xFE 0x200F #RIGHT-TO-LEFT MARK
0xFF 0x06D2 #ARABIC LETTER YEH BARREE
]]
local win2utf = {}
for w, u in win2utf_list:gmatch'0x(%x%x)%s+0x(%x+)' do
local c, t, h = tonumber(u,16), {}, 128
while c >= h do
t[#t+1] = 128 + c%64
c = math.floor(c/64)
h = h > 32 and 32 or h/2
end
t[#t+1] = 256 - 2*h + c
win2utf[w.char(tonumber(w,16))] =
w.char((table.unpack or unpack)(t)):reverse()
end
local function convert_to_utf8(win_string)
return win_string:gsub('.', win2utf)
end
Windows-1256 is one of the character-sets designed as an 8-bit overlay of ASCII. It therefore has 256 characters, each encoded as one byte.
UTF-8 is an encoding of the Unicode character-set. Being "universal", it works out that it is a superset of the Windows-1256 character-set. So, there is no loss of information by having to use "substitution characters" in place of ones that aren't members of the character-set.
Conversion is a simple matter of transforming the Windows-1256 byte for each character to the corresponding UTF-8 bytes. A lookup table is an easy way to do it.
local encoding = {
-- table maps the one byte Windows-1256 encoding for a character to a Lua string with the UTF-8 encoding for the character
"\000" , "\001" , "\002" , "\003" , "\004" , "\005" , "\006" , "\007" ,
"\008" , "\009" , "\010" , "\011" , "\012" , "\013" , "\014" , "\015" ,
"\016" , "\017" , "\018" , "\019" , "\020" , "\021" , "\022" , "\023" ,
"\024" , "\025" , "\026" , "\027" , "\028" , "\029" , "\030" , "\031" ,
"\032" , "\033" , "\034" , "\035" , "\036" , "\037" , "\038" , "\039" ,
"\040" , "\041" , "\042" , "\043" , "\044" , "\045" , "\046" , "\047" ,
"\048" , "\049" , "\050" , "\051" , "\052" , "\053" , "\054" , "\055" ,
"\056" , "\057" , "\058" , "\059" , "\060" , "\061" , "\062" , "\063" ,
"\064" , "\065" , "\066" , "\067" , "\068" , "\069" , "\070" , "\071" ,
"\072" , "\073" , "\074" , "\075" , "\076" , "\077" , "\078" , "\079" ,
"\080" , "\081" , "\082" , "\083" , "\084" , "\085" , "\086" , "\087" ,
"\088" , "\089" , "\090" , "\091" , "\092" , "\093" , "\094" , "\095" ,
"\096" , "\097" , "\098" , "\099" , "\100" , "\101" , "\102" , "\103" ,
"\104" , "\105" , "\106" , "\107" , "\108" , "\109" , "\110" , "\111" ,
"\112" , "\113" , "\114" , "\115" , "\116" , "\117" , "\118" , "\119" ,
"\120" , "\121" , "\122" , "\123" , "\124" , "\125" , "\126" , "\127" ,
"\226\130\172", "\217\190" , "\226\128\154", "\198\146" , "\226\128\158", "\226\128\166", "\226\128\160", "\226\128\161",
"\203\134" , "\226\128\176", "\217\185" , "\226\128\185", "\197\146" , "\218\134" , "\218\152" , "\218\136" ,
"\218\175" , "\226\128\152", "\226\128\153", "\226\128\156", "\226\128\157", "\226\128\162", "\226\128\147", "\226\128\148",
"\218\169" , "\226\132\162", "\218\145" , "\226\128\186", "\197\147" , "\226\128\140", "\226\128\141", "\218\186" ,
"\194\160" , "\216\140" , "\194\162" , "\194\163" , "\194\164" , "\194\165" , "\194\166" , "\194\167" ,
"\194\168" , "\194\169" , "\218\190" , "\194\171" , "\194\172" , "\194\173" , "\194\174" , "\194\175" ,
"\194\176" , "\194\177" , "\194\178" , "\194\179" , "\194\180" , "\194\181" , "\194\182" , "\194\183" ,
"\194\184" , "\194\185" , "\216\155" , "\194\187" , "\194\188" , "\194\189" , "\194\190" , "\216\159" ,
"\219\129" , "\216\161" , "\216\162" , "\216\163" , "\216\164" , "\216\165" , "\216\166" , "\216\167" ,
"\216\168" , "\216\169" , "\216\170" , "\216\171" , "\216\172" , "\216\173" , "\216\174" , "\216\175" ,
"\216\176" , "\216\177" , "\216\178" , "\216\179" , "\216\180" , "\216\181" , "\216\182" , "\195\151" ,
"\216\183" , "\216\184" , "\216\185" , "\216\186" , "\217\128" , "\217\129" , "\217\130" , "\217\131" ,
"\195\160" , "\217\132" , "\195\162" , "\217\133" , "\217\134" , "\217\135" , "\217\136" , "\195\167" ,
"\195\168" , "\195\169" , "\195\170" , "\195\171" , "\217\137" , "\217\138" , "\195\174" , "\195\175" ,
"\217\139" , "\217\140" , "\217\141" , "\217\142" , "\195\180" , "\217\143" , "\217\144" , "\195\183" ,
"\217\145" , "\195\185" , "\217\146" , "\195\187" , "\195\188" , "\226\128\142", "\226\128\143", "\219\146"
}
--
encoding.convert = function(str)
assert(type(str) == "string", "Parameter 1 must be a string")
local result = {}
for i = 1, string.len(str) do
table.insert(result, encoding[string.byte(str,i)+1])
end
return table.concat(result)
end
assert(encoding.convert("test1") == "test1", "test1 failed")
Refs:
Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Roberto Ierusalimschy, Creating Strings Piece by Piece
Generally, convert from a code page (character set) to another. Must use a map table.
Which like: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1256.TXT, from CP1256 to Unicode.
Then convert from Unicode to Utf8 (encode/decode method works between Unicode & UTF-8 , without a big map).

Pig Changing Schema to required type

I'm a new Pig user.
I have an existing schema which I want to modify. My source data is as follows with 6 columns:
Name Type Date Region Op Value
-----------------------------------------------------
john ab 20130106 D X 20
john ab 20130106 D C 19
jphn ab 20130106 D T 8
jphn ab 20130106 E C 854
jphn ab 20130106 E T 67
jphn ab 20130106 E X 98
and so on. Each Op value is always C, T or X.
I basically want to split my data in the following way into 7 columns:
Name Type Date Region OpX OpC OpT
----------------------------------------------------------
john ab 20130106 D 20 19 8
john ab 20130106 E 98 854 67
Basically split the Op column into 3 columns: each for one Op value. Each of these columns should contain appropriate value from column Value.
How can I do this in Pig?
One way to achieve the desired result:
IN = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
A = order IN by op asc;
B = group A by (name, type, date, region);
C = foreach B {
bs = STRSPLIT(BagToString(A.value, ','),',',3);
generate flatten(group) as (name, type, date, region),
bs.$2 as OpX:chararray, bs.$0 as OpC:chararray, bs.$1 as OpT:chararray;
}
describe C;
C: {name: chararray,type: chararray,date: int,region: chararray,OpX:
chararray,OpC: chararray,OpT: chararray}
dump C;
(john,ab,20130106,D,20,19,8)
(john,ab,20130106,E,98,854,67)
Update:
If you want to skip order by which adds an additional reduce phase to the computation, you can prefix each value with its corresponding op in tuple v. Then sort the tuple fields by using a custom UDF to have the desired OpX, OpC, OpT order:
register 'myjar.jar';
A = load 'data.txt' using PigStorage(',') as (name:chararray, type:chararray,
date:int, region:chararray, op:chararray, value:int);
B = group A by (name, type, date, region);
C = foreach B {
v = foreach A generate CONCAT(op, (chararray)value);
bs = STRSPLIT(BagToString(v, ','),',',3);
generate flatten(group) as (name, type, date, region),
flatten(TupleArrange(bs)) as (OpX:chararray, OpC:chararray, OpT:chararray);
}
where TupleArrange in mjar.jar is something like this:
..
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class TupleArrange extends EvalFunc<Tuple> {
private static final TupleFactory tupleFactory = TupleFactory.getInstance();
#Override
public Tuple exec(Tuple input) throws IOException {
try {
Tuple result = tupleFactory.newTuple(3);
Tuple inputTuple = (Tuple) input.get(0);
String[] tupleArr = new String[] {
(String) inputTuple.get(0),
(String) inputTuple.get(1),
(String) inputTuple.get(2)
};
Arrays.sort(tupleArr); //ascending
result.set(0, tupleArr[2].substring(1));
result.set(1, tupleArr[0].substring(1));
result.set(2, tupleArr[1].substring(1));
return result;
}
catch (Exception e) {
throw new RuntimeException("TupleArrange error", e);
}
}
#Override
public Schema outputSchema(Schema input) {
return input;
}
}

Speed up the analysis

I have 2 dataframes in R for example df and dfrefseq.
df<-data.frame( chr = c("chr1","chr1","chr1","chr4")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("HTA","OdX","FEA","MGA")
)
dfrefseq<-data.frame( chr = c("chr1","chr1","chr1","chr2")
, start = c(843294,4329248,4329423,4932234)
, stop = c(845294,4329248,4529423,4935234)
, genenames= c("tra","FGE","FFs","FAA")
)
I want to check for each gene in df witch gene in dfrefseq lies closest to the selected df gene.
I first selected "chr1" in both dataframes.
Then I calculated for the first gene in readschr1 the distance between start-start start-stop stop-start and stop-stop sites.
The sum of this calculations say everything about the distance. My question here is, How can I speed up this analyse? Because now I tested only 1 gene against a dataframe, but I need to test 2000 genes.
readschr1 <- subset(df,df[,1]=="chr1")
refseqchr1 <- subset(dfrefseq,dfrefseq[,1]=="chr1")
names<-list()
read_start_start<-list()
read_start_stop<-list()
read_stop_start<-list()
read_stop_stop<-list()
for (i in 1:nrow(refseqchr1)) {
startstart<-abs(readschr1[1,2] - refseqchr1[i,2])
startstop<-abs(readschr1[1,2] - refseqchr1[i,3])
stopstart<-abs(readschr1[1,3] - refseqchr1[i,2])
stopstop<-abs(readschr1[1,3] - refseqchr1[i,3])
read_start_start[[i]]<- matrix(startstart)
read_start_stop[[i]]<- matrix(startstop)
read_stop_start[[i]]<- matrix(stopstart)
read_stop_stop[[i]]<- matrix(stopstop)
names[[i]]<-matrix(refseqchr1[i,4])
}
table<-cbind(names, read_start_start, read_start_stop, read_stop_start, read_stop_stop)
sumtotalcolumns<-as.numeric(table[,2]) + as.numeric(table[,3])+ as.numeric(table[,4]) + as.numeric(table[,5])
test<-cbind(table, sumtotalcolumns)
test1<-test[order(as.vector(test$sumtotalcolumns)), ]
Thank you!
The Bioconductor package GenomicRanges is designed to work with this type of data
source('http://bioconductor.org/biocLite.R')
biocLite('GenomicRanges') # one-time installation
then
library(GenomicRanges)
gr <- with(df,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
grrefseq <- with(dfrefseq,
GRanges(factor(chr, levels=paste("chr", 1:4, sep="")),
IRanges(start, stop), genenames=genenames))
and
> nearest(gr, grrefseq)
[1] 1 2 3 NA
You can merge the two separate data.frames together to form one table and then use vectorized operations. The key to merge is to specify the common column(s) between the data.frames and to tell it what to do when there are cases that do not match. Specifying all = TRUE will return all rows and fill in NAs if there is no match in the other data.frame, i.e. ch2 and ch4 in this case. Once the data.frames have been merged, then it's a simple exercise in subtracting the different columns from one another and then summing the four columns of interest. I use transform to cut down on the typing needed to do the subtraction.
zz <- merge(df, dfrefseq, by = "chr", all = TRUE)
zz <- transform(zz,
read_start_start = abs(start.x - start.y)
, read_start_stop = abs(start.x - stop.y)
, read_stop_start = abs(stop.x - start.y)
, read_stop_stop = abs(stop.x - stop.y)
)
zz <- transform(zz,
sum_total_columns = read_start_start + read_start_stop + read_stop_start + read_stop_stop
)
Here's one approach get the row with the minimum distance. I'm assuming you want to do this by chr and genenames. I use the plyr package, but I'm sure there are base solutions if you'd prefer one of those. Maybe someone else will chime in with a base solution.
require(plyr)
ddply(zz, c("chr", "genenames.x"), function(x) x[which.min(x$sum_total_columns) ,])

Resources