ViewVC Help
View File | Revision Log | Show Annotations | Download File
/cvs/Geo-LatLon2Place/bin/geo-latlon2place-makedb
Revision: 1.3
Committed: Tue Mar 15 07:33:40 2022 UTC (2 years, 2 months ago) by root
Branch: MAIN
CVS Tags: rel-0_9
Changes since 1.2: +10 -13 lines
Log Message:
*** empty log message ***

File Contents

# Content
1 #!/opt/bin/perl
2
3 =head1 NAME
4
5 geo-latlon2place-makedb - generate database for use with Geo::LatLon2Place
6
7 =head1 SYNOPSIS
8
9 geo-latlon2place-makedb [OPTION]... inputfile.txt outputfile.cdb
10
11 =head1 DESCRIPTION
12
13 abc
14
15 =head1 OPTIONS AND ARGUMENTS
16
17 geo-latlon2place-makedb requires two arguments: a text file with geo data
18 and the name of the database file to be written.
19
20 By default, the input file is considered to be in geonames gazetteer
21 format, but this can be customized using B<--extract>.
22
23 =over
24
25 =item B<--cellsize> km (default C<20>, or C<10> for geonames-postalcodes)
26
27 The (minimum) size of a single grid cell in km - the data is binned into
28 cells of at least this size. It should not be larger than the search
29 radius.
30
31 =item B<--extract> C<geonames>|C<geonames-postalcodes>|perl...
32
33 The extraction method: the default is C<geonames>, which expects a
34 geonames database (L<https://download.geonames.org/export/dump/>, for
35 example F<DE.txt>, F<cities500.txt> or F<allCountries.txt>) and extracts
36 I<placename, countrycode> strings from it.
37
38 The method C<geonames-postalcodes> does the same, but for a geonames
39 postal code database L<https://download.geonames.org/export/zip>, and
40 extracts C<zip name, countrycopde> strings.
41
42 Lastly, you can specify a perl fragment that implements your own filtering
43 and extraction.
44
45 =back
46
47 =head1 FILTERING AND EXTRACTION
48
49 Record selection and the returned data are not fixed - you can filter your records
50 yourself and associate any (reasonably short, the maximum is a bit above 200 octets)
51 binary blob with a coordinate pair.
52
53 To do this, you have to provide a perl fragment that extracts latitude,
54 longitude, a weight and the associated data blob from an input line stored
55 in C<$_>. The file is opened using the C<:perlio> layer, so if your input
56 file is in UTF-8, so will be C<$_>.
57
58 For example, the following would expect an input file with space separated
59 latitude, longitude, weight and name, where name can contain spaces, which
60 is useful when you want to provide your own input data:
61
62 geo-latlon2place-makedb --extract 'chomp; split / /, 4' input output
63
64 A slighly more verbose example expecting only latitude, longitude and a
65 name would be:
66
67 geo-latlon2place-makedb --extract '
68 chomp;
69 my ($lat, $lon, $name) = split / /, 4;
70 ($lat, $lon, 1, $name)
71 ' input output
72
73 If you want to skip certain lines without adding anything to the database,
74 you can return nothing:
75
76 geo-latlon2place-makedb --extract '
77 chomp;
78 my ($lat, $lon, $name) = split / /;
79 return unless $lat < 0; # only add southern hemisphere points
80 ($lat, $lon, 1, $name)
81 ' input output
82
83 In general, the fragment should return either an empty list, or a
84 four-tuple with decimal latitude, longitude, a weight (integer 0..255)
85 and the binary data to be associated with the coordinate. Other than the
86 weight, these should be self-explaining. The weight is used during search
87 and will be multiplied to the square of the distance, and is used to make
88 larger cities win over small ones when the coordinate is somewhere between
89 them.
90
91 The standard extractors (C<geonames> and C<geonames-postalcodes>) provide
92 a UTF-8-encoded string as blob, but any binary data will do, for example,
93 if you want to associate your coordinate pairs with some short-ish
94 integer codes, you could do this:
95
96 geo-latlon2place-makedb --extract '
97 chomp;
98 my ($lat, $lon, $id) = split / /, 4;
99 ($lat, $lon, 1, pack "w", $id)
100 ' input output
101
102 And later use C<unpack "w"> on the data returned by C<lookup>.
103
104 The C<geonames> filter looks similar to the following fragment, which
105 shows off some more filtering possibilities:
106
107 my ($id, $name, undef, undef, $lat, $lon, $t1, $t2, $cc, undef, $a1, $s2, $a3, $a4, $pop, undef) = split /\t/;
108
109 return if $t1 ne "P"; # only places
110
111 # minimum population 200, or it is a populated place with no known population
112 $pop => 200
113 or ($pop eq "" and $t2 eq "PPL")
114 or return;
115
116 # skip certain places we are not interested in
117 return if $t2 eq "PPLX"; # section of populated place
118 return if $t2 eq "PPLW"; # destroyed populated place
119 return if $t2 eq "PPLH"; # historical populated place
120 return if $t2 eq "PPLQ"; # abandoned populated place
121
122 # geonames has a lot of very long place names which aren't
123 # actually place names, so ignore very long names
124 60 > length $name
125 or return;
126
127 # we estimate a weight by dividing 25 by the radius of the place,
128 # which we get by assuming a fixed population density of 5000 # people
129 # per square km, # which is almost always a considerable over-estimate.
130 # 25 and 5000 are pretty much made-up, feel free to improve and
131 # send me the results.
132 my $w = 25 / (1 + sqrt $pop / 5000);
133
134 # administrative centers get a fixed low weight
135 if ($t2 =~ /^PPLA(\d*)/) {
136 $w = $1 || 1;
137 }
138
139 ($lat, $lon, $w, "$name, $cc")
140
141 =head1 AUTHOR
142
143 Marc Lehmann <schmorp@schmorp.de>
144
145 =cut
146
147 use common::sense;
148 use Getopt::Long;
149 use Geo::LatLon2Place ();
150 use Pod::Usage ();
151
152 GetOptions
153 "help|h" => sub { Pod::Usage::pod2usage -exittval => 0, verbose => 1 },
154 "extract=s" => \my $extract,
155 "cellsize=i" => \my $km,
156 or Pod::Usage::pod2usage
157 -exitval => 1,
158 ;
159
160 @ARGV == 2
161 or die "need exactly two paths: inputfile.txt outputdatabase.cdb\n";
162
163 $extract ||= "geonames";
164
165 if ($extract eq "geonames") {
166 $extract = sub {
167 my ($id, $name, undef, undef, $lat, $lon, $t1, $t2, $cc, undef, $a1, $s2, $a3, $a4, $pop, undef) = split /\t/;
168
169 return if $t1 ne "P"; # only places
170
171 $pop => 200
172 or ($pop eq "" and $t2 eq "PPL")
173 or return;
174
175 return if $t2 eq "PPLX"; # section of populated place
176 return if $t2 eq "PPLW"; # destroyed populated place
177 return if $t2 eq "PPLH"; # historical populated place
178 return if $t2 eq "PPLQ"; # abandoned populated place
179
180 # geonames has a lot of very long place names which aren't
181 # actually place names, so ignore very long names
182 60 > length $name
183 or return;
184
185 my $w = 25 / (1 + sqrt $pop / 5000);
186
187 if ($t2 =~ /^PPLA(\d+)/) {
188 $w = $1 || 1;
189 }
190
191 ($lat, $lon, $w, "$name, $cc")
192 };
193 } elsif ($extract eq "geonames-postalcodes") {
194 $km ||= 10;
195 $extract = sub {
196 my ($cc, $zip, $name, undef, undef, undef, undef, undef, undef, $lat, $lon, undef) = split /\t/;
197
198 ($lat, $lon, 1, "$zip $name, $cc")
199 };
200 } else {
201 $extract = eval "#line 1 \"extract fragment\"\nsub { $extract; }";
202 die "$@" if $@;
203 }
204
205 $km ||= 20;
206
207 my $torad = (atan2 1,0) / 90;
208
209 my $boxes = int 6378 * 2 * 2 * (atan2 1,0) / $km; # equator radius / cell size
210
211 open my $fh, "<:perlio", $ARGV[0];
212
213 my @grid;
214
215 while (<$fh>) {
216 my ($lat, $lon, $w, $payload) = $extract->()
217 or next;
218
219 unless (255 >= length $payload) {
220 $payload =~ s/([^(\x20-\x7e])/sprintf "\\x%02x", ord $1/ge;
221 warn "payload too long, skipping: $payload\n";
222 next;
223 }
224
225 my $blat = int $boxes * cos $lat * $torad; # can be 0, but does not matter
226 my $x = int (($lon + 180) * $blat / 360);
227 my $y = int (($lat + 90) * $boxes / 180);
228
229 # we use 16 bit for lat/lon, 8 bi8t for the weight, BER-id, counted name and CC
230 push @{ $grid[$y][$x] }, pack "s< s< C C/a*", $lat * 32767 / 90, $lon * 32767 / 180, $w, $payload;
231 }
232
233 my ($max, $sum, $cnt);
234
235 open my $cdb, ">", $ARGV[1]
236 or die;
237 Geo::LatLon2Place::cdb_make_start fileno $cdb
238 and die "cdb_make_start failure";
239
240 Geo::LatLon2Place::cdb_make_add "", pack "a4VVV", "SRGL", 2, $km, $boxes;
241
242 for my $y (0 .. $#grid) {
243 my $r = $grid[$y];
244 for my $x (0 .. $#$r) {
245 my $c = $r->[$x]
246 or next;
247
248 # statistics
249 $sum += scalar @$c;
250 $cnt++;
251 $max = [(scalar @$c), $x, $y] if @$c > $max->[0];
252
253 Geo::LatLon2Place::cdb_make_add
254 +(pack "s< s<", $x, $y),
255 +(join "", @$c)
256 and "cdb_make_add failure";
257 }
258 }
259
260 Geo::LatLon2Place::cdb_make_finish
261 and die "cdb_make_finish failed";
262 close $cdb;
263
264 $max->[2] = $max->[2] * 180 / $boxes - 90;
265 my $blat = int $boxes * cos $max->[2] * $torad;
266 $max->[1] = $max->[1] * 360 / $blat - 180;
267 print "cell size: $km km, grid size: $boxes, non-empty cell count: $cnt\naverage cell size: ", int $sum/$cnt, ", maximum cell size: $max->[0] (at $max->[2] $max->[1])\n";
268