site

source files for beau's website
git clone https://git.beauhilton.com/site.git
Log | Files | Refs

index.html (10700B)


      1 <!DOCTYPE html>
      2 <html lang="en">
      3  <head>
      4   <link rel="stylesheet" href="/style.css" type="text/css">
      5   <meta charset="utf-8">
      6   <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      7   <meta name="viewport" content="width=device-width, initial-scale=1.0">
      8   <link rel="stylesheet" type="text/css" href="/style.css">
      9   <link rel="icon" href="data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'%3E%3Cstyle%3E %23m %7B opacity:0; %7D%0A@media (prefers-color-scheme: dark) %7B %23m %7B opacity:1; %7D %23e %7B opacity:0 %7D%0A%7D %3C/style%3E%3Ctext id='m' y='.9em' font-size='90'%3E🏕️%3C/text%3E%3Ctext id='e' y='.9em' font-size='90'%3E🌞%3C/text%3E%3C/svg%3E">
     10   <title></title>
     11  </head>
     12  <body>
     13   <div id="page-wrapper">
     14    <div id="header" role="banner">
     15     <header class="banner">
     16      <div id="banner-text">
     17       <span class="banner-title"><a href="/">beauhilton</a></span>
     18      </div>
     19     </header>
     20     <nav>
     21      <a href="/about">about</a>
     22 <a href="/now">now</a>
     23 <a class="nav-active" href="/posts">posts</a>
     24 <a href="https://notes.beauhilton.com">notes</a>
     25 <a href="https://talks.beauhilton.com">talks</a>
     26 <a href="https://git.beauhilton.com">git</a>
     27 <a href="/contact">contact</a>
     28 <a href="/feed.xml">rss</a>
     29     </nav>
     30    </div>
     31    <main>
     32     <h1>
     33      geocheatcode
     34     </h1>
     35     <p>
     36      <time id="post-date">2022-04-22</time>
     37     </p>
     38     <p id="post-excerpt">
     39      Here is background and code
     40 for a trick I use to get
     41 Google to give me best-in-class guesses 
     42 for latitude and longitude,
     43 despite goofy and/or downright bad location searches.
     44     </p>
     45     <h2>
     46      Map all the things
     47     </h2>
     48     <p>
     49      I love maps.
     50     </p>
     51     <p>
     52      Several of my projects involve mapping things at scale.
     53     </p>
     54     <p>
     55      When you want to map a few things, you type searches into Google Maps
     56 and get addresses and/or latitudes and longitudes quickly and
     57 reliably.
     58     </p>
     59     <p>
     60      But what if you’d like to map 90,000 things whose locations you don’t
     61 yet know?
     62     </p>
     63     <p>
     64      <a href="https://developers.google.com/maps">Google</a> and <a href="https://www.openstreetmap.org/">OpenStreetMap</a>, as well as
     65 others, provide mapping services you can call programmatically from your
     66 software. You send in some query, such as “VUMC Internal Medicine,” and
     67 they return information relevant to that query, such as street address
     68 and latitude and longitude. Up to a certain number of queries per day or
     69 hour, the services are free, and since my work is academic, rather than
     70 real-time mapping for some for-profit app, I am happy to send in small
     71 batches to stay under the limits in the free tier.
     72     </p>
     73     <p>
     74      I’ve used these services to make large maps, and they work pretty
     75 well.
     76     </p>
     77     <p>
     78      <em>Pretty</em> well.
     79     </p>
     80     <h2>
     81      But mapping is hard
     82     </h2>
     83     <p>
     84      Problems with these services:
     85     </p>
     86     <ol type="1">
     87      <li>
     88       they expected well-formed and reasonable queries
     89      </li>
     90      <li>
     91       if they didn’t know the answer, the guesses were often wildly off,
     92 or they would refuse to guess at all
     93      </li>
     94     </ol>
     95     <p>
     96      If I’m mapping 90,000 things, I’m going to write some code to go
     97 through each of those 90,000 things and ask the mapping services to
     98 kindly tell me what I want to know. Though I write sanitation code to
     99 clean up the 90,000 things, I’m not going to quality check each of those
    100 90,000 things. Sometimes things among the 90,000 things are kinda nuts
    101 (misspelled, inclusive of extraneous data, oddly formatted), in
    102 idiosyncratic ways that are impossible to completely cover, no matter
    103 how much code I write to catch the weird cases.
    104     </p>
    105     <p>
    106      I would like a solution that is fairly tolerant of weirdnesses, and
    107 makes good guesses.
    108     </p>
    109     <h2>
    110      Google is really good at search
    111     </h2>
    112     <p>
    113      I noticed that when I manually typed things into the Google Maps
    114 search bar, it forgave a myriad of sins and did a great job centering
    115 the map on its best guess. When I copied and pasted some of the weird
    116 things among the 90,000 into the Google Maps search bar (the same things
    117 that made the official mapping services - including Google’s - go all
    118 Poltergeist), <em>voila!</em>, the right answer appeared, success rates
    119 nearing 100%.
    120     </p>
    121     <p>
    122      I thought there must be a way to repeat this process with code, in a
    123 scalable way.
    124     </p>
    125     <p>
    126      Turns out there is, and it’s easy.
    127     </p>
    128     <h2>
    129      <code>geocheatcode.py</code>
    130     </h2>
    131     <pre tabindex="0"><code class="language-python">
    132 <span class="hl kwa">from</span> requests_html <span class="hl kwa">import</span> HTMLSession
    133 
    134 session <span class="hl opt">=</span> <span class="hl kwd">HTMLSession</span><span class="hl opt">()</span>
    135 
    136 
    137 <span class="hl kwa">def</span> <span class="hl kwd">google_lat_lon</span><span class="hl opt">(</span>query<span class="hl opt">:</span> <span class="hl kwb">str</span><span class="hl opt">):</span>
    138 
    139     url <span class="hl opt">=</span> <span class="hl sng">"https://www.google.com/maps/search/?api=1"</span>
    140     params <span class="hl opt">= {}</span>
    141     params<span class="hl opt">[</span><span class="hl sng">"query"</span><span class="hl opt">] =</span> query
    142 
    143     r <span class="hl opt">=</span> session<span class="hl opt">.</span><span class="hl kwd">get</span><span class="hl opt">(</span>url<span class="hl opt">,</span> params<span class="hl opt">=</span>params<span class="hl opt">)</span>
    144 
    145     reg <span class="hl opt">=</span> <span class="hl sng">"APP_INITIALIZATION_STATE=[[[{}]"</span>
    146     res <span class="hl opt">=</span> r<span class="hl opt">.</span>html<span class="hl opt">.</span><span class="hl kwd">search</span><span class="hl opt">(</span>reg<span class="hl opt">)[</span><span class="hl num">0</span><span class="hl opt">]</span>
    147     lat <span class="hl opt">=</span> res<span class="hl opt">.</span><span class="hl kwd">split</span><span class="hl opt">(</span><span class="hl sng">","</span><span class="hl opt">)[</span><span class="hl num">2</span><span class="hl opt">]</span>
    148     lon <span class="hl opt">=</span> res<span class="hl opt">.</span><span class="hl kwd">split</span><span class="hl opt">(</span><span class="hl sng">","</span><span class="hl opt">)[</span><span class="hl num">1</span><span class="hl opt">]</span>
    149 
    150     <span class="hl kwa">return</span> lat<span class="hl opt">,</span> lon
    151 
    152 
    153 extraneous <span class="hl opt">=</span> <span class="hl sng">""" something something</span>
    154 <span class="hl sng">                 the earth is banana shaped</span>
    155 <span class="hl sng">                 latitude and longitude </span>
    156 <span class="hl sng">                 wouldn't you like to know, maybe """</span>
    157 
    158 relevant <span class="hl opt">=</span> <span class="hl sng">""" Vanderbilt University Medical Center </span>
    159 <span class="hl sng">               Internal Medicine """</span>
    160 
    161 query <span class="hl opt">=</span> extraneous <span class="hl opt">+</span> relevant
    162 
    163 lat<span class="hl opt">,</span> lon <span class="hl opt">=</span> <span class="hl kwd">google_lat_lon</span><span class="hl opt">(</span>query<span class="hl opt">)</span>
    164 
    165 <span class="hl kwa">print</span><span class="hl opt">(</span> 
    166        <span class="hl sng">"Hello. "</span>
    167        <span class="hl sng">"My name is Google. "</span>
    168        <span class="hl sng">"I am really good at guessing what you meant. "</span>
    169       f<span class="hl sng">"Your query was '</span><span class="hl ipl">{query}</span><span class="hl sng">'. "</span>
    170        <span class="hl sng">"Here are the coordinates you probably wanted. "</span>
    171       f<span class="hl sng">"The latitude is</span> <span class="hl ipl">{lat}</span><span class="hl sng">, and the longitude is</span> <span class="hl ipl">{lon}</span><span class="hl sng">. "</span>
    172        <span class="hl sng">"Don't believe me? "</span>
    173        <span class="hl sng">"Here it is again, "</span>
    174        <span class="hl sng">"in a format you can paste into the search bar:</span> <span class="hl esc">\n</span><span class="hl sng">"</span>
    175       f<span class="hl sng">"</span><span class="hl ipl">{lat}</span><span class="hl sng">,</span> <span class="hl ipl">{lon}</span> <span class="hl sng"></span><span class="hl esc">\n</span><span class="hl sng">"</span>
    176        <span class="hl sng">"Told ya. "</span>
    177 <span class="hl opt">)</span>
    178 </code></pre>
    179     <p>
    180      Despite having all that extra junk in the query, this returns the
    181 right answer. Because Google is many things good and evil, but of these
    182 one is certain: Google is <em>really</em> good at search.
    183     </p>
    184     <h2>
    185      How does the code work?
    186     </h2>
    187     <p>
    188      If you inspect the source HTML on the Google Maps website after you
    189 search for something and it centers the map on its best guess, and you
    190 scroll way on down (or Ctrl-F search for it) you’ll find
    191 <code>APP_INITIALIZATION_STATE</code>, which contains latitude and
    192 longitude for the place the map centered on.
    193     </p>
    194     <ul>
    195      <li>
    196       <a href="https://www.google.com/maps?q=something+whose+latitude+and+longitude+you+would+like+to+know,+maybe+VUMC+Internal+Medicine">example
    197 search</a>
    198      </li>
    199      <li>
    200       <a href="view-source:https://www.google.com/maps/search/something+whose+latitude+and+longitude+you+would+like+to+know,+maybe+VUMC+Internal+Medicine/">example
    201 source</a> (you have to copy and paste this link into a new tab
    202 manually, clicking won’t work)
    203      </li>
    204     </ul>
    205     <p>
    206      I use the lovely <a href="https://docs.python-requests.org/projects/requests-html/en/latest/"><code>requests-html</code></a>
    207 Python library to send the query to Google, receive the response, and
    208 search through the response for the part I want to extract. Then I use a
    209 little standard Python to parse the extracted part and save the
    210 important bits.
    211     </p>
    212     <h2>
    213      With great power…
    214     </h2>
    215     <p>
    216      Don’t go crazy with this.
    217     </p>
    218     <p>
    219      The trick is good for leisurely automation of location retrieval when
    220 you have squirrelly queries.
    221     </p>
    222     <p>
    223      If you need real-time mapping of many things, you don’t want this
    224 solution. Use the actual APIs, and work instead on formatting the
    225 queries properly before sending them to Google/OSM.
    226     </p>
    227     <p>
    228      Also, if you try to query too much/too quickly, Google will shut you
    229 out after a little while. Put a few seconds of delay between each
    230 request and run it overnight and/or in automated batches.
    231     </p>
    232     <h2>
    233      Know a better way?
    234     </h2>
    235     <p>
    236      I’d love to know. Drop me a line.
    237     </p>
    238    </main>
    239    <div id="footnotes"></div>
    240    <footer></footer>
    241   </div>
    242  </body>
    243 </html>