Lab 02: Georefereing Location-based Social Media#

In this tutorial, we will learn:

  • How to extract information (e.g., tweets) from location-based social media (e.g., Twitter)

  • How to identify locational information (e.g., place name) from text-based data (e.g., tweets, newspapers)

  • How to refer the identified location to metric-based coordinates on the surface of the earcth

Several libraries/packages are needed in this tutorial. Use pip or conda to install them:

  • tweepy: this is library to access the Tweeter API

  • spaCy: this is the library to do natural lanuguage processing

  • spacy-dbpedia-spotlight: a small library that annotate recognized entities from spaCy to DBpedia enities.

Part 1-1: Retrieving data from location-based social media#

We can retrieve information from many location-based social media, such as Foursqure, Tweeter, Google Places, Yelp, and so. The key technique of doing so is through API (Application Programming Interface), which helps to bridge your program with the database these services develop. In this practice, we learn to access data from two social media – Foursquare and Twitter (X). Note that since X changed their API policy in 2023 charging for accessing their data, this practice only provides you the code. You are also not required to run through it by paying yourself. The code is provided for your future reference in case you want to use it in your career.

To use API in python, we need to install the requests library. The common way of using it is shown below. First, you need to define the endpoint url where the data is served at (usually you can find this information from the API’s documentation, e.g., Foursquare Place). Then you need to set up the parameters using the variable params for your query (again, you can find potential parameter variables in the documentation). Finally, you need to setup the headers of your request, which defines the data format of the response Accept, and your API Authorization. Where to get the authorization? Different APIs have different ways, for Foursquare, please check this documentation.

No matter which API you use, the logic/processes are mostly the same, as shown below. But details on how to set up the parameters, the url, and authorization process can be quite different. Lucily, most of these APIs provide detailed documentations with examples.

Now, please read the Foursqure Place documentation, follow the instruction to create an account/authorization, and then replicate the following steps.

The following code basically query coffees that are within 500 meters of School of Geographic Science building, and rank them in the order of distance.

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation


url = "https://api.foursquare.com/v3/places/search"

params = {
  "query": "coffee",
  "ll": "51.45712270688956,-2.605159893254094",
  "radius": "500",  
  "sort":"DISTANCE"
}

headers = {
    "Accept": "application/json",
    "Authorization": "fsq3o8iIO/yZccIR5QzSI1Jmz0RX3vJ2HXCkuJVDwk9o6rQ=" # please change it to your own Authorization
}

response = requests.request("GET", url, params=params, headers=headers)
print(response.text)
/var/folders/xg/5n3zc4sn5hlcg8zzz6ysx21m0000gq/T/ipykernel_79890/3946376676.py:2: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # library for data analsysis
{"results":[{"fsq_id":"4b058836f964a520ceb822e3","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13035,"name":"Coffee Shop","short_name":"Coffee Shop","plural_name":"Coffee Shops","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":231,"geocodes":{"main":{"latitude":51.455219,"longitude":-2.603972},"roof":{"latitude":51.455219,"longitude":-2.603972}},"link":"/v3/places/4b058836f964a520ceb822e3","location":{"address":"75 Park St","admin_region":"England","country":"GB","formatted_address":"75 Park St, Bristol, BS1 5PF","locality":"Bristol","post_town":"Bristol","postcode":"BS1 5PF","region":"Bristol"},"name":"Boston Tea Party","related_places":{},"timezone":"Europe/London"},{"fsq_id":"557d9e15498e3fc4dda1ec92","categories":[{"id":13002,"name":"Bakery","short_name":"Bakery","plural_name":"Bakeries","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/bakery_","suffix":".png"}},{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"LikelyOpen","distance":350,"geocodes":{"drop_off":{"latitude":51.455446,"longitude":-2.600694},"main":{"latitude":51.45533,"longitude":-2.600682},"roof":{"latitude":51.45533,"longitude":-2.600682}},"link":"/v3/places/557d9e15498e3fc4dda1ec92","location":{"address":"16 Park Row","admin_region":"England","country":"GB","cross_street":"","formatted_address":"16 Park Row, Bristol, BS1 5LJ","locality":"Bristol","post_town":"Bristol","postcode":"BS1 5LJ","region":"Bristol"},"name":"Sotiris Bakery","related_places":{},"timezone":"Europe/London"},{"fsq_id":"517e90a0e4b0f7d859cc9578","categories":[{"id":13036,"name":"Tea Room","short_name":"Tea Room","plural_name":"Tea Rooms","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/tearoom_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":360,"geocodes":{"main":{"latitude":51.456274,"longitude":-2.610294},"roof":{"latitude":51.456274,"longitude":-2.610294}},"link":"/v3/places/517e90a0e4b0f7d859cc9578","location":{"country":"GB","cross_street":"","formatted_address":""},"name":"Ross' Tea Corner","related_places":{},"timezone":"Europe/London"},{"fsq_id":"4b6447c0f964a5200ca82ae3","categories":[{"id":12010,"name":"Adult Education","short_name":"Adult Education","plural_name":"Adult Education","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/building/school_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":406,"geocodes":{"main":{"latitude":51.45419,"longitude":-2.602153},"roof":{"latitude":51.45419,"longitude":-2.602153}},"link":"/v3/places/4b6447c0f964a5200ca82ae3","location":{"address":"40a Park St","admin_region":"England","country":"GB","cross_street":"","formatted_address":"40a Park St, Bristol, BS1 5JG","locality":"Bristol","post_town":"Bristol","postcode":"BS1 5JG","region":"Bristol"},"name":"Bristol Folk House","related_places":{},"timezone":"Europe/London"},{"fsq_id":"5e6235421381060008b18679","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13035,"name":"Coffee Shop","short_name":"Coffee Shop","plural_name":"Coffee Shops","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"Unsure","distance":428,"geocodes":{"drop_off":{"latitude":51.459369,"longitude":-2.609964},"main":{"latitude":51.459152,"longitude":-2.609873},"roof":{"latitude":51.459152,"longitude":-2.609873}},"link":"/v3/places/5e6235421381060008b18679","location":{"address":"St Pauls Rd","admin_region":"England","country":"GB","cross_street":"","formatted_address":"St Pauls Rd, Bristol, BS8 1LX","locality":"Bristol","post_town":"Bristol","postcode":"BS8 1LX","region":"Bristol"},"name":"Caffe Clifton","related_places":{"parent":{"fsq_id":"4b058834f964a5204bb822e3","categories":[{"id":19014,"name":"Hotel","short_name":"Hotel","plural_name":"Hotels","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/travel/hotel_","suffix":".png"}}],"name":"The Clifton Hotel"}},"timezone":"Europe/London"},{"fsq_id":"4ce3baeea3c6721ed2df11cf","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":440,"geocodes":{"drop_off":{"latitude":51.458888,"longitude":-2.611066},"main":{"latitude":51.458696,"longitude":-2.61096},"roof":{"latitude":51.458696,"longitude":-2.61096}},"link":"/v3/places/4ce3baeea3c6721ed2df11cf","location":{"address":"St Paul's Rd","country":"GB","cross_street":"","formatted_address":"St Paul's Rd, Bristol","locality":"Bristol","post_town":"Bristol","region":"Bristol"},"name":"Caffe Clifton","related_places":{},"timezone":"Europe/London"},{"fsq_id":"59bd2120da5ede2141070f99","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13040,"name":"Dessert Shop","short_name":"Desserts","plural_name":"Dessert Shops","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/dessert_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":458,"geocodes":{"main":{"latitude":51.453793,"longitude":-2.601547},"roof":{"latitude":51.453793,"longitude":-2.601547}},"link":"/v3/places/59bd2120da5ede2141070f99","location":{"address":"20 Park St","admin_region":"England","country":"GB","cross_street":"","formatted_address":"20 Park St, Bristol, BS1 5JA","locality":"Bristol","post_town":"Bristol","postcode":"BS1 5JA","region":"Bristol"},"name":"Mrs. Potts Chocolate House","related_places":{},"timezone":"Europe/London"},{"fsq_id":"4c3469ed3ffc952179da90f5","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13035,"name":"Coffee Shop","short_name":"Coffee Shop","plural_name":"Coffee Shops","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":462,"geocodes":{"main":{"latitude":51.460572,"longitude":-2.60179},"roof":{"latitude":51.460572,"longitude":-2.60179}},"link":"/v3/places/4c3469ed3ffc952179da90f5","location":{"address":"135 St. Michaels Hill","admin_region":"England","country":"GB","cross_street":"at Highbury Villas","formatted_address":"135 St. Michaels Hill (at Highbury Villas), Bristol, BS2 8BS","locality":"Bristol","post_town":"Bristol","postcode":"BS2 8BS"},"name":"Twoday Coffee Roasters","related_places":{},"timezone":"Europe/London"},{"fsq_id":"4b058837f964a520deb822e3","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13035,"name":"Coffee Shop","short_name":"Coffee Shop","plural_name":"Coffee Shops","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_","suffix":".png"}},{"id":13065,"name":"Restaurant","short_name":"Restaurant","plural_name":"Restaurants","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/default_","suffix":".png"}}],"chains":[],"closed_bucket":"VeryLikelyOpen","distance":464,"geocodes":{"main":{"latitude":51.453685,"longitude":-2.601516},"roof":{"latitude":51.453685,"longitude":-2.601516}},"link":"/v3/places/4b058837f964a520deb822e3","location":{"address":"18 Park St","admin_region":"England","country":"GB","cross_street":"","formatted_address":"18 Park St, Bristol, BS1 5JA","locality":"Bristol","post_town":"Bristol","postcode":"BS1 5JA","region":"Bristol"},"name":"Woodes","related_places":{},"timezone":"Europe/London"},{"fsq_id":"59ac098a86bc494d39e6effc","categories":[{"id":13034,"name":"Café","short_name":"Café","plural_name":"Cafés","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/food/cafe_","suffix":".png"}},{"id":13003,"name":"Bar","short_name":"Bar","plural_name":"Bars","icon":{"prefix":"https://ss3.4sqi.net/img/categories_v2/nightlife/pub_","suffix":".png"}}],"chains":[],"closed_bucket":"Unsure","distance":469,"geocodes":{"main":{"latitude":51.457739,"longitude":-2.612143},"roof":{"latitude":51.457739,"longitude":-2.612143}},"link":"/v3/places/59ac098a86bc494d39e6effc","location":{"address":"99 Queens Rd","admin_region":"England","country":"GB","cross_street":"","formatted_address":"99 Queens Rd, Bristol, BS8 1LW","locality":"Bristol","post_town":"Bristol","postcode":"BS8 1LW","region":"Bristol"},"name":"99 Queens","related_places":{},"timezone":"Europe/London"}],"context":{"geo_bounds":{"circle":{"center":{"latitude":51.45712270688956,"longitude":-2.605159893254094},"radius":500}}}}

As can be seen, the response is a messy chunk of text that is hard to interpret. We can then organize it using json’s structure.

import json
json_data = json.loads(response.text)
json_data
{'results': [{'fsq_id': '4b058836f964a520ceb822e3',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13035,
     'name': 'Coffee Shop',
     'short_name': 'Coffee Shop',
     'plural_name': 'Coffee Shops',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 231,
   'geocodes': {'main': {'latitude': 51.455219, 'longitude': -2.603972},
    'roof': {'latitude': 51.455219, 'longitude': -2.603972}},
   'link': '/v3/places/4b058836f964a520ceb822e3',
   'location': {'address': '75 Park St',
    'admin_region': 'England',
    'country': 'GB',
    'formatted_address': '75 Park St, Bristol, BS1 5PF',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS1 5PF',
    'region': 'Bristol'},
   'name': 'Boston Tea Party',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '557d9e15498e3fc4dda1ec92',
   'categories': [{'id': 13002,
     'name': 'Bakery',
     'short_name': 'Bakery',
     'plural_name': 'Bakeries',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/bakery_',
      'suffix': '.png'}},
    {'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'LikelyOpen',
   'distance': 350,
   'geocodes': {'drop_off': {'latitude': 51.455446, 'longitude': -2.600694},
    'main': {'latitude': 51.45533, 'longitude': -2.600682},
    'roof': {'latitude': 51.45533, 'longitude': -2.600682}},
   'link': '/v3/places/557d9e15498e3fc4dda1ec92',
   'location': {'address': '16 Park Row',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': '16 Park Row, Bristol, BS1 5LJ',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS1 5LJ',
    'region': 'Bristol'},
   'name': 'Sotiris Bakery',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '517e90a0e4b0f7d859cc9578',
   'categories': [{'id': 13036,
     'name': 'Tea Room',
     'short_name': 'Tea Room',
     'plural_name': 'Tea Rooms',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/tearoom_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 360,
   'geocodes': {'main': {'latitude': 51.456274, 'longitude': -2.610294},
    'roof': {'latitude': 51.456274, 'longitude': -2.610294}},
   'link': '/v3/places/517e90a0e4b0f7d859cc9578',
   'location': {'country': 'GB', 'cross_street': '', 'formatted_address': ''},
   'name': "Ross' Tea Corner",
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '4b6447c0f964a5200ca82ae3',
   'categories': [{'id': 12010,
     'name': 'Adult Education',
     'short_name': 'Adult Education',
     'plural_name': 'Adult Education',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/building/school_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 406,
   'geocodes': {'main': {'latitude': 51.45419, 'longitude': -2.602153},
    'roof': {'latitude': 51.45419, 'longitude': -2.602153}},
   'link': '/v3/places/4b6447c0f964a5200ca82ae3',
   'location': {'address': '40a Park St',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': '40a Park St, Bristol, BS1 5JG',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS1 5JG',
    'region': 'Bristol'},
   'name': 'Bristol Folk House',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '5e6235421381060008b18679',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13035,
     'name': 'Coffee Shop',
     'short_name': 'Coffee Shop',
     'plural_name': 'Coffee Shops',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'Unsure',
   'distance': 428,
   'geocodes': {'drop_off': {'latitude': 51.459369, 'longitude': -2.609964},
    'main': {'latitude': 51.459152, 'longitude': -2.609873},
    'roof': {'latitude': 51.459152, 'longitude': -2.609873}},
   'link': '/v3/places/5e6235421381060008b18679',
   'location': {'address': 'St Pauls Rd',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': 'St Pauls Rd, Bristol, BS8 1LX',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS8 1LX',
    'region': 'Bristol'},
   'name': 'Caffe Clifton',
   'related_places': {'parent': {'fsq_id': '4b058834f964a5204bb822e3',
     'categories': [{'id': 19014,
       'name': 'Hotel',
       'short_name': 'Hotel',
       'plural_name': 'Hotels',
       'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/travel/hotel_',
        'suffix': '.png'}}],
     'name': 'The Clifton Hotel'}},
   'timezone': 'Europe/London'},
  {'fsq_id': '4ce3baeea3c6721ed2df11cf',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 440,
   'geocodes': {'drop_off': {'latitude': 51.458888, 'longitude': -2.611066},
    'main': {'latitude': 51.458696, 'longitude': -2.61096},
    'roof': {'latitude': 51.458696, 'longitude': -2.61096}},
   'link': '/v3/places/4ce3baeea3c6721ed2df11cf',
   'location': {'address': "St Paul's Rd",
    'country': 'GB',
    'cross_street': '',
    'formatted_address': "St Paul's Rd, Bristol",
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'region': 'Bristol'},
   'name': 'Caffe Clifton',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '59bd2120da5ede2141070f99',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13040,
     'name': 'Dessert Shop',
     'short_name': 'Desserts',
     'plural_name': 'Dessert Shops',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/dessert_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 458,
   'geocodes': {'main': {'latitude': 51.453793, 'longitude': -2.601547},
    'roof': {'latitude': 51.453793, 'longitude': -2.601547}},
   'link': '/v3/places/59bd2120da5ede2141070f99',
   'location': {'address': '20 Park St',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': '20 Park St, Bristol, BS1 5JA',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS1 5JA',
    'region': 'Bristol'},
   'name': 'Mrs. Potts Chocolate House',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '4c3469ed3ffc952179da90f5',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13035,
     'name': 'Coffee Shop',
     'short_name': 'Coffee Shop',
     'plural_name': 'Coffee Shops',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 462,
   'geocodes': {'main': {'latitude': 51.460572, 'longitude': -2.60179},
    'roof': {'latitude': 51.460572, 'longitude': -2.60179}},
   'link': '/v3/places/4c3469ed3ffc952179da90f5',
   'location': {'address': '135 St. Michaels Hill',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': 'at Highbury Villas',
    'formatted_address': '135 St. Michaels Hill (at Highbury Villas), Bristol, BS2 8BS',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS2 8BS'},
   'name': 'Twoday Coffee Roasters',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '4b058837f964a520deb822e3',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13035,
     'name': 'Coffee Shop',
     'short_name': 'Coffee Shop',
     'plural_name': 'Coffee Shops',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/coffeeshop_',
      'suffix': '.png'}},
    {'id': 13065,
     'name': 'Restaurant',
     'short_name': 'Restaurant',
     'plural_name': 'Restaurants',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'VeryLikelyOpen',
   'distance': 464,
   'geocodes': {'main': {'latitude': 51.453685, 'longitude': -2.601516},
    'roof': {'latitude': 51.453685, 'longitude': -2.601516}},
   'link': '/v3/places/4b058837f964a520deb822e3',
   'location': {'address': '18 Park St',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': '18 Park St, Bristol, BS1 5JA',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS1 5JA',
    'region': 'Bristol'},
   'name': 'Woodes',
   'related_places': {},
   'timezone': 'Europe/London'},
  {'fsq_id': '59ac098a86bc494d39e6effc',
   'categories': [{'id': 13034,
     'name': 'Café',
     'short_name': 'Café',
     'plural_name': 'Cafés',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/cafe_',
      'suffix': '.png'}},
    {'id': 13003,
     'name': 'Bar',
     'short_name': 'Bar',
     'plural_name': 'Bars',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/nightlife/pub_',
      'suffix': '.png'}}],
   'chains': [],
   'closed_bucket': 'Unsure',
   'distance': 469,
   'geocodes': {'main': {'latitude': 51.457739, 'longitude': -2.612143},
    'roof': {'latitude': 51.457739, 'longitude': -2.612143}},
   'link': '/v3/places/59ac098a86bc494d39e6effc',
   'location': {'address': '99 Queens Rd',
    'admin_region': 'England',
    'country': 'GB',
    'cross_street': '',
    'formatted_address': '99 Queens Rd, Bristol, BS8 1LW',
    'locality': 'Bristol',
    'post_town': 'Bristol',
    'postcode': 'BS8 1LW',
    'region': 'Bristol'},
   'name': '99 Queens',
   'related_places': {},
   'timezone': 'Europe/London'}],
 'context': {'geo_bounds': {'circle': {'center': {'latitude': 51.45712270688956,
     'longitude': -2.605159893254094},
    'radius': 500}}}}

Next, to convert json format to tables (i.e. dataframe), we can use a for-loop. Note that there are many other ways of executing such a conversion.

In the code below, we only extract the id, name, address, and coordinates of the places.

places = []
for place in json_data['results']:
    fsq_id = place['fsq_id']
    coords = place['geocodes']['main']
    address = place['location']['formatted_address']
    name = place['name']
    places.append((fsq_id, name, address, coords))
    
places_df = pd.DataFrame(places, columns =['ID', 'Name', 'Address', 'Coords'])
places_df
ID Name Address Coords
0 4b058836f964a520ceb822e3 Boston Tea Party 75 Park St, Bristol, BS1 5PF {'latitude': 51.455219, 'longitude': -2.603972}
1 557d9e15498e3fc4dda1ec92 Sotiris Bakery 16 Park Row, Bristol, BS1 5LJ {'latitude': 51.45533, 'longitude': -2.600682}
2 517e90a0e4b0f7d859cc9578 Ross' Tea Corner {'latitude': 51.456274, 'longitude': -2.610294}
3 4b6447c0f964a5200ca82ae3 Bristol Folk House 40a Park St, Bristol, BS1 5JG {'latitude': 51.45419, 'longitude': -2.602153}
4 5e6235421381060008b18679 Caffe Clifton St Pauls Rd, Bristol, BS8 1LX {'latitude': 51.459152, 'longitude': -2.609873}
5 4ce3baeea3c6721ed2df11cf Caffe Clifton St Paul's Rd, Bristol {'latitude': 51.458696, 'longitude': -2.61096}
6 59bd2120da5ede2141070f99 Mrs. Potts Chocolate House 20 Park St, Bristol, BS1 5JA {'latitude': 51.453793, 'longitude': -2.601547}
7 4c3469ed3ffc952179da90f5 Twoday Coffee Roasters 135 St. Michaels Hill (at Highbury Villas), Br... {'latitude': 51.460572, 'longitude': -2.60179}
8 4b058837f964a520deb822e3 Woodes 18 Park St, Bristol, BS1 5JA {'latitude': 51.453685, 'longitude': -2.601516}
9 59ac098a86bc494d39e6effc 99 Queens 99 Queens Rd, Bristol, BS8 1LW {'latitude': 51.457739, 'longitude': -2.612143}

You can further get more information about the place. One example is to extract the description (unstructured text) of each place using the following code. Again, we request to the API by setting up some parameters here: to include fsq_id in the url, and pass the parameter fields.

description_list = [] 

for index, row in places_df.iterrows():
    fsq_id = row['ID']
    url = "https://api.foursquare.com/v3/places/%s" % (fsq_id)
    params ={
        "fields":"description", # you can also get "tips"
    }

    headers = {
        "accept": "application/json",
        "Authorization": "fsq3o8iIO/yZccIR5QzSI1Jmz0RX3vJ2HXCkuJVDwk9o6rQ="
    }

    response = requests.request("GET", url, params=params, headers=headers)
    description = json.loads(response.text)
    try:
        description_list.append(description['description'])
    except:
        description_list.append(None)
    

One question: why we write a try:  except: block here?

Finally, we can add the extracted description to the data frame: places_df:

places_df['Description']=description_list
places_df
ID Name Address Coords Description
0 4b058836f964a520ceb822e3 Boston Tea Party 75 Park St, Bristol, BS1 5PF {'latitude': 51.455219, 'longitude': -2.603972} This is our first & flagship café. Opened in 1...
1 557d9e15498e3fc4dda1ec92 Sotiris Bakery 16 Park Row, Bristol, BS1 5LJ {'latitude': 51.45533, 'longitude': -2.600682} Specialists in Greek Pastries and Pies. Specia...
2 517e90a0e4b0f7d859cc9578 Ross' Tea Corner {'latitude': 51.456274, 'longitude': -2.610294} None
3 4b6447c0f964a5200ca82ae3 Bristol Folk House 40a Park St, Bristol, BS1 5JG {'latitude': 51.45419, 'longitude': -2.602153} None
4 5e6235421381060008b18679 Caffe Clifton St Pauls Rd, Bristol, BS8 1LX {'latitude': 51.459152, 'longitude': -2.609873} None
5 4ce3baeea3c6721ed2df11cf Caffe Clifton St Paul's Rd, Bristol {'latitude': 51.458696, 'longitude': -2.61096} None
6 59bd2120da5ede2141070f99 Mrs. Potts Chocolate House 20 Park St, Bristol, BS1 5JA {'latitude': 51.453793, 'longitude': -2.601547} None
7 4c3469ed3ffc952179da90f5 Twoday Coffee Roasters 135 St. Michaels Hill (at Highbury Villas), Br... {'latitude': 51.460572, 'longitude': -2.60179} None
8 4b058837f964a520deb822e3 Woodes 18 Park St, Bristol, BS1 5JA {'latitude': 51.453685, 'longitude': -2.601516} None
9 59ac098a86bc494d39e6effc 99 Queens 99 Queens Rd, Bristol, BS8 1LW {'latitude': 51.457739, 'longitude': -2.612143} None

Part 1-2 (Optional): Extracting (geo)text from Twitter#

Again, note that this part is optional, given the X API (originally called Twitter) now charges for fees to retrieve data from it. Also worth noting that the code/screenshot might change day by day. The tutorial is designed as a reference for your future use.

This part explains how to extract text-based unstructured information from the social media Twitter via its API. Similar pipline can be used to extract information from other types of social media/Web services (e.g., Foursquare, Yelp, Flickr, Google Places, etc.).

Twitter is a useful data source for studying the social impact of events and activities. In this part, we are going to learn how to collect Twitter data using its API. Specifically, we are going to focus on geotagged Twitter data.

First, Twitter requires the users of Twitter API to be authenticated by the system. One simple approach to obtain such authentication is by registering a Twitter account. This is the approach we are going to take in this tutorial.

Go to the website of Twitter: https://twitter.com/ , and click “Sign up” at the upper right corner. You can skip this step if you already have a Twitter account.

After you have registered/logged in to your Twitter account, we are going to obtain the keys that are necessary to use Twitter API. Go to https://apps.twitter.com/ , and sign in using the Twitter account you have just created (sometimes the browser will automatically sign you in).

After you have signed in, click the button “Create New App”. Then fill in the necessary information to create an APP. Note that you might need to record your phone number in your Twitter account in order to do so. If you don’t like it, feel free to remove your phone number from your account after you have done your project.

Then you will be directed to a page (see example below) asking you for a name of your App. Give it a name that you want.

Get Keys from Twitter Developer

Click Get keys. It will then generated API Key, API Key Secret, and Bearer Token (see below for an example). Make sure you copy and paste them into a safe place (e.g., a text editor). We need these authentications later.

Authentication Example

Next, we also need to obtain the Access Token and its key. To do so, go to the Projects & Apps–> Select your App. Then click Keys and tokens, and then click Generate on the right of Access Token and Secret (see below). Again, make sure you record them in a safe place. We need them later. Note that if for some reasons, you lose your tokens and secrets, this page is where you regenerate them.

Access Token Example

Once you have your Twitter app set-up, you are ready to access tweets in Python. Begin by importing the necessary Python libraries.

import os
import tweepy as tw
import pandas as pd

To access the Twitter API, you will need four things from your Twitter App page. These keys are located in your Twitter app settings in the Keys and Access Tokens tab.

  • api key

  • api key seceret

  • access token

  • access token secret

Below I put in my authentications. You should use yours! But remember to not share these with anyone else because these values are specific to your App.

api_key= 'ADsWr3cgAce13adabpK0aVKh5'
api_key_secret= '9EPHVN4OdcsnizwMIu5zvNarbWfybuXMQou6Mz0WDEZehTynMz'
access_token= '1582847729486684180-FqKFjuAorSXQlLcQ8f8niWK7HyDWMN'
access_token_secret = 'TSjdYCWluaenbiXb2cxSA6rR7AxTkEPzDJ1Znd8J4225q'

With these authentications, we can next build an API variable in Python that build the connection between this Jupyter program and Twitter:

auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

For example, now we can send tweets using your API access. Note that your tweet needs to be 280 characters or less:

# Post a tweet from Python
api.update_status("Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience")
# Your tweet has been posted!

If you go to your Twitter account and check Profile, you will see the tweet being posted! Congrats for your first post via Python!

Note that if you see errors like “453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve”. It means you need to elevate your access. What you need to do is (1). Go to Products –> Twitter API v2; (2). click the tab “Elevated” (or “Academic Research” if you need it for your dissertation later); (3). Click Apply, then file the form (you can choose No for many of the questions). See screenshot below for (1) and (2):

Elevate your access

Next, let’s retrieve (search) some tweets that are about #energycrisis that are posted currently in English. There are going to be many posts returned. To make it easy to illustrate and to save some request (note you have a limited number of requests via this API), we only request 5 from the list.

search_words = "#energycrisis"
tweets = tw.Cursor(api.search_tweets,
              q=search_words,
              lang="en").items(5)
tweets

Here, you see a an object that you can iterate (i.e. ItemIterator) or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:

  • the text of the tweet

  • who sent the tweet

  • the date the tweet was sent

and more. The code below loops through the object and save the time of the tweet, the user who posted the tweet, the text of the tweet, as well ast the user location to a pandas DataFrame:

import pandas as pd

# create dataframe
columns = ['Time', 'User', 'Tweet', 'Location']

data = []
for tweet in tweets:
    data.append([tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location])

df = pd.DataFrame(data, columns=columns)

We can further save the dataframe to a local csv file (structured data):

df.to_csv('tweets_example.csv')

Note that there is another way of writing the query to Twitter API, which might be more intuitive to some users. For example, you can replace tweets = tw.Cursor(api.search_tweets,q=search_words,lang="en").items(5) to something like:

tweets2 = api.search_tweets(q=search_words,lang="en", count="5")
data2 = []
for tweet2 in tweets2:
    data2.append([tweet2.created_at, tweet2.user.screen_name, tweet2.text, tweet2.user.location])
df2 = pd.DataFrame(data2, columns=columns)

To learn more about the key function search_tweets(), check its webpage here. Please try yourself to set up some other parameters to see what you can get.

Part 2: Basic Natural Language Processing and Geoparsing#

To extract places (or other categories) from text-based (unstructured) data, we need to do some basic Natural Language Processing (NLP), such as tokenization and Part-of-Speech analysis. All these operations can be done through the library spaCy.

Ideally, you can use the tweets you got from Part 1 to do the experiment. But since sometimes the tweets you get might be very heterogenous and noisy, here we use a clean example (you can also get it from some long news online) to show how to use spaCy in order to make sure all important points are covered in one example.

First make sure you have intsalled and imported spaCy:

import spacy

spaCy comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), transforming to word vectors etc.

If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function. For example, we want to load the English version:

# Load small english model: https://spacy.io/models
# If you see errors like "OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.", it means your system has not installed the corresponding language models yet. 
# Use the commond to install it: python -m spacy download en 
nlp=spacy.load("en_core_web_sm")
nlp
<spacy.lang.en.English at 0x16b0a3fb0>

This returns a Language object that comes ready with multiple built-in capabilities.

Now let’s say you have your text data in a string. What can be done to understand the structure of the text?

First, call the loaded nlp object on the text. It should return a processed Doc object.

# Parse text through the `nlp` model
my_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""

my_doc = nlp(my_text)
type(my_doc)
spacy.tokens.doc.Doc

Hmmm, it is a Doc object. But wait, what is exactly a Doc object?

It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text, and so on are pre-computed and readily stored in the Doc object.

So first, what is a token?

As you have learnt from the lecture. Tokens are individual textual entities that make up the text. Typically a token can be the words, punctuation, spaces, etc. Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.

Each token in spaCy has different attributes that tell us a great deal of information.

Let’s see the token texts on my_doc. The string, which the token represents, can be accessed through the token.text attribute.

# Printing the tokens of a doc
for token in my_doc:
  print(token.text)
The
economic
situation
of
the
country
is
on
edge
,
as
the
stock


market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment


in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off


thousands
of
people
to
reduce
labor
cost

The above tokens contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words. We can clean it up.

The type of tokens will allow us to clean those noisy tokens such as stop word, punctuation, and space. First, we show whether a token is stop/punctuation or not, and then we use this information to remove them.

# Printing tokens and boolean values stored in different attributes
for token in my_doc:
  print(token.text,'--',token.is_stop,'---',token.is_punct)
The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False

 -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False

 -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -- False --- False
off -- True --- False

 -- False --- False
thousands -- False --- False
of -- True --- False
people -- False --- False
to -- True --- False
reduce -- False --- False
labor -- False --- False
cost -- False --- False
# Removing stop words and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct and not token.is_space]

for token in my_doc_cleaned:
  print(token.text)
economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
thousands
people
reduce
labor
cost

To get the POS tagging of your text, you use code like:

for token in my_doc_cleaned:
  print(token.text,'---- ',token.pos_)
economic ----  ADJ
situation ----  NOUN
country ----  NOUN
edge ----  NOUN
stock ----  NOUN
market ----  NOUN
crashed ----  VERB
causing ----  VERB
loss ----  NOUN
millions ----  NOUN
Citizens ----  NOUN
main ----  ADJ
investment ----  NOUN
share ----  NOUN
market ----  NOUN
facing ----  VERB
great ----  ADJ
loss ----  NOUN
companies ----  NOUN
lay ----  VERB
thousands ----  NOUN
people ----  NOUN
reduce ----  VERB
labor ----  NOUN
cost ----  NOUN

You will see each word (token) now is associated with a POS tag, whether it is a NOUN, a ADJ, a VERB, or so on … POS often can help us disambiguate the meaning of words (or places in GIR).

Btw, if you don’t know what “ADJ” means, you can use code like:

spacy.explain('ADJ')
'adjective'

You can also use spaCy to do some Named Entity Recognition (including place name identification or geoparsing). For instance:

text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

for entity in doc.ents:
    print(entity.text,'--- ',entity.label_)
Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  LANGUAGE

What is “GPE”?

spacy.explain('GPE')
'Countries, cities, states'

spaCy also provides special visualization for NER through displacy. Using displacy.render() function, you can set the style='ent' to visualize. For more interesting visualizations, check displacy documentation

# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)
Tony Stark PERSON owns the company StarkEnterprises ORG . Emily Clark PERSON works at Microsoft ORG and lives in Manchester GPE . She loves to read the Bible WORK_OF_ART and learn French LANGUAGE

So far, you have learnt the basics of retrieving information from social media like Twitter, as well as basic NLP operations and named entity recognition (geoparsing is part of it). I suggest you to play with what you have learnt so far by using new data (maybe through different APIs) to experiment these functions, changing the parameters of function, combining these skills with what you have learn in Tutorial 1 (e.g., geopandas), etc.

Part 3: Geocoding the recognized place names#

spaCy helps us recognize different categories of tokens from a text, including place names (with tag GPE or LOC), but have not refer these place names into geographical locations on the surface of the earth. In this part, we will explore ways of geocoding text-based place names to geographic coordinates. There are several cool libraries/packages there for us to directly use, which we will cover some in this tutorial. But before that, let’s develop our own geocoding tool first. We might not use it in the future due to its simplicity, but it will help us understand the fundementals behind those technologies, which we have highlighted in our lectures.

First, let’s create a variable storing the text that we want to georeference. The text below is copied from the Physical Geography section from the wikipedia page about the UK. We then use nlp() to convert the text into a nlp object defined by spacy. Having this object, we can then extract place names using the label LOC and/or GPE. Here we use a for-loop to go through all the tokens and only get those that have the two location-related labels, and then save them all into a list called locations. Finally, we transfer such a list into a panda dataframe.

import pandas as pd
UK_physicalGeo = "The physical geography of the UK varies greatly. England consists of mostly lowland terrain, with upland or mountainous terrain only found north-west of the Tees-Exe line. The upland areas include the Lake District, the Pennines, North York Moors, Exmoor and Dartmoor. The lowland areas are typically traversed by ranges of low hills, frequently composed of chalk, and flat plains. Scotland is the most mountainous country in the UK and its physical geography is distinguished by the Highland Boundary Fault which traverses the Scottish mainland from Helensburgh to Stonehaven. The faultline separates the two distinctively different regions of the Highlands to the north and west, and the Lowlands to the south and east. The Highlands are predominantly mountainous, containing the majority of Scotland's mountainous landscape, while the Lowlands contain flatter land, especially across the Central Lowlands, with upland and mountainous terrain located at the Southern Uplands. Wales is mostly mountainous, though south Wales is less mountainous than north and mid Wales. Northern Ireland consists of mostly hilly landscape and its geography includes the Mourne Mountains as well as Lough Neagh, at 388 square kilometres (150 sq mi), the largest body of water in the UK.[12]The overall geomorphology of the UK was shaped by a combination of forces including tectonics and climate change, in particular glaciation in northern and western areas. The tallest mountain in the UK (and British Isles) is Ben Nevis, in the Grampian Mountains, Scotland. The longest river is the River Severn which flows from Wales into England. The largest lake by surface area is Lough Neagh in Northern Ireland, though Scotland's Loch Ness has the largest volume."
UK_physicalGeo_doc=nlp(UK_physicalGeo)

locations = [] 

for entity in UK_physicalGeo_doc.ents:
    if entity.label_ in ['LOC', 'GPE']:
        print(entity.text,'--- ',entity.label_)
        locations.append([entity.text, entity.label_])
locations_df = pd.DataFrame(locations, columns = ['Place Name', 'Tag'])
UK ---  GPE
England ---  GPE
the Lake District ---  LOC
North York Moors ---  GPE
Scotland ---  GPE
UK ---  GPE
Helensburgh ---  GPE
south ---  LOC
Scotland ---  GPE
Central Lowlands ---  LOC
the Southern Uplands ---  LOC
Wales ---  GPE
Wales ---  GPE
Northern Ireland ---  GPE
the Mourne Mountains ---  LOC
UK ---  GPE
UK ---  GPE
the Grampian Mountains ---  LOC
Scotland ---  GPE
the River Severn ---  LOC
Wales ---  GPE
England ---  GPE
Northern Ireland ---  GPE
Scotland ---  GPE

Note that we see many duplicates in the list. What we can do is to delete those duplicates. pandas provides a very easy function for us to do it: drop_duplicates():

locations_df = locations_df.drop_duplicates()
locations_df
Place Name Tag
0 UK GPE
1 England GPE
2 the Lake District LOC
3 North York Moors GPE
4 Scotland GPE
6 Helensburgh GPE
7 south LOC
9 Central Lowlands LOC
10 the Southern Uplands LOC
11 Wales GPE
13 Northern Ireland GPE
14 the Mourne Mountains LOC
17 the Grampian Mountains LOC
19 the River Severn LOC

In Python, there are often many ways to achieve the same goal. To use what we have done so far as an example, you can also use the code below to achieve the same (creating the locations list). Try it yourself!

locations.extend([[entity.text, entity.label_] for entity in UK_physicalGeo_doc.ents if entity.label_ in [‘LOC’, 'GPE']]

After recognizing (or extracting) these place names, we next want to geocode them. First, we want to see how far we can go without using any external geocoding/geoparsing libraries.

As we discussed in the lecture, to do geocoding, we need a gazetteer first. Since the example we are using is mostly about the UK, we can use the Gazetteer of British Place Names. Make sure you downloaded the csv file into your local directory, and remember to replace the directory ../../LabData/GBPN_14062021.csv below to yours.

import pandas as pd
import geopandas as gpd

UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
UK_gazetteer_df.head()
/var/folders/xg/5n3zc4sn5hlcg8zzz6ysx21m0000gq/T/ipykernel_79890/3736579507.py:4: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type
0 1 A' Chill NG2705 57.057719 -6.500908 Argyllshire NaN NaN NaN Highland Highlands and Islands Scotland NaN Settlement
1 2 Ab Kettleby SK7223 52.800049 -0.927993 Leicestershire NaN Leicestershire Melton NaN Leicestershire England NaN Settlement
2 3 Ab Lench SP0151 52.163533 -1.980962 Worcestershire NaN Worcestershire Wychavon NaN West Mercia England NaN Settlement
3 4 Abaty Cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abbey-cwm-hir, Abbeycwmhir Settlement
4 4 Abbey-cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abaty Cwm-hir, Abbeycwmhir Settlement

This is obiviously a spatial data set as it has two columns Lat abd Lng that represent the geographic coordinates of the entry. Then, we want to transfer the data as geopandas. To do so, we will convert the Lat and Lng columns into a new column, which is called geometry. We also need to assign the right coordinate reference system to the data (we have covered all these in the lectures).

If you started doing so, you will quickly find there might be an error saying something is wrong on the Lat column. Don’t be panic! After you understand what the error indicates, you can go back to the csv file, where you will find a cell value on Lat is 53.20N. This is not a standard way of representing geographic coordinates. What we can do is to simply remove that row, then finish the transformation from dataframe to geodataframe:

UK_gazetteer_df.drop(UK_gazetteer_df[UK_gazetteer_df['Lat'] == '53.20N '].index, inplace = True)

UK_gazetteer_gpd = gpd.GeoDataFrame(
   UK_gazetteer_df , geometry=gpd.points_from_xy(UK_gazetteer_df.Lng, UK_gazetteer_df.Lat))
#UK_gazetteer_gpd.set_crs('epsg:4326')
UK_gazetteer_gpd.head()
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
0 1 A' Chill NG2705 57.057719 -6.500908 Argyllshire NaN NaN NaN Highland Highlands and Islands Scotland NaN Settlement POINT (-6.50091 57.05772)
1 2 Ab Kettleby SK7223 52.800049 -0.927993 Leicestershire NaN Leicestershire Melton NaN Leicestershire England NaN Settlement POINT (-0.92799 52.80005)
2 3 Ab Lench SP0151 52.163533 -1.980962 Worcestershire NaN Worcestershire Wychavon NaN West Mercia England NaN Settlement POINT (-1.98096 52.16353)
3 4 Abaty Cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abbey-cwm-hir, Abbeycwmhir Settlement POINT (-3.38992 52.33102)
4 4 Abbey-cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abaty Cwm-hir, Abbeycwmhir Settlement POINT (-3.38992 52.33102)

Now we have the geoparsed place name list and a gazetteer to extract candidate place names with their coordinates. Let’s now do a lookup matching.

Guess what? We can use the merge() (similar to join()) operations we learned in Tutoiral 1 to achieve it:

locations_merged = pd.merge(locations_df, UK_gazetteer_gpd, left_on='Place Name', right_on='PlaceName', how = "left")
locations_merged
Place Name Tag GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
0 UK GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
1 England GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
2 the Lake District LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
3 North York Moors GPE 198897.0 North York Moors SE7295 54.347372 -0.886344 Yorkshire North Riding North Yorkshire Ryedale NaN North Yorkshire England NaN Downs, Moorland POINT (-0.88634 54.34737)
4 Scotland GPE 39523.0 Scotland SK3822 52.796412 -1.428229 Leicestershire NaN Leicestershire North West Leicestershire NaN Leicestershire England NaN Settlement POINT (-1.42823 52.79641)
5 Scotland GPE 39524.0 Scotland SP6798 52.57925 -1.000125 Leicestershire NaN Leicestershire Harborough NaN Leicestershire England NaN Settlement POINT (-1.00012 52.57925)
6 Scotland GPE 39525.0 Scotland SU5669 51.417258 -1.196096 Berkshire NaN NaN NaN West Berkshire Thames Valley England NaN Settlement POINT (-1.19610 51.41726)
7 Scotland GPE 39526.0 Scotland TF0030 52.8604 -0.512500 Lincolnshire Parts of Kesteven Lincolnshire South Kesteven NaN Lincolnshire England NaN Settlement POINT (-0.51250 52.86040)
8 Scotland GPE 294460.0 Scotland SE2340 53.857797 -1.641983 Yorkshire West Riding NaN NaN Leeds West Yorkshire England NaN Settlement POINT (-1.64198 53.85780)
9 Helensburgh GPE 21050.0 Helensburgh NS2982 56.003981 -4.733445 Dunbartonshire NaN NaN NaN Argyll and Bute Argyll and West Dunbartonshire Scotland Baile Eilidh Settlement POINT (-4.73344 56.00398)
10 south LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
11 Central Lowlands LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
12 the Southern Uplands LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
13 Wales GPE 47495.0 Wales SK4782 53.341077 -1.284149 Yorkshire West Riding NaN NaN Rotherham South Yorkshire England NaN Settlement POINT (-1.28415 53.34108)
14 Wales GPE 47496.0 Wales ST5824 51.02019 -2.590189 Somerset NaN Somerset South Somerset NaN Avon and Somerset England NaN Settlement POINT (-2.59019 51.02019)
15 Wales GPE 64525.0 Wales SK4782 53.341105 -1.290959 Yorkshire West Riding South Yorkshire Rotherham NaN South Yorkshire England NaN Civil Parish POINT (-1.29096 53.34110)
16 Northern Ireland GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
17 the Mourne Mountains LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
18 the Grampian Mountains LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
19 the River Severn LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None

Alright, as you can see, many place names, such as “North York Moors” and “Helensburgh” are now geocoded. You will also notice some places, like “Wales” and “Scotland” are matched to multiple coordinates. It is because in our gazetteer, there are multiple records about “Wales” and “Scotland”. Note also that these are not the “Wales” and “Scotland” you are thinking. If you check the gazetteer by searching for rows that have Place Name and “Wales” for example (see below), you will find these Wales are either “Settlement” and “Civil Parish” in England.

Plus, we also see many place names such as “UK”, “the Lake District”, “Pennines”, etc. do not find a match (their GBPNIDs are all NaN).

UK_gazetteer_gpd.loc[UK_gazetteer_gpd['PlaceName'] == "Wales"]
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
47317 47495 Wales SK4782 53.341077 -1.284149 Yorkshire West Riding NaN NaN Rotherham South Yorkshire England NaN Settlement POINT (-1.28415 53.34108)
47318 47496 Wales ST5824 51.02019 -2.590189 Somerset NaN Somerset South Somerset NaN Avon and Somerset England NaN Settlement POINT (-2.59019 51.02019)
62356 64525 Wales SK4782 53.341105 -1.290959 Yorkshire West Riding South Yorkshire Rotherham NaN South Yorkshire England NaN Civil Parish POINT (-1.29096 53.34110)

All these are due to facts that (1) our imported gazetteer is only about places in England, and (2) our simple model is incapable of capturing the context of the text. Do you have better ideas to improve our simple geocoding tool?

Now it is the right time to introduce you some more “fancy” geocoding/geoparsing libraries. Let’s try spacy_dbpedia_spotlight here. Make sure you have installed it. Assuming we already have the nlp objec from the previous steps, or you can create a new one like below, below are the code of using the library:

import spacy_dbpedia_spotlight
import spacy
nlp = spacy.blank('en')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')
# get the document
doc = nlp(UK_physicalGeo)
# see the entities
entities_dbpedia = [(ent.text, ent.label_, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents]
entities_dbpedia[1]
('UK',
 'DBPEDIA_ENT',
 'http://dbpedia.org/resource/United_Kingdom',
 '0.9999999697674871')

What the code does is to go through all the tokens in the text and try to find the corresponding entities from DBpedia. The result is a list of tuples. One tuple example is as shown in the box. It includes the text that is parsed, its label (notice how different it is than spaCy’s labels), its id in DBpedia (this is very useful as all associated information can be further retrieved using this link. We will cover more about it in future lectures (and tutorials), and a score from 0-1 (this is the similarity score of the string matching; the higher it is, the more similar the target text with the candidate).

Similar to what we did before, we can then transfer this list into a dataframe. Note that since we do not have coordinates explicitly listed here, we can simply use pandas’s general dataframe.

Check Spacy DBpedia Spotlight for more interesting functions and examples.

columns = ['Text', 'DBpedia Label', 'DBpedia URI', 'Similarity Score']

UK_physicalGeo_DBpedia = pd.DataFrame(entities_dbpedia, columns=columns)
UK_physicalGeo_DBpedia = UK_physicalGeo_DBpedia.drop_duplicates()
UK_physicalGeo_DBpedia
Text DBpedia Label DBpedia URI Similarity Score
0 physical geography DBPEDIA_ENT http://dbpedia.org/resource/Physical_geography 1.0
1 UK DBPEDIA_ENT http://dbpedia.org/resource/United_Kingdom 0.9999999697674871
2 England DBPEDIA_ENT http://dbpedia.org/resource/England 0.9999997958868998
3 Lake District DBPEDIA_ENT http://dbpedia.org/resource/Lake_District 1.0
4 Moors DBPEDIA_ENT http://dbpedia.org/resource/Moorland 1.0
5 Exmoor DBPEDIA_ENT http://dbpedia.org/resource/Exmoor 1.0
6 Dartmoor DBPEDIA_ENT http://dbpedia.org/resource/Dartmoor 1.0
7 chalk DBPEDIA_ENT http://dbpedia.org/resource/Chalk 0.999982297417919
8 Scotland DBPEDIA_ENT http://dbpedia.org/resource/Scotland 1.0
11 Highland Boundary Fault DBPEDIA_ENT http://dbpedia.org/resource/Highland_Boundary_... 1.0
12 Scottish DBPEDIA_ENT http://dbpedia.org/resource/Scotland 0.9997720847861457
13 Helensburgh DBPEDIA_ENT http://dbpedia.org/resource/Helensburgh 1.0
14 Stonehaven DBPEDIA_ENT http://dbpedia.org/resource/Stonehaven 1.0
15 faultline DBPEDIA_ENT http://dbpedia.org/resource/Faultline_(musician) 0.9838808091555019
17 Central Lowlands DBPEDIA_ENT http://dbpedia.org/resource/Central_Lowlands 1.0
18 Southern Uplands DBPEDIA_ENT http://dbpedia.org/resource/Southern_Uplands 1.0
19 Wales DBPEDIA_ENT http://dbpedia.org/resource/Wales 0.9999999999998863
22 Northern Ireland DBPEDIA_ENT http://dbpedia.org/resource/Northern_Ireland 0.999999996766519
23 geography DBPEDIA_ENT http://dbpedia.org/resource/Geography 0.9947401620497708
24 Mourne Mountains DBPEDIA_ENT http://dbpedia.org/resource/Mourne_Mountains 1.0
25 Lough Neagh DBPEDIA_ENT http://dbpedia.org/resource/Lough_Neagh 1.0
26 geomorphology DBPEDIA_ENT http://dbpedia.org/resource/Geomorphology 1.0
28 climate change DBPEDIA_ENT http://dbpedia.org/resource/Global_warming 0.9859722723298184
29 glaciation DBPEDIA_ENT http://dbpedia.org/resource/Glacial_period 0.9861545121455497
31 British Isles DBPEDIA_ENT http://dbpedia.org/resource/British_Isles 1.0
32 Nevis DBPEDIA_ENT http://dbpedia.org/resource/River_Nevis 0.9999978752268652
33 Grampian Mountains DBPEDIA_ENT http://dbpedia.org/resource/Grampian_Mountains 1.0
35 River Severn DBPEDIA_ENT http://dbpedia.org/resource/River_Severn 1.0
38 lake DBPEDIA_ENT http://dbpedia.org/resource/Lake 0.999999999890747
42 Loch Ness DBPEDIA_ENT http://dbpedia.org/resource/Loch_Ness 1.0

It seems that the coordinate is not extracted from the tool. But if you use the url, for example https://dbpedia.org/page/Lake_District, you will see a full information about the place, including its geometry and more! See below for instance: DBpedia-Point

If you wish to extract the geometry of the identified geographic entities automatically in python, below is the example function. Here we used a library SPARQLWrapper for us to send a SPARQL query (or called a request) to an endpoint (or called server, here we use DBpedia http://dbpedia.org/sparql as an example), and the endpoint will respond by sending the queried information back to your python program.

Also note that the process is exactly same to using SQL query language. The only difference is that SQL is for tabular data, while SPARQL is for knowledge graphs, which we will discuss in the near future. To learn SQL in Python, I suggest this post, in which I recommend to check the use sqlite3 library.

The syntax of SQL and SPARQl are quite similar. Main reasons to focus on SPARQL here are: (1) SPARQL is more advanced (2) Many knowledge graphs are open, so that we can readily access the endpoint. In contrast, we must first build some relational databases (using MySQL, Postgres, etc.) in order to use SQL in Python

from SPARQLWrapper import SPARQLWrapper, JSON

def get_coordinates(place_url):
    
    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setReturnFormat(JSON)
    
    query = """PREFIX db: <http://dbpedia.org/resource/>
    PREFIX georss: <http://www.georss.org/georss/>
    SELECT ?geom WHERE { <%s> georss:point ?geom. } """%place_url

    sparql.setQuery(query) 
    return sparql.query().convert()

example = get_coordinates("http://dbpedia.org/resource/Dartmoor")
example
{'head': {'link': [], 'vars': ['geom']},
 'results': {'distinct': False,
  'ordered': True,
  'bindings': [{'geom': {'type': 'literal',
     'value': '50.56666666666667 -4.0'}}]}}

The result example is formated as json (check this documentation if you have not heard about it). We can easily extract the attribute geom from it:

geom = example['results']['bindings'][0]['geom']['value']
geom
'50.56666666666667 -4.0'

Finally, even thgough spacy_dbpedia_spotlight has its advantages for geoparsing, there are still flaws in such a tool. For example, many non-spatial entiteis are also detected and coordinates are not explicitly show (e.g., Moors). Do you have any idea on addressing these issues? (hint: maybe check this library - pyDBpedia which also helps you access the data shown in DBpedia).

Now, you can also try to use it on your extracted tweets, or other texts you get from the Internet.