Lab 02: Georefereing Location-based Social Media#

In this tutorial, we will learn:

  • How to extract information (e.g., tweets) from location-based social media (e.g., Twitter)

  • How to identify locational information (e.g., place name) from text-based data (e.g., tweets, newspapers)

  • How to refer the identified location to metric-based coordinates on the surface of the earcth

Several libraries/packages are needed in this tutorial. Use pip or conda to install them:

  • tweepy: this is library to access the Tweeter API

  • spaCy: this is the library to do natural lanuguage processing

  • spacy-dbpedia-spotlight: a small library that annotate recognized entities from spaCy to DBpedia enities.

Part 1: Extracting (geo)text from Twitter#

This part explains how to extract text-based unstructured information from the social media Twitter via its API. Similar pipline can be used to extract information from other types of social media/Web services (e.g., Foursquare, Yelp, Flickr, etc.).

Twitter is a useful data source for studying the social impacts of events and activities. In this part, we are going to learn how to collect Twitter data using its API. Specifically, we are going to focus on geotagged Twitter data.

First, Twitter requires the users of Twitter API to be authenticated by the system. One simple approach to obtain such authentication is by registering a Twitter account. This is the approach we are going to take in this tutorial.

Go to the website of Twitter: https://twitter.com/ , and click “Sign up” at the upper right corner. You can skip this step if you already have a Twitter account.

After you have registered/logged in to your Twitter account, we are going to obtain the keys that are necessary to use Twitter API. Go to https://apps.twitter.com/ , and sign in using the Twitter account you have just created (sometimes the browser will automatically sign you in).

After you have signed in, click the button “Create New App”. Then fill in the necessary information to create an APP. Note that you might need to record your phone number in your Twitter account in order to do so. If you don’t like it, feel free to remove your phone number from your account after you have done your project.

Then you will be directed to a page (see example below) asking you for a name of your App. Give it a name that you want.

Get Keys from Twitter Developer

Click Get keys. It will then generated API Key, API Key Secret, and Bearer Token (see below for an example). Make sure you copy and paste them into a safe place (e.g., a text editor). We need these authentications later.

Authentication Example

Next, we also need to obtain the Access Token and its key. To do so, go to the Projects & Apps–> Select your App. Then click Keys and tokens, and then click Generate on the right of Access Token ane Secret (see below). Again, make sure you record them in a safe place. We need them later. Note that if for some reasons, you lose your tokens and secrets, this page is where you regenerate them.

Access Token Example

Once you have your Twitter app set-up, you are ready to access tweets in Python. Begin by importing the necessary Python libraries.

import os
import tweepy as tw
import pandas as pd

To access the Twitter API, you will need four things from your Twitter App page. These keys are located in your Twitter app settings in the Keys and Access Tokens tab.

  • api key

  • api key seceret

  • access token

  • access token secret

Below I put in my authentications. You should use yours! But remember to not share these with anyone else because these values are specific to your App.

api_key= '5TX6isrDz92kOC1s7qsTFWq5F'
api_key_secret= 'DL4Gw2WLNo2bK538lL5GeNYCtiwlsuYUHOlW8NCSQszK3ac101'
access_token= '1582847729486684180-VH5N9AEb2zyyFyLOj5BuD8I9ca0ils'
access_token_secret = 'e0cvkxJz9AWvu9dq0Fb48r61vkmIaA1JqLLyhEhms5FGt'

With these authentications, we can next build an API variable in Python that build the connection between this Jupyter program and Twitter:

auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

For example, now we can send tweets using your API access. Note that your tweet needs to be 280 characters or less:

# Post a tweet from Python
api.update_status("Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience")
# Your tweet has been posted!
Status(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'created_at': 'Thu Nov 10 11:50:08 +0000 2022', 'id': 1590673155726929923, 'id_str': '1590673155726929923', 'text': "Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience", 'truncated': False, 'entities': {'hashtags': [{'text': 'DataScience', 'indices': [92, 104]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="https://ruizhugeographer.com/" rel="nofollow">GEOGM0068-Zhu</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'lang': 'en'}, created_at=datetime.datetime(2022, 11, 10, 11, 50, 8, tzinfo=datetime.timezone.utc), id=1590673155726929923, id_str='1590673155726929923', text="Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience", truncated=False, entities={'hashtags': [{'text': 'DataScience', 'indices': [92, 104]}], 'symbols': [], 'user_mentions': [], 'urls': []}, source='GEOGM0068-Zhu', source_url='https://ruizhugeographer.com/', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, id=1582847729486684180, id_str='1582847729486684180', name='Richard Chu', screen_name='GEOGM0068', location='', description='', url=None, entities={'description': {'urls': []}}, protected=False, followers_count=1, friends_count=1, listed_count=0, created_at=datetime.datetime(2022, 10, 19, 21, 35, 4, tzinfo=datetime.timezone.utc), favourites_count=0, utc_offset=None, time_zone=None, geo_enabled=False, verified=False, statuses_count=4, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=False, profile_background_color='F5F8FA', profile_background_image_url=None, profile_background_image_url_https=None, profile_background_tile=False, profile_image_url='http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_image_url_https='https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_link_color='1DA1F2', profile_sidebar_border_color='C0DEED', profile_sidebar_fill_color='DDEEF6', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=True, default_profile=True, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='none', withheld_in_countries=[]), user=User(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, id=1582847729486684180, id_str='1582847729486684180', name='Richard Chu', screen_name='GEOGM0068', location='', description='', url=None, entities={'description': {'urls': []}}, protected=False, followers_count=1, friends_count=1, listed_count=0, created_at=datetime.datetime(2022, 10, 19, 21, 35, 4, tzinfo=datetime.timezone.utc), favourites_count=0, utc_offset=None, time_zone=None, geo_enabled=False, verified=False, statuses_count=4, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=False, profile_background_color='F5F8FA', profile_background_image_url=None, profile_background_image_url_https=None, profile_background_tile=False, profile_image_url='http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_image_url_https='https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_link_color='1DA1F2', profile_sidebar_border_color='C0DEED', profile_sidebar_fill_color='DDEEF6', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=True, default_profile=True, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='none', withheld_in_countries=[]), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=0, favorite_count=0, favorited=False, retweeted=False, lang='en')

If you go to your Twitter account and check Profile, you will see the tweet being posted! Congrats for your first post via Python!

Note that if you see errors like “453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve”. It means you need to elevate your access. What you need to do is (1). Go to Products –> Twitter API v2; (2). click the tab “Elevated” (or “Academic Research” if you need it for your dissertation later); (3). Click Apply, then file the form (you can choose No for many of the questions). See screenshot below for (1) and (2):

Elevate your access

Next, let’s retrieve (search) some tweets that are about #energycrisis that are posted currently in English. There are going to be many posts returned. To make it easy to illustrate and to save some request (note you have a limited number of requests via this API), we only request 5 from the list.

search_words = "#energycrisis"
tweets = tw.Cursor(api.search_tweets,
              q=search_words,
              lang="en").items(5)
tweets
<tweepy.cursor.ItemIterator at 0x7fb198ad6400>

Here, you see a an object that you can iterate (i.e. ItemIterator) or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:

  • the text of the tweet

  • who sent the tweet

  • the date the tweet was sent

and more. The code below loops through the object and save the time of the tweet, the user who posted the tweet, the text of the tweet, as well ast the user location to a pandas DataFrame:

import pandas as pd

# create dataframe
columns = ['Time', 'User', 'Tweet', 'Location']

data = []
for tweet in tweets:
    data.append([tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location])

df = pd.DataFrame(data, columns=columns)
df
Time User Tweet Location
0 2022-11-10 11:49:20+00:00 shirleyyoung2 RT @secularmac: Isn't this nice?\n#EnergyCrisi... Fife, Scotland
1 2022-11-10 11:48:09+00:00 Q_Review_ How does living in the cold affect your health...
2 2022-11-10 11:46:15+00:00 Bellona_EU RT @jonashelseth: Encouraging!\nEU 🇪🇺 consumer... Brussels, Belgium
3 2022-11-10 11:46:02+00:00 Bellona_EU RT @LoviMarta: Renewables &amp; heat pumps are... Brussels, Belgium
4 2022-11-10 11:44:51+00:00 jonashelseth Encouraging!\nEU 🇪🇺 consumer organisation @BEU... Brussels, Belgium

We can further save the dataframe to a local csv file (structured data):

df.to_csv('tweets_example.csv')

Note that there is another way of writing the query to Twitter API, which might be more intuitive to some users. For example, you can replace tweets = tw.Cursor(api.search_tweets,q=search_words,lang="en").items(5) to something like:

tweets2 = api.search_tweets(q=search_words,lang="en", count="5")
data2 = []
for tweet2 in tweets2:
    data2.append([tweet2.created_at, tweet2.user.screen_name, tweet2.text, tweet2.user.location])
df2 = pd.DataFrame(data2, columns=columns)
df2
Time User Tweet Location
0 2022-11-10 11:49:20+00:00 shirleyyoung2 RT @secularmac: Isn't this nice?\n#EnergyCrisi... Fife, Scotland
1 2022-11-10 11:48:09+00:00 Q_Review_ How does living in the cold affect your health...
2 2022-11-10 11:46:15+00:00 Bellona_EU RT @jonashelseth: Encouraging!\nEU 🇪🇺 consumer... Brussels, Belgium
3 2022-11-10 11:46:02+00:00 Bellona_EU RT @LoviMarta: Renewables &amp; heat pumps are... Brussels, Belgium
4 2022-11-10 11:44:51+00:00 jonashelseth Encouraging!\nEU 🇪🇺 consumer organisation @BEU... Brussels, Belgium

To learn more about the key function search_tweets(), check its webpage here. Please try yourself to set up some other parameters to see what you can get.

Part 2: Basic Natural Language Processing and Geoparsing#

To extract places (or other categories) from text-based (unstructured) data, we need to do some basic Natural Language Processing (NLP), such as tokenization and Part-of-Speech analysis. All these operations can be done through the library spaCy.

Ideally, you can use the tweets you got from Part 1 to do the experiment. But since sometimes the tweets you get might be very heterogenous and noisy, here we use a clean example (you can also get it from some long news online) to show how to use spaCy in order to make sure all the knowledge points are covered in one example.

First make sure you have intsalled and imported spaCy:

import spacy

spaCy comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), transforming to word vectors etc.

If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function. For example, we want to load the English version:

# Load small english model: https://spacy.io/models
nlp=spacy.load("en_core_web_sm")
nlp
/Users/gy22808/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'en_core_web_sm' (3.4.0) requires spaCy v3.4 and is incompatible with the current spaCy version (2.3.8). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
  warnings.warn(warn_msg)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 # Load small english model: https://spacy.io/models
----> 2 nlp=spacy.load("en_core_web_sm")
      3 nlp

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/__init__.py:30, in load(name, **overrides)
     28 if depr_path not in (True, False, None):
     29     warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
---> 30 return util.load_model(name, **overrides)

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:170, in load_model(name, **overrides)
    168     return load_model_from_link(name, **overrides)
    169 if is_package(name):  # installed as package
--> 170     return load_model_from_package(name, **overrides)
    171 if Path(name).exists():  # path to model data directory
    172     return load_model_from_path(Path(name), **overrides)

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:191, in load_model_from_package(name, **overrides)
    189 """Load a model from an installed package."""
    190 cls = importlib.import_module(name)
--> 191 return cls.load(**overrides)

File ~/opt/anaconda3/lib/python3.9/site-packages/en_core_web_sm/__init__.py:10, in load(**overrides)
      9 def load(**overrides):
---> 10     return load_model_from_init_py(__file__, **overrides)

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:239, in load_model_from_init_py(init_file, **overrides)
    237 if not model_path.exists():
    238     raise IOError(Errors.E052.format(path=path2str(data_path)))
--> 239 return load_model_from_path(data_path, meta, **overrides)

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:203, in load_model_from_path(model_path, meta, **overrides)
    201 lang = meta.get("lang_factory", meta["lang"])
    202 cls = get_lang_class(lang)
--> 203 nlp = cls(meta=meta, **overrides)
    204 pipeline = meta.get("pipeline", [])
    205 factories = meta.get("factories", {})

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/language.py:171, in Language.__init__(self, vocab, make_doc, max_length, meta, **kwargs)
    149 def __init__(
    150     self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
    151 ):
    152     """Initialise a Language object.
    153 
    154     vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
   (...)
    169     RETURNS (Language): The newly constructed object.
    170     """
--> 171     user_factories = util.registry.factories.get_all()
    172     self.factories.update(user_factories)
    173     self._meta = dict(meta)

File ~/opt/anaconda3/lib/python3.9/site-packages/catalogue/__init__.py:112, in Registry.get_all(self)
    110 result = OrderedDict()
    111 if self.entry_points:
--> 112     result.update(self.get_entry_points())
    113 for keys, value in REGISTRY.items():
    114     if len(self.namespace) == len(keys) - 1 and all(
    115         self.namespace[i] == keys[i] for i in range(len(self.namespace))
    116     ):

File ~/opt/anaconda3/lib/python3.9/site-packages/catalogue/__init__.py:127, in Registry.get_entry_points(self)
    125 result = {}
    126 for entry_point in self._get_entry_points():
--> 127     result[entry_point.name] = entry_point.load()
    128 return result

File ~/opt/anaconda3/lib/python3.9/site-packages/setuptools/_vendor/importlib_metadata/__init__.py:194, in EntryPoint.load(self)
    189 """Load the entry point from its definition. If only a module
    190 is indicated by the value, return that module. Otherwise,
    191 return the named object.
    192 """
    193 match = self.pattern.match(self.value)
--> 194 module = import_module(match.group('module'))
    195 attrs = filter(None, (match.group('attr') or '').split('.'))
    196 return functools.reduce(getattr, attrs, module)

File ~/opt/anaconda3/lib/python3.9/importlib/__init__.py:127, in import_module(name, package)
    125             break
    126         level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1030, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1007, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:986, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:680, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:850, in exec_module(self, module)

File <frozen importlib._bootstrap>:228, in _call_with_frames_removed(f, *args, **kwds)

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy_dbpedia_spotlight/__init__.py:4, in <module>
      1 from spacy.language import Language
      2 from spacy.util import load_model_from_init_py
----> 4 from . import entity_linker, util
      5 from .entity_linker import EntityLinker, create
      7 __version__ = util.pkg_meta["version"]

File ~/opt/anaconda3/lib/python3.9/site-packages/spacy_dbpedia_spotlight/entity_linker.py:20, in <module>
     16 Span.set_extension("dbpedia_raw_result", default=None)
     17 Doc.set_extension("dbpedia_raw_result", default=None)
---> 20 @Language.factory('dbpedia_spotlight', default_config={
     21     'language_code': None,
     22     'dbpedia_rest_endpoint': None,
     23     'process': 'annotate',
     24     'confidence': None,
     25     'support': None,
     26     'types': None,
     27     'sparql': None,
     28     'policy': None,
     29     'span_group': 'dbpedia_spotlight',
     30     'overwrite_ents': True,
     31     'raise_http_errors': True,
     32     'debug': False
     33 })
     34 def dbpedia_spotlight_factory(nlp, name, language_code, dbpedia_rest_endpoint, process, confidence, support, types, sparql, policy, span_group, overwrite_ents, raise_http_errors, debug):
     35     '''Factory of the pipeline stage `dbpedia_spotlight`.
     36     Parameters:
     37     - `language_code`: which language to use for entity linking. Possible values are listed in EntityLinker.supported_languages. If the parameter is left as None, the language code is matched with the nlp object currently used.
   (...)
     48     - `debug`: prints several debug information to stdout
     49     '''
     50     logger.remove()

AttributeError: type object 'Language' has no attribute 'factory'

This returns a Language object that comes ready with multiple built-in capabilities.

Now let’s say you have your text data in a string. What can be done to understand the structure of the text?

First, call the loaded nlp object on the text. It should return a processed Doc object.

# Parse text through the `nlp` model
my_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost"""

my_doc = nlp(my_text)
type(my_doc)
spacy.tokens.doc.Doc

Hmmm, it is a Doc object. But wait, what exactly is a Doc object?

It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text and so on are pre-computed and readily stored in the Doc object.

So first, what is a token?

As you have learnt from the lecture. Tokens are individual textual entities that make up the text. Typically a token can be the words, punctuation, spaces, etc. Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.

Each token in spaCy has different attributes that tell us a great deal of information.

Let’s see the token texts on my_doc. The string which the token represents can be accessed through the token.text attribute.

# Printing the tokens of a doc
for token in my_doc:
  print(token.text)
The
economic
situation
of
the
country
is
on
edge
,
as
the
stock


market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment


in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off


thousands
of
people
to
reduce
labor
cost

The above tokens contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words. We can clean it up.

The type of tokens will allow us to clean those noisy tokens such as stop word, punctuation, and space. First, we show whether a token is stop/punctuation or not, and then we use this information to remove them.

# Printing tokens and boolean values stored in different attributes
for token in my_doc:
  print(token.text,'--',token.is_stop,'---',token.is_punct)
The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False

 -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False

 -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -- False --- False
off -- True --- False

 -- False --- False
thousands -- False --- False
of -- True --- False
people -- False --- False
to -- True --- False
reduce -- False --- False
labor -- False --- False
cost -- False --- False
# Removing StopWords and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct and not token.is_space]

for token in my_doc_cleaned:
  print(token.text)
economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
thousands
people
reduce
labor
cost

To get the POS tagging of your text, you use code like:

for token in my_doc_cleaned:
  print(token.text,'---- ',token.pos_)
economic ----  ADJ
situation ----  NOUN
country ----  NOUN
edge ----  NOUN
stock ----  NOUN
market ----  NOUN
crashed ----  VERB
causing ----  VERB
loss ----  NOUN
millions ----  NOUN
Citizens ----  NOUN
main ----  ADJ
investment ----  NOUN
share ----  NOUN
market ----  NOUN
facing ----  VERB
great ----  ADJ
loss ----  NOUN
companies ----  NOUN
lay ----  VERB
thousands ----  NOUN
people ----  NOUN
reduce ----  VERB
labor ----  NOUN
cost ----  NOUN

You will see each word (token) now is associated with a POS tag, whether it is a Noun, a Adj, a Verb, or so on … POS often can help us disambiguate the meaning of words (or places in GIR).

Btw, if you don’t know what “ADJ” means, you can use code like:

spacy.explain('ADJ')
'adjective'

You can also use spaCy to do some Named Entity Recognition (including place name identification or geoparsing). For you instance:

text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)

for entity in doc.ents:
    print(entity.text,'--- ',entity.label_)
Tony Stark ---  PERSON
StarkEnterprises ---  ORG
Emily Clark ---  PERSON
Microsoft ---  ORG
Manchester ---  GPE
Bible ---  WORK_OF_ART
French ---  NORP

What is “GPE”?

spacy.explain('GPE')
'Countries, cities, states'

spaCy also provides special visualization for NER through displacy. Using displacy.render() function, you can set the style='ent' to visualize.

# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)
Tony Stark PERSON owns the company StarkEnterprises ORG . Emily Clark PERSON works at Microsoft ORG and lives in Manchester GPE . She loves to read the Bible WORK_OF_ART and learn French NORP

So far, you have learnt the basics of retrieving information from social media like Twitter, as well as basic NLP operations and named entity recognition (geoparsing is part of it). I suggest you to play with what you have learnt so far by using new data to experiment these functions, changing the parameters of function, combining these skills with what you have learn in Tutorial 1 (e.g., geopandas), etc.

Part 3: Geocoding the recognized place names#

spaCy helps us recognize different categories of tokens from a text, including place names (with tag GPE or LOC), but have not refer these place names into geographical locations on the surface of the earth. In this part, we will explore ways of geocoding text-based place names to geographic coordinates. There are several cool libraries/packages there for us to directly use, which we will cover some in this tutorial. But before that, let’s develop our own geocoding tool first. We might not use it in the future due to its simplicity, but it will help us understand the fundementals behind those technologies, which we have highlighted in our lectures.

First, let’s create a variable storing the text that we want to georeference. The text below is copied from the Physical Geography section from the wikipedia page about the UK. We then use nlp() to convert the text into a nlp object defined by spacy. Having this object, we can then extract place names using the label LOC and GPE. Here we use a for-loop to go through all the tokens and only get those that have the two location-related labels, and then save them all into a list locations. Then we transfer such a list into a panda dataframe.

UK_physicalGeo = "The physical geography of the UK varies greatly. England consists of mostly lowland terrain, with upland or mountainous terrain only found north-west of the Tees-Exe line. The upland areas include the Lake District, the Pennines, North York Moors, Exmoor and Dartmoor. The lowland areas are typically traversed by ranges of low hills, frequently composed of chalk, and flat plains. Scotland is the most mountainous country in the UK and its physical geography is distinguished by the Highland Boundary Fault which traverses the Scottish mainland from Helensburgh to Stonehaven. The faultline separates the two distinctively different regions of the Highlands to the north and west, and the Lowlands to the south and east. The Highlands are predominantly mountainous, containing the majority of Scotland's mountainous landscape, while the Lowlands contain flatter land, especially across the Central Lowlands, with upland and mountainous terrain located at the Southern Uplands. Wales is mostly mountainous, though south Wales is less mountainous than north and mid Wales. Northern Ireland consists of mostly hilly landscape and its geography includes the Mourne Mountains as well as Lough Neagh, at 388 square kilometres (150 sq mi), the largest body of water in the UK.[12]The overall geomorphology of the UK was shaped by a combination of forces including tectonics and climate change, in particular glaciation in northern and western areas. The tallest mountain in the UK (and British Isles) is Ben Nevis, in the Grampian Mountains, Scotland. The longest river is the River Severn which flows from Wales into England. The largest lake by surface area is Lough Neagh in Northern Ireland, though Scotland's Loch Ness has the largest volume."
UK_physicalGeo_doc=nlp(UK_physicalGeo)

locations = [] 

for entity in UK_physicalGeo_doc.ents:
    if entity.label_ in ['LOC', 'GPE']:
        print(entity.text,'--- ',entity.label_)
        locations.append([entity.text, entity.label_])
locations_df = pd.DataFrame(locations, columns = ['Place Name', 'Tag'])
UK ---  GPE
the Lake District ---  LOC
Pennines ---  LOC
North York Moors ---  GPE
Scotland ---  GPE
UK ---  GPE
Helensburgh ---  GPE
Scotland ---  GPE
Wales ---  GPE
Wales ---  GPE
Northern Ireland ---  GPE
the Mourne Mountains ---  LOC
UK ---  GPE
UK ---  GPE
the Grampian Mountains ---  LOC
Scotland ---  GPE
Wales ---  GPE
England ---  GPE
Northern Ireland ---  GPE
Scotland ---  GPE

Note that we see many duplicates in the list. What we can do is then to delete those duplicates. pandas provides a very easy function for us to do it: drop_duplicates():

locations_df = locations_df.drop_duplicates()
locations_df
Place Name Tag
0 UK GPE
1 the Lake District LOC
2 Pennines LOC
3 North York Moors GPE
4 Scotland GPE
6 Helensburgh GPE
8 Wales GPE
10 Northern Ireland GPE
11 the Mourne Mountains LOC
14 the Grampian Mountains LOC
17 England GPE

In Python, there are often many ways to achieve the same goal. To use what we have done so far as an example, you can also use the code below to achieve the same (creating the locations list). Try it yourself!

locations.extend([[entity.text, entity.label_] for entity in UK_physicalGeo_doc.ents if entity.label_ in [‘LOC’, 'GPE']]

After recognized these place names, we next want to geocode them. First, we want to see how far we can go without using any external geocoding/geoparsing libraries.

As we discussed in the lecture, to do geocoding, we need a gazetteer first. Since the example we are using is mostly about the UK, we can use the Gazetteer of British Place Names. Make sure you downloaded the csv file into your local directory, and remember to replace the directory ../../LabData/GBPN_14062021.csv below to yours.

import pandas as pd
import geopandas as gpd

UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
UK_gazetteer_df.head()
/var/folders/xg/5n3zc4sn5hlcg8zzz6ysx21m0000gq/T/ipykernel_3617/3736579507.py:4: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
  UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type
0 1 A' Chill NG2705 57.057719 -6.500908 Argyllshire NaN NaN NaN Highland Highlands and Islands Scotland NaN Settlement
1 2 Ab Kettleby SK7223 52.800049 -0.927993 Leicestershire NaN Leicestershire Melton NaN Leicestershire England NaN Settlement
2 3 Ab Lench SP0151 52.163533 -1.980962 Worcestershire NaN Worcestershire Wychavon NaN West Mercia England NaN Settlement
3 4 Abaty Cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abbey-cwm-hir, Abbeycwmhir Settlement
4 4 Abbey-cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abaty Cwm-hir, Abbeycwmhir Settlement

This is obiviously a spatial data set. Then, we want to represent the data as geopandas. To do so, we will convert the Lat and Lng columns into a new column, which is geometry. We also need to assign the coordinate reference system to the data.

If you started doing so, you will quickly find there might be an error saying something is wrong on the Lat column. Don’t be panic! After you understand what the error indicates, you can go back to the csv file, where you will find a cell value on Lat is 53.20N. This is not a standard way of representing geographic coordinates. What we can do is to simply remove that row, then finish the transformation from dataframe to geodataframe:

UK_gazetteer_df.drop(UK_gazetteer_df[UK_gazetteer_df['Lat'] == '53.20N '].index, inplace = True)

UK_gazetteer_gpd = gpd.GeoDataFrame(
   UK_gazetteer_df , geometry=gpd.points_from_xy(UK_gazetteer_df.Lng, UK_gazetteer_df.Lat))
UK_gazetteer_gpd.set_crs('epsg:4326')
UK_gazetteer_gpd.head()
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
0 1 A' Chill NG2705 57.057719 -6.500908 Argyllshire NaN NaN NaN Highland Highlands and Islands Scotland NaN Settlement POINT (-6.50091 57.05772)
1 2 Ab Kettleby SK7223 52.800049 -0.927993 Leicestershire NaN Leicestershire Melton NaN Leicestershire England NaN Settlement POINT (-0.92799 52.80005)
2 3 Ab Lench SP0151 52.163533 -1.980962 Worcestershire NaN Worcestershire Wychavon NaN West Mercia England NaN Settlement POINT (-1.98096 52.16353)
3 4 Abaty Cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abbey-cwm-hir, Abbeycwmhir Settlement POINT (-3.38992 52.33102)
4 4 Abbey-cwm-hir SO0571 52.331015 -3.389919 Radnorshire NaN NaN NaN Powys Dyfed Powys Wales Abaty Cwm-hir, Abbeycwmhir Settlement POINT (-3.38992 52.33102)

Now we have the geoparsed place name list and a gazetteer to extract candidate place names with their coordinates. Let’s now do a lookup matching.

Guess what? We can use the merge() (similar to join()) operations we learned in Tutoiral 1 to achieve it:

locations_merged = pd.merge(locations_df, UK_gazetteer_gpd, left_on='Place Name', right_on='PlaceName', how = "left")
locations_merged
Place Name Tag GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
0 UK GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
1 the Lake District LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
2 Pennines LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
3 North York Moors GPE 198897.0 North York Moors SE7295 54.347372 -0.886344 Yorkshire North Riding North Yorkshire Ryedale NaN North Yorkshire England NaN Downs, Moorland POINT (-0.88634 54.34737)
4 Scotland GPE 39523.0 Scotland SK3822 52.796412 -1.428229 Leicestershire NaN Leicestershire North West Leicestershire NaN Leicestershire England NaN Settlement POINT (-1.42823 52.79641)
5 Scotland GPE 39524.0 Scotland SP6798 52.57925 -1.000125 Leicestershire NaN Leicestershire Harborough NaN Leicestershire England NaN Settlement POINT (-1.00012 52.57925)
6 Scotland GPE 39525.0 Scotland SU5669 51.417258 -1.196096 Berkshire NaN NaN NaN West Berkshire Thames Valley England NaN Settlement POINT (-1.19610 51.41726)
7 Scotland GPE 39526.0 Scotland TF0030 52.8604 -0.512500 Lincolnshire Parts of Kesteven Lincolnshire South Kesteven NaN Lincolnshire England NaN Settlement POINT (-0.51250 52.86040)
8 Scotland GPE 294460.0 Scotland SE2340 53.857797 -1.641983 Yorkshire West Riding NaN NaN Leeds West Yorkshire England NaN Settlement POINT (-1.64198 53.85780)
9 Helensburgh GPE 21050.0 Helensburgh NS2982 56.003981 -4.733445 Dunbartonshire NaN NaN NaN Argyll and Bute Argyll and West Dunbartonshire Scotland Baile Eilidh Settlement POINT (-4.73344 56.00398)
10 Wales GPE 47495.0 Wales SK4782 53.341077 -1.284149 Yorkshire West Riding NaN NaN Rotherham South Yorkshire England NaN Settlement POINT (-1.28415 53.34108)
11 Wales GPE 47496.0 Wales ST5824 51.02019 -2.590189 Somerset NaN Somerset South Somerset NaN Avon and Somerset England NaN Settlement POINT (-2.59019 51.02019)
12 Wales GPE 64525.0 Wales SK4782 53.341105 -1.290959 Yorkshire West Riding South Yorkshire Rotherham NaN South Yorkshire England NaN Civil Parish POINT (-1.29096 53.34110)
13 Northern Ireland GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
14 the Mourne Mountains LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
15 the Grampian Mountains LOC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None
16 England GPE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN None

Alright, as you can see, many place names, such as “North York Moors” and “Helensburgh” are now geocoded. You will also notice some places, like “Wales” and “Scotland” are matched to multiple coordinates. It is because in our gazetteer, there are multiple records about “Wales” and “Scotland”. Note also that these are not the “Wales” and “Scotland” you are thinking. If you check the gazetteer by searching for rows that have Place Name and “Wales” for example (see below), you will find these Wales are either “Settlement” and “Civil Parish” in England.

Plus, we also see many place names such as “UK”, “the Lake District”, “Pennines”, etc. do not find a match (their GBPNIDs are all NaN).

UK_gazetteer_gpd.loc[UK_gazetteer_gpd['PlaceName'] == "Wales"]
GBPNID PlaceName GridRef Lat Lng HistCounty Division AdCounty District UniAuth Police Region Alternative_Names Type geometry
47317 47495 Wales SK4782 53.341077 -1.284149 Yorkshire West Riding NaN NaN Rotherham South Yorkshire England NaN Settlement POINT (-1.28415 53.34108)
47318 47496 Wales ST5824 51.02019 -2.590189 Somerset NaN Somerset South Somerset NaN Avon and Somerset England NaN Settlement POINT (-2.59019 51.02019)
62356 64525 Wales SK4782 53.341105 -1.290959 Yorkshire West Riding South Yorkshire Rotherham NaN South Yorkshire England NaN Civil Parish POINT (-1.29096 53.34110)

All these are due to facts that (1). our imported gazetteer is only about places in England, and (2). our simple model is uncapable to capture the context of the text. Do you have better ideas to improve our simple geocoding tool?

Now it is a great time to introduce you some “fancy” geocoding/geoparsing libraries. Let’s try spacy_dbpedia_spotlight next. Make sure you have installed it. Assuming we already have the nlp objec from previous steps, or you can create a new one like below, below are the code of using the library:

import spacy_dbpedia_spotlight
import spacy
nlp = spacy.blank('en')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')
# get the document
doc = nlp(UK_physicalGeo)
# see the entities
entities_dbpedia = [(ent.text, ent.label_, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents]
entities_dbpedia[1]
('UK',
 'DBPEDIA_ENT',
 'http://dbpedia.org/resource/United_Kingdom',
 '0.9999999697674871')

What the code does is to go through all the tokens in the text and try to find the corresponding entities from DBpedia. The result is a list of tuples. One tuple example is as shown in the box. It includes the text that is parsed, its label (notice how different it is than spaCy’s labels), its id in DBpedia (this is very useful as all associated information can be further retrieved using this link. We will cover more about in the future lectures and tutorials), and a score from 0-1 (this is the similarity score of the string matching; the higher it is, the more similar the target text with the candidate).

Similar to what we did before, we can then transfer this list into a data frame. Note that since we do not have coordinates explicitly listed here, we can simply use pandas’s data frame.

columns = ['Text', 'DBpedia Label', 'DBpedia URI', 'Similarity Score']

UK_physicalGeo_DBpedia = pd.DataFrame(entities_dbpedia, columns=columns)
UK_physicalGeo_DBpedia = UK_physicalGeo_DBpedia.drop_duplicates()
UK_physicalGeo_DBpedia
Text DBpedia Label DBpedia URI Similarity Score
0 physical geography DBPEDIA_ENT http://dbpedia.org/resource/Physical_geography 1.0
1 UK DBPEDIA_ENT http://dbpedia.org/resource/United_Kingdom 0.9999999697674871
2 England DBPEDIA_ENT http://dbpedia.org/resource/England 0.9999997958868998
3 Lake District DBPEDIA_ENT http://dbpedia.org/resource/Lake_District 1.0
4 Moors DBPEDIA_ENT http://dbpedia.org/resource/Moorland 1.0
5 Exmoor DBPEDIA_ENT http://dbpedia.org/resource/Exmoor 1.0
6 Dartmoor DBPEDIA_ENT http://dbpedia.org/resource/Dartmoor 1.0
7 chalk DBPEDIA_ENT http://dbpedia.org/resource/Chalk 0.999982297417919
8 Scotland DBPEDIA_ENT http://dbpedia.org/resource/Scotland 1.0
11 Highland Boundary Fault DBPEDIA_ENT http://dbpedia.org/resource/Highland_Boundary_... 1.0
12 Scottish DBPEDIA_ENT http://dbpedia.org/resource/Scotland 0.9997720847861457
13 Helensburgh DBPEDIA_ENT http://dbpedia.org/resource/Helensburgh 1.0
14 Stonehaven DBPEDIA_ENT http://dbpedia.org/resource/Stonehaven 1.0
15 faultline DBPEDIA_ENT http://dbpedia.org/resource/Faultline_(musician) 0.9838808091555019
17 Central Lowlands DBPEDIA_ENT http://dbpedia.org/resource/Central_Lowlands 1.0
18 Southern Uplands DBPEDIA_ENT http://dbpedia.org/resource/Southern_Uplands 1.0
19 Wales DBPEDIA_ENT http://dbpedia.org/resource/Wales 0.9999999999998863
22 Northern Ireland DBPEDIA_ENT http://dbpedia.org/resource/Northern_Ireland 0.999999996766519
23 geography DBPEDIA_ENT http://dbpedia.org/resource/Geography 0.9947401620497708
24 Mourne Mountains DBPEDIA_ENT http://dbpedia.org/resource/Mourne_Mountains 1.0
25 Lough Neagh DBPEDIA_ENT http://dbpedia.org/resource/Lough_Neagh 1.0
26 geomorphology DBPEDIA_ENT http://dbpedia.org/resource/Geomorphology 1.0
28 climate change DBPEDIA_ENT http://dbpedia.org/resource/Global_warming 0.9859722723298184
29 glaciation DBPEDIA_ENT http://dbpedia.org/resource/Glacial_period 0.9861545121455497
31 British Isles DBPEDIA_ENT http://dbpedia.org/resource/British_Isles 1.0
32 Nevis DBPEDIA_ENT http://dbpedia.org/resource/River_Nevis 0.9999978752268652
33 Grampian Mountains DBPEDIA_ENT http://dbpedia.org/resource/Grampian_Mountains 1.0
35 River Severn DBPEDIA_ENT http://dbpedia.org/resource/River_Severn 1.0
38 lake DBPEDIA_ENT http://dbpedia.org/resource/Lake 0.999999999890747
42 Loch Ness DBPEDIA_ENT http://dbpedia.org/resource/Loch_Ness 1.0

It seems that the coordinate is not extracted from the tool. But if you use the url, for example https://dbpedia.org/page/Lake_District, you will see a full information about the place, including its geometry and more! See below for instance: DBpedia-Point

Again, even thgough spacy_dbpedia_spotlight has its advantages, as you can see there are still flaws in such a tool. For example, many non-spatial entiteis are also detected and coordinates are not explicitly show. Do you have any idea on addressing these issues? (hint: maybe check this library - pyDBpedia which helps you access the data shown in DBpedia).

Now, you can also try to use it on your extracted tweets, or other texts you get from the Internet.