Lab 02: Georefereing Location-based Social Media
Contents
Lab 02: Georefereing Location-based Social Media#
In this tutorial, we will learn:
How to extract information (e.g., tweets) from location-based social media (e.g., Twitter)
How to identify locational information (e.g., place name) from text-based data (e.g., tweets, newspapers)
How to refer the identified location to metric-based coordinates on the surface of the earcth
Several libraries/packages are needed in this tutorial. Use pip
or conda
to install them:
tweepy: this is library to access the Tweeter API
spaCy: this is the library to do natural lanuguage processing
spacy-dbpedia-spotlight: a small library that annotate recognized entities from spaCy to DBpedia enities.
Part 1: Extracting (geo)text from Twitter#
This part explains how to extract text-based unstructured information from the social media Twitter via its API. Similar pipline can be used to extract information from other types of social media/Web services (e.g., Foursquare, Yelp, Flickr, etc.).
Twitter is a useful data source for studying the social impacts of events and activities. In this part, we are going to learn how to collect Twitter data using its API. Specifically, we are going to focus on geotagged Twitter data.
First, Twitter requires the users of Twitter API to be authenticated by the system. One simple approach to obtain such authentication is by registering a Twitter account. This is the approach we are going to take in this tutorial.
Go to the website of Twitter: https://twitter.com/ , and click “Sign up” at the upper right corner. You can skip this step if you already have a Twitter account.
After you have registered/logged in to your Twitter account, we are going to obtain the keys that are necessary to use Twitter API. Go to https://apps.twitter.com/ , and sign in using the Twitter account you have just created (sometimes the browser will automatically sign you in).
After you have signed in, click the button “Create New App”. Then fill in the necessary information to create an APP. Note that you might need to record your phone number in your Twitter account in order to do so. If you don’t like it, feel free to remove your phone number from your account after you have done your project.
Then you will be directed to a page (see example below) asking you for a name of your App. Give it a name that you want.
Click Get keys
. It will then generated API Key, API Key Secret, and Bearer Token (see below for an example). Make sure you copy and paste them into a safe place (e.g., a text editor). We need these authentications later.
Next, we also need to obtain the Access Token and its key. To do so, go to the Projects & Apps
–> Select your App. Then click Keys and tokens
, and then click Generate
on the right of Access Token ane Secret
(see below). Again, make sure you record them in a safe place. We need them later. Note that if for some reasons, you lose your tokens and secrets, this page is where you regenerate them.
Once you have your Twitter app set-up, you are ready to access tweets in Python. Begin by importing the necessary Python libraries.
import os
import tweepy as tw
import pandas as pd
To access the Twitter API, you will need four things from your Twitter App page. These keys are located in your Twitter app settings in the Keys and Access Tokens tab.
api key
api key seceret
access token
access token secret
Below I put in my authentications. You should use yours! But remember to not share these with anyone else because these values are specific to your App.
api_key= '5TX6isrDz92kOC1s7qsTFWq5F'
api_key_secret= 'DL4Gw2WLNo2bK538lL5GeNYCtiwlsuYUHOlW8NCSQszK3ac101'
access_token= '1582847729486684180-VH5N9AEb2zyyFyLOj5BuD8I9ca0ils'
access_token_secret = 'e0cvkxJz9AWvu9dq0Fb48r61vkmIaA1JqLLyhEhms5FGt'
With these authentications, we can next build an API variable in Python that build the connection between this Jupyter program and Twitter:
auth = tw.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
For example, now we can send tweets using your API access. Note that your tweet needs to be 280 characters or less:
# Post a tweet from Python
api.update_status("Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience")
# Your tweet has been posted!
Status(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'created_at': 'Thu Nov 10 11:50:08 +0000 2022', 'id': 1590673155726929923, 'id_str': '1590673155726929923', 'text': "Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience", 'truncated': False, 'entities': {'hashtags': [{'text': 'DataScience', 'indices': [92, 104]}], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="https://ruizhugeographer.com/" rel="nofollow">GEOGM0068-Zhu</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'lang': 'en'}, created_at=datetime.datetime(2022, 11, 10, 11, 50, 8, tzinfo=datetime.timezone.utc), id=1590673155726929923, id_str='1590673155726929923', text="Hello Twitter, I'm sending the first message via Python to you! I learnt it from GEOGM0068. #DataScience", truncated=False, entities={'hashtags': [{'text': 'DataScience', 'indices': [92, 104]}], 'symbols': [], 'user_mentions': [], 'urls': []}, source='GEOGM0068-Zhu', source_url='https://ruizhugeographer.com/', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, id=1582847729486684180, id_str='1582847729486684180', name='Richard Chu', screen_name='GEOGM0068', location='', description='', url=None, entities={'description': {'urls': []}}, protected=False, followers_count=1, friends_count=1, listed_count=0, created_at=datetime.datetime(2022, 10, 19, 21, 35, 4, tzinfo=datetime.timezone.utc), favourites_count=0, utc_offset=None, time_zone=None, geo_enabled=False, verified=False, statuses_count=4, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=False, profile_background_color='F5F8FA', profile_background_image_url=None, profile_background_image_url_https=None, profile_background_tile=False, profile_image_url='http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_image_url_https='https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_link_color='1DA1F2', profile_sidebar_border_color='C0DEED', profile_sidebar_fill_color='DDEEF6', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=True, default_profile=True, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='none', withheld_in_countries=[]), user=User(_api=<tweepy.api.API object at 0x7fb1787dab20>, _json={'id': 1582847729486684180, 'id_str': '1582847729486684180', 'name': 'Richard Chu', 'screen_name': 'GEOGM0068', 'location': '', 'description': '', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 1, 'friends_count': 1, 'listed_count': 0, 'created_at': 'Wed Oct 19 21:35:04 +0000 2022', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 4, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': True, 'default_profile': True, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none', 'withheld_in_countries': []}, id=1582847729486684180, id_str='1582847729486684180', name='Richard Chu', screen_name='GEOGM0068', location='', description='', url=None, entities={'description': {'urls': []}}, protected=False, followers_count=1, friends_count=1, listed_count=0, created_at=datetime.datetime(2022, 10, 19, 21, 35, 4, tzinfo=datetime.timezone.utc), favourites_count=0, utc_offset=None, time_zone=None, geo_enabled=False, verified=False, statuses_count=4, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=False, profile_background_color='F5F8FA', profile_background_image_url=None, profile_background_image_url_https=None, profile_background_tile=False, profile_image_url='http://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_image_url_https='https://pbs.twimg.com/profile_images/1582847827062972431/-8c67lsJ_normal.png', profile_link_color='1DA1F2', profile_sidebar_border_color='C0DEED', profile_sidebar_fill_color='DDEEF6', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=True, default_profile=True, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='none', withheld_in_countries=[]), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=0, favorite_count=0, favorited=False, retweeted=False, lang='en')
If you go to your Twitter account and check Profile, you will see the tweet being posted! Congrats for your first post via Python!
Note that if you see errors like “453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve”. It means you need to elevate your access. What you need to do is (1). Go to Products –> Twitter API v2; (2). click the tab “Elevated” (or “Academic Research” if you need it for your dissertation later); (3). Click Apply
, then file the form (you can choose No for many of the questions). See screenshot below for (1) and (2):
Next, let’s retrieve (search) some tweets that are about #energycrisis
that are posted currently in English. There are going to be many posts returned. To make it easy to illustrate and to save some request (note you have a limited number of requests via this API), we only request 5 from the list.
search_words = "#energycrisis"
tweets = tw.Cursor(api.search_tweets,
q=search_words,
lang="en").items(5)
tweets
<tweepy.cursor.ItemIterator at 0x7fb198ad6400>
Here, you see a an object that you can iterate (i.e. ItemIterator
) or loop over to access the data collected. Each item in the iterator has various attributes that you can access to get information about each tweet including:
the text of the tweet
who sent the tweet
the date the tweet was sent
and more. The code below loops through the object and save the time of the tweet, the user who posted the tweet, the text of the tweet, as well ast the user location to a pandas DataFrame
:
import pandas as pd
# create dataframe
columns = ['Time', 'User', 'Tweet', 'Location']
data = []
for tweet in tweets:
data.append([tweet.created_at, tweet.user.screen_name, tweet.text, tweet.user.location])
df = pd.DataFrame(data, columns=columns)
df
Time | User | Tweet | Location | |
---|---|---|---|---|
0 | 2022-11-10 11:49:20+00:00 | shirleyyoung2 | RT @secularmac: Isn't this nice?\n#EnergyCrisi... | Fife, Scotland |
1 | 2022-11-10 11:48:09+00:00 | Q_Review_ | How does living in the cold affect your health... | |
2 | 2022-11-10 11:46:15+00:00 | Bellona_EU | RT @jonashelseth: Encouraging!\nEU 🇪🇺 consumer... | Brussels, Belgium |
3 | 2022-11-10 11:46:02+00:00 | Bellona_EU | RT @LoviMarta: Renewables & heat pumps are... | Brussels, Belgium |
4 | 2022-11-10 11:44:51+00:00 | jonashelseth | Encouraging!\nEU 🇪🇺 consumer organisation @BEU... | Brussels, Belgium |
We can further save the dataframe to a local csv file (structured data):
df.to_csv('tweets_example.csv')
Note that there is another way of writing the query to Twitter API, which might be more intuitive to some users. For example, you can replace tweets = tw.Cursor(api.search_tweets,q=search_words,lang="en").items(5)
to something like:
tweets2 = api.search_tweets(q=search_words,lang="en", count="5")
data2 = []
for tweet2 in tweets2:
data2.append([tweet2.created_at, tweet2.user.screen_name, tweet2.text, tweet2.user.location])
df2 = pd.DataFrame(data2, columns=columns)
df2
Time | User | Tweet | Location | |
---|---|---|---|---|
0 | 2022-11-10 11:49:20+00:00 | shirleyyoung2 | RT @secularmac: Isn't this nice?\n#EnergyCrisi... | Fife, Scotland |
1 | 2022-11-10 11:48:09+00:00 | Q_Review_ | How does living in the cold affect your health... | |
2 | 2022-11-10 11:46:15+00:00 | Bellona_EU | RT @jonashelseth: Encouraging!\nEU 🇪🇺 consumer... | Brussels, Belgium |
3 | 2022-11-10 11:46:02+00:00 | Bellona_EU | RT @LoviMarta: Renewables & heat pumps are... | Brussels, Belgium |
4 | 2022-11-10 11:44:51+00:00 | jonashelseth | Encouraging!\nEU 🇪🇺 consumer organisation @BEU... | Brussels, Belgium |
To learn more about the key function search_tweets()
, check its webpage here. Please try yourself to set up some other parameters to see what you can get.
Part 2: Basic Natural Language Processing and Geoparsing#
To extract places (or other categories) from text-based (unstructured) data, we need to do some basic Natural Language Processing (NLP), such as tokenization and Part-of-Speech analysis. All these operations can be done through the library spaCy
.
Ideally, you can use the tweets you got from Part 1 to do the experiment. But since sometimes the tweets you get might be very heterogenous and noisy, here we use a clean example (you can also get it from some long news online) to show how to use spaCy
in order to make sure all the knowledge points are covered in one example.
First make sure you have intsalled and imported spaCy
:
import spacy
spaCy
comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), transforming to word vectors etc.
If you are dealing with a particular language, you can load the spacy model specific to the language using spacy.load()
function. For example, we want to load the English version:
# Load small english model: https://spacy.io/models
nlp=spacy.load("en_core_web_sm")
nlp
/Users/gy22808/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:275: UserWarning: [W031] Model 'en_core_web_sm' (3.4.0) requires spaCy v3.4 and is incompatible with the current spaCy version (2.3.8). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate
warnings.warn(warn_msg)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [10], in <cell line: 2>()
1 # Load small english model: https://spacy.io/models
----> 2 nlp=spacy.load("en_core_web_sm")
3 nlp
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/__init__.py:30, in load(name, **overrides)
28 if depr_path not in (True, False, None):
29 warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning)
---> 30 return util.load_model(name, **overrides)
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:170, in load_model(name, **overrides)
168 return load_model_from_link(name, **overrides)
169 if is_package(name): # installed as package
--> 170 return load_model_from_package(name, **overrides)
171 if Path(name).exists(): # path to model data directory
172 return load_model_from_path(Path(name), **overrides)
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:191, in load_model_from_package(name, **overrides)
189 """Load a model from an installed package."""
190 cls = importlib.import_module(name)
--> 191 return cls.load(**overrides)
File ~/opt/anaconda3/lib/python3.9/site-packages/en_core_web_sm/__init__.py:10, in load(**overrides)
9 def load(**overrides):
---> 10 return load_model_from_init_py(__file__, **overrides)
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:239, in load_model_from_init_py(init_file, **overrides)
237 if not model_path.exists():
238 raise IOError(Errors.E052.format(path=path2str(data_path)))
--> 239 return load_model_from_path(data_path, meta, **overrides)
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/util.py:203, in load_model_from_path(model_path, meta, **overrides)
201 lang = meta.get("lang_factory", meta["lang"])
202 cls = get_lang_class(lang)
--> 203 nlp = cls(meta=meta, **overrides)
204 pipeline = meta.get("pipeline", [])
205 factories = meta.get("factories", {})
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy/language.py:171, in Language.__init__(self, vocab, make_doc, max_length, meta, **kwargs)
149 def __init__(
150 self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs
151 ):
152 """Initialise a Language object.
153
154 vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via
(...)
169 RETURNS (Language): The newly constructed object.
170 """
--> 171 user_factories = util.registry.factories.get_all()
172 self.factories.update(user_factories)
173 self._meta = dict(meta)
File ~/opt/anaconda3/lib/python3.9/site-packages/catalogue/__init__.py:112, in Registry.get_all(self)
110 result = OrderedDict()
111 if self.entry_points:
--> 112 result.update(self.get_entry_points())
113 for keys, value in REGISTRY.items():
114 if len(self.namespace) == len(keys) - 1 and all(
115 self.namespace[i] == keys[i] for i in range(len(self.namespace))
116 ):
File ~/opt/anaconda3/lib/python3.9/site-packages/catalogue/__init__.py:127, in Registry.get_entry_points(self)
125 result = {}
126 for entry_point in self._get_entry_points():
--> 127 result[entry_point.name] = entry_point.load()
128 return result
File ~/opt/anaconda3/lib/python3.9/site-packages/setuptools/_vendor/importlib_metadata/__init__.py:194, in EntryPoint.load(self)
189 """Load the entry point from its definition. If only a module
190 is indicated by the value, return that module. Otherwise,
191 return the named object.
192 """
193 match = self.pattern.match(self.value)
--> 194 module = import_module(match.group('module'))
195 attrs = filter(None, (match.group('attr') or '').split('.'))
196 return functools.reduce(getattr, attrs, module)
File ~/opt/anaconda3/lib/python3.9/importlib/__init__.py:127, in import_module(name, package)
125 break
126 level += 1
--> 127 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1030, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1007, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:986, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:680, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:850, in exec_module(self, module)
File <frozen importlib._bootstrap>:228, in _call_with_frames_removed(f, *args, **kwds)
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy_dbpedia_spotlight/__init__.py:4, in <module>
1 from spacy.language import Language
2 from spacy.util import load_model_from_init_py
----> 4 from . import entity_linker, util
5 from .entity_linker import EntityLinker, create
7 __version__ = util.pkg_meta["version"]
File ~/opt/anaconda3/lib/python3.9/site-packages/spacy_dbpedia_spotlight/entity_linker.py:20, in <module>
16 Span.set_extension("dbpedia_raw_result", default=None)
17 Doc.set_extension("dbpedia_raw_result", default=None)
---> 20 @Language.factory('dbpedia_spotlight', default_config={
21 'language_code': None,
22 'dbpedia_rest_endpoint': None,
23 'process': 'annotate',
24 'confidence': None,
25 'support': None,
26 'types': None,
27 'sparql': None,
28 'policy': None,
29 'span_group': 'dbpedia_spotlight',
30 'overwrite_ents': True,
31 'raise_http_errors': True,
32 'debug': False
33 })
34 def dbpedia_spotlight_factory(nlp, name, language_code, dbpedia_rest_endpoint, process, confidence, support, types, sparql, policy, span_group, overwrite_ents, raise_http_errors, debug):
35 '''Factory of the pipeline stage `dbpedia_spotlight`.
36 Parameters:
37 - `language_code`: which language to use for entity linking. Possible values are listed in EntityLinker.supported_languages. If the parameter is left as None, the language code is matched with the nlp object currently used.
(...)
48 - `debug`: prints several debug information to stdout
49 '''
50 logger.remove()
AttributeError: type object 'Language' has no attribute 'factory'
This returns a Language
object that comes ready with multiple built-in capabilities.
Now let’s say you have your text data in a string. What can be done to understand the structure of the text?
First, call the loaded nlp
object on the text. It should return a processed Doc
object.
# Parse text through the `nlp` model
my_text = """The economic situation of the country is on edge , as the stock
market crashed causing loss of millions. Citizens who had their main investment
in the share-market are facing a great loss. Many companies might lay off
thousands of people to reduce labor cost"""
my_doc = nlp(my_text)
type(my_doc)
spacy.tokens.doc.Doc
Hmmm, it is a Doc
object. But wait, what exactly is a Doc
object?
It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy
model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text and so on are pre-computed and readily stored in the Doc
object.
So first, what is a token?
As you have learnt from the lecture. Tokens are individual textual entities that make up the text. Typically a token can be the words, punctuation, spaces, etc. Tokenization is the process of converting a text into smaller sub-texts, based on certain predefined rules. For example, sentences are tokenized to words (and punctuation optionally). And paragraphs into sentences, depending on the context.
Each token in spaCy
has different attributes that tell us a great deal of information.
Let’s see the token texts on my_doc
. The string which the token represents can be accessed through the token.text
attribute.
# Printing the tokens of a doc
for token in my_doc:
print(token.text)
The
economic
situation
of
the
country
is
on
edge
,
as
the
stock
market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment
in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off
thousands
of
people
to
reduce
labor
cost
The above tokens contain punctuation and common words like “a”, ” the”, “was”, etc. These do not add any value to the meaning of your text. They are called stop words. We can clean it up.
The type of tokens will allow us to clean those noisy tokens such as stop word, punctuation, and space. First, we show whether a token is stop/punctuation or not, and then we use this information to remove them.
# Printing tokens and boolean values stored in different attributes
for token in my_doc:
print(token.text,'--',token.is_stop,'---',token.is_punct)
The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False
-- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False
-- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -- False --- False
off -- True --- False
-- False --- False
thousands -- False --- False
of -- True --- False
people -- False --- False
to -- True --- False
reduce -- False --- False
labor -- False --- False
cost -- False --- False
# Removing StopWords and punctuations
my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct and not token.is_space]
for token in my_doc_cleaned:
print(token.text)
economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
thousands
people
reduce
labor
cost
To get the POS tagging of your text, you use code like:
for token in my_doc_cleaned:
print(token.text,'---- ',token.pos_)
economic ---- ADJ
situation ---- NOUN
country ---- NOUN
edge ---- NOUN
stock ---- NOUN
market ---- NOUN
crashed ---- VERB
causing ---- VERB
loss ---- NOUN
millions ---- NOUN
Citizens ---- NOUN
main ---- ADJ
investment ---- NOUN
share ---- NOUN
market ---- NOUN
facing ---- VERB
great ---- ADJ
loss ---- NOUN
companies ---- NOUN
lay ---- VERB
thousands ---- NOUN
people ---- NOUN
reduce ---- VERB
labor ---- NOUN
cost ---- NOUN
You will see each word (token) now is associated with a POS tag, whether it is a Noun, a Adj, a Verb, or so on … POS often can help us disambiguate the meaning of words (or places in GIR).
Btw, if you don’t know what “ADJ” means, you can use code like:
spacy.explain('ADJ')
'adjective'
You can also use spaCy
to do some Named Entity Recognition (including place name identification or geoparsing). For you instance:
text='Tony Stark owns the company StarkEnterprises . Emily Clark works at Microsoft and lives in Manchester. She loves to read the Bible and learn French'
doc=nlp(text)
for entity in doc.ents:
print(entity.text,'--- ',entity.label_)
Tony Stark --- PERSON
StarkEnterprises --- ORG
Emily Clark --- PERSON
Microsoft --- ORG
Manchester --- GPE
Bible --- WORK_OF_ART
French --- NORP
What is “GPE”?
spacy.explain('GPE')
'Countries, cities, states'
spaCy also provides special visualization for NER through displacy. Using displacy.render()
function, you can set the style='ent'
to visualize.
# Using displacy for visualizing NER
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)
So far, you have learnt the basics of retrieving information from social media like Twitter, as well as basic NLP operations and named entity recognition (geoparsing is part of it). I suggest you to play with what you have learnt so far by using new data to experiment these functions, changing the parameters of function, combining these skills with what you have learn in Tutorial 1 (e.g., geopandas), etc.
Part 3: Geocoding the recognized place names#
spaCy
helps us recognize different categories of tokens from a text, including place names (with tag GPE
or LOC
), but have not refer these place names into geographical locations on the surface of the earth. In this part, we will explore ways of geocoding text-based place names to geographic coordinates. There are several cool libraries/packages there for us to directly use, which we will cover some in this tutorial. But before that, let’s develop our own geocoding tool first. We might not use it in the future due to its simplicity, but it will help us understand the fundementals behind those technologies, which we have highlighted in our lectures.
First, let’s create a variable storing the text that we want to georeference. The text below is copied from the Physical Geography section from the wikipedia page about the UK. We then use nlp()
to convert the text into a nlp
object defined by spacy
. Having this object, we can then extract place names using the label LOC
and GPE
. Here we use a for-loop
to go through all the tokens and only get those that have the two location-related labels, and then save them all into a list locations
. Then we transfer such a list into a panda
dataframe.
UK_physicalGeo = "The physical geography of the UK varies greatly. England consists of mostly lowland terrain, with upland or mountainous terrain only found north-west of the Tees-Exe line. The upland areas include the Lake District, the Pennines, North York Moors, Exmoor and Dartmoor. The lowland areas are typically traversed by ranges of low hills, frequently composed of chalk, and flat plains. Scotland is the most mountainous country in the UK and its physical geography is distinguished by the Highland Boundary Fault which traverses the Scottish mainland from Helensburgh to Stonehaven. The faultline separates the two distinctively different regions of the Highlands to the north and west, and the Lowlands to the south and east. The Highlands are predominantly mountainous, containing the majority of Scotland's mountainous landscape, while the Lowlands contain flatter land, especially across the Central Lowlands, with upland and mountainous terrain located at the Southern Uplands. Wales is mostly mountainous, though south Wales is less mountainous than north and mid Wales. Northern Ireland consists of mostly hilly landscape and its geography includes the Mourne Mountains as well as Lough Neagh, at 388 square kilometres (150 sq mi), the largest body of water in the UK.[12]The overall geomorphology of the UK was shaped by a combination of forces including tectonics and climate change, in particular glaciation in northern and western areas. The tallest mountain in the UK (and British Isles) is Ben Nevis, in the Grampian Mountains, Scotland. The longest river is the River Severn which flows from Wales into England. The largest lake by surface area is Lough Neagh in Northern Ireland, though Scotland's Loch Ness has the largest volume."
UK_physicalGeo_doc=nlp(UK_physicalGeo)
locations = []
for entity in UK_physicalGeo_doc.ents:
if entity.label_ in ['LOC', 'GPE']:
print(entity.text,'--- ',entity.label_)
locations.append([entity.text, entity.label_])
locations_df = pd.DataFrame(locations, columns = ['Place Name', 'Tag'])
UK --- GPE
the Lake District --- LOC
Pennines --- LOC
North York Moors --- GPE
Scotland --- GPE
UK --- GPE
Helensburgh --- GPE
Scotland --- GPE
Wales --- GPE
Wales --- GPE
Northern Ireland --- GPE
the Mourne Mountains --- LOC
UK --- GPE
UK --- GPE
the Grampian Mountains --- LOC
Scotland --- GPE
Wales --- GPE
England --- GPE
Northern Ireland --- GPE
Scotland --- GPE
Note that we see many duplicates in the list. What we can do is then to delete those duplicates. pandas
provides a very easy function for us to do it: drop_duplicates()
:
locations_df = locations_df.drop_duplicates()
locations_df
Place Name | Tag | |
---|---|---|
0 | UK | GPE |
1 | the Lake District | LOC |
2 | Pennines | LOC |
3 | North York Moors | GPE |
4 | Scotland | GPE |
6 | Helensburgh | GPE |
8 | Wales | GPE |
10 | Northern Ireland | GPE |
11 | the Mourne Mountains | LOC |
14 | the Grampian Mountains | LOC |
17 | England | GPE |
In Python, there are often many ways to achieve the same goal. To use what we have done so far as an example, you can also use the code below to achieve the same (creating the locations
list). Try it yourself!
locations.extend([[entity.text, entity.label_] for entity in UK_physicalGeo_doc.ents if entity.label_ in [‘LOC’, 'GPE']]
After recognized these place names, we next want to geocode them. First, we want to see how far we can go without using any external geocoding/geoparsing libraries.
As we discussed in the lecture, to do geocoding, we need a gazetteer first. Since the example we are using is mostly about the UK, we can use the Gazetteer of British Place Names. Make sure you downloaded the csv file into your local directory, and remember to replace the directory ../../LabData/GBPN_14062021.csv
below to yours.
import pandas as pd
import geopandas as gpd
UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
UK_gazetteer_df.head()
/var/folders/xg/5n3zc4sn5hlcg8zzz6ysx21m0000gq/T/ipykernel_3617/3736579507.py:4: DtypeWarning: Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
UK_gazetteer_df = pd.read_csv('../../LabData/GBPN_14062021.csv')
GBPNID | PlaceName | GridRef | Lat | Lng | HistCounty | Division | AdCounty | District | UniAuth | Police | Region | Alternative_Names | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A' Chill | NG2705 | 57.057719 | -6.500908 | Argyllshire | NaN | NaN | NaN | Highland | Highlands and Islands | Scotland | NaN | Settlement |
1 | 2 | Ab Kettleby | SK7223 | 52.800049 | -0.927993 | Leicestershire | NaN | Leicestershire | Melton | NaN | Leicestershire | England | NaN | Settlement |
2 | 3 | Ab Lench | SP0151 | 52.163533 | -1.980962 | Worcestershire | NaN | Worcestershire | Wychavon | NaN | West Mercia | England | NaN | Settlement |
3 | 4 | Abaty Cwm-hir | SO0571 | 52.331015 | -3.389919 | Radnorshire | NaN | NaN | NaN | Powys | Dyfed Powys | Wales | Abbey-cwm-hir, Abbeycwmhir | Settlement |
4 | 4 | Abbey-cwm-hir | SO0571 | 52.331015 | -3.389919 | Radnorshire | NaN | NaN | NaN | Powys | Dyfed Powys | Wales | Abaty Cwm-hir, Abbeycwmhir | Settlement |
This is obiviously a spatial data set. Then, we want to represent the data as geopandas
. To do so, we will convert the Lat
and Lng
columns into a new column, which is geometry
. We also need to assign the coordinate reference system to the data.
If you started doing so, you will quickly find there might be an error saying something is wrong on the Lat
column. Don’t be panic! After you understand what the error indicates, you can go back to the csv file, where you will find a cell value on Lat
is 53.20N
. This is not a standard way of representing geographic coordinates. What we can do is to simply remove that row, then finish the transformation from dataframe to geodataframe:
UK_gazetteer_df.drop(UK_gazetteer_df[UK_gazetteer_df['Lat'] == '53.20N '].index, inplace = True)
UK_gazetteer_gpd = gpd.GeoDataFrame(
UK_gazetteer_df , geometry=gpd.points_from_xy(UK_gazetteer_df.Lng, UK_gazetteer_df.Lat))
UK_gazetteer_gpd.set_crs('epsg:4326')
UK_gazetteer_gpd.head()
GBPNID | PlaceName | GridRef | Lat | Lng | HistCounty | Division | AdCounty | District | UniAuth | Police | Region | Alternative_Names | Type | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A' Chill | NG2705 | 57.057719 | -6.500908 | Argyllshire | NaN | NaN | NaN | Highland | Highlands and Islands | Scotland | NaN | Settlement | POINT (-6.50091 57.05772) |
1 | 2 | Ab Kettleby | SK7223 | 52.800049 | -0.927993 | Leicestershire | NaN | Leicestershire | Melton | NaN | Leicestershire | England | NaN | Settlement | POINT (-0.92799 52.80005) |
2 | 3 | Ab Lench | SP0151 | 52.163533 | -1.980962 | Worcestershire | NaN | Worcestershire | Wychavon | NaN | West Mercia | England | NaN | Settlement | POINT (-1.98096 52.16353) |
3 | 4 | Abaty Cwm-hir | SO0571 | 52.331015 | -3.389919 | Radnorshire | NaN | NaN | NaN | Powys | Dyfed Powys | Wales | Abbey-cwm-hir, Abbeycwmhir | Settlement | POINT (-3.38992 52.33102) |
4 | 4 | Abbey-cwm-hir | SO0571 | 52.331015 | -3.389919 | Radnorshire | NaN | NaN | NaN | Powys | Dyfed Powys | Wales | Abaty Cwm-hir, Abbeycwmhir | Settlement | POINT (-3.38992 52.33102) |
Now we have the geoparsed place name list and a gazetteer to extract candidate place names with their coordinates. Let’s now do a lookup matching.
Guess what? We can use the merge()
(similar to join()
) operations we learned in Tutoiral 1 to achieve it:
locations_merged = pd.merge(locations_df, UK_gazetteer_gpd, left_on='Place Name', right_on='PlaceName', how = "left")
locations_merged
Place Name | Tag | GBPNID | PlaceName | GridRef | Lat | Lng | HistCounty | Division | AdCounty | District | UniAuth | Police | Region | Alternative_Names | Type | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | UK | GPE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
1 | the Lake District | LOC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
2 | Pennines | LOC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
3 | North York Moors | GPE | 198897.0 | North York Moors | SE7295 | 54.347372 | -0.886344 | Yorkshire | North Riding | North Yorkshire | Ryedale | NaN | North Yorkshire | England | NaN | Downs, Moorland | POINT (-0.88634 54.34737) |
4 | Scotland | GPE | 39523.0 | Scotland | SK3822 | 52.796412 | -1.428229 | Leicestershire | NaN | Leicestershire | North West Leicestershire | NaN | Leicestershire | England | NaN | Settlement | POINT (-1.42823 52.79641) |
5 | Scotland | GPE | 39524.0 | Scotland | SP6798 | 52.57925 | -1.000125 | Leicestershire | NaN | Leicestershire | Harborough | NaN | Leicestershire | England | NaN | Settlement | POINT (-1.00012 52.57925) |
6 | Scotland | GPE | 39525.0 | Scotland | SU5669 | 51.417258 | -1.196096 | Berkshire | NaN | NaN | NaN | West Berkshire | Thames Valley | England | NaN | Settlement | POINT (-1.19610 51.41726) |
7 | Scotland | GPE | 39526.0 | Scotland | TF0030 | 52.8604 | -0.512500 | Lincolnshire | Parts of Kesteven | Lincolnshire | South Kesteven | NaN | Lincolnshire | England | NaN | Settlement | POINT (-0.51250 52.86040) |
8 | Scotland | GPE | 294460.0 | Scotland | SE2340 | 53.857797 | -1.641983 | Yorkshire | West Riding | NaN | NaN | Leeds | West Yorkshire | England | NaN | Settlement | POINT (-1.64198 53.85780) |
9 | Helensburgh | GPE | 21050.0 | Helensburgh | NS2982 | 56.003981 | -4.733445 | Dunbartonshire | NaN | NaN | NaN | Argyll and Bute | Argyll and West Dunbartonshire | Scotland | Baile Eilidh | Settlement | POINT (-4.73344 56.00398) |
10 | Wales | GPE | 47495.0 | Wales | SK4782 | 53.341077 | -1.284149 | Yorkshire | West Riding | NaN | NaN | Rotherham | South Yorkshire | England | NaN | Settlement | POINT (-1.28415 53.34108) |
11 | Wales | GPE | 47496.0 | Wales | ST5824 | 51.02019 | -2.590189 | Somerset | NaN | Somerset | South Somerset | NaN | Avon and Somerset | England | NaN | Settlement | POINT (-2.59019 51.02019) |
12 | Wales | GPE | 64525.0 | Wales | SK4782 | 53.341105 | -1.290959 | Yorkshire | West Riding | South Yorkshire | Rotherham | NaN | South Yorkshire | England | NaN | Civil Parish | POINT (-1.29096 53.34110) |
13 | Northern Ireland | GPE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
14 | the Mourne Mountains | LOC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
15 | the Grampian Mountains | LOC | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
16 | England | GPE | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | None |
Alright, as you can see, many place names, such as “North York Moors” and “Helensburgh” are now geocoded. You will also notice some places, like “Wales” and “Scotland” are matched to multiple coordinates. It is because in our gazetteer, there are multiple records about “Wales” and “Scotland”. Note also that these are not the “Wales” and “Scotland” you are thinking. If you check the gazetteer by searching for rows that have Place Name
and “Wales” for example (see below), you will find these Wales
are either “Settlement” and “Civil Parish” in England.
Plus, we also see many place names such as “UK”, “the Lake District”, “Pennines”, etc. do not find a match (their GBPNIDs are all NaN).
UK_gazetteer_gpd.loc[UK_gazetteer_gpd['PlaceName'] == "Wales"]
GBPNID | PlaceName | GridRef | Lat | Lng | HistCounty | Division | AdCounty | District | UniAuth | Police | Region | Alternative_Names | Type | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47317 | 47495 | Wales | SK4782 | 53.341077 | -1.284149 | Yorkshire | West Riding | NaN | NaN | Rotherham | South Yorkshire | England | NaN | Settlement | POINT (-1.28415 53.34108) |
47318 | 47496 | Wales | ST5824 | 51.02019 | -2.590189 | Somerset | NaN | Somerset | South Somerset | NaN | Avon and Somerset | England | NaN | Settlement | POINT (-2.59019 51.02019) |
62356 | 64525 | Wales | SK4782 | 53.341105 | -1.290959 | Yorkshire | West Riding | South Yorkshire | Rotherham | NaN | South Yorkshire | England | NaN | Civil Parish | POINT (-1.29096 53.34110) |
All these are due to facts that (1). our imported gazetteer is only about places in England, and (2). our simple model is uncapable to capture the context of the text. Do you have better ideas to improve our simple geocoding tool?
Now it is a great time to introduce you some “fancy” geocoding/geoparsing libraries. Let’s try spacy_dbpedia_spotlight
next. Make sure you have installed it. Assuming we already have the nlp
objec from previous steps, or you can create a new one like below, below are the code of using the library:
import spacy_dbpedia_spotlight
import spacy
nlp = spacy.blank('en')
# add the pipeline stage
nlp.add_pipe('dbpedia_spotlight')
# get the document
doc = nlp(UK_physicalGeo)
# see the entities
entities_dbpedia = [(ent.text, ent.label_, ent.kb_id_, ent._.dbpedia_raw_result['@similarityScore']) for ent in doc.ents]
entities_dbpedia[1]
('UK',
'DBPEDIA_ENT',
'http://dbpedia.org/resource/United_Kingdom',
'0.9999999697674871')
What the code does is to go through all the tokens in the text and try to find the corresponding entities from DBpedia. The result is a list of tuples. One tuple example is as shown in the box. It includes the text that is parsed, its label (notice how different it is than spaCy
’s labels), its id in DBpedia (this is very useful as all associated information can be further retrieved using this link. We will cover more about in the future lectures and tutorials), and a score from 0-1 (this is the similarity score of the string matching; the higher it is, the more similar the target text with the candidate).
Similar to what we did before, we can then transfer this list into a data frame. Note that since we do not have coordinates explicitly listed here, we can simply use pandas
’s data frame.
columns = ['Text', 'DBpedia Label', 'DBpedia URI', 'Similarity Score']
UK_physicalGeo_DBpedia = pd.DataFrame(entities_dbpedia, columns=columns)
UK_physicalGeo_DBpedia = UK_physicalGeo_DBpedia.drop_duplicates()
UK_physicalGeo_DBpedia
Text | DBpedia Label | DBpedia URI | Similarity Score | |
---|---|---|---|---|
0 | physical geography | DBPEDIA_ENT | http://dbpedia.org/resource/Physical_geography | 1.0 |
1 | UK | DBPEDIA_ENT | http://dbpedia.org/resource/United_Kingdom | 0.9999999697674871 |
2 | England | DBPEDIA_ENT | http://dbpedia.org/resource/England | 0.9999997958868998 |
3 | Lake District | DBPEDIA_ENT | http://dbpedia.org/resource/Lake_District | 1.0 |
4 | Moors | DBPEDIA_ENT | http://dbpedia.org/resource/Moorland | 1.0 |
5 | Exmoor | DBPEDIA_ENT | http://dbpedia.org/resource/Exmoor | 1.0 |
6 | Dartmoor | DBPEDIA_ENT | http://dbpedia.org/resource/Dartmoor | 1.0 |
7 | chalk | DBPEDIA_ENT | http://dbpedia.org/resource/Chalk | 0.999982297417919 |
8 | Scotland | DBPEDIA_ENT | http://dbpedia.org/resource/Scotland | 1.0 |
11 | Highland Boundary Fault | DBPEDIA_ENT | http://dbpedia.org/resource/Highland_Boundary_... | 1.0 |
12 | Scottish | DBPEDIA_ENT | http://dbpedia.org/resource/Scotland | 0.9997720847861457 |
13 | Helensburgh | DBPEDIA_ENT | http://dbpedia.org/resource/Helensburgh | 1.0 |
14 | Stonehaven | DBPEDIA_ENT | http://dbpedia.org/resource/Stonehaven | 1.0 |
15 | faultline | DBPEDIA_ENT | http://dbpedia.org/resource/Faultline_(musician) | 0.9838808091555019 |
17 | Central Lowlands | DBPEDIA_ENT | http://dbpedia.org/resource/Central_Lowlands | 1.0 |
18 | Southern Uplands | DBPEDIA_ENT | http://dbpedia.org/resource/Southern_Uplands | 1.0 |
19 | Wales | DBPEDIA_ENT | http://dbpedia.org/resource/Wales | 0.9999999999998863 |
22 | Northern Ireland | DBPEDIA_ENT | http://dbpedia.org/resource/Northern_Ireland | 0.999999996766519 |
23 | geography | DBPEDIA_ENT | http://dbpedia.org/resource/Geography | 0.9947401620497708 |
24 | Mourne Mountains | DBPEDIA_ENT | http://dbpedia.org/resource/Mourne_Mountains | 1.0 |
25 | Lough Neagh | DBPEDIA_ENT | http://dbpedia.org/resource/Lough_Neagh | 1.0 |
26 | geomorphology | DBPEDIA_ENT | http://dbpedia.org/resource/Geomorphology | 1.0 |
28 | climate change | DBPEDIA_ENT | http://dbpedia.org/resource/Global_warming | 0.9859722723298184 |
29 | glaciation | DBPEDIA_ENT | http://dbpedia.org/resource/Glacial_period | 0.9861545121455497 |
31 | British Isles | DBPEDIA_ENT | http://dbpedia.org/resource/British_Isles | 1.0 |
32 | Nevis | DBPEDIA_ENT | http://dbpedia.org/resource/River_Nevis | 0.9999978752268652 |
33 | Grampian Mountains | DBPEDIA_ENT | http://dbpedia.org/resource/Grampian_Mountains | 1.0 |
35 | River Severn | DBPEDIA_ENT | http://dbpedia.org/resource/River_Severn | 1.0 |
38 | lake | DBPEDIA_ENT | http://dbpedia.org/resource/Lake | 0.999999999890747 |
42 | Loch Ness | DBPEDIA_ENT | http://dbpedia.org/resource/Loch_Ness | 1.0 |
It seems that the coordinate is not extracted from the tool. But if you use the url, for example https://dbpedia.org/page/Lake_District, you will see a full information about the place, including its geometry and more! See below for instance:
Again, even thgough spacy_dbpedia_spotlight
has its advantages, as you can see there are still flaws in such a tool. For example, many non-spatial entiteis are also detected and coordinates are not explicitly show. Do you have any idea on addressing these issues? (hint: maybe check this library - pyDBpedia which helps you access the data shown in DBpedia).
Now, you can also try to use it on your extracted tweets, or other texts you get from the Internet.