Wikipedia Pull Data from table sample
I was researching some data sets and had the need to pull population data from Wikipedia. I looked at several articles on doing this and they all seemed either very basic or not current. I had already done an article on the Wikipedia Python package so I decided to try to write one on pulling table data out of a Wikipedia page. The packages I used were urllib and Beautiful Soup. I could have done this with other packages (requests comes to mind) but this is very simple and it works. Many Wikipedia pages offer tables to display data and most of them happen to be the only tables on the page or have unique IDs. Unfortunately, on the "List_of_states_and_territories_of_the_United_States_by_population" page on Wikipedia, There is a table for state rankings and another for rankings by region as well. So I would need to only pull table data for the specific table I needed, which was not too hard.
So to start, how do you pull data from a Wikipedia page? urllib works for me so that's what I used:
So we import urlopen from urllib.request and beautiful soup from bs4. Setup the url and then open and read in the content. The next step is to setup your data with BeautifulSoup. Quick note: I have hardcoded some things and removed the code from the class it was in for simplicity sake. The next code snippet reads through the Beautiful soup(BS) dataset and parses the data for me.
That first line is where most of the magic happens. Most tables on the Wikipedia site use the table classes, wikitable and sortable. BS has a great interface called select where you can use HTML tags (table in this case) and then further find the exact table you want using css tags, so that explains the .wikitable.sortable . Then we tell BS that we want to further filter our page table with the tbody and tr tags that are found within the table we want. The rest is just simplified for loop to loop through the table rows and pull out the State name and population statistics I needed. The result is the data shown below.
|State||2019 Population Estimate||2010 Census Population||Percent of US Population|
|District of Columbia||705,749||601,723||0.21%|
|U.S. Virgin Islands||106,235||106,405||0.03%|
|Northern Mariana Islands||51,433||53,883||0.02%|
|Back to top||Total States and Territories found: 56|