Gippius Part 1: Creating a Python Parser with Beautiful Soup and MySQL Integration

Anna Chernysheva
6 min readNov 21, 2023

Zinaida Gippius is a prominent poet and writer of the Silver Age Russian Symbolism and is known for her remarkable contribution to literature. Her poems delve into a wide range of emotional inquiries, offering insightful answers. As a native Russian speaker, I find her work particularly elegant for my aesthetic preferences.

In this article, we delve into the process of scraping Zinaida Gippius’s poems' URL paths from the website ollam.ru. Our objective is to create a comprehensive collection of her works for further analysis.

1. About the Web Page

After spending some time surfing the search engines, we stumbled upon the website that offers the most comprehensive collection of Gippius poems: ollam.ru, and proceeded with parsing them.

There’s a list of 448 poem pages.

Additionally, the web page includes internal linking, allowing users to filter pages based on poem length and collections.

To ensure proper data scraping, it’s important to consider pagination. Upon manual counting, we identified a total of 8 pages in the main section of the gippius-zinaida category.

2. Scraping URLs

For this project, we will use the Beautiful Soup library in combination with the requests module to perform parsing.

from bs4 import BeautifulSoup
import requests

Now, let’s develop an algorithm to scrape only all poems from the main page of the section and save them to the data frame poems_url_df.

Step 1. Start by creating a list of pagination:

url_pages = ['https://ollam.ru/classic/rus/gippius-zinaida/']
for number in range(1, 9):
url_pages.append(f'https://ollam.ru/classic/rus/gippius-zinaida?page={number}')

As we already mentioned, there are eight pagination URLs. The code iterates through the range(1, 9), which includes the numbers 1 to 8, and concatenates each number with the URL’s path by using page={number}. This results in a list of eight pagination URLs.

Step 2. Retrieve the HTML structure of all these pages and create a soup out of them:

#get htmls
htmls = []
for page in url_pages:
htmls.append(requests.get(page).text)

#create soup
soups = []
for html in htmls:
soups.append(BeautifulSoup(html, 'html.parser')

Step 3. The following action is to extract all <a href> tags from the soups and store them in the poems_urls list adding the missing path ‘https://ollam.ru’.

#create the list
poems_urls = []
for ahref in ahrefs:
urls = [tag['href'] for tag in ahref if 'classic' in tag['href']]
for url in urls:
poems_urls.append(f'https://ollam.ru{url}')

#see the result
print(f"The number of ahref elements: \033[43m{len(poems_urls)}\033[0m")
poems_urls

If you are interested in how to highlight the output of codes in Python and other print method techniques → check it out here.

Step 4. The poems_urls includes all URLs of the main page of the section:

Our next task is to separate the filter pages and scrape all the poem URLs from them. As depicted in the image above, the filters come after the bio URL and are located at indexes [5:20].

#save filters in the list 
filtered=poems_urls[5:20]

#clean poems_urls from everything but poems urls
todrop=poems_urls[:20]
single=[url for url in poems_urls if url not in todrop]

Step 5. To store the URLs, let’s create a dataframe called poems_url_df with the columns ‘url’ and ‘filter’. Since the poems_urls list only contains the main page links, the ‘filter’ column will be set to ‘No’.

#create dataframe 
poems_url_df=pd.DataFrame()
poems_url_df[['url','filter']]=pd.Series([single,'No'])

Step 6. Though it is mentioned that there’s a full list of poems, we still need to ensure that filter pages do not include extra ones. For this reason, we will define the retrieve_urls function:

#define function
def retrieve_urls(url):
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
ahrefs = soup.find_all('a', href=True)
h1_elements = soup.find_all('h1')

h1_content = [h1.get_text().strip() for h1 in h1_elements][0]
urls = [tag['href'] for tag in ahrefs if 'classic' in tag.get('href', '')]

poems_urls = [f'https://ollam.ru{url}' for url in urls]
poems_urls = poems_urls[5:]

for i, url in enumerate(poems_urls):
poems_url_df.loc[len(poems_url_df.index)] = [url + str(i), h1_content]

#check if there is pagination
pagination_links = soup.find_all('a', class_='')

last_page = 1
for link in pagination_links:
if link.get_text().isdigit():
page = int(link.get_text())
if page > last_page:
last_page = page

#process each page of pagination
if last_page > 1:
for page_num in range(2, last_page+1):
page_url = f"{url}?page={page_num}"
page_html = requests.get(page_url).text
page_soup = BeautifulSoup(page_html, 'html.parser')
page_ahrefs = page_soup.find_all('a', href=True)
page_urls = [tag['href'] for tag in page_ahrefs if 'classic' in tag.get('href', '')]

poems_urls = [f'https://ollam.ru{url}' for url in page_urls]
poems_urls = poems_urls[5:]

for i, url in enumerate(poems_urls):
poems_url_df.loc[len(poems_url_df.index)] = [url + str(i), h1_content]

#call the retrieve_urls function for each URL in 'filtered' list
for url in filtered:
retrieve_urls(url)

It scrapes all the <a href> elements from the HTML soups of all URLs from the list of filters, taking into account pagination within the filter sections. The scraped URLs are then saved in the ‘url’ column of the poems_url_df dataframe, with the respective filter names stored in the ‘filter’ column. The filter names are represented by the h1_content values.

Step 7. Some URLs from the dataframe can return the 404 response code:

Hence, it is essential to iterate through all URLs and drop the ones with the server response 404:

#drop 404 function
def drop_404(dataframe):
for index, row in dataframe.iterrows():
url = row['url']
response = requests.get(url)
if response.status_code == 404:
dataframe.drop(index, inplace=True)

#call the drop_404 function with the dataframe as an argument
drop_404(poems_url_df)

3. Sending Data to MySql Database

The final step of this stage is to store the collected URLs in the database.

Step 1. Establish a connection to the MySQL server:

#connect to MySql
import mysql.connector

cnx = mysql.connector.connect(
host='💚',
user='🧡',
password='💜',
database='gippius'
)

cursor = cnx.cursor()

Step 2. Create the ‘poems_url’ table with a ‘url’ column:

#create the 'poems_url' table with a 'url' column
create_table_query = '''
CREATE TABLE poems_url (
url VARCHAR(255)
)
'''
cursor.execute(create_table_query)

Step 3. Insert the ‘url’ column from the poems_url_df into MySql table.


#define the query
urls = poems_url_df['url']
insert_query = "INSERT INTO poems_url (url) VALUES (%s)"

#iterate over the URLs and execute the INSERT query
for url in urls:
cursor.execute(insert_query, (url,))

#commit the changes to the database
cnx.commit()

#close the cursor and the database connection
cursor.close()

Now, we have obtained the URLs for all of Zinaida Gippius’s poem pages. In the following part, we will extract all the poems.

Thanks for reading!

--

--

Anna Chernysheva

SEO Analyst and Data Scientist. Specializing in linguistic tasks.