Webpage Indexing with NetworkX Library and Beautiful Soup

ยท

5 min read

Have you ever asked yourself how search engines work? How does Google work behind the scenes? You might have an idea about the hyperlinks and how Google builds webpage relevance based on a site's index. These terms are relevant in all sorts of applications from document search to Search Engine Optimization.

With the ease of app development these days, you might want to get your hands dirty with a webpage indexing project where you build a website's relevance based on how other sites interlink to it. This can be done using a directed graph with data scraped from real web pages using BeautifulSoup.

In this tutorial, you will learn how to index web pages using a PageRank algorithm from NetworkX Python library. This library also gives you the tools necessary for building directed graphs. You will also use BeautifulSoup, a webpage scrapper tool written in Python to analyze the content of HTML markup to extract hyperlinks for building the site index.

How PageRank indexing works

The PageRank algorithm, originally developed by Larry Page and Sergey Brin, co-founders of Google, forms the cornerstone of web page ranking and is fundamental to understanding search engine indexing. This algorithm employs a network graph structure to assess the importance of web pages based on the links connecting them. At its core, PageRank assigns each web page a numerical value, or "rank," which reflects its significance within the web of interconnected pages. The algorithm works by iteratively calculating these ranks based on the incoming links from other web pages. Pages with more incoming links from reputable sources are considered more authoritative and receive higher PageRank scores.
So given three webpages, a.html, b.html, and c.html, where a.html links to b.html, b.html to c.html, and c.html back to b.html, we can immediately see that b.html will have the higher ranking because it has the most links to it. We can illustrate the webpage relationship as a directed graph as follows:

   a.html --------> b.html -----> c.html
                    ^              |
                    |              |
                    ---------------|

This interconnected structure is crucial for the PageRank algorithm, as it evaluates the flow of authority and relevance within the web of hyperlinks to determine the importance of each page. The PageRank algorithm is then run on this directed graph to obtain a list of rankings for each webpage.

Prerequisites

To follow along in this tutorial, you need to have the following prerequisites met:

  1. Python 3.7 or higher. You can install Python using your package manager or the official Python website.

  2. Pip3 installed locally. Pip should come by default with your Python installation.

  3. Python3-venv, which should come with your Python installation. This might not be the case on Linux computers and your can quickly install it on debian-based Linux, using the command: apt install python3-venv.

  4. A UNIX-like shell. You should be fine if you are on Mac or Linux. On Windows, I recommend you use Windows Subsystem for Linux (WSL) or Git Bash.

Implement the PageRank algorithm in Python with NetworkX

To implement this algorithm, create a new directory for this tutorial's project and run the command below:

python3 -m venv .venv

After the virtual environment has been created, activate it by running the appropriate command for your OS as shown below:

source .venv/bin/activate # mac and linux
source .venv/Scripts/activate # windows

Now that the environment has been activated, you can go ahead to install the required packages using the command below:

pip install networkx beautifulsoup4 numpy

After these pip commands are installed successfully, you can create the required HTML files and the page-ranker script. Run the command below to create the html files:

mkdir webpages && \
cd webpages && \
touch a.html, b.html, c.html && \
cd ..

The command above will create a webpages directory and add three HTML files to it. Now, go ahead to create the script that will contain the scrapper and PageRank algorithm using the command below:

touch main.py

With the main.py created, paste the Python snippet below into it:

import networkx as nx
import os
from bs4 import BeautifulSoup

def simple_pagerank(directory: str) -> dict:
    """ 
    Extract links from all .html files in `directory` 
    and return a dict of PageRanks for the pages.
    """
    Graph = nx.DiGraph()

    for htmlfilename in os.listdir(directory):
      with open(os.path.join(directory, htmlfilename)) as htmlfile:
        try:
          soup = BeautifulSoup(htmlfile, "html.parser")
          links = [a["href"] for a in soup.find_all("a") if "href" in a.attrs]

          for link in links:
            Graph.add_edge(htmlfilename, link)
        except Exception as e:
          print(e)
        finally:
          htmlfile.close()

    return nx.pagerank(Graph)

print(simple_pagerank('webpages'))

The script above created a Graph using networkX's DiGraph method. It scans for HTML files in the directory indicated by the directory parameter and uses the "html.parser" from the BeautifulSoup library to parse each HTML it finds, extracting the links within them from the anchor tags into a new list. It then iterates through this list and adds the HTML filename and link as the edges of the NetworkX graph. Finally, it returns the ranking of these pages from the Graph using the NetworkX's pagerank method.

To see this algorithm work, you need to populate the HTML files with content. Open the a.html file and paste the following to it:

<!DOCTYPE html>
<html>
<head>
    <title>a.html</title>
</head>
<body>
    <h1>This is a.html</h1>
    <p><a href="b.html">Go to b.html</a></p>
</body>
</html>

Do the same for the b.html file:

<!DOCTYPE html>
<html>
  <head>
    <title>b.html</title>
  </head>
  <body>
    <h1>This is b.html</h1>
    <p><a href="c.html">Go to c.html</a></p>
  </body>
</html>

And finally, for the c.html file:

<!DOCTYPE html>
<html>
<head>
    <title>c.html</title>
</head>
<body>
    <h1>This is c.html</h1>
    <p><a href="b.html">Go back to b.html</a></p>
</body>
</html>

Now, run the main.py file. On a successful run, you should have the following output:

{'b.html': 0.48648582432442095, 'c.html': 0.46351417567557884, 'a.html': 0.05}

Conclusion

In this tutorial, you learnt how to write a PageRanker script using NetworkX library and BeautfulSoup. The knowledge you gained from this article can be applied to many other contexts such as building an in-house document search index, building a custom ranking system for an online game, or building a Google competitor, lol.

If you found this tutorial helpful, please share it with your friends and colleagues. Thank you for reading. Arigato Gozaimasu!

ย