Clustering blog post titles with unigrams

Published on August 23, 2024 under the Coding category.

I was having a conversation yesterday with a reader about clustering news headlines according to similarity. This had me reflecting on some of my past experiments with clustering and sorting, where I have used word embeddings to find similar documents. Word embeddings encode semantic similarity between documents, which allow for more nuanced calculations and clustering.

With that said, word embeddings take time to compute. My discussion made me think about what simpler methods could be used for clustering documents. I asked: What statistical methods could I employ to measure similarity?

I had an idea: what if documents were grouped by how many unique words (unigrams) their titles shared? My hypothesis was that titles contain more important words in documents since they are written to be indicative of the contents of a document.

I tested this method on my blog posts.

For the title Designing Aurora, a new static site generator, my algorithm returns:

Moving over to my own static site generator: 3
Adding dark mode to my static Jekyll site: 2
Implementing Incremental Static Regeneration in Aurora: 2

The results are ordered by how many unique words they share with the input title. The most related result shares three words: static site generator.

In this post, I am going to walk through the clustering algorithm I wrote, noting its strengths and limitations. I will then talk about how my initial algorithm, outlined in step #2, could be improved to be significantly faster.

Step #1: Processing input data

Natural language processing projects start with collecting input data, then processing that data to be in a consistent format. To collect my data, I extracted the titles from the markdown files in my blog repository. I then had a list of titles like this:

titles = ['Ask the Website Maker (Me!)', 'The Glint', 'Adding dark mode to my static Jekyll site', ...]

From the above titles, you can see that words appear in different capitalisations. Some words contain symbols. To compute the number of unique words in a title, we want to remove all punctuation and convert all the words to lowercase. This allows us to ensure that (Me!) and me are treated as me and Website and website are treated as website.

I applied the following steps to my input data:

import string

from nltk.corpus import stopwords

stopwords = set(stopwords.words("english"))

titles = sorted(titles)
original_titles = titles.copy()
titles = [title.lower().translate(str.maketrans("", "", string.punctuation)) for title in titles]
titles = [title.split(" ") for title in titles]
titles = [[word for word in title if word not in stopwords] for title in titles]
titles = [set(title) for title in titles]

The steps are as follows:

Sort all titles alphabetically.
Create a copy of the original titles, for use when we test our algorithm later.
Convert all titles to lowercase and remove punctuation.
Split all titles into a series of words (i.e. [‘ask’, ‘the’, ‘website’, ‘maker’, ‘me’]).
Remove all stop words. These are common words in a language that are less likely to be useful in ascertaining the similarity of documents. For example, two titles should not be more similar because they both contain the word the. The nltk list of stopwords is used above, commonly used in natural language processing tasks.
Convert all lists of title words, with stop words removed, into sets.

By the end of these preprocessing steps, we have documents like this:

[
	{'website', 'maker', 'ask'},
	{'glint'},
	{'dark', 'jekyll', 'mode', 'site', 'static', 'adding'}
]

Converting a list to a set automatically removes all duplicate elements.

Step #2: Implementing a similarity algorithm

With our documents prepared, we can now implement an algorithm to compute the most similar titles to all the titles in our dataset.

Here is the algorithm:

from collections import defaultdict

similar_titles = defaultdict(defaultdict)

for original_title, title in zip(original_titles, titles):
    for other_title, other_words in zip(original_titles, titles):
        if original_title == other_title:
            continue

        similar_titles[original_title][other_title] = len(title.intersection(other_words))

for title in similar_titles:
    similar_titles[title] = dict(sorted(similar_titles[title].items(), key=lambda x: x[1], reverse=True))

First, we create a defaultdict. This is a dictionary that will automatically create keys based on the specified type if a key doesn’t exist. In this case, every time a new key is added to similartitles, a new defaultdict is created as the value.

We then iterate over every title set. Within this loop, we iterate over every title set again. We want to compare every title to every other title. This is called “pairwise comparison”.

For each title set in the outer loop, we calculate the number of words it shares with the title in the inner loop.

For example, suppose the title set in the outer loop is {'10', '2021', 'advent', 'bloggers', 'day'} and the title set in the inner loop is {'11', '2021', 'advent', 'bloggers', 'day'}. Four words are shared between these sets (“2021”, “advent”, “bloggers”, “day”).

We compute the number of shared words in each title set using the set.intersection method. This allows us to find items that exist in two or more sets. We then take the length of the set to find the total number of shared words between each title.

After we have found the number of shared words between all titles, we order the results by the number of shared words. The more shared words, the more similar the documents are.

Of note, the algorithmic complexity for the two loops is O(n^2). This is also called quadratic complexity. This is because we compare every item in our titles to every other item. If we add a new title, we need to compare that title to every other title. It took <1s to compare the similarity of 1080 documents on my Macbook. It took 40s to compare the similarity of 10800 titles. Thus, while this algorithm works, it will be slow when run on tens or over 100,000 documents.

When you see quadratic complexity, you should ask “could this be faster?” I will talk about a way the time complexity could be reduced in the next section. I wrote the algorithm above to test out my idea, then I realised the time complexity could be substantially improved.

Step #3: Testing the algorithm

Let’s test the algorithm. To run a test, we need to provide a title in the input data as the key to our similar_titles dictionary. We can then return the top five most similar titles:

results = similar_titles["How I decide what coffee to drink [Diagram]"]

for title, score in list(results.items())[:5]:
    print(f"{title}: {score}")

This code returns:

Breakfast and Coffee: A wiki for sharing food and drink recommendations: 2
Poll: How much coffee you drink in a day?: 2
Poll: How much coffee you drink in a day? [Results]: 2
Why I Drink Speciality Coffee: 2
A Beginner's Introduction to Grinding Coffee at Home: 1

The documents that share the most words with the input title are at the top. The results are similar to our input term!

Improving the algorithm

Reverse indices

To test the algorithm, we need to provide a title that already exists in the training set. This is because we have computed the similarities between all of the titles. With that said, there is another approach we could use that would allow greater flexibility, and our code to run faster: create a reverse index that contains all of the words in all of the titles, and their associated words. The index would look like this:

index = {
	"word": [document_name, document_name, document_name]
}

Then, to find similar documents, we could take all words in a title and run them through our reverse index like this:

title = "coffee making tips"

To find documents that contain the same words as the title, we can use:

similar_titles = [index.get(word) for word in title.split(" ")]

From here, we can compute how the number of shared document names between all of the words. This approach avoids the need to compare every title to every other title. Instead, we can compute how many titles are shared between only the words in the input title.

I will leave the code above as incomplete as a challenge to you, the reader, to finish.

If you would like to learn more about reverse indices, I recommend reading my Build a search index in Python guide.

Using embeddings

With that said, this algorithm does not account for the semantic similarity of documents. Thus, a title Coffee Making Tips may be similar to Coding Tips since they both share a word, despite being about different topics. This can be mitigated by using word embeddings, which encode the semantic meaning of texts. Embedding models available through Hugging Face Transformers are a good place to start to learn about using text embeddings.

Conclusion

This was a fun exercise in thinking about the strengths and limits of statistical language processing. For my data, the titles from my blog, the technique above was effective in allowing me to find clusters of content. With that said, there are limits: the algorithm doesn’t know the “meaning” of the words, so the same word used in different contexts will be treated without regard to the context.

If you have any questions about this guide, feel free to send me an email at readers [at] jamesg [dot] blog.