Extracting Top Keywords from Text using Python and YAKE Library

3 min readApr 24, 2023

Introduction

Keywords play an important role in text analysis, as they help us understand the key themes and topics covered in a piece of text. In this article, I’ll walk you through the process of extracting top keywords from a given input text using Python and the YAKE (Yet Another Keyword Extractor) library. I’ll also explain the code step by step and provide examples to illustrate the process.

Prerequisites

To follow along with this tutorial, you’ll need to have Python installed on your system (preferably Python 3.x), along with the YAKE library. You can install YAKE using the following command:

!pip install yake

Step 1: Importing the Required Libraries

To extract the top keywords from a given input text, we’ll be using the YAKE library. We’ll start by importing the library:

import yake

Step 2: Defining the Function

We’ll now define the function to extract the top keywords from the input text.

def extract_top_keywords(text, language="en", max_ngram_size=2, deduplication_threshold=0.1, deduplication_algo='seqm', window_size=1, num_of_keywords=5):
    if not isinstance(text, str):
        raise ValueError("Input must be a string")

    try:
        custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=window_size, top=num_of_keywords, features=None)
    except ModuleNotFoundError:
        raise ModuleNotFoundError("YAKE library not installed")

    keywords = custom_kw_extractor.extract_keywords(text)

    top_keywords = [kw[0] for kw in keywords]

    return top_keywords

Let’s go through the different parameters in the function:

text: This is the input text from which we want to extract the top keywords.
language: This parameter specifies the language of the input text. The default value is set to English (en), but you can change it to any other language supported by the YAKE library.
max_ngram_size: This parameter specifies the maximum size of the n-grams (sequences of words) that will be considered when extracting keywords. The default value is set to 2, which means that the function will extract bigrams.
deduplication_threshold: This parameter specifies the threshold for deduplication of keywords. Keywords with a similarity score higher than this threshold will be deduplicated. The default value is set to 0.1.
deduplication_algo: This parameter specifies the deduplication algorithm to be used. The default value is set to seqm, which stands for sequential matching. Other supported values include seqm2 and lev.
window_size: This parameter specifies the window size (in words) for computing the term frequency of each keyword. The default value is set to 1, which means that each word is considered individually.
num_of_keywords: This parameter specifies the number of top keywords to be extracted. The default value is set to 5.

Step 3: Using the Function

Now that we have defined the function, let’s see how we can use it to extract the top keywords from an input text. Here’s an example:

input_text = "Python is a popular programming language used for a wide range of applications, from web development to data analysis. It is known for its simplicity and readability, making it a great choice for beginners and experts alike."

top_keywords = extract_top_keywords(input_text)

print(top_keywords)

['python', 'programming language', 'web development', 'data analysis', 'readability']

In this example, we have passed a string input_text to the extract_top_keywords function, and it has returned a list of the top 5 keywords in the input text.

Explanation of the Code Let’s go through the code step by step to understand how it works:

First, we import the yake library.
We define the function extract_top_keywords that takes an input text and various parameters as input, and returns a list of top keywords.
In the function, we check if the input text is a string. If not, we raise a ValueError.
We initialize the custom_kw_extractor object with the specified parameters.
We extract the keywords from the input text using the extract_keywords method of the custom_kw_extractor object.
We extract the top keywords from the list of keywords returned by the extract_keywords method, and return the list of top keywords.

Conclusion

We’ve learned how to extract top keywords using Yake library. I’ve been using the same method in the tool https://lessentext.com

Please see the demo live at https://lessentext.com and put any URL and hit the Summarize button: You will see top 5 keywords at the top.

If you like the article, please clap a bunch!