Extracting Top Keywords from Text using Python and YAKE Library
Keywords play an important role in text analysis, as they help us understand the key themes and topics covered in a piece of text. In this article, I’ll walk you through the process of extracting top keywords from a given input text using Python and the YAKE (Yet Another Keyword Extractor) library. I’ll also explain the code step by step and provide examples to illustrate the process.
To follow along with this tutorial, you’ll need to have Python installed on your system (preferably Python 3.x), along with the YAKE library. You can install YAKE using the following command:
!pip install yake
Step 1: Importing the Required Libraries
To extract the top keywords from a given input text, we’ll be using the YAKE library. We’ll start by importing the library:
Step 2: Defining the Function
We’ll now define the function to extract the top keywords from the input text.
def extract_top_keywords(text, language="en", max_ngram_size=2, deduplication_threshold=0.1, deduplication_algo='seqm', window_size=1, num_of_keywords=5):
if not isinstance(text, str):
raise ValueError("Input must be a string")
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=window_size, top=num_of_keywords, features=None)
raise ModuleNotFoundError("YAKE library not installed")
keywords = custom_kw_extractor.extract_keywords(text)
top_keywords = [kw for kw in keywords]
Let’s go through the different parameters in the function:
text: This is the input text from which we want to extract the top keywords.
language: This parameter specifies the language of the input text. The default value is set to English (
en), but you can change it to any other language supported by the YAKE library.
max_ngram_size: This parameter specifies the maximum size of the n-grams (sequences of words) that will be considered when extracting keywords. The default value is set to
2, which means that the function will extract bigrams.
deduplication_threshold: This parameter specifies the threshold for deduplication of keywords. Keywords with a similarity score higher than this threshold will be deduplicated. The default value is set to
deduplication_algo: This parameter specifies the deduplication algorithm to be used. The default value is set to
seqm, which stands for sequential matching. Other supported values include
window_size: This parameter specifies the window size (in words) for computing the term frequency of each keyword. The default value is set to
1, which means that each word is considered individually.
num_of_keywords: This parameter specifies the number of top keywords to be extracted. The default value is set to
Step 3: Using the Function
Now that we have defined the function, let’s see how we can use it to extract the top keywords from an input text. Here’s an example:
input_text = "Python is a popular programming language used for a wide range of applications, from web development to data analysis. It is known for its simplicity and readability, making it a great choice for beginners and experts alike."
top_keywords = extract_top_keywords(input_text)
['python', 'programming language', 'web development', 'data analysis', 'readability']
In this example, we have passed a string
input_text to the
extract_top_keywords function, and it has returned a list of the top 5 keywords in the input text.
Explanation of the Code Let’s go through the code step by step to understand how it works:
- First, we import the
- We define the function
extract_top_keywordsthat takes an input text and various parameters as input, and returns a list of top keywords.
- In the function, we check if the input text is a string. If not, we raise a
- We initialize the
custom_kw_extractorobject with the specified parameters.
- We extract the keywords from the input text using the
extract_keywordsmethod of the
- We extract the top keywords from the list of keywords returned by the
extract_keywordsmethod, and return the list of top keywords.
We’ve learned how to extract top keywords using Yake library. I’ve been using the same method in the tool https://lessentext.com
Please see the demo live at https://lessentext.com and put any URL and hit the Summarize button: You will see top 5 keywords at the top.
If you like the article, please clap a bunch!