Extracting Top Keywords from Text using Python and YAKE Library
Introduction
Keywords play an important role in text analysis, as they help us understand the key themes and topics covered in a piece of text. In this article, I’ll walk you through the process of extracting top keywords from a given input text using Python and the YAKE (Yet Another Keyword Extractor) library. I’ll also explain the code step by step and provide examples to illustrate the process.
Prerequisites
To follow along with this tutorial, you’ll need to have Python installed on your system (preferably Python 3.x), along with the YAKE library. You can install YAKE using the following command:
!pip install yake
Step 1: Importing the Required Libraries
To extract the top keywords from a given input text, we’ll be using the YAKE library. We’ll start by importing the library:
import yake
Step 2: Defining the Function
We’ll now define the function to extract the top keywords from the input text.
def extract_top_keywords(text, language="en", max_ngram_size=2, deduplication_threshold=0.1, deduplication_algo='seqm', window_size=1, num_of_keywords=5):
if not isinstance(text, str):
raise ValueError("Input must be a string")
try:
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=window_size, top=num_of_keywords, features=None)
except ModuleNotFoundError:
raise ModuleNotFoundError("YAKE library not installed")
keywords = custom_kw_extractor.extract_keywords(text)
top_keywords = [kw[0] for kw in keywords]
return top_keywords
Let’s go through the different parameters in the function:
text
: This is the input text from which we want to extract the top keywords.language
: This parameter specifies the language of the input text. The default value is set to English (en
), but you can change it to any other language supported by the YAKE library.max_ngram_size
: This parameter specifies the maximum size of the n-grams (sequences of words) that will be considered when extracting keywords. The default value is set to2
, which means that the function will extract bigrams.deduplication_threshold
: This parameter specifies the threshold for deduplication of keywords. Keywords with a similarity score higher than this threshold will be deduplicated. The default value is set to0.1
.deduplication_algo
: This parameter specifies the deduplication algorithm to be used. The default value is set toseqm
, which stands for sequential matching. Other supported values includeseqm2
andlev
.window_size
: This parameter specifies the window size (in words) for computing the term frequency of each keyword. The default value is set to1
, which means that each word is considered individually.num_of_keywords
: This parameter specifies the number of top keywords to be extracted. The default value is set to5
.
Step 3: Using the Function
Now that we have defined the function, let’s see how we can use it to extract the top keywords from an input text. Here’s an example:
input_text = "Python is a popular programming language used for a wide range of applications, from web development to data analysis. It is known for its simplicity and readability, making it a great choice for beginners and experts alike."
top_keywords = extract_top_keywords(input_text)
print(top_keywords)
['python', 'programming language', 'web development', 'data analysis', 'readability']
In this example, we have passed a string input_text
to the extract_top_keywords
function, and it has returned a list of the top 5 keywords in the input text.
Explanation of the Code Let’s go through the code step by step to understand how it works:
- First, we import the
yake
library. - We define the function
extract_top_keywords
that takes an input text and various parameters as input, and returns a list of top keywords. - In the function, we check if the input text is a string. If not, we raise a
ValueError
. - We initialize the
custom_kw_extractor
object with the specified parameters. - We extract the keywords from the input text using the
extract_keywords
method of thecustom_kw_extractor
object. - We extract the top keywords from the list of keywords returned by the
extract_keywords
method, and return the list of top keywords.
Conclusion
We’ve learned how to extract top keywords using Yake library. I’ve been using the same method in the tool https://lessentext.com
Please see the demo live at https://lessentext.com and put any URL and hit the Summarize button: You will see top 5 keywords at the top.
If you like the article, please clap a bunch!