by jitender yadav, Jun 30, 2018, 12:07:47 PM | 2 minutes |

How to count number of words in a HTML string and find Read time in Python 3

In this blog we are going to learn how to count number of words in a string with HTML tags and read-time of that string in Python.

While writing blogs or articles in html text editor the editor gives a string with embedded  with HTML tags which is saved in database as it is.

We need to show read-time of a blog/article OR number of words in that blog to a reader. We can count words from a string with HTML tags by stripping HTML tags as follows in Python:

we need HTMLParser library for striping of HTML Tags and math library for mathematical operations and re for regex related operations.

from html.parser import HTMLParser
import math
import re

Then we need to create a Class which implement HTMLParser

class MLStripper(HTMLParser):
Class for stripping Html Tags
def __init__(self):
self.strict = False
self.convert_charrefs= True
self.fed = []
    #this function takes html string as input and put data in
def handle_data(self, d):

def get_data(self):
return ''.join(self.fed)

Now write function which takes input as HTML string return clean word string without HTML tags

def strip_tags(html):
s = MLStripper()
return s.get_data()

Write functions for word count and read-time

def count_words(html_string):
# html_string = """
# <h1>This is a title</h1>
# """
word_string = strip_tags(html_string)
count = len(word_string.split()) #without any argument split() works on space
return count

def get_read_time(html_string):
count = count_words(html_string)
read_time_min = math.ceil(count/200.0) #assuming 200wpm reading
return int(read_time_min)
We can count words using regex also
def count_words(html_string):
# html_string = """
# <h1>This is a title</h1>
# """
word_string = strip_tags(html_string)
words = re.findall(r'\w+', word_string)
count = len(words)
return count

Hope this will help.

Image Credit : Google

