Blog Detail

Covid-19 Tracker Ask Question

preview image Programming
by jitender yadav, Jun 30, 2018, 12:07:47 PM | 2 minutes |

How to count number of words in a HTML string and find Read time in Python 3

In this blog we are going to learn how to count number of words in a string with HTML tags and read-time of that string in Python.

While writing blogs or articles in html text editor the editor gives a string with embedded  with HTML tags which is saved in database as it is.

We need to show read-time of a blog/article OR number of words in that blog to a reader. We can count words from a string with HTML tags by stripping HTML tags as follows in Python:

we need HTMLParser library for striping of HTML Tags and math library for mathematical operations and re for regex related operations.

from html.parser import HTMLParser
import math
import re

Then we need to create a Class which implement HTMLParser

class MLStripper(HTMLParser):
Class for stripping Html Tags
def __init__(self):
self.strict = False
self.convert_charrefs= True
self.fed = []
    #this function takes html string as input and put data in
def handle_data(self, d):

def get_data(self):
return ''.join(self.fed)

Now write function which takes input as HTML string return clean word string without HTML tags

def strip_tags(html):
s = MLStripper()
return s.get_data()

Write functions for word count and read-time

def count_words(html_string):
# html_string = """
# <h1>This is a title</h1>
# """
word_string = strip_tags(html_string)
count = len(word_string.split()) #without any argument split() works on space
return count

def get_read_time(html_string):
count = count_words(html_string)
read_time_min = math.ceil(count/200.0) #assuming 200wpm reading
return int(read_time_min)
We can count words using regex also
def count_words(html_string):
# html_string = """
# <h1>This is a title</h1>
# """
word_string = strip_tags(html_string)
words = re.findall(r'\w+', word_string)
count = len(words)
return count

Hope this will help.

Image Credit : Google

Comments (0)

Leave a comment

Related Blogs

Create Sequence Diagrams using PlantUML

Jun 27, 2021, 12:50:31 PM | Anurag Srivastava

Improving your productivity on Linux Terminal

Nov 29, 2020, 5:16:40 PM | Anurag Srivastava

Elastic Stack Interview Questions

Sep 12, 2020, 3:58:55 PM | Anurag Srivastava

Introduction to Kibana

Aug 1, 2020, 6:19:45 PM | Anurag Srivastava

Create a Chess board in PHP

Mar 9, 2020, 8:45:41 AM | Rocky Paul

Handle Excel file using Python (Part 2)

Dec 31, 2019, 1:33:53 PM | Anurag Srivastava

Top Blogs

Wildcard and Boolean Search in Elasticsearch

Aug 10, 2018, 7:14:40 PM | Anurag Srivastava

Elasticsearch Rest API

Jul 31, 2018, 6:16:42 PM | Anurag Srivastava