VTT Cleanup

Now that people are working remote more and more, there are a ton of meetings that I have conflicts and I can't attend.

Something that works for me to keep up to date is to watch meeting recordings, which I can do at 1.5x speed for example and save some time.

Microsot Stream has a nifty feature where it creates transcriptions. The web page has this transcript available from settings, and failing that you can use your browser's F12 network to get the URL of the transcript as it downloads.

The file is a bit messy to read with the naked eye, and so, today's entry adds to the Python text series, introducing regular expressions!

import argparse
import re

guid_re = re.compile('^[0-9a-zA-Z]{8}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{4}-[0-9a-zA-Z]{12}$') # eg 01234567-b626-4ef4-b0d1-917881a3d172
timespan_re = re.compile('^[0-9]+:[0-9]+.*-->.*[0-9]+:[0-9]+.*$') #eg 00:00:09.079 --> 00:00:13.291

def main_with_args(args):
  line_count = 0
  content_count = 0
  with open(args.infile) as f:
    for l in f:
      line_count = line_count + 1
      l = l.strip()
      if l == "" or l.startswith("WEBVTT") or l.startswith("NOTE ") or guid_re.match(l) or timespan_re.match(l):
        pass # line_count = line_count + 1
      else:
        content_count = content_count + 1
        print(l)
  print("Found {} content lines of {} in {}".format(content_count, line_count, args.infile))

def main():
  parser = argparse.ArgumentParser(description="VTT Cleanup")
  parser.add_argument("infile")
  args = parser.parse_args()
  main_with_args(args)

if __name__ == '__main__':
  main()

More details on syntax and flags like case insensitive matching here.

Happy text cleanups!

Tags:  codingpython

Home