I’ve got a lot of pdfs with password and with a huge text with my name on every page. To fix both issues, I wrote the small code below:


import pikepdf
import sys
import re

if len(sys.argv) < 2:
    print(f"syntax {sys.argv[0]} input-file.pdf\n")
    exit(1)

to_remove = ['TEXT TO REMOVE']
input_file = sys.argv[1]
output_file = f"clean_{input_file}"

with pikepdf.open(input_file, password='FILE-PASSWORD') as pdf:
    for page in pdf.pages:
        instructions = pikepdf.parse_content_stream(page)
        new_instructions = []
        for x in instructions:
            extracted = ''.join(m.group(1) for m in re.finditer(r'"([^"]+)"', str(x)))
            if not any(s in extracted for s in to_remove):
                new_instructions.append(x)

        new_content_stream = pikepdf.unparse_content_stream(new_instructions)
        page.Contents = pdf.make_stream(new_content_stream)
    pdf.save(output_file)

it worked for everything that I did try. Even multi-line text, just add more fields to the to_remove array.