使用python操作pdf文件

00 安装扩展库
pip install pypdf2
01 提取文本
import PyPDF2
pdfobj1=open('D:\PDF_Samples\数学之美2.pdf','rb')
pdfobj2=open('D:\PDF_Samples\p19.pdf','rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj1)
pdffile2=PyPDF2.PdfFileReader(pdfobj2)
print(pdffile1.numPages,pdffile2.numPages)
345 19 #查看两个PDF文件的总页数
pdffile1.getPage(5).extractText()
Out[12]: '' #提取PDF文件指定页的文本内容,目前对中文字难以识别。
pdffile2.getPage(5).extractText() #成功提取出PDF的文本
Out[13]: "National Hydro Network, Data Model \n\n \nEdition 1.0\n \n2004\n-\n06\n \nGeoBase®\n \n6\n \nMandatory\n \nOptional\n \nQuantity of phenomenon\n \nNumber of characteristics\n \n \n \nOptional\n \n1\n \nOverview\n \nThe data model can (and must) extend beyond the smallest common denominator obtained with the \npartners. The model must therefore
02 旋转页面
原来PDF的页面:
使用python将其旋转90度并另存为pp11.pdf:
import PyPDF2
pdfobj=open('D:\PDF_Samples\p1.pdf','rb')
pdffile=PyPDF2.PdfFileReader(pdfobj)
pdfpage=pdffile.getPage(0)
pdfpage.rotateClockwise(90) #旋转90度
pdffile2=PyPDF2.PdfFileWriter()
pdffile2.addPage(pdfpage)
pdfobj2=open('d:\PDF_Samples\pp11.pdf','wb')
pdffile2.write(pdfobj2)
pdfobj1.close()
pdfobj2.close()
03 合并PDF文件
将两个PDF文件组合成一个PDF文件,
将jpeg.pdf和p19.pdf合并为glue.pdf:
import PyPDF2
pdfobj1=open('D:\PDF_Samples\jpeg.pdf','rb')
pdfobj2=open('D:\PDF_Samples\p19.pdf','rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj1)
pdffile2=PyPDF2.PdfFileReader(pdfobj2)
pdffile3=PyPDF2.PdfFileWriter()
for i in range(pdffile1.numPages):
pdfpage=pdffile1.getPage(i)
pdffile3.addPage(pdfpage)
for j in range(pdffile2.numPages):
pdfpage=pdffile2.getPage(j)
pdffile3.addPage(pdfpage)
pdfobj3=open('D:\PDF_Samples\glue.pdf','wb')
pdffile3.write(pdfobj3)
pdfobj1.close()
pdfobj2.close()
pdfobj3.close()
04 叠加页面
将一个页面的内容作为水印叠加在另一个文件的首页上
import PyPDF2
pdfobj1=open('D:\PDF_Samples\p6.pdf','rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj1)
pdfpage1=pdffile1.getPage(0)
pdfobj2=open('D:\PDF_Samples\watermark.pdf','rb')
pdffile2=PyPDF2.PdfFileReader(pdfobj2)
pdfpage2=pdffile2.getPage(0)
pdfpage1.mergePage(pdfpage2) #叠加页面
pdffile3=PyPDF2.PdfFileWriter()
pdffile3.addPage(pdfpage1)
for i in range(1,pdffile1.numPages):
pdffile3.addPage(pdffile1.getPage(i))
pdfobj3=open('d:\PDF_Samples\merge.pdf','wb')
pdffile3.write(pdfobj3)
pdfobj1.close()
pdfobj2.close()
pdfobj3.close()
05 加密PDF文件
在写入文件之前,进行加密设置,密码为leslie:
06 解密PDF文件
常规方法打开加密PDF文件:
import PyPDF2
pdfobj=open('D:\PDF_Samples\leslie.pdf','rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj)
pdffile1.getPage(0)
会出现错误提示:
解密方法:
import PyPDF2
pdfobj=open('D:\PDF_Samples\leslie.pdf','rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj)
pdffile1.decrypt('leslie') #输入密码
Out[35]: 1 #返回1表示密码正确
07 加密批处理
import PyPDF2,os
pdffiles=[]
for filename in os.listdir('D:\PDF_Samples\.'):
if filename.endswith('.pdf'):
pdffiles.append(filename)
os.chdir('D:\PDF_Samples')
for pdfname in pdffiles:
pdfobj=open(pdfname,'rb')
pdffile1=PyPDF2.PdfFileReader(pdfobj)
pdffile2=PyPDF2.PdfFileWriter()
for i in range(pdffile1.numPages):
page=pdffile1.getPage(i)
pdffile2.addPage(page)
pdffile2.encrypt('leslie')
pdfobj2=open(pdfname+'_encrypted.pdf','wb')
pdffile2.write(pdfobj2)
pdfobj2.close()
pdfobj.close()

工程师必备
- 项目客服
- 培训客服
- 平台客服
TOP
