In this tutorial, we’re gonna look at way to use python-docx
module to read, write Word docx files in Python program.
Word documents
Word .docx
file has more structures than plain text. With python-docx
module, we have 3 different data types:
– a Document
object for entire document.
– Paragraph
objects for the paragraphs inside Document
object.
– Each Paragraph
object contains a list of Run
objects.
Read/Write Word docx files in Python
Install python-docx module
Open cmd, then run:
pip install python-docx
Once the installation is successful, we can see docx
folder at Python\Python[version]\Lib\site-packages
.
(In this tutorial, we use python-docx 0.8.10
)
Now we can import the module by running import docx
.
Read docx file
Open file
We call docx.Document()
function and pass the filename to open a docx file under a Document
object.
>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')
Get paragraphs
Document
object has paragraphs
attribute that is a list of Paragraph
objects.
>>> gkzDoc = docx.Document('ozenero.docx')
>>> len(gkzDoc.paragraphs)
4
>>> gkzDoc.paragraphs[0].text
'JavaSampleApproach.com was the predecessor website to ozenero.com.'
>>> gkzDoc.paragraphs[1].text
'In this brandnew site, we don\u2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.'
>>> gkzDoc.paragraphs[2].text
'What does ozenero mean?'
>>> gkzDoc.paragraphs[3].text
'Well, ozenero is derived from the words grok and konez.'
Get full-text
To get full-text of the document, we will:
- open the Word document
- loop over all Paragraph
objects and then appends their text
>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')
>>> fullText = []
>>> for paragraph in doc.paragraphs:
... fullText.append(paragraph.text)
...
>>> fullText
[
'JavaSampleApproach.com was the predecessor website to ozenero.com.',
'In this brandnew site, we don\u2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.',
'What does ozenero mean?',
'Well, ozenero is derived from the words grok and konez.'
]
# add 2 lines between paragraphs and merge them all
>>> '\n\n'.join(fullText)
Write docx file
Create Word document
We call docx.Document()
function to get a new, blank Word Document
object.
>>> import docx
>>> gkzDoc = docx.Document()
>>> gkzDoc
Save Word Document
When everything's done, we must use save('filename.docx')
Document's method with filename to save the Document
object to a file.
>>> gkzDoc.save('ozenero.docx')
Add Paragraphs
We use add_paragraph()
Document's method to add a new paragraph and get a reference to this Paragraph
object.
>>> gkzDoc.add_paragraph('ozenero.com for developers!')
We can add text to the end of an existing Paragraph
object using Paragraph’s add_run(text)
method.
>>> para1 = gkzDoc.add_paragraph('1- Python Tutorials')
>>> para2 = gkzDoc.add_paragraph('2- Tensorflow Tutorials')
>>> para1.add_run(' - Basics')
Result in ozenero.docx:
To make text styled, we add Run
with text attributes.
>>> para2.add_run(' - ')
>>> para2.add_run('Machine Learning').bold = True
>>> para2.add_run(' Tutorials').italic = True
# both bold and italic
>>> para3 = gkzDoc.add_paragraph()
>>> runner = para3.add_run('3- Big Data Tutorials')
>>> runner.bold = True
>>> runner.italic = True
>>> gkzDoc.save('ozenero.docx')
Result in ozenero.docx:
These are some text attributes:
- bold
- italic
- underline
- strike
(strikethrough)
- double_strike
(double strikethrough)
- all_caps
(capital letters)
- shadow
- outline
- rtl
(right-to-left)
Add Headings
We use Document's add_heading(heading, i)
method to add a paragraph with heading style with i
argument from 0
to 9
for heading levels.
>>> gkzDoc.add_heading('ozenero 1', 1)
>>> gkzDoc.add_heading('ozenero 2', 2)
>>> gkzDoc.add_heading('ozenero 3', 3)
>>> gkzDoc.add_heading('ozenero 4', 4)
>>> gkzDoc.add_heading('ozenero 5', 5)
>>> gkzDoc.add_heading('ozenero 6', 6)
>>> gkzDoc.add_heading('ozenero 7', 7)
>>> gkzDoc.add_heading('ozenero 8', 8)
>>> gkzDoc.add_heading('ozenero 9', 9)
Add Line Breaks, Page Breaks
Instead of starting a new paragraph, we can add a line break using Run object add_break()
method on the one that we want to have the break appear after.
>>> import docx
>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break()
>>> para.add_run('Python Basics')
>>> gkzDoc.save('ozenero.docx')
>>> para.text
'ozenero Tutorials\nPython Basics'
Result in ozenero.docx:
We can also add a page break with add_break()
method by passing the value docx.enum.text.WD_BREAK.PAGE
as an argument to it.
>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
>>> gkzDoc.add_paragraph('Python Basics')
>>> gkzDoc.save('ozenero.docx')
Result in ozenero.docx:
Add Pictures
We can use Document
object's add_picture()
method to add an image to the end of the document.
>>> gkzDoc = docx.Document()
>>> gkzDoc.add_paragraph('ozenero Tutorials')
>>> gkzDoc.add_picture('gkn-logo-sm.png')
>>> gkzDoc.save('ozenero.docx')
Result in ozenero.docx:
add_picture()
method has optional width
and height
arguments.
If we don't use them, the width and height will default to the normal size of the image.
>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1))
>>> gkzDoc.save('ozenero.docx')
Result in ozenero.docx:
>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1), height=docx.shared.Cm(3))
>>> gkzDoc.save('ozenero.docx')
Result in ozenero.docx: