How to read/write Word docx files in Python

python-read-write-word-docx-files-feature-image

In this tutorial, we’re gonna look at way to use python-docx module to read, write Word docx files in Python program.

Word documents

Word .docx file has more structures than plain text. With python-docx module, we have 3 different data types:
– a Document object for entire document.
Paragraph objects for the paragraphs inside Document object.
– Each Paragraph object contains a list of Run objects.

read-write-word-docx-files-in-python-docx-module-docx-file

Read/Write Word docx files in Python

Install python-docx module

Open cmd, then run:
pip install python-docx

Once the installation is successful, we can see docx folder at Python\Python[version]\Lib\site-packages.
(In this tutorial, we use python-docx 0.8.10)

Now we can import the module by running import docx.

Read docx file

Open file

We call docx.Document() function and pass the filename to open a docx file under a Document object.


>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')

Get paragraphs

Document object has paragraphs attribute that is a list of Paragraph objects.


>>> gkzDoc = docx.Document('ozenero.docx')

>>> len(gkzDoc.paragraphs)
4
>>> gkzDoc.paragraphs[0].text
'JavaSampleApproach.com was the predecessor website to ozenero.com.'
>>> gkzDoc.paragraphs[1].text
'In this brandnew site, we don\u2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.'
>>> gkzDoc.paragraphs[2].text
'What does ozenero mean?'
>>> gkzDoc.paragraphs[3].text
'Well, ozenero is derived from the words grok and konez.'

Get full-text

To get full-text of the document, we will:
- open the Word document
- loop over all Paragraph objects and then appends their text


>>> import docx
>>> gkzDoc = docx.Document('ozenero.docx')

>>> fullText = []
>>> for paragraph in doc.paragraphs:
...     fullText.append(paragraph.text)
...

>>> fullText
[
'JavaSampleApproach.com was the predecessor website to ozenero.com.',
'In this brandnew site, we don\u2019t only focus on Java & Javascript Technology but also approach to other technologies & frameworks, other fields of computer science such as Machine Learning and Testing. All of them will come to you in simple, feasible, practical and integrative ways. Then you will feel the connection of everything.',
'What does ozenero mean?',
'Well, ozenero is derived from the words grok and konez.'
]

# add 2 lines between paragraphs and merge them all
>>> '\n\n'.join(fullText)

Write docx file

Create Word document

We call docx.Document() function to get a new, blank Word Document object.


>>> import docx
>>> gkzDoc = docx.Document()
>>> gkzDoc

Save Word Document

When everything's done, we must use save('filename.docx') Document's method with filename to save the Document object to a file.


>>> gkzDoc.save('ozenero.docx')

Add Paragraphs

We use add_paragraph() Document's method to add a new paragraph and get a reference to this Paragraph object.


>>> gkzDoc.add_paragraph('ozenero.com for developers!')

We can add text to the end of an existing Paragraph object using Paragraph’s add_run(text) method.


>>> para1 = gkzDoc.add_paragraph('1- Python Tutorials')
>>> para2 = gkzDoc.add_paragraph('2- Tensorflow Tutorials')
>>> para1.add_run(' - Basics')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-paragraphs

To make text styled, we add Run with text attributes.


>>> para2.add_run(' - ')

>>> para2.add_run('Machine Learning').bold = True
>>> para2.add_run(' Tutorials').italic = True

# both bold and italic
>>> para3 = gkzDoc.add_paragraph()
>>> runner = para3.add_run('3- Big Data Tutorials')
>>> runner.bold = True
>>> runner.italic = True

>>> gkzDoc.save('ozenero.docx')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-paragraphs-add-runs

These are some text attributes:
- bold
- italic
- underline
- strike (strikethrough)
- double_strike (double strikethrough)
- all_caps (capital letters)
- shadow
- outline
- rtl (right-to-left)

Add Headings

We use Document's add_heading(heading, i) method to add a paragraph with heading style with i argument from 0 to 9 for heading levels.


>>> gkzDoc.add_heading('ozenero 1', 1)

>>> gkzDoc.add_heading('ozenero 2', 2)

>>> gkzDoc.add_heading('ozenero 3', 3)

>>> gkzDoc.add_heading('ozenero 4', 4)

>>> gkzDoc.add_heading('ozenero 5', 5)

>>> gkzDoc.add_heading('ozenero 6', 6)

>>> gkzDoc.add_heading('ozenero 7', 7)

>>> gkzDoc.add_heading('ozenero 8', 8)

>>> gkzDoc.add_heading('ozenero 9', 9)

read-write-word-docx-files-in-python-docx-module-add-headings

Add Line Breaks, Page Breaks

Instead of starting a new paragraph, we can add a line break using Run object add_break() method on the one that we want to have the break appear after.


>>> import docx
>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break()
>>> para.add_run('Python Basics')

>>> gkzDoc.save('ozenero.docx')
>>> para.text
'ozenero Tutorials\nPython Basics'

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-line-break

We can also add a page break with add_break() method by passing the value docx.enum.text.WD_BREAK.PAGE as an argument to it.


>>> gkzDoc = docx.Document()
>>> para = gkzDoc.add_paragraph('ozenero Tutorials')
>>> para.runs[0].add_break(docx.enum.text.WD_BREAK.PAGE)
>>> gkzDoc.add_paragraph('Python Basics')

>>> gkzDoc.save('ozenero.docx')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-page-break

Add Pictures

We can use Document object's add_picture() method to add an image to the end of the document.


>>> gkzDoc = docx.Document()
>>> gkzDoc.add_paragraph('ozenero Tutorials')

>>> gkzDoc.add_picture('gkn-logo-sm.png')

>>> gkzDoc.save('ozenero.docx')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-picture

add_picture() method has optional width and height arguments.
If we don't use them, the width and height will default to the normal size of the image.


>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1))

>>> gkzDoc.save('ozenero.docx')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-picture-width


>>> gkzDoc.add_picture('gkn-logo-sm.png', width=docx.shared.Inches(1), height=docx.shared.Cm(3))

>>> gkzDoc.save('ozenero.docx')

Result in ozenero.docx:

read-write-word-docx-files-in-python-docx-module-add-picture-width-height

0 0 votes
Article Rating
Subscribe
Notify of
guest
1.1K Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments