Converting Documents

September 7th, 2011 by andylockran Leave a reply »

I recently had the unenviable dilema of being told that my girlfriend had been given a data entry task for her internship. As part of that job, she had to copy/paste data from an OOXML document into a spreadsheet.  Having taken a closer look at the XML behind the document, I realised it all followed a fairly straightforward structure:

<hr>
<H2>Date</>
<H2>Title</>
<H2>Type</>
<div>
<p><a>ORIGINAL URL</></>
<p class=person>Speaker/Author</>
<p>Content</>  *this may repeat any number of times
<p><b>Speaker/Author</b>
<p>Content</> *this may repeat any number of times
<p class="column">Reference</> *this may also repeat any number of times
</div>
<hr>
.etc

I had a play with BeautifulSoup, and was able to extract the data for a stanza of this document. The Content does repeat itself and loop through a few times, but I have yet to come up with a solution for iterating through, and defining each section between two ‘hr’ tags as an independent instance.

What I’m trying to do it output each section under the following headings into a spreadsheet:

Date | Title | Speaker | Content

If anyone could give me some guidance on how to do it using BeautifulSoup then it would be much appreciated.  The main things I feel I need to understand are:

  1. Identifying and defining ‘sections’ of a document as to iterate over them.
  2. What DataType to use when processing the sections (should they be in a dict, or a list.. .etc)
  3. How to output each line of content to a csv file with the appropriate fields (speaker, topic and date) set on each line of content.
  4. How to handle the unicode strings that BeautifulSoup returns.
If anyone could give me a hand, it would be much appreciated.  The task is for a charity, and as you may have guessed by the content, it is to process a report on parliamentary debates.  The good news is that the data structure is very similar for the other types of reports that I would like to parse, so once I’ve got this one completed I’m pretty sure that I’ll be able to make the minor modifications to get others done.
Thanks in advance!  Andy

 

Advertisement
  • http://withoutatraceroute.com Brendan McCollam

    1) If this XML is coming from a Word OOXML document, then BeautifulSoup (which is designed to deal with HTML) might not be your best bet. I would think maybe a general-purpose XML library like LXML(http://lxml.de/), or perhaps an OOXML-specific library. I found this after a few minutes of searching: https://github.com/mikemaccana/python-docx

    You can easily split the entire string (before parsing) into sections by using Python’s built-in string split() method, split(“”), which will break it into a list of strings on each tag. You could then parse each section separately.

    2) If it were me, I would process the sections into dictionaries, with the keys being the eventual column headers (“Date” , “Title”, etc.)

    3) Python has a very nice CSV module in the standard library. Once you have your dictionaries, you can easily output them with DictWriter: http://docs.python.org/library/csv.html#csv.DictWriter

    4) In Python, you can use unicode strings exactly like you would use any other string. If you need to encode them in some particular way for the output, this document might help: http://docs.python.org/howto/unicode.html

    But my advice would be to just ignore the unicode-ness and not worry about it unless it causes a problem down the line. If you do need to encode, UTF-8 is usually a safe bet these days.

    • http://zrmt.com andylockran

      Brendan,

      Thanks for the comprehensive reply.  I’ll be using your advice for my code.  Hopefully will have something to publish in the next couple of days…

  • http://twitter.com/zeth0 zeth0

    I use TEI XML for marking up ancient documents. In these documents you have multiple hierarchies (where you view a text in different ways). To view a document page by page, you grab the content between page break elements. You basically want to use a SAX based approach.

    However I use currently use LXML. Have a look here:
    http://bazaar.launchpad.net/~zeth0/vmr/trunk/view/head:/content/xmlparsing.py

    Change to , hopefully you see what I mean.

    • http://zrmt.com andylockran

      Zeth,

      Sorry the comments system broke your xml.. Thanks for your input – I’ve come across lxml since starting the snippet, so I’ll go back and take a closer look inline with Brendan’s recommendations.