Posts Tagged ‘parliament’

Converting Documents

September 7th, 2011

I recently had the unenviable dilema of being told that my girlfriend had been given a data entry task for her internship. As part of that job, she had to copy/paste data from an OOXML document into a spreadsheet.  Having taken a closer look at the XML behind the document, I realised it all followed a fairly straightforward structure:

<hr>
<H2>Date</>
<H2>Title</>
<H2>Type</>
<div>
<p><a>ORIGINAL URL</></>
<p class=person>Speaker/Author</>
<p>Content</>  *this may repeat any number of times
<p><b>Speaker/Author</b>
<p>Content</> *this may repeat any number of times
<p class="column">Reference</> *this may also repeat any number of times
</div>
<hr>
.etc

I had a play with BeautifulSoup, and was able to extract the data for a stanza of this document. The Content does repeat itself and loop through a few times, but I have yet to come up with a solution for iterating through, and defining each section between two ‘hr’ tags as an independent instance.

What I’m trying to do it output each section under the following headings into a spreadsheet:

Date | Title | Speaker | Content

If anyone could give me some guidance on how to do it using BeautifulSoup then it would be much appreciated.  The main things I feel I need to understand are:

  1. Identifying and defining ‘sections’ of a document as to iterate over them.
  2. What DataType to use when processing the sections (should they be in a dict, or a list.. .etc)
  3. How to output each line of content to a csv file with the appropriate fields (speaker, topic and date) set on each line of content.
  4. How to handle the unicode strings that BeautifulSoup returns.
If anyone could give me a hand, it would be much appreciated.  The task is for a charity, and as you may have guessed by the content, it is to process a report on parliamentary debates.  The good news is that the data structure is very similar for the other types of reports that I would like to parse, so once I’ve got this one completed I’m pretty sure that I’ll be able to make the minor modifications to get others done.
Thanks in advance!  Andy