I recently had the unenviable dilema of being told that my girlfriend had been given a data entry task for her internship. As part of that job, she had to copy/paste data from an OOXML document into a spreadsheet. Having taken a closer look at the XML behind the document, I realised it all followed a fairly straightforward structure:
<hr> <H2>Date</> <H2>Title</> <H2>Type</> <div> <p><a>ORIGINAL URL</></> <p class=person>Speaker/Author</> <p>Content</> *this may repeat any number of times <p><b>Speaker/Author</b> <p>Content</> *this may repeat any number of times <p class="column">Reference</> *this may also repeat any number of times </div>
I had a play with BeautifulSoup, and was able to extract the data for a stanza of this document. The Content does repeat itself and loop through a few times, but I have yet to come up with a solution for iterating through, and defining each section between two ‘hr’ tags as an independent instance.
What I’m trying to do it output each section under the following headings into a spreadsheet:
Date | Title | Speaker | Content
If anyone could give me some guidance on how to do it using BeautifulSoup then it would be much appreciated. The main things I feel I need to understand are:
- Identifying and defining ‘sections’ of a document as to iterate over them.
- What DataType to use when processing the sections (should they be in a dict, or a list.. .etc)
- How to output each line of content to a csv file with the appropriate fields (speaker, topic and date) set on each line of content.
- How to handle the unicode strings that BeautifulSoup returns.