Writing Scrapers

A state scraper is implementing by providing classes derived from BillScraper, LegislatorScraper, VoteScraper, and CommitteeScraper.

Derived scraper classes should override the scrape() method that that is responsible for creating Bill, Legislator, Vote, and Committee objects as appropriate.

Example state scraper directory structure:

./ex/__init__.py      # metadata for "ex" state scraper
./ex/bills.py         # contains EXBillScraper (also scrapes Votes)
./ex/legislators.py   # contains EXLegislatorScraper
./ex/committees.py    # contains EXCommitteeScraper

billy.scrape

Scraper

The most useful on the base Scraper class is urlopen(url, method='GET', body=None). Scraper.urlopen opens a URL and returns a string-like object that can then be parsed by a library like lxml.

This method provides advantages over built-in urlopen methods in that the underlying Scraper class can be configured to support rate-limiting, caching, and provides robust error handling.

Note

For advanced usage see scrapelib which provides the basis for billy.scrape.Scraper.

Logging

The base class also configures a python logger instance and provides several shortcuts for logging at various log levels:

log(msg, *args, **kwargs)
log a message with level logging.INFO
debug(msg, *args, **kwargs)
log a message with level logging.DEBUG
warning(msg, *args, **kwargs)
log a message with level logging.WARNING

Note

It is also possible to access the self.logger object directly.

class billy.scrape.Scraper(metadata, output_dir=None, strict_validation=None, fastmode=False)

Base class for all Scrapers

Provides several useful methods for retrieving URLs and checking arguments against metadata.

__init__(metadata, output_dir=None, strict_validation=None, fastmode=False)

Create a new Scraper instance.

Parameters:
  • metadata – metadata for this scraper
  • output_dir – the data directory to use
  • strict_validation – exit immediately if validation fails
validate_session(session, latest_only=False)

Check that a session is present in the metadata dictionary.

raises NoDataForPeriod if session is invalid

Parameters:session – string representing session to check
validate_term(term, latest_only=False)

Check that a term is present in the metadata dictionary.

raises NoDataForPeriod if term is invalid

Parameters:
  • term – string representing term to check
  • latest_only – if True, will raise exception if term is not the current term (default: False)

SourcedObject

class billy.scrape.SourcedObject(_type, **kwargs)

Base object used for data storage.

Base class for Bill, Legislator, Vote, and Committee.

SourcedObjects work like a dictionary. It is possible to add extra data beyond the required fields by assigning to the SourcedObject instance like a dictionary.

add_source(url, **kwargs)

Add a source URL from which data related to this object was scraped.

Parameters:url – the location of the source

Exceptions

class billy.scrape.ScrapeError(msg, orig_exception=None)

Base class for scrape errors.

class billy.scrape.NoDataForPeriod(period)

Exception to be raised when no data exists for a given period

Bills

BillScraper

BillScraper implementations should gather and save Bill objects.

Sometimes it is easiest to also gather Vote objects in a BillScraper as well, these can be attached to Bill objects via the add_vote() method.

class billy.scrape.bills.BillScraper(metadata, output_dir=None, strict_validation=None, fastmode=False)
scrape(chamber, session)

Grab all the bills for a given chamber and session. Must be overridden by subclasses.

Should raise a NoDataForPeriod exception if it is not possible to scrape bills for the given session.

Bill

class billy.scrape.bills.Bill(session, chamber, bill_id, title, **kwargs)

Object representing a piece of legislation.

See SourcedObject for notes on extra attributes/fields.

__init__(session, chamber, bill_id, title, **kwargs)

Create a new Bill.

Parameters:
  • session – the session in which the bill was introduced.
  • chamber – the chamber in which the bill was introduced: either ‘upper’ or ‘lower’
  • bill_id – an identifier assigned to this bill by the legislature (should be unique within the context of this chamber/session) e.g.: ‘HB 1’, ‘S. 102’, ‘H.R. 18’
  • title – a title or short description of this bill provided by the official source

Any additional keyword arguments will be associated with this bill and stored in the database.

add_action(actor, action, date, type=None, committees=None, legislators=None, **kwargs)

Add an action that was performed on this bill.

Parameters:
  • actor – a string representing who performed the action. If the action is associated with one of the chambers this should be ‘upper’ or ‘lower’. Alternatively, this could be the name of a committee, a specific legislator, or an outside actor such as ‘Governor’.
  • action – a string representing the action performed, e.g. ‘Introduced’, ‘Signed by the Governor’, ‘Amended’
  • date – the date/time this action was performed.
  • type – a type classification for this action
;param committees: a committee or list of committees to associate with
this action
add_document(name, url, mimetype=None, **kwargs)

Add a document or media item that is related to the bill. Use this method to add documents such as Fiscal Notes, Analyses, Amendments, or public hearing recordings.

Parameters:
  • name – a name given to the document, e.g. ‘Fiscal Note for Amendment LCO 6544’
  • url – link to location of document or file
  • mimetype – MIME type of the document

If multiple formats of a document are provided, a good rule of thumb is to prefer text, followed by html, followed by pdf/word/etc.

add_source(url, **kwargs)

Add a source URL from which data related to this object was scraped.

Parameters:url – the location of the source
add_sponsor(type, name, **kwargs)

Associate a sponsor with this bill.

Parameters:
  • type – the type of sponsorship, e.g. ‘primary’, ‘cosponsor’
  • name – the name of the sponsor as provided by the official source
add_title(title)

Associate an alternate title with this bill.

add_version(name, url, mimetype=None, on_duplicate='error', **kwargs)

Add a version of the text of this bill.

Parameters:
  • name – a name given to this version of the text, e.g. ‘As Introduced’, ‘Version 2’, ‘As amended’, ‘Enrolled’
  • url – the location of this version on the legislative website.
  • mimetype – MIME type of the document
  • on_duplicate – What to do if a duplicate is seen: error - default option, raises a ValueError ignore - add the document twice (rarely the right choice) use_new - use the new name, removing the old document use_old - use the old name, not adding the new document

If multiple formats are provided, a good rule of thumb is to prefer text, followed by html, followed by pdf/word/etc.

add_vote(vote)

Associate a Vote object with this bill.

Votes

VoteScraper

VoteScraper implementations should gather and save Vote objects.

If a state’s BillScraper gathers votes it is not necessary to provide a VoteScraper implementation.

class billy.scrape.votes.VoteScraper(metadata, output_dir=None, strict_validation=None, fastmode=False)
scrape(chamber, session)

Grab all votes for a given chamber and session. Must be overridden by subclasses.

Should raise a NoDataForPeriod exception if it is not possible to scrape votes for the provided session.

Vote

class billy.scrape.votes.Vote(chamber, date, motion, passed, yes_count, no_count, other_count, type='other', **kwargs)
__init__(chamber, date, motion, passed, yes_count, no_count, other_count, type='other', **kwargs)

Create a new Vote.

Parameters:
  • chamber – the chamber in which the vote was taken, ‘upper’ or ‘lower’
  • date – the date/time when the vote was taken
  • motion – a string representing the motion that was being voted on
  • passed – did the vote pass, True or False
  • yes_count – the number of ‘yes’ votes
  • no_count – the number of ‘no’ votes
  • other_count – the number of abstentions, ‘present’ votes, or anything else not covered by ‘yes’ or ‘no’.
  • type – vote type classification

Any additional keyword arguments will be associated with this vote and stored in the database.

Examples:

Vote('upper', '', '12/7/08', 'Final passage',
     True, 30, 8, 3)
Vote('lower', 'Finance Committee', '3/4/03 03:40:22',
     'Recommend passage', 12, 1, 0)
add_source(url, **kwargs)

Add a source URL from which data related to this object was scraped.

Parameters:url – the location of the source
no(legislator)

Indicate that a legislator (given as a string of their name) voted ‘no’.

other(legislator)

Indicate that a legislator (given as a string of their name) abstained, voted ‘present’, or made any other vote not covered by ‘yes’ or ‘no’.

yes(legislator)

Indicate that a legislator (given as a string of their name) voted ‘yes’.

Examples:

vote.yes('Smith')
vote.yes('Alan Hoerth')

Legislators

LegislatorScraper implementations should gather and save Legislator objects.

Sometimes it is easiest to also gather committee memberships at the same time as legislators. Committee memberships can be attached to Legislator objects via the add_role() method.

LegislatorScraper

class billy.scrape.legislators.LegislatorScraper(metadata, output_dir=None, strict_validation=None, fastmode=False)
scrape(chamber, term)

Grab all the legislators who served in a given term. Must be overridden by subclasses.

Should raise a NoDataForPeriod exception if the year is invalid.

Person

class billy.scrape.legislators.Person(full_name, first_name='', last_name='', middle_name='', **kwargs)
__init__(full_name, first_name='', last_name='', middle_name='', **kwargs)

Create a Person.

Note: the Legislator class should be used when dealing with legislators.

Parameters:
  • full_name – the person’s full name
  • first_name – the first name of this legislator (if specified)
  • last_name – the last name of this legislator (if specified)
  • middle_name – a middle name or initial of this legislator (if specified)
add_role(role, term, start_date=None, end_date=None, **kwargs)

Examples:

leg.add_role(‘member’, term=‘2009’, chamber=’upper’,
party=’Republican’, district=‘10th’)

Legislator

class billy.scrape.legislators.Legislator(term, chamber, district, full_name, first_name='', last_name='', middle_name='', party='', **kwargs)
__init__(term, chamber, district, full_name, first_name='', last_name='', middle_name='', party='', **kwargs)

Create a Legislator.

Parameters:
  • term – the term for this legislator
  • chamber – the chamber in which this legislator served, ‘upper’ or ‘lower’
  • district – the district this legislator is representing, as given e.g. ‘District 2’, ‘7th’, ‘District C’.
  • full_name – the full name of this legislator
  • first_name – the first name of this legislator (if specified)
  • last_name – the last name of this legislator (if specified)
  • middle_name – a middle name or initial of this legislator (if specified)
  • party – the party this legislator belongs to (if specified)

Note

please only provide the first_name, middle_name and last_name parameters if they are listed on the official web site; do not try to split the legislator’s full name into components yourself.

add_source(url, **kwargs)

Add a source URL from which data related to this object was scraped.

Parameters:url – the location of the source

Committees

CommitteeScraper implementations should gather and save Committee objects.

If a state’s LegislatorScraper gathers committee memberships it is not necessary to provide a CommitteeScraper implementation.

CommitteeScraper

class billy.scrape.committees.CommitteeScraper(metadata, output_dir=None, strict_validation=None, fastmode=False)

Committee

class billy.scrape.committees.Committee(chamber, committee, subcommittee=None, **kwargs)
__init__(chamber, committee, subcommittee=None, **kwargs)

Create a Committee.

Parameters:
  • chamber – the chamber this committee is associated with (‘upper’, ‘lower’, or ‘joint’)
  • committee – the name of the committee
  • subcommittee – the name of the subcommittee (optional)
add_member(legislator, role='member', **kwargs)

Add a member to the committee object.

Parameters:
  • legislator – name of the legislator
  • role – role that legislator holds in the committee (eg. chairman) default: ‘member’

Events

EventScraper implementations should gather and save Event objects.

Relevant bills, documents, and participants can be attached to Event objects via the add_related_bill(), add_document(), and add_participant() methods, respectively.

EventScraper

class billy.scrape.events.EventScraper(metadata, output_dir=None, strict_validation=None, fastmode=False)

Event

class billy.scrape.events.Event(session, when, type, description, location, end=None, **kwargs)