Monday, August 15, 2011

Using "Google Docs" to scrape HTML tables from web pages

One of my students suggested I try this... so I did.  In Google Docs, create a new spreadsheet.  In the first cell, type something of the form:

=ImportHtml("http://the-url-goes-here", "table", 0)

My first attempt was scraping some fantasy football points projections:

=ImportHtml("http://www.fftoolbox.com/football/2011/cheatsheets.cfm?player_pos=QB", "table", 0)

Bingo.  At least, it worked for me on the 8 pages I tried.  I used 0 as the third argument because some web page recommended it.

I could see using this for data scrapes when a small number of pages are involved, but for more advanced scrapes that require automation I'll continue to use R.

2 comments:

  1. Hi Dude,

    MOZENDA has developed a proprietary screen scraping tool which not only scrapes HTML, but also scrapes website text, downloads images and repackages the data into CSV or XML turning a target website into a virtual data feed. Thanks for sharing it.......

    Web Data Extraction Software

    ReplyDelete
  2. I used 0 as the third argument because some web page recommended it.
    scrape a website

    ReplyDelete