This blog entry documents my recent (successful) attempt to use Simon Urbanek's Rserve and FastRWeb for CGI scripting with R. This is a working blog entry and will be updated or replaced as needed (last updated 4:15 PM 10/6/2011).
#### Helpful documentation:
http://rforge.net/FastRWeb/
http://urbanek.info/research/pub/urbanek-iasc08.pdf
http://www.rforge.net/Rserve/
http://cran.r-project.org/web/packages/Rserve/
(Plus personal communications with Simon, the results
of which are included in the summary below)
#### The steps used (your configuration probably varies):
0. Ubuntu Linux, 64-bit, Version 10.04 LTS (plus updates). I did the following steps as root, but will return to security issues below.
1. I did a fresh installation of the apache2 web server. I noted that the default location of the cgi-bin (used later) is /usr/lib/cgi-bin; yours may vary. I confirmed that this was up and running and that I could use the toy CGI script foo.cgi placed in the cgi-bin:
#!/usr/bin/perl
print "Content-type: text/html\n\n";
print "<html>Hello World</html>";
To test this I pointed my browser to http://localhost/cgi-bin/foo.cgi; if there are problems, consult your system administrator or do detective work (probably in the log files, /var/log/apache2 on my system). Do not continue until you have Hello World working!
2. I did a fresh installation of R, version 2.13.2, using the required --enable-R-shlib option to configure.
3. I installed R packages Rserve, Cairo, FastRWeb, and (though not required) XML (this required installing some libxml2... package in Ubuntu, first, but again is NOT required for Rserve/FastRWeb).
4. After installing FastRWeb, I went into the inst directory of the package and ran the install.sh script; this created /var/FastRWeb, used extensively below.
5. I went into /var/FastRWeb/code and examined the files; in a slightly older version of FastRWeb I commented out a few lines, but the current (10/6/2011) version removed that need for me.
6. I fired up R, and per Simon's instructions did the following:
system.file("cgi-bin", package="FastRWeb")
This revealed the location of a binary called Rcgi. I copied this into /usr/lib/cgi-bin, and renamed it R (instead of Rcgi).
7. Finally, I created a file /var/FastRWeb/web.R/foo.png.R:
# foo.png.R:
run <- function(n=100, ...) {
n <- as.integer(n)
p <- WebPlot(800, 600)
plot(rnorm(n), rnorm(n), pch=19, col=2)
p
}
8. I tested it with the URL: http://localhost/cgi-bin/R/foo.png?n=500
#### Security Issues
I have a feeling that if you have a "trusted machine" without user access, the steps above may not technically pose security risks (even as root); but they do not represent good security practices and *would* introduce security risks on shared servers. For my purposes, I added to the beginning of /var/FastRWeb/code/rserve.conf:
gid 33
uid 33
because www-data (uid and gid 33) is the username for my apache2 instances and it seemed like a reasonable choice. For good measure, I also changed permissions in /var/FastRWeb:
chown www-data:www-data .
chown -R www-data:www-data ./*
Finally, I set
sockmod 0660
umask 0007
based on Simon's recommendation for further security. To stop Rserve and FastRWeb:
killall -INT Rserve
I will try to blog about things that interest me, including data sources and data analysis, travel tips (hot restaurants are good data, right?), and more.
Thursday, October 6, 2011
Monday, September 26, 2011
The Inaugural "Least Interesting Stat" Award
I hereby give the first award to the Yale Daily News for its sports page caption, Monday, September 26, 2011:
"STAT OF THE DAY 4: THE NUMBER OF YEAR SINCE THE FOOTBALL TEAM HAS SCORED 70 POINTS AFTER THE FIRST TWO GAMES OF THE SEASON. The Bulldogs have scored 74 points after two weeks, a total that was last matched in 2007, when Yale put up 79 in what would become a 9-1 season."
For a slightly more invigorating use of statistics and Yale football, see my Yale-Harvard graphical exploration. I need to update it with the last few years of results.
"STAT OF THE DAY 4: THE NUMBER OF YEAR SINCE THE FOOTBALL TEAM HAS SCORED 70 POINTS AFTER THE FIRST TWO GAMES OF THE SEASON. The Bulldogs have scored 74 points after two weeks, a total that was last matched in 2007, when Yale put up 79 in what would become a 9-1 season."
For a slightly more invigorating use of statistics and Yale football, see my Yale-Harvard graphical exploration. I need to update it with the last few years of results.
Sunday, September 4, 2011
New York Predictive Analytics Talk
I'll be giving an evening talk at the New York Predictive Analytics World, http://www.predictiveanalyticsworld.com/newyork/2011/. The rough plan:
This talk will touch upon topics in data analysis, statistics, and computing relating to modern massive data challenges. How do classical theories in statistical inference and asymptotics translate into statistical practice in the modern world? What role should complex Bayesian procedures and other cutting-edge methodologies have in the data analyst toolkit? Computationally, how can we manage the data deluge and how is statistical software evolving? What are the implications for the data analyst? What are the dangers posed by
addressing these very questions? I'll suggest possible answers to some of these questions, and hope to spur further debate by posing others.
This talk will touch upon topics in data analysis, statistics, and computing relating to modern massive data challenges. How do classical theories in statistical inference and asymptotics translate into statistical practice in the modern world? What role should complex Bayesian procedures and other cutting-edge methodologies have in the data analyst toolkit? Computationally, how can we manage the data deluge and how is statistical software evolving? What are the implications for the data analyst? What are the dangers posed by
addressing these very questions? I'll suggest possible answers to some of these questions, and hope to spur further debate by posing others.
Wednesday, August 17, 2011
Blogs on Trade and the Environment
http://environment.yale.edu/envirocenter/
This blogging on the Yale Center for Environmental Law & Policy site discusses issues arising from our recent study of linkages between trade and the environment.
This blogging on the Yale Center for Environmental Law & Policy site discusses issues arising from our recent study of linkages between trade and the environment.
Tuesday, August 16, 2011
Fantasy Football 2011
It's that time of year again! Yesterday I scraped some ranking and points projection data from http://fftoolbox.com.
I was interested in how the projected points declined with rank, across the player positions. The plot, below, helps explain why running backs are selected ahead of wide receivers, for example: the decline in production of wide receivers is much more shallow than for running backs. You get hurt less (in expectation) by taking lower-ranked wide receivers than you do by taking lower-ranked running backs. What I'd really like to do is integrate weekly variation into the analysis... but this requires a more substantial data scrape than I had time for.
I was interested in how the projected points declined with rank, across the player positions. The plot, below, helps explain why running backs are selected ahead of wide receivers, for example: the decline in production of wide receivers is much more shallow than for running backs. You get hurt less (in expectation) by taking lower-ranked wide receivers than you do by taking lower-ranked running backs. What I'd really like to do is integrate weekly variation into the analysis... but this requires a more substantial data scrape than I had time for.
Monday, August 15, 2011
Using "Google Docs" to scrape HTML tables from web pages
One of my students suggested I try this... so I did. In Google Docs, create a new spreadsheet. In the first cell, type something of the form:
My first attempt was scraping some fantasy football points projections:
=ImportHtml("http://www.fftoolbox.com/football/2011/cheatsheets.cfm?player_pos=QB", "table", 0)
Bingo. At least, it worked for me on the 8 pages I tried. I used 0 as the third argument because some web page recommended it.
I could see using this for data scrapes when a small number of pages are involved, but for more advanced scrapes that require automation I'll continue to use R.
=ImportHtml("http://the-url-goes-here", "table", 0)
My first attempt was scraping some fantasy football points projections:
=ImportHtml("http://www.fftoolbox.com/football/2011/cheatsheets.cfm?player_pos=QB", "table", 0)
Bingo. At least, it worked for me on the 8 pages I tried. I used 0 as the third argument because some web page recommended it.
I could see using this for data scrapes when a small number of pages are involved, but for more advanced scrapes that require automation I'll continue to use R.
Subscribe to:
Posts (Atom)