How Long Will it Take to Scrape Every Professor’s Individual Page on RMP?

For the Rate My Professors group project, my group, Cumulus, decided to use a weighting scheme instead of a thresholding method. In order to do a weighting of Easiness and Quality, I needed to acquire the number of ratings for each category of rating (5 star, 4 star, 3 star, 2 star, 1 star). Thus, I have to access every professor’s page on Rate My Professor in the Big Ten. In class, we were discussing how long it would take to scrape each professor’s page on RMP, and we decided it was somewhere on the magnitude of less than an hour. Well, let’s just say that was a little off…by a factor of 10.

Right now, my scraper is averaging an hour per university, meaning that to scrape the entire Big 10, it will take roughly 12 hours. Currently, my scraper is scraping its third university and has been running for two hours and 24 minutes. I haven’t taken any classes that discuss the efficiency of code, so there might be a way to scrape the data faster, but I imagine it would still take a long time (my internet connection may play a role in this too, but I would say it’s acceptable ~5-10MBs/sec on average). With all this taken into consideration, I don’t think I ever quite understood the magnitude of our scraper or the amount of data it was scraping until I changed the scraper so that it visits every professors’ page. In addition, the magnitude of a mistake in your code is much larger–on my first iteration, my code ran for roughly 40 minutes and I received an error caused by my stem_list.txt file not being in the same directory as the Python file when it was called. Because I wasn’t writing the scraped information to txt files right away and instead was storing the information in lists, I wasted 40 minutes of scraping. I know this isn’t the most interesting blog post in the world, but I thought members of our class might find it interesting how long the code ran after visiting each professor’s page rather than visiting 26 pages per university.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to How Long Will it Take to Scrape Every Professor’s Individual Page on RMP?

  1. dkoleanb says:

    Update: Total run time was 11 hours and 10 minutes.

    Accessed: 32,227 pages

  2. blevz says:

    Did you try running it on an EC2? I wonder how much that would speed it up.
    Also, if you could parallelize it you could run all of the universities at the same time.

  3. dkoleanb says:

    I thought about parallelizing it, but I was worried that I would set off RMP’s anti-DOS measures if I made too many requests in such a short period of time.

    I actually got this error for the first school I ran:
    IOError: [Errno socket error] [Errno 54] Connection reset by peer

    But, I’m pretty sure it was just a random hiccup–I tried again after a half hour or so, and it worked just fine.

  4. blevz says:

    Did you store the text of the comments or just the individual ratings?

  5. dkoleanb says:

    Unfortunately, I did not store the text of the comments–I have no idea how to write code that analyzes syntax (ie: I don’t know how I would write something that would know the difference between “This professor was a bad teacher” and “There is no way this professor was a bad teacher.” So, I just captured individual ratings and passed them through a weighting algorithm (this was slowed down a little bit because it also had to calculate Quality using Clarity and Helpfulness ratings before the weighting was able to commence.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s