For the Rate My Professors group project, my group, Cumulus, decided to use a weighting scheme instead of a thresholding method. In order to do a weighting of Easiness and Quality, I needed to acquire the number of ratings for each category of rating (5 star, 4 star, 3 star, 2 star, 1 star). Thus, I have to access every professor’s page on Rate My Professor in the Big Ten. In class, we were discussing how long it would take to scrape each professor’s page on RMP, and we decided it was somewhere on the magnitude of less than an hour. Well, let’s just say that was a little off…by a factor of 10.
Right now, my scraper is averaging an hour per university, meaning that to scrape the entire Big 10, it will take roughly 12 hours. Currently, my scraper is scraping its third university and has been running for two hours and 24 minutes. I haven’t taken any classes that discuss the efficiency of code, so there might be a way to scrape the data faster, but I imagine it would still take a long time (my internet connection may play a role in this too, but I would say it’s acceptable ~5-10MBs/sec on average). With all this taken into consideration, I don’t think I ever quite understood the magnitude of our scraper or the amount of data it was scraping until I changed the scraper so that it visits every professors’ page. In addition, the magnitude of a mistake in your code is much larger–on my first iteration, my code ran for roughly 40 minutes and I received an error caused by my stem_list.txt file not being in the same directory as the Python file when it was called. Because I wasn’t writing the scraped information to txt files right away and instead was storing the information in lists, I wasted 40 minutes of scraping. I know this isn’t the most interesting blog post in the world, but I thought members of our class might find it interesting how long the code ran after visiting each professor’s page rather than visiting 26 pages per university.