Wednesday, July 15, 2015

If I HATE web lists so much - why do i click on them?!

I hate web lists.  I really do.  Especially the ones that list things ONE-AT-TIME so you have to load a page that is 90% ads. Irritating!  But, I saw one about "America's Top 100 Colleges". I work at a University - I wondered if we made the top 100.  So, I took the bait.

First I am sent to a page telling me what the list is about WITH LOTS OF ADS and a NEXT button.  I click NEXT. Ok, number 1, lots of ads another NEXT button.  This has already gotten old for me.  I was just curious what the top 100 colleges were - can't they just show me a list?!

So, I think to myself "I wonder if I could just harvest the list from the website with Perl.  I look at the URL - I see what they're doing - simply incrementing the folder by 1 each time.  Easy!  Next I look at the source code: TOO EASY!  They have only the STRONG tag once per page for the school name.  Cool!  So, two minutes of code and then: Tah-dah!


#!/usr/bin/perl -w
# Remember - this took two minutes, I know it's not perfect

use LWP::Simple;

my $rank=100;
my $base="http://www3.forbes.com/forbeswoman/americas-top-100-colleges/";
while ($rank >0) {
    my $url = $base . "/$rank/";
    my $html = get($url);
    $html =~ m{<p><strong>(.*)</strong></p>}
          } or die "Cannot find school: $rank\n";
    $school = $1;
    print "$rank: $school\n";
    $rank--;
}

I run it.  It doesn't work (I think).  It isn't changing the name.  Hmmm.  I keep killing it off before it finishes and examine the code.  I cannot see a problem so let it run - this time all the way to the end.  It does NOT change the name until it counts down to 50.  Wild!  I check the site - sure enough - there are no new schools after 50.  This is a list of the (alleged) Top 50 Colleges going under the name Top 100 Colleges.

Ha!  I guess they never thought anyone would ever have the patience to wade through all of those ads to see all 100 (I sure didn't).  This makes me wonder: Is it just a scam to get you to click through ads or did the content creator tell his boss he put all 100 pages up there knowing NO ONE would ever click through all of that.  If it's the latter - sorry buddy (or lady) for exposing you.

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. The STRONG tag in the regex got messed up (it was HTML - notice how bold my (.*) is!)

    ReplyDelete
  3. The results:
    slackvm% ./TopColleges.pl
    100: Williams College 
    99: Williams College 
    98: Williams College 
    97: Williams College 
    96: Williams College 
    95: Williams College 
    94: Williams College 
    93: Williams College 
    92: Williams College 
    91: Williams College 
    90: Williams College 
    89: Williams College 
    88: Williams College 
    87: Williams College 
    86: Williams College 
    85: Williams College 
    84: Williams College 
    83: Williams College 
    82: Williams College 
    81: Williams College 
    80: Williams College 
    79: Williams College 
    78: Williams College 
    77: Williams College 
    76: Williams College 
    75: Williams College 
    74: Williams College 
    73: Williams College 
    72: Williams College 
    71: Williams College 
    70: Williams College 
    69: Williams College 
    68: Williams College 
    67: Williams College 
    66: Williams College 
    65: Williams College 
    64: Williams College 
    63: Williams College 
    62: Williams College 
    61: Williams College 
    60: Williams College 
    59: Williams College 
    58: Williams College 
    57: Williams College 
    56: Williams College 
    55: Williams College 
    54: Williams College 
    53: Williams College 
    52: Williams College 
    51: Williams College 
    50: Stanford University 
    49: Swarthmore College 
    48: Princeton University 
    47: Massachusetts Institute of Technology 
    46: Yale University 
    45: Harvard University 
    44: Pomona College 
    43: United States Military Academy 
    42: Amherst College 
    41: Haverford College 
    40: University of Pennsylvania 
    39: Brown University 
    38: Bowdoin College 
    37: Wesleyan University 
    36: Carleton College 
    35: University of Notre Dame 
    34: Dartmouth College
    33: Northwestern University 
    32: Columbia University 
    31: California Institute of Technology 
    30: Davidson College 
    29: Duke University
    28: University of Chicago 
    27: Tufts University 
    26: Vassar College
    25: United States Naval Academy 
    24: Georgetown University 
    23: Wellesley College 
    22: Middlebury College 
    21: Cornell University 
    20: Rice University 
    19: Washington and Lee University 
    18: United States Air Force Academy 
    17: Barnard College 
    16: Boston College 
    15: University of California, Berkeley 
    14: Colgate University 
    13: Colby College 
    12: University of Virginia 
    11: College of William and Mary 
    10: Kenyon College 
    9: Oberlin College 
    8: University of California, Los Angeles 
    7: University of Michigan, Ann Arbor
    6: Reed College 
    5: Whitman College 
    4: Lafayette College 
    3: Smith College 
    2: University of North Carolina, Chapel Hill
    Cannot find school: 1

    ReplyDelete