Open Data Analytics Example¶

I recently came upon this article depicting a introductory exercise of open data analytics using Python (scraping web, extracting data with regex and visualization). Fun learning indeed. But then I thought how much much time and how many lines of code it would take to do the same thing with SpectX? Let’s find out.

So, here’s the web page containing the annual salaries of Cleveland colleges presidents. Our goal is to answer the question: how much tuition do I pay for five minutes of my college president’s time?

First thing to do is to develop the pattern to extract presidents and their salaries. Open Data Browser, paste in the url and press Prepare Pattern. Quickly scrolling trough data in Data Editor I can see that the table with salaries is not there, likely because it is not within the first 16kB of html page. Well, we don’t need the whole page, all we need is just the list so let’s copy it from page source (cmd+U in FireFox):

<p>Following is total compensation for other presidents at private colleges in Ohio in 2015:</p>
<ul>
<li>Grant Cornwell, College of Wooster (left in 2015): $911,651</li>
<li>Marvin Krislov, Oberlin College (left in 2016): &nbsp;$829,913</li>
<li>Mark Roosevelt, Antioch College, (left in 2015): $507,672</li>
<li>Laurie Joyner, Wittenberg University (left in 2015): $463,504</li>
<li>Richard Giese, University of Mount Union (left in 2015): $453,800</li>
<li>Sean Decatur,Kenyon College: $451,698</li>
<li>Adam Weinberg, Denison University: $435,322</li>
<li>Daniel Dibiasio, Ohio Northern University: $414,716</li>
<li>Denvy Bowman, Capital University (left in 2016): $388,570</li>
<li>Anne Steele, Muskingum University (left in 2016): $384,233</li>
<li>Kathy Krendl, Otterbein University: &nbsp;$378,035</li>
<li>Rockwell Jones, Ohio Wesleyan University: $366,625</li>
<li>Robert Helmer, Baldwin Wallace University: $365,616</li>
<li>Robert Huntington, Heidelberg University: $300,005</li>
<li>Lori Varlotta, Hiram College: $293,336</li>
<li>Joseph Bruno, Marietta College (left in 2016): $288,295</li>
<li>W. Richard Merriman Jr., University of Mount Union (started in June 2015): $221,761</li>
</ul>
<p>The Rev. Robert Niehoff, president of John

This seems to be the only list in this page, therefore we can simply match list items, i.e stuff between <li> and </li> tags:

'<li>'                  //pattern begins with list item opening tag
LD:president ','        //LD matches everything until next matcher in the line (here comma), i.e we get president's
                        //name. Export as 'president'
LD:college              //next is the college name
('(' LD ')')?           /*leaving notes are between parentheses. Not all records have this,
                          therefore enclose it in the sequence group and make it optional.
                          Since we're not really interested in leaving notes let's not export it.*/
':'                     //college name is matched until colon
LD '$'                  //skip everything until $
CDOUBLE:salary          //parse salary as double value from integer value with thousands separator comma
'</li>' EOL             //list item closing tag followed by line feed

Cool, Parse Preview shows all needed list items are matched, Results shows all elements from items are parsed correctly. Now copy the pattern to clipboard and go to Data Browser again. It opens conveniently at last navigated uri, so just press Prepare Query. Now replace the pattern (paste from clipboard):

$pattern = <<<PATTERN   //let's use heredoc syntax for declaring pattern string
'<li>'                  //pattern begins with list item opening tag
LD:president ','        //LD matches everything until next matcher in the line (here comma), i.e we get president's name. Export as 'president'
LD:college              //next is the college name
('(' LD ')')?           /*leaving notes are between parentheses. Not all records have this,
                          therefore enclose it in the sequence group and make it optional.
                          Since we're not really interested in leaving notes let's not export it.*/
':'                     //college name is matched until colon
LD '$'                  //skip everything until $
CDOUBLE:salary          //parse salary as double value from integer value with thousands separator comma
'</li>' EOL             //list item closing tag followed by line feed
PATTERN;

@stream  = PARSE(pattern:$pattern, src:'http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html');

@stream
 .select(president, college, salary:salary*1000)    //as we parsed salary as double we need to multiply it by 1000
                                                    //to get correct value
 .sort(salary DESC)                                 //sort results by salary amount in descending order

Executing query script gives us:

president college salary
Grant Cornwell
College of Wooster
911651,0
Marvin Krislov
Oberlin College
829913,0
Mark Roosevelt
Antioch College,
507672,0
Laurie Joyner
Wittenberg University
463504,0
Richard Giese
University of Mount Union
453800,0
Sean Decatur Kenyon College 451698,0
Adam Weinberg
Denison University
435322,0
Daniel Dibiasio
Ohio Northern University
414716,0
Denvy Bowman
Capital University
388570,0
Anne Steele
Muskingum University
384233,0
Kathy Krendl
Otterbein University
378035,0
Rockwell Jones
Ohio Wesleyan University
366625,0
Robert Helmer
Baldwin Wallace University
365616,0
Robert Huntington
Heidelberg University
300005,0
Lori Varlotta
Hiram College
293336,0
Joseph Bruno
Marietta College
288295,0
  1. Richard Merriman Jr.
University of Mount Union
221761,0

Pressing Chart displays nicely bar chart of salaries:

../../../_images/cleveland_college_presidents-1.png

To compute the value of 5 minutes of your president’s time (assuming 2000 working hours per year) we add another select command performing simple arithmetic (divide annual salary by 2000 to get hourly value, then by 12 to get 5 min value, followed by rounding and casting to get clean and nice output):

@stream
 .select(president, college, salary:salary*1000)    //as we parsed salary as double we need to multiply it by 1000
                                                    //to get correct value
 .sort(salary DESC)                                 //sort results by salary amount in descending order
 .select(president ||', '|| college as president_college,
         '$' || STRING(LONG(ROUND(salary/2000)/12)) as president_5min_value)
president_college president_5min_value
Grant Cornwell, College of Wooster $38
Marvin Krislov, Oberlin College $34
Mark Roosevelt, Antioch College, $21
Laurie Joyner, Wittenberg University $19
Richard Giese, University of Mount Union $18
Sean Decatur, Kenyon College $18
Adam Weinberg, Denison University $18
Daniel Dibiasio, Ohio Northern University $17
Denvy Bowman, Capital University $16
Anne Steele, Muskingum University $16
Kathy Krendl, Otterbein University $15
Rockwell Jones, Ohio Wesleyan University $15
Robert Helmer, Baldwin Wallace University $15
Robert Huntington, Heidelberg University $12
Lori Varlotta, Hiram College $12
Joseph Bruno, Marietta College $12
  1. Richard Merriman Jr., University of Mount Union
$9

Full script (for copy, paste and execute):

1
2
3
4
5
6
7
8
9
$pattern = <<<PATTERN
    '<li>' LD:president ',' LD:college ('(' LD ')')? ':' LD '$' CDOUBLE:salary '</li>' EOL
PATTERN;

PARSE(pattern:$pattern, src:'http://www.cleveland.com/metro/index.ssf/2017/12/case_western_reserve_university_president_barbara_snyders_base_salary_and_bonus_pay_tops_among_private_colleges_in_ohio.html')
 .select(president, college, salary:salary*1000)    //as we parsed salary as double we need to multiply it by 1000 to get correct value
 .sort(salary DESC)                                 //sort results by salary amount in descending order
 .select(president ||', '|| college as president_college,
     '$' || STRING(LONG(ROUND(salary/2000)/12)) as president_5min_value)

Time spent on developing was about ~10 minutes (about an hour to write it up). And the whole script can easily fit under 10 lines when we remove comments from pattern and convert it to one line.