Wednesday, June 2, 2010

"Now seems like a good time," I said to myself...

..."to get those rusty programming skills going."

I had found myself wanting to do some analysis in Excel of price behavior of a large list of stocks.

I glanced at the first few pages of Perl and LWP, and then at the Regular Expressions Pocket Reference; I opened Firebug on the Yahoo Finance "summary page" for a stock I was interested in, so that I could see the raw HTML I was dealing with; and wrote the following:1

use LWP::Simple;

# Expects a list of security symbols on standard input, one per line.


while ($sym = <>)
chop $sym;
die "Couldn't get Yahoo Finance Quote Summary page for symbol $sym!"
unless defined $summary;

$summary =~ m/>Prev Close:<.*?>(\d+\.\d+)</;
$prevclose = $1;
$summary =~ m/>Open:<.*?>(\d+\.\d+)</;
$open = $1;
$summary =~ m/>Last Trade:<.*?>(\d+\.\d+)</;
$last = $1;


It worked the first time—not bad for not having done any programming whatsoever for about five years and nothing of significant size for ten. (Yes, I know it's not very idiomatic Perl—combining the match regexps and doing a few other things would probably cut the line count in half.) That code took a few hours to produce, but subsequent similar programs to web-scrape other pages took much less time, now that I was in the groove.

For example, more exciting was the following, which expects the same list of symbols:

use LWP::Simple;

print "<html>";

while ($sym = <>)
chop $sym;

print "<font size=5>$sym</font><br>";

foreach $period ("1d","1w","1m")
getstore("" .
"s=$sym&t=$period&q=l&l=on&z=m&p=e5,e20&" .
print "<img src=\"$sym$period.png\"/>";

print "<br><br>\n";

print "</html>";

What I'm doing here, if it isn't clear, is scraping a number of security price charts from Yahoo Finance, saving the image files locally, and building a crude but effective web page to make them viewable in one place. Beats looking at each stock by hand for price trends, let me tell you!

Now, all of this may seem like "Hello World" stuff to anyone reading this who's had any programming experience beyond Computer Science 101. But what I think shouldn't be taken for granted here is the amazing ability to (in the first script, as the simpler example) in just a few lines of code, suck an entire web page into a string variable, search that string in a complex way, and output the result in a universally readable (i.e. by humans or other programs) format. We used to want applications to have built-in programming languages—now (and here's the takeaway!) we have programming languages with built-in applications: very-high-level functionality to do things that only applications used to be able to do. And we can do them in a scriptable, redirectable, programmatic way. Admittedly much of this is due to the straightforward API of the LWP module (Perlspeak for "library"); but I'd argue that that accessibility is a function of the design of Perl; there's obviously stuff going on there behind the scenes that would be much harder to write in a language without such integral support for string manipulations (C, say).

I was weaned as a programmer on 1980s consumer 8-bit machines, the multimedia powerhouses of their day, on which even in a high-level language (built-in BASIC), to do anything interesting you had to twiddle bits. And most of the software I've been paid to write has been low-level stuff in C—device drivers and the like. So I'm easily impressed and easily seduced by VHL (very-high-level) languages that let you do so much with so little typing. Of course there's danger inherent in only knowing high-level languages. When you don't understand what's really going on at the machine level, optimization can be much more difficult, for example. Ironically, though, even as undergraduate computer science programs deemphasize C and assembler skills and move their students towards Java, C#, .NET, PHP, and so on—preparing them more effectively for the kind of web-back-end-database-interface work 9 out of 10 of them will face as new programmers—even as this huge and largely unremarked shift in what it means to be a professional computer programmer takes place, hobbyists tinker with microcontrollers, programmed at as low a level as you want, to recapture some of that early-80s frontier-machine-code feeling. Some will call this retrograde or Luddite-ish but the truth is, I think, that controlling hardware directly with one's code fulfills some kind of deep need in the engineering personality to exercise maximum control over one's immediate universe; and there's nothing wrong with the practical experience gained thus: few programmers will ever write an operating system, true, but there will always be lesser software that needs to run "close to the metal." (A $5 pocket calculator will never run a Java interpreter, for instance. I think...)

Getting back to my own programming for my own use and profit, far more complicated and wonderful things will come in time. I'm comfortable using Perl for this kind of stuff, but have never written a program of any serious size in it. What little user-level software development I've done has been fairly strictly object-oriented code in C++. I only know how to use Perl procedurally; understanding the OOP features of the language, which seem to be highly regarded, would be a good thing to have under my belt.

On the other hand, I have had a strong hankering to learn Python, thanks to what seems to me to be a very elegant syntax. And I have the book A Primer on Scientific Programming with Python which—while I'm quite sure that somewhere on CPAN there's a module to support in Perl the same kind of computations I need to do—numerical integration and differentiation, curve fitting, linear regression, etc.—is an excellent tutorial for Python in general besides describing the appropriate libraries in detail. Lastly, the Beautiful Soup library looks like an even cleaner way to do webscraping.

  1. How the heck do you format code nicely (i.e. not just in a non-proportional font but also indented correctly, lines that overrun the margin indicated clearly, and with symbols correctly escaped) in idiomatic HTML these days? Yeah, I know there's the <pre> tag, but it doesn't help you with lines that run past the right edge of your text frame (or wherever your body text is going), and you still have to festoon your code with &whatever-entity tags to escape all the non-alphanumeric characters.[back]

No comments:

Post a Comment