You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
709 lines
26 KiB
709 lines
26 KiB
|
|
#Time-stamp: "2001-03-10 23:19:11 MST" -*-Text-*-
|
|
# This document contains text in Perl "POD" format.
|
|
# Use a POD viewer like perldoc or perlman to render it.
|
|
|
|
=head1 NAME
|
|
|
|
HTML::Tree::Scanning -- article: "Scanning HTML"
|
|
|
|
=head1 SYNOPSIS
|
|
|
|
# This an article, not a module.
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
The following article by Sean M. Burke first appeared in I<The Perl
|
|
Journal> #19 and is copyright 2000 The Perl Journal. It appears
|
|
courtesy of Jon Orwant and The Perl Journal. This document may be
|
|
distributed under the same terms as Perl itself.
|
|
|
|
=head1 Scanning HTML
|
|
|
|
-- Sean M. Burke
|
|
|
|
In I<The Perl Journal> issue 17, Ken MacFarlane's article "Parsing
|
|
HTML with HTML::Parser" describes how the HTML::Parser module scans
|
|
HTML source as a stream of start-tags, end-tags, text, comments, etc.
|
|
In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
|
|
data structures. Now I'll try to tie it together, in a discussion of
|
|
HTML trees.
|
|
|
|
The CPAN module HTML::TreeBuilder takes the
|
|
tags that HTML::Parser picks out, and builds a parse tree -- a
|
|
tree-shaped network of objects...
|
|
|
|
=over
|
|
|
|
Footnote:
|
|
And if you need a quick explanation of objects, see my TPJ17 article "A
|
|
User's View of Object-Oriented Modules"; or go whole hog and get Damian
|
|
Conway's excellent book I<Object-Oriented Perl>, from Manning
|
|
Publications.
|
|
|
|
=back
|
|
|
|
...representing the structured content of the HTML document. And once
|
|
the document is parsed as a tree, you'll find the common tasks
|
|
of extracting data from that HTML document/tree to be quite
|
|
straightforward.
|
|
|
|
=head2 HTML::Parser, HTML::TreeBuilder, and HTML::Element
|
|
|
|
You use HTML::TreeBuilder to make a parse tree out of an HTML source
|
|
file, by simply saying:
|
|
|
|
use HTML::TreeBuilder;
|
|
my $tree = HTML::TreeBuilder->new();
|
|
$tree->parse_file('foo.html');
|
|
|
|
and then C<$tree> contains a parse tree built from the HTML source from
|
|
the file "foo.html". The way this parse tree is represented is with a
|
|
network of objects -- C<$tree> is the root, an element with tag-name
|
|
"html", and its children typically include a "head" and "body" element,
|
|
and so on. Elements in the tree are objects of the class
|
|
HTML::Element.
|
|
|
|
So, if you take this source:
|
|
|
|
<html><head><title>Doc 1</title></head>
|
|
<body>
|
|
Stuff <hr> 2000-08-17
|
|
</body></html>
|
|
|
|
and feed it to HTML::TreeBuilder, it'll return a tree of objects that
|
|
looks like this:
|
|
|
|
html
|
|
/ \
|
|
head body
|
|
/ / | \
|
|
title "Stuff" hr "2000-08-17"
|
|
|
|
|
"Doc 1"
|
|
|
|
This is a pretty simple document, but if it were any more complex,
|
|
it'd be a bit hard to draw in that style, since it's sprawl left and
|
|
right. The same tree can be represented a bit more easily sideways,
|
|
with indenting:
|
|
|
|
. html
|
|
. head
|
|
. title
|
|
. "Doc 1"
|
|
. body
|
|
. "Stuff"
|
|
. hr
|
|
. "2000-08-17"
|
|
|
|
Either way expresses the same structure. In that structure, the root
|
|
node is an object of the class HTML::Element
|
|
|
|
=over
|
|
|
|
Footnote:
|
|
Well actually, the root is of the class HTML::TreeBuilder, but that's
|
|
just a subclass of HTML::Element, plus the few extra methods like
|
|
C<parse_file> that elaborate the tree
|
|
|
|
=back
|
|
|
|
, with the tag name "html", and with two children: an HTML::Element
|
|
object whose tag names are "head" and "body". And each of those
|
|
elements have children, and so on down. Not all elements (as we'll
|
|
call the objects of class HTML::Element) have children -- the "hr"
|
|
element doesn't. And note all nodes in the tree are elements -- the
|
|
text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
|
|
|
|
Objects of the class HTML::Element each have three noteworthy attributes:
|
|
|
|
=over
|
|
|
|
=item C<_tag> -- (best accessed as C<$e-E<gt>tag>)
|
|
this element's tag-name, lowercased (e.g., "em" for an "em" element).
|
|
|
|
=over
|
|
|
|
Footnote: Yes, this is misnamed. In proper SGML terminology, this is
|
|
instead called a "GI", short for "generic identifier"; and the term
|
|
"tag" is used for a token of SGML source that represents either
|
|
the start of an element (a start-tag like "<em lang='fr'>") or the end
|
|
of an element (an end-tag like "</em>". However, since more people
|
|
claim to have been abducted by aliens than to have ever seen the
|
|
SGML standard, and since both encounters typically involve a feeling of
|
|
"missing time", it's not surprising that the terminology of the SGML
|
|
standard is not closely followed.
|
|
|
|
=back
|
|
|
|
=item C<_parent> -- (best accessed as C<$e-E<gt>parent>)
|
|
the element that is C<$obj>'s parent, or undef if this element is the
|
|
root of its tree.
|
|
|
|
=item C<_content> -- (best accessed as C<$e-E<gt>content_list>)
|
|
the list of nodes (i.e., elements or text segments) that are C<$e>'s
|
|
children.
|
|
|
|
=back
|
|
|
|
Moreover, if an element object has any attributes in the SGML sense of
|
|
the word, then those are readable as C<$e-E<gt>attr('name')> -- for
|
|
example, with the object built from having parsed "E<lt>a
|
|
B<id='foo'>E<gt>barE<lt>/aE<gt>", C<$e-E<gt>attr('id')> will return
|
|
the string "foo". Moreover, C<$e-E<gt>tag> on that object returns the
|
|
string "a", C<$e-E<gt>content_list> returns a list consisting of just
|
|
the single scalar "bar", and C<$e-E<gt>parent> returns the object
|
|
that's this node's parent -- which may be, for example, a "p" element.
|
|
|
|
And that's all that there is to it -- you throw HTML
|
|
source at TreeBuilder, and it returns a tree built of HTML::Element
|
|
objects and some text strings.
|
|
|
|
However, what do you I<do> with a tree of objects? People code
|
|
information into HTML trees not for the fun of arranging elements, but
|
|
to represent the structure of specific text and images -- some text is
|
|
in this "li" element, some other text is in that heading, some
|
|
images are in that other table cell that has those attributes, and so on.
|
|
|
|
Now, it may happen that you're rendering that whole HTML tree into some
|
|
layout format. Or you could be trying to make some systematic change to
|
|
the HTML tree before dumping it out as HTML source again. But, in my
|
|
experience, by far the most common programming task that Perl
|
|
programmers face with HTML is in trying to extract some piece
|
|
of information from a larger document. Since that's so common (and
|
|
also since it involves concepts that are basic to more complex tasks),
|
|
that is what the rest of this article will be about.
|
|
|
|
=head2 Scanning HTML trees
|
|
|
|
Suppose you have a thousand HTML documents, each of them a press
|
|
release. They all start out:
|
|
|
|
[...lots of leading images and junk...]
|
|
<h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
|
|
BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
|
|
of world conquest, Rock Feldspar, announced today the opening of a
|
|
new office in Ougadougou, the capital city of Burkino Faso, gateway
|
|
to the bustling "Silicon Sahara" of Africa...
|
|
[...etc...]
|
|
|
|
...and what you've got to do is, for each document, copy whatever text
|
|
is in the "h1" element, so that you can, for example, make a table of
|
|
contents of it. Now, there are three ways to do this:
|
|
|
|
=over
|
|
|
|
=item * You can just use a regexp to scan the file for a text pattern.
|
|
|
|
For many very simple tasks, this will do fine. Many HTML documents are,
|
|
in practice, very consistently formatted as far as placement of
|
|
linebreaks and whitespace, so you could just get away with scanning the
|
|
file like so:
|
|
|
|
sub get_heading {
|
|
my $filename = $_[0];
|
|
local *HTML;
|
|
open(HTML, $filename)
|
|
or die "Couldn't open $filename);
|
|
my $heading;
|
|
Line:
|
|
while(<HTML>) {
|
|
if( m{<h1>(.*?)</h1>}i ) { # match it!
|
|
$heading = $1;
|
|
last Line;
|
|
}
|
|
}
|
|
close(HTML);
|
|
warn "No heading in $filename?"
|
|
unless defined $heading;
|
|
return $heading;
|
|
}
|
|
|
|
This is quick and fast, but awfully fragile -- if there's a newline in
|
|
the middle of a heading's text, it won't match the above regexp, and
|
|
you'll get an error. The regexp will also fail if the "h1" element's
|
|
start-tag has any attributes. If you have to adapt your code to fit
|
|
more kinds of start-tags, you'll end up basically reinventing part of
|
|
HTML::Parser, at which point you should probably just stop, and use
|
|
HTML::Parser itself:
|
|
|
|
=item * You can use HTML::Parser to scan the file for an "h1" start-tag
|
|
token, then capture all the text tokens until the "h1" close-tag. This
|
|
approach is extensively covered in the Ken MacFarlane's TPJ17 article
|
|
"Parsing HTML with HTML::Parser". (A variant of this approach is to use
|
|
HTML::TokeParser, which presents a different and rather handier
|
|
interface to the tokens that HTML::Parser picks out.)
|
|
|
|
Using HTML::Parser is less fragile than our first approach, since it's
|
|
not sensitive to the exact internal formatting of the start-tag (much
|
|
less whether it's split across two lines). However, when you need more
|
|
information about the context of the "h1" element, or if you're having
|
|
to deal with any of the tricky bits of HTML, such as parsing of tables,
|
|
you'll find out the flat list of tokens that HTML::Parser returns
|
|
isn't immediately useful. To get something useful out of those tokens,
|
|
you'll need to write code that knows some things about what elements
|
|
take no content (as with "hr" elements), and that a "</p>" end-tags
|
|
are omissible, so a "<p>" will end any currently
|
|
open paragraph -- and you're well on your way to pointlessly
|
|
reinventing much of the code in HTML::TreeBuilder
|
|
|
|
=over
|
|
|
|
Footnote:
|
|
And, as the person who last rewrote that module, I can attest that it
|
|
wasn't terribly easy to get right! Never underestimate the perversity
|
|
of people coding HTML.
|
|
|
|
=back
|
|
|
|
, at which point you should probably just stop, and use
|
|
HTML::TreeBuilder itself:
|
|
|
|
=item * You can use HTML::Treebuilder, and scan the tree of element
|
|
objects that you get back.
|
|
|
|
=back
|
|
|
|
The last approach, using HTML::TreeBuilder, is the diametric opposite of
|
|
first approach: The first approach involves just elementary Perl and one
|
|
regexp, whereas the TreeBuilder approach involves being at home with
|
|
the concept of tree-shaped data structures and modules with
|
|
object-oriented interfaces, as well as with the particular interfaces
|
|
that HTML::TreeBuilder and HTML::Element provide.
|
|
|
|
However, what the TreeBuilder approach has going for it is that it's
|
|
the most robust, because it involves dealing with HTML in its "native"
|
|
format -- it deals with the tree structure that HTML code represents,
|
|
without any consideration of how the source is coded and with what
|
|
tags omitted.
|
|
|
|
So, to extract the text from the "h1" elements of an HTML document:
|
|
|
|
sub get_heading {
|
|
my $tree = HTML::TreeBuilder->new;
|
|
$tree->parse_file($_[0]); # !
|
|
my $heading;
|
|
my $h1 = $tree->look_down('_tag', 'h1'); # !
|
|
if($h1) {
|
|
$heading = $h1->as_text; # !
|
|
} else {
|
|
warn "No heading in $_[0]?";
|
|
}
|
|
$tree->delete; # clear memory!
|
|
return $heading;
|
|
}
|
|
|
|
This uses some unfamiliar methods that need explaning. The
|
|
C<parse_file> method that we've seen before, builds a tree based on
|
|
source from the file given. The C<delete> method is for marking a
|
|
tree's contents as available for garbage collection, when you're done
|
|
with the tree. The C<as_text> method returns a string that contains
|
|
all the text bits that are children (or otherwise descendants) of the
|
|
given node -- to get the text content of the C<$h1> object, we could
|
|
just say:
|
|
|
|
$heading = join '', $h1->content_list;
|
|
|
|
but that will work only if we're sure that the "h1" element's children
|
|
will be only text bits -- if the document contained:
|
|
|
|
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
|
|
|
then the sub-tree would be:
|
|
|
|
. h1
|
|
. "Local Man Sees "
|
|
. cite
|
|
. "Blade"
|
|
. " Again'
|
|
|
|
so C<join '', $h1-E<gt>content_list> will be something like:
|
|
|
|
Local Man Sees HTML::Element=HASH(0x15424040) Again
|
|
|
|
whereas C<$h1-E<gt>as_text> would yield:
|
|
|
|
Local Man Sees Blade Again
|
|
|
|
and depending on what you're doing with the heading text, you might
|
|
want the C<as_HTML> method instead. It returns the (sub)tree
|
|
represented as HTML source. C<$h1-E<gt>as_HTML> would yield:
|
|
|
|
<h1>Local Man Sees <cite>Blade</cite> Again</h1>
|
|
|
|
However, if you wanted the contents of C<$h1> as HTML, but not the
|
|
C<$h1> itself, you could say:
|
|
|
|
join '',
|
|
map(
|
|
ref($_) ? $_->as_HTML : $_,
|
|
$h1->content_list
|
|
)
|
|
|
|
This C<map> iterates over the nodes in C<$h1>'s list of children; and
|
|
for each node that's just a text bit (as "Local Man Sees " is), it just
|
|
passes through that string value, and for each node that's an actual
|
|
object (causing C<ref> to be true), C<as_HTML> will used instead of the
|
|
string value of the object itself (which would be something quite
|
|
useless, as most object values are). So that C<as_HTML> for the "cite"
|
|
element will be the string "<cite>BladeE<lt>/cite>". And then,
|
|
finally, C<join> just puts into one string all the strings that the
|
|
C<map> returns.
|
|
|
|
Last but not least, the most important method in our C<get_heading> sub
|
|
is the C<look_down> method. This method looks down at the subtree
|
|
starting at the given object (C<$h1>), looking for elements that meet
|
|
criteria you provide.
|
|
|
|
The criteria are specified in the method's argument list. Each
|
|
criterion can consist of two scalars, a key and a value, which express
|
|
that you want elements that have that attribute (like "_tag", or
|
|
"src") with the given value ("h1"); or the criterion can be a
|
|
reference to a subroutine that, when called on the given element,
|
|
returns true if that is a node you're looking for. If you specify
|
|
several criteria, then that's taken to mean that you want all the
|
|
elements that each satisfy I<all> the criteria. (In other words,
|
|
there's an "implicit AND".)
|
|
|
|
And finally, there's a bit of an optimization -- if you call the
|
|
C<look_down> method in a scalar context, you get just the I<first> node
|
|
(or undef if none) -- and, in fact, once C<look_down> finds that first
|
|
matching element, it doesn't bother looking any further.
|
|
|
|
So the example:
|
|
|
|
$h1 = $tree->look_down('_tag', 'h1');
|
|
|
|
returns the first element at-or-under C<$tree> whose C<"_tag">
|
|
attribute has the value C<"h1">.
|
|
|
|
=head2 Complex Criteria in Tree Scanning
|
|
|
|
Now, the above C<look_down> code looks like a lot of bother, with
|
|
barely more benefit than just grepping the file! But consider if your
|
|
criteria were more complicated -- suppose you found that some of the
|
|
press releases that you were scanning had several "h1" elements,
|
|
possibly before or after the one you actually want. For example:
|
|
|
|
<h1><center>Visit Our Corporate Partner
|
|
<br><a href="/dyna/clickthru"
|
|
><img src="/dyna/vend_ad"></a>
|
|
</center></h1>
|
|
<h1><center>ConGlomCo President Schreck to Visit Regional HQ
|
|
<br><a href="/photos/Schreck_visit_large.jpg"
|
|
><img src="/photos/Schreck_visit.jpg"></a>
|
|
</center></h1>
|
|
|
|
Here, you want to ignore the first "h1" element because it contains an
|
|
ad, and you want the text from the second "h1". The problem is in
|
|
formalizing the way you know that it's an ad. Since ad banners are
|
|
always entreating you to "visit" the sponsoring site, you could exclude
|
|
"h1" elements that contain the word "visit" under them:
|
|
|
|
my $real_h1 = $tree->look_down(
|
|
'_tag', 'h1',
|
|
sub {
|
|
$_[0]->as_text !~ m/\bvisit/i
|
|
}
|
|
);
|
|
|
|
The first criterion looks for "h1" elements, and the second criterion
|
|
limits those to only the ones whose text content doesn't match
|
|
C<m/\bvisit/>. But unfortunately, that won't work for our example,
|
|
since the second "h1" mentions "ConGlomCo President Schreck to
|
|
I<Visit> Regional HQ".
|
|
|
|
Instead you could try looking for the first "h1" element that
|
|
doesn't contain an image:
|
|
|
|
my $real_h1 = $tree->look_down(
|
|
'_tag', 'h1',
|
|
sub {
|
|
not $_[0]->look_down('_tag', 'img')
|
|
}
|
|
);
|
|
|
|
This criterion sub might seem a bit odd, since it calls C<look_down>
|
|
as part of a larger C<look_down> operation, but that's fine. Note that
|
|
when considered as a boolean value, a C<look_down> in a scalar context
|
|
value returns false (specifically, undef) if there's no matching element
|
|
at or under the given element; and it returns the first matching
|
|
element (which, being a reference and object, is always a true value),
|
|
if any matches. So, here,
|
|
|
|
sub {
|
|
not $_[0]->look_down('_tag', 'img')
|
|
}
|
|
|
|
means "return true only if this element has no 'img' element as
|
|
descendants (and isn't an 'img' element itself)."
|
|
|
|
This correctly filters out the first "h1" that contains the ad, but it
|
|
also incorrectly filters out the second "h1" that contains a
|
|
non-advertisement photo besides the headline text you want.
|
|
|
|
There clearly are detectable differences between the first and second
|
|
"h1" elements -- the only second one contains the string "Schreck", and
|
|
we could just test for that:
|
|
|
|
my $real_h1 = $tree->look_down(
|
|
'_tag', 'h1',
|
|
sub {
|
|
$_[0]->as_text =~ m{Schreck}
|
|
}
|
|
);
|
|
|
|
And that works fine for this one example, but unless all thousand of
|
|
your press releases have "Schreck" in the headline, that's just not a
|
|
general solution. However, if all the ads-in-"h1"s that you want to
|
|
exclude involve a link whose URL involves "/dyna/", then you can use
|
|
that:
|
|
|
|
my $real_h1 = $tree->look_down(
|
|
'_tag', 'h1',
|
|
sub {
|
|
my $link = $_[0]->look_down('_tag','a');
|
|
return 1 unless $link;
|
|
# no link means it's fine
|
|
return 0 if $link->attr('href') =~ m{/dyna/};
|
|
# a link to there is bad
|
|
return 1; # otherwise okay
|
|
}
|
|
);
|
|
|
|
Or you can look at it another way and say that you want the first "h1"
|
|
element that either contains no images, or else whose image has a "src"
|
|
attribute whose value contains "/photos/":
|
|
|
|
my $real_h1 = $tree->look_down(
|
|
'_tag', 'h1',
|
|
sub {
|
|
my $img = $_[0]->look_down('_tag','img');
|
|
return 1 unless $img;
|
|
# no image means it's fine
|
|
return 1 if $img->attr('src') =~ m{/photos/};
|
|
# good if a photo
|
|
return 0; # otherwise bad
|
|
}
|
|
);
|
|
|
|
Recall that this use of C<look_down> in a scalar context means to return
|
|
the first element at or under C<$tree> that matches all the criteria.
|
|
But if you notice that you can formulate criteria that'll match several
|
|
possible "h1" elements, some of which may be bogus but the I<last> one
|
|
of which is always the one you want, then you can use C<look_down> in a
|
|
list context, and just use the last element of that list:
|
|
|
|
my @h1s = $tree->look_down(
|
|
'_tag', 'h1',
|
|
...maybe more criteria...
|
|
);
|
|
die "What, no h1s here?" unless @h1s;
|
|
my $real_h1 = $h1s[-1]; # last or only
|
|
|
|
=head2 A Case Study: Scanning Yahoo News's HTML
|
|
|
|
The above (somewhat contrived) case involves extracting data from a
|
|
bunch of pre-existing HTML files. In that sort of situation, if your
|
|
code works for all the files, then you know that the code I<works> --
|
|
since the data it's meant to handle won't go changing or growing; and,
|
|
typically, once you've used the program, you'll never need to use it
|
|
again.
|
|
|
|
The other kind of situation faced in many data extraction tasks is
|
|
where the program is used recurringly to handle new data -- such as
|
|
from ever-changing Web pages. As a real-world example of this,
|
|
consider a program that you could use (suppose it's crontabbed) to
|
|
extract headline-links from subsections of Yahoo News
|
|
(C<http://dailynews.yahoo.com/>).
|
|
|
|
Yahoo News has several subsections:
|
|
|
|
=over
|
|
|
|
=item http://dailynews.yahoo.com/h/tc/ for technology news
|
|
|
|
=item http://dailynews.yahoo.com/h/sc/ for science news
|
|
|
|
=item http://dailynews.yahoo.com/h/hl/ for health news
|
|
|
|
=item http://dailynews.yahoo.com/h/wl/ for world news
|
|
|
|
=item http://dailynews.yahoo.com/h/en/ for entertainment news
|
|
|
|
=back
|
|
|
|
and others. All of them are built on the same basic HTML template --
|
|
and a scarily complicated template it is, especially when you look at
|
|
it with an eye toward making up rules that will select where the real
|
|
headline-links are, while screening out all the links to other parts of
|
|
Yahoo, other news services, etc. You will need to puzzle
|
|
over the HTML source, and scrutinize the output of
|
|
C<$tree-E<gt>dump> on the parse tree of that HTML.
|
|
|
|
Sometimes the only way to pin down what you're after is by position in
|
|
the tree. For example, headlines of interest may be in the third
|
|
column of the second row of the second table element in a page:
|
|
|
|
my $table = ( $tree->look_down('_tag','table') )[1];
|
|
my $row2 = ( $table->look_down('_tag', 'tr' ) )[1];
|
|
my $col3 = ( $row2->look-down('_tag', 'td') )[2];
|
|
...then do things with $col3...
|
|
|
|
Or they may be all the links in a "p" element that has at least three
|
|
"br" elements as children:
|
|
|
|
my $p = $tree->look_down(
|
|
'_tag', 'p',
|
|
sub {
|
|
2 < grep { ref($_) and $_->tag eq 'br' }
|
|
$_[0]->content_list
|
|
}
|
|
);
|
|
@links = $p->look_down('_tag', 'a');
|
|
|
|
But almost always, you can get away with looking for properties of the
|
|
of the thing itself, rather than just looking for contexts. Now, if
|
|
you're lucky, the document you're looking through has clear semantic
|
|
tagging, such is as useful in CSS -- note the
|
|
class="headlinelink" bit here:
|
|
|
|
<a href="...long_news_url..." class="headlinelink">Elvis
|
|
seen in tortilla</a>
|
|
|
|
If you find anything like that, you could leap right in and select
|
|
links with:
|
|
|
|
@links = $tree->look_down('class','headlinelink');
|
|
|
|
Regrettably, your chances of seeing any sort of semantic markup
|
|
principles really being followed with actual HTML are pretty thin.
|
|
|
|
=over
|
|
|
|
Footnote:
|
|
In fact, your chances of finding a page that is simply free of HTML
|
|
errors are even thinner. And surprisingly, sites like Amazon or Yahoo
|
|
are typically worse as far as quality of code than personal sites
|
|
whose entire production cycle involves simply being saved and uploaded
|
|
from Netscape Composer.
|
|
|
|
=back
|
|
|
|
The code may be sort of "accidentally semantic", however -- for example,
|
|
in a set of pages I was scanning recently, I found that looking for
|
|
"td" elements with a "width" attribute value of "375" got me exactly
|
|
what I wanted. No-one designing that page ever conceived of
|
|
"width=375" as I<meaning> "this is a headline", but if you impute it
|
|
to mean that, it works.
|
|
|
|
An approach like this happens to work for the Yahoo News code, because
|
|
the headline-links are distinguished by the fact that they (and they
|
|
alone) contain a "b" element:
|
|
|
|
<a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
|
|
|
|
or, diagrammed as a part of the parse tree:
|
|
|
|
. a [href="...long_news_url..."]
|
|
. b
|
|
. "Elvis seen in tortilla"
|
|
|
|
A rule that matches these can be formalized as "look for any 'a'
|
|
element that has only one daugher node, which must be a 'b' element".
|
|
And this is what it looks like when cooked up as a C<look_down>
|
|
expression and prefaced with a bit of code that retrieves the text of
|
|
the given Yahoo News page and feeds it to TreeBuilder:
|
|
|
|
use strict;
|
|
use HTML::TreeBuilder 2.97;
|
|
use LWP::UserAgent;
|
|
sub get_headlines {
|
|
my $url = $_[0] || die "What URL?";
|
|
|
|
my $response = LWP::UserAgent->new->request(
|
|
HTTP::Request->new( GET => $url )
|
|
);
|
|
unless($response->is_success) {
|
|
warn "Couldn't get $url: ", $response->status_line, "\n";
|
|
return;
|
|
}
|
|
|
|
my $tree = HTML::TreeBuilder->new();
|
|
$tree->parse($response->content);
|
|
$tree->eof;
|
|
|
|
my @out;
|
|
foreach my $link (
|
|
$tree->look_down( # !
|
|
'_tag', 'a',
|
|
sub {
|
|
return unless $_[0]->attr('href');
|
|
my @c = $_[0]->content_list;
|
|
@c == 1 and ref $c[0] and $c[0]->tag eq 'b';
|
|
}
|
|
)
|
|
) {
|
|
push @out, [ $link->attr('href'), $link->as_text ];
|
|
}
|
|
|
|
warn "Odd, fewer than 6 stories in $url!" if @out < 6;
|
|
$tree->delete;
|
|
return @out;
|
|
}
|
|
|
|
...and add a bit of code to actually call that routine and display the
|
|
results...
|
|
|
|
foreach my $section (qw[tc sc hl wl en]) {
|
|
my @links = get_headlines(
|
|
"http://dailynews.yahoo.com/h/$section/"
|
|
);
|
|
print
|
|
$section, ": ", scalar(@links), " stories\n",
|
|
map((" ", $_->[0], " : ", $_->[1], "\n"), @links),
|
|
"\n";
|
|
}
|
|
|
|
And we've got our own headline-extractor service! This in and of
|
|
itself isn't no amazingly useful (since if you want to see the
|
|
headlines, you I<can> just look at the Yahoo News pages), but it could
|
|
easily be the basis for quite useful features like filtering the
|
|
headlines for matching certain keywords of interest to you.
|
|
|
|
Now, one of these days, Yahoo News will decide to change its HTML
|
|
template. When this happens, this will appear to the above program as
|
|
there being no links that meet the given criteria; or, less likely,
|
|
dozens of erroneous links will meet the criteria. In either case, the
|
|
criteria will have to be changed for the new template; they may just
|
|
need adjustment, or you may need to scrap them and start over.
|
|
|
|
=head2 I<Regardez, duvet!>
|
|
|
|
It's often quite a challenge to write criteria to match the desired
|
|
parts of an HTML parse tree. Very often you I<can> pull it off with a
|
|
simple C<$tree-E<gt>look_down('_tag', 'h1')>, but sometimes you do
|
|
have to keep adding and refining criteria, until you might end up with
|
|
complex filters like what I've shown in this article. The
|
|
benefit to learning how to deal with HTML parse trees is that one main
|
|
search tool, the C<look_down> method, can do most of the work, making
|
|
simple things easy, while still making hard things possible.
|
|
|
|
B<[end body of article]>
|
|
|
|
=head2 [Author Credit]
|
|
|
|
Sean M. Burke (C<[email protected]>) is the current maintainer of
|
|
C<HTML::TreeBuilder> and C<HTML::Element>, both originally by
|
|
Gisle Aas.
|
|
|
|
Sean adds: "I'd like to thank the folks who listened to me ramble
|
|
incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet
|
|
Another Perl Conference and O'Reilly Open Source Software Convention."
|
|
|
|
=head1 BACK
|
|
|
|
Return to the L<HTML::Tree|HTML::Tree> docs.
|
|
|
|
=cut
|
|
|