Hello,
I am interested in collecting sequences of hyperlink labels (the text anchoring
the a href's) for various sub-branches of ODP, where a sequence is the path
from the root of ODP to a leaf (external webpage indexed). These sequences are
essentially the breadcrumbs at the top of each page in ODP.
For example, the Arts sub-branch contains the following selected sequences:
Arts: Animation: Cartoons: Campaigns and Petitions: url to CybertOOn's Cartoon Campaign
Arts: Animation: Cartoons: Chats and Forums: url to Cartoon World
Arts: Animation: Cartoons: Chats and Forums: url to Cartoons
Arts: Animation: Cartoons: Chats and Forums: url to Toon Zone Forums
Each sequence leads to a unique webpage (URL).
What is difficult is getting the paths involving crosslinks. For example,
Games: Coin-Op: Jukeboxes: Retailers is one such path. Where Jukeboxes
and Retailers are crosslinks (prefaced with a '@').
For example, I'd like to extract the following sequences, involving
crosslinks, from the Arts branch:
"Arts: Antiques: Directories: Art: url to Affordable Antique Art.com"
(when Affordable Antique Art.com really lives in the Recreation sub-branch
under Recreation: Antiques: Directories: Art).
"Arts: Dance: Disabled: url to Adaptive Dancing, Inc."
(when Adaptive Dancing, Inc. really lives in the Society sub-branch
under Arts: Performing Arts: Dance).
I'm interested in collecting such sequences on the
order of thousands in selected sub-branches of ODP.
Can I extract these sequences from the rdf structure dump
with slight modification to the POD scripts?
Thank You and Best Regards,
Saverio
|