My Support Forums - View Single Post - extracting sequences of hyperlink labels from ODP

sperugin · Feb 19, 2004, 09:40 PM

Hello,

I am interested in collecting sequences of hyperlink labels (the text anchoring
the a href's) for various sub-branches of ODP, where a sequence is the path
from the root of ODP to a leaf (external webpage indexed). These sequences are
essentially the breadcrumbs at the top of each page in ODP.

For example, the Arts sub-branch contains the following selected sequences:

Arts: Animation: Cartoons: Campaigns and Petitions: url to CybertOOn's Cartoon Campaign
Arts: Animation: Cartoons: Chats and Forums: url to Cartoon World
Arts: Animation: Cartoons: Chats and Forums: url to Cartoons
Arts: Animation: Cartoons: Chats and Forums: url to Toon Zone Forums

Each sequence leads to a unique webpage (URL).

What is difficult is getting the paths involving crosslinks. For example,
Games: Coin-Op: Jukeboxes: Retailers is one such path. Where Jukeboxes
and Retailers are crosslinks (prefaced with a '@').

For example, I'd like to extract the following sequences, involving
crosslinks, from the Arts branch:

"Arts: Antiques: Directories: Art: url to Affordable Antique Art.com"
(when Affordable Antique Art.com really lives in the Recreation sub-branch
under Recreation: Antiques: Directories: Art).

"Arts: Dance: Disabled: url to Adaptive Dancing, Inc."
(when Adaptive Dancing, Inc. really lives in the Society sub-branch
under Arts: Performing Arts: Dance).

I'm interested in collecting such sequences on the
order of thousands in selected sub-branches of ODP.
Can I extract these sequences from the rdf structure dump
with slight modification to the POD scripts?

Thank You and Best Regards,
Saverio