top of page

Parsing HTML from a Website

The web is an awesome resource for information and knowledge. If you see another site that could improve upon their user experience, you have the opportunity to extract their data and provide it in your own format through HTML parsing.

All web pages have to be rendered in HTML so that the browser can print its contents to your screen. The HTML is public knowledge and anyone can access it. This is what I did for the site hearthhead.com when I built my deckfinder. They have an extensive collection of user generated decks, over a hundred thousand. There is a web page for each deck that lists the name and all the cards in the deck. It looks something like this:

<title>Ridiciulously Cheap and Easy Paladin-Decks-Hearthhead</title>

.

.

(intermediate code)

.

.

<div class="deckguide-cards"> <div class="deckguide-cards-type" data-type="4"> <h2 class="heading-size-3">Minions (20)</h2> <ul> <li><a href="/card=1167" class="card q3" data-id="1167">Aldor spacePeacekeeper</a> x2</li> <li><a href="/card=2037" class="card q1" data-id="2037">Antique spaceHealbot</a> x2</li>

<li><a href="/card=1022" class="card q1" data-id="1022">Argent spaceProtector</a> x2</li> <li><a href="/card=801" class="card q3" data-id="801">Crazed spaceAlchemist</a> x2</li> <li><a href="/card=582" class="card q1" data-id="582">Darkscale spaceHealer</a> x2</li> <li><a href="/card=1651" class="card q1" data-id="1651">Earthen spaceRing Farseer</a> x2</li> <li><a href="/card=2262" class="card q5" data-id="2262">Emperor spaceThaurissan</a></li> <li><a href="/card=1068" class="card q1" data-id="1068">Guardian of spaceKings</a></li> <li><a href="/card=424" class="card q1" data-id="424">Priestess of spaceElune</a> x2</li> <li><a href="/card=1784" class="card q4" data-id="1784">Shade of spaceNaxxramas</a> x2</li> <li><a href="/card=132" class="card q0" data-id="132">Voodoo Doctor</a> spacex2</li> </ul> </div> <div class="deckguide-cards-type" data-type="5"> <h2 class="heading-size-3">Spells (8)</h2> <ul> <li><a href="/card=943" class="card q1" data-id="943">Blessing of spaceKings</a> x2</li> <li><a href="/card=1373" class="card q1" data-id="1373">Blessing of spaceWisdom</a> x2</li> <li><a href="/card=250" class="card q0" data-id="250">Hammer of Wrath</a> spacex2</li> <li><a href="/card=727" class="card q0" data-id="727">Hand of spaceProtection</a> x2</li> </ul> </div> <div class="deckguide-cards-type" data-type="7"> <h2 class="heading-size-3">Weapons (2)</h2> <ul> <li><a href="/card=847" class="card q1" data-id="847">Truesilver Champion</a> spacex2</li>

</ul>

</div>

Ruby gives you the tools to do an HTTP request and then do string comparisons and finally clean the string of html tags. Check out the code I used below:

require 'net/http' require 'sanitize' def getDeckName src titleIndex = src.index(/title/i) title = /<title>/ titleEnd = '</title>' deckTitle = src[/#{title}(.*?)#{titleEnd}/m, 1] deckTitle.slice!( / - Deck - Hearthstone/ ) return deckTitle end

def getDeckList src indexBegin = /<div class="deckguide-cards-options">/ indexEnd = /<div class="graph-statistics">/ deckList = src[/#{indexBegin}(.*?)#{indexEnd}/m, 1] deckArray = [] begin slice = deckList.slice! deckList[/<li>(.*?)<\/li>/m,0] cleanSlice = Sanitize.fragment slice cleanSlice.strip! deckArray.push(cleanSlice) end while deckList[/<li>/] return deckArray end

Then I went on to insert all my extracted data into a PostgreSQL database, and give additional search functionality for looking for decks with card(s) X1, X2, ..., Xn. You can check out the app at deckfinder.heroku.com


Featured Review
Check back soon
Once posts are published, you’ll see them here.
Tag Cloud
No tags yet.
bottom of page