Fetch page with proxy using The Go language
For a while i’m playing with
The Go Programming Language
– so far I loved it. I figure out that I’ll push some code snippets from time to time.
Today I spend some time creating simple not ever crawler, but website fetcher.
Idea is very simple – download page, run xpath query on it and spit out results. I was looking for decent xpath library for Go and couldn’t find any. I tried to use
xmlpath
but it sucks. I couldn’t even run queries like id('product-details')/div[@class='product-price']"
Then I found something nicer –
Gokogiri
– which works pretty nicely, but – couldn’t find any examples except this
small article
.
The only problem with running Gokogiri is that it uses libxml2
which is not a huge problem on Linux based systems, but on Mac OS X you have to install it via
homebrewbrew install libxml2
Here is code
package main
import (
"fmt"
"net/http"
"log"
"github.com/moovweb/gokogiri"
"io/ioutil"
"os"
)
func main() {
body := fetch("http://httpbin.org/html")
doc, err := gokogiri.ParseHtml(body)
if err != nil {
log.Fatalln(err)
}
defer doc.Free()
html := doc.Root().FirstChild()
result, err := html.Search("/html/body/h1")
if err != nil {
log.Fatalln(err)
}
fmt.Println(result)
}
func fetch(url string) []byte {
os.Setenv("HTTP_PROXY", "http://x.x.x.x:8080")
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
log.Fatalln(err)
}
req.Header.Set("User-Agent", "Golang Spider Bot v. 3.0")
resp, err := client.Do(req)
if err != nil {
log.Fatalln(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatalln(err)
}
return body
}