
<html>
<head>
    <title>amazon ASIN crawler

 (theinfo)</title>
    <style type="text/css">
    .login { text-align: right; }
    .login { color: gray; }
    .login :link, .login :visited { color: gray; text-decoration: underline;}

    .error { color: red; font-weight: bold; }

    .linky {
        color: gray; background-color: transparent;
        border: none; cursor: pointer;
        margin: 0px; padding: 0px; vertical-align: bottom;
        display: inline;
        text-align: right;
        text-decoration: underline;
        font-size: 1em;
    }
    form th { text-align: right; color: gray; font-weight: normal; text-transform: lowercase; }
    form .wrong { color: red; font-weight: bold; }

.header { background-color: #00557C; padding: .2em;}
.header h1 {   text-align: center;  }
.header h1 :link, .header h1 :visited { 
  font-family: MPW, monospace; 
  color: #fff;  /*BFEBFF */
  font-weight: 300%;
  margin-bottom: 0;
}
.subtitle {
  text-align: center; margin-top: 0;
  font-family: MPW, monospace;
  color: #BFEBFF; /*ACC6D8*/
}

body { font-family: "Lucida Grande";}

.body, .page {
  border: 1px solid #ccc; margin: 1.5em; padding: 0 1em;
  background-color: #eee;
}

.frontpage { 
  width: 100%;
  border-spacing: .6em 0;
}
.frontpage td { vertical-align: top; padding: 0; width: 33%; padding-bottom: 1em; }
.frontpage p { padding-left: 1em; padding-right: 1em;}
.frontpage h2 { text-align: center; margin-bottom: 0; font-family: MPW, monospace; padding-top: .5em;}
.frontpage .subhed { margin-top: .5em; text-align: center; color: #777; padding: 0;}


.r1 { background-color: #fdd;}
.r2 { background-color: #dfd;}
.r3 { background-color: #ddf;}

.status { text-align: right; color: gray }
.footer { clear: both}
.nav { text-align: center; }
    </style>



</head>
<body>
<div class="header">
<h1><a href="http://theinfo.anandology.com/">theinfo.org</a></h1>
<p class="subtitle">for people with large data sets</p>
</div>

<p class="login">[<a href="http://theinfo.anandology.com/account/login">log in</a>]</p>

<h1>amazon ASIN crawler

</h1>





<p>[<a href="http://theinfo.anandology.com/get/code/awscrawl.py?m=edit" accesskey="e">edit</a>] [<a href="http://theinfo.anandology.com/get/code/awscrawl.py?m=history">history</a>]</p>



<div class="page">

<p>Here's some code I wrote to generate a list of ASINs by crawling Amazon's similarity service:
</p>
<pre><code>import urllib, re, gzip

AWS_ID = '[put your AWS ID here]'
qurl = lambda n:  'http://ecs.amazonaws.com/onca/xml?Service=AWSECommerceService&amp;AWSAccessKeyId=%s&amp;Operation=SimilarityLookup&amp;ItemId=%s' % (AWS_ID, n)
data_r = re.compile('(&lt;Items&gt;.*?&lt;/Items&gt;)', re.S)
asin_r = re.compile('&lt;ASIN&gt;(.*?)&lt;/ASIN&gt;', re.S)

tried_fh = file('tried', 'a')
data_fh = gzip.open('data.xml.gz', 'a')

def handleone(asin):
    myqurl = qurl(asin)
    text = urllib.urlopen(myqurl).read()
    for x in data_r.findall(text):
        data_fh.write(x)
    data_fh.flush()
    tried_fh.write(asin + '\n')
    tried_fh.flush()
    return [x.strip() for x in asin_r.findall(text)]

patient_0 = '0805063897'
def crawler():
    done = set([x.strip() for x in file('tried')])
    todo = set([])
    for line in gzip.open('data.xml.gz'):
        for asin in asin_r.findall(line):
            if asin not in done: todo.add(asin)
    if patient_0 not in done: todo.add(patient_0)

    while 1:
        asin = todo.pop()
        pointers = handleone(asin)
        done.add(asin)
        for p in pointers:
            if p not in done:
                todo.add(p)
        print asin, '(%s todo; %s done)' % (len(todo), len(done))

crawler()
</code></pre>


</div>


<p>[<a href="http://theinfo.anandology.com/get/code/awscrawl.py?m=edit" accesskey="e">edit</a>] [<a href="http://theinfo.anandology.com/get/code/awscrawl.py?m=history">history</a>]</p>

<p class="status">last modified August  6, 2012</p>


<p class="footer"><code>theinfo.org</code> is a community site; if you want to help run it, join the <a href="http://groups.google.com/group/theinfo/">mailing list</a>. (It was originally started by <a href="http://www.aaronsw.com/">Aaron Swartz</a> and is powered by <a href="http://infogami.org/">infogami</a>.)</p>

</body>
</html>

