AD

ruby capture and parse the page URL for all URL content

# -*- coding: ISO-8859-1 -*-

require 'open-uri' 
require 'uri'

html = open('****265**/').read(2000000) 

END_CHARS = %{.,'?!:;}
puts URI.extract(html, ['http']).collect { |u| END_CHARS.index(u[-1]) ? u.chop : u }


The result:

****w3**/TR/xhtml1/DTD/xhtml1-transitional.dtd
****w3**/1999/xhtml
****google**.hk/search?hl=zh-CN&q=%E7%94%B5%E5%BD%B1&btnG=Google+%E6%90%9C%E7%B4%A2&meta=&aq=f&oq=&client=aff-avalanche
**google**.hk/webhp?hl=zh-cn&tab=Aw
****google.cn/imghp?hl=zh-cn&tab=Ai
**video.google.cn/?hl=zh-cn&tab=Av
**ditu.google.cn/maps?hl=zh-cn&tab=Al
**news.google.cn/nwshp?hl=zh-cn&tab=An
****google.cn/music?hl=zh-cn&tab=AU
****google.cn/prdhp?hl=zh-cn&tab=Af
****google**.hk/intl/zh-CN/options/
****google.cn/music?client=aff-avalanche&channel=textlink
****265**/images/googlemusic.gif
****google.cn/search?hl=zh-CN&client=aff-avalanche&forid=1&channel=logo&q=
****google**.hk/search?hl=zh-CN&q=
****365rili**/
****boc.cn/cn/common/whpj.html
****google**.hk/finance?client=aff-a&hl=zh-CN
**jipiao.kuxun.cn/?fromid=Kgbw-S1134381-T1137511
**huoche.kuxun.cn/?fromid=Kgbw-S1134381-T1076521
****lottery.gov.cn/
****zhcw**/
****google**.hk/search?hl=zh-CN&q=%E7%94%B5%E5%BD%B1&btnG=Google+%E6%90%9C%E7%B4%A2&meta=&aq=f&oq=&client=aff-avalanche
****google**.hk/dictionary
****google**.hk/intl/zh-CN/help/features.html
****google**.hk/intl/zh-CN/mobile/maps/
****chashouji**/
**product.cheshi**/price.html
**fund.eastmoney**/fund.html
****google**.hk/search?hl=zh-CN&q=
****google**.hk/search?hl=zh-CN&q=
**download.tech.qq**/soft/17/21/index.shtml
****google**.hk/search?hl=zh-CN&newwindow=1&safe=strict&q=%E8%B0%B7%E6%AD%8C%E9%87%91%E5%B1%B1%E8%AF%8D%E9%9C%B8&aq=f&aqi=&aql=&oq=&gs_rfai=
****uusee.tv/
**earth.google**/intl/zh-CN/
****360.cn
****kuaikuai.cn/h/down/kk_game/?channel=57_1
****flashget**/cn/download.htm
****ppstream**/download.html
**pstatic.xunlei**/about/product/down_xl5.htm
****google**/ime/pinyin/
**toolbar.google**/T4/intl/zh-CN/
****kuwo.cn/
****rising**.cn/
****maxthon.cn/
****stockstar**/cfg
****QVOD**
**feixin.10086.cn/download/
****gw**.cn/download/level1/index.htm
****google**.hk/search?q=%E6%9A%B4%E9%A3%8E%E5%BD%B1%E9%9F%B33.6+%E6%99%BA%E8%83%BD%E9%AB%98%E6%B8%85%E7%89%88&hl=zh-CN&source=billboard_site&cd=12200098&cad=5&ie=utf8
****google**/chrome?hl=zh-cn
****mosh.cn/beijing
****wooha**/?WT.mc_id=new265
****qunar**/
****abang**/
****yihaodian**/product/index.do?tracker_type=9&tracker_u=1949358
****izhufu**/
****docin**
**bai.sohu**
****daodao**/?m=12347
****gong1chuang1**
****mpdaogou**/
****aibaimm.cn/
****bankrate**.cn/
****5925car**/
****miui**/
****trends**.cn/
****m18**/
****0-6**/
****chilema.cn/default.aspx
****51mike**/
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#609
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#352
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1134
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1371
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#963
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1372
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1373
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1437
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1374
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#355
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1438
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#957
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1417
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1376
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#958
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1378
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1135
****265**/Youxi_Wangyou/Remen_Youxizhuanti_All.html#1377
****265**/Youxi_Wangyou/index.html#196
****265**/Youxi_Wangyou/index.html#195
****google**.hk/webhp?client=aff-avalanche
****baidu**/
****qq**/
****163**/
**cn.msn**/
****tom**/
****china**/
****sina**.cn/
****sohu**/
**cn.yahoo**/
****51job**/
****renren**/
****mop**/
****chinamobile**/
****ifeng**/
****eastmoney**/
****people**.cn/
****xinhuanet**/
****ctrip**/
****360.cn/
****soufun**/
****cctv**/
****zaobao**/
****xunlei**/
**58**/
****icbc**.cn/
****kaixin001**/
****4399**/
****jrj**.cn/
****hexun**/
****gov.cn/
**china.alibaba**/?cosite=265dh&location=alihomepage_dh
****taobao**/
****zol**.cn
****zhcw**/
****autohome**.cn/
****youku**
****tudou**/
****jiayuan**/
****dangdang**/
****tianya.cn/
****51**/?fromsite=265_1
**news.google**.hk/news/search?pz=1&cf=all&ned=cn&hl=zh-CN&q=%E6%B2%BB%E5%A0%B5&source=265**',
**news.google**.hk/news/more?pz=1&cf=all&ned=cn&cf=all&ncl=dVZ221HiynUgGMMdPA7YVmSj8eZRM&source=265**',
**news.google**.hk/news/more?pz=1&cf=all&ned=cn&cf=all&ncl=dlDTKEYmgUS-GKMcA1iendnuqffNM&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&num=10&topic=e&bcid=1651226638&bc_lang=zh-CN&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&topic=fe&bcid=1651764651&bc_lang=zh-CN&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&num=10&topic=fe&bcid=1651435361&bc_lang=zh-CN&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&num=10&topic=fe&bcid=1650340552&bc_lang=zh-CN&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&num=10&topic=fe&bcid=1651458816&bc_lang=zh-CN&source=265**',
**blogsearch.google**.hk/blogsearch/story?hl=zh-cn&topic=l&bcid=1651766213&bc_lang=zh-CN&source=265**',
**news.google**.hk/news/more?pz=1&cf=all&ned=cn&cf=all&ncl=dZkub0JsW0461gMDeDdtdN4qp-pMM&source=265**',
**news.google**.hk/news/search?pz=1&cf=all&ned=cn&hl=zh-CN&q=%E5%85%AC%E5%8A%A1%E5%91%98%E5%B7%A5%E8%B5%84&source=265**',
**news.google**.hk/news/more?pz=1&cf=all&ned=cn&cf=all&ncl=dQhOS-yAc2hsqbMejJ2S82D9XA0CM&source=265**',
**news.google**.hk/news/search?pz=1&cf=all&ned=cn&hl=zh-CN&q=%E5%BC%A0%E9%9B%A8%E7%BB%AE&source=265**',
****17173**
**games.sina**.cn/
****duowan**/
**game.zol**.cn
**tl.sohu**/
**aion.sdo**/project/index/index.htm
**t.mop**
****onlinedown**/
****skycn**/
****crsky**
**xiazai.zol**.cn
**tech.sina**.cn/down/
****mydown**/
****qidian**/
****readnovel**/
****xxsy**/
****hongxiu**/
****jjwxc**/
**book.sina**.cn/
****xs8.cn/
**military.china**/zh_cn/
**mil.news.sina**.cn/
****tiexue**/
**mil.huanqiu**/
****xinhuanet**/mil/
**junshi.xilu**/
**sports.sina**.cn/
**sports.sohu**/
**sports.163**/
**sports.tom**/
**china.nba**/
****hoopchina**/
****zhibo8**/
****google.cn/music/homepage?client=aff-avalanche
****1ting**/
****haoting**/
****qq163**/
****aiting**
****520music**/
**youku**
**tudou**
****ku6**/
****6.cn
****joy.cn/
**video.sina**.cn/
**tv.sohu**/
****google**.hk/finance?client=aff-a&hl=zh-CN
****jrj**.cn/
**eastmoney**/
****stockstar**/
****cnfol**/
**hexun**/
**finance.qq**/
**people**.cn/
**xinhuanet**/
****chinanews**.cn/
**news.sina**.cn/
**news.google**.hk/news?ned=cn&client=aff-avalanche
****china**.cn
**news.ifeng**
****tianya.cn
****qihoo**
****xici**
****daqi**
****dianping**
**dzh.mop**
**laiba.tianya.cn
**blog.sina**.cn/
****bokee**/
**blog.163**/
**blog.sohu**/
**hi.baidu**
**qzone.qq**/
****jiayuan**/
****marry5**/
****baihe**/
****zhenai**/
****ipart.cn/
****wealink**/
****myspace.cn/
****lol99**/
****4399**/
****7k7k**/
****51mole**/
****yx007**/
****3839**/
****xiaoyouxi**/
****91wan**/
**mail.163**/
**mail.cn.yahoo**
**mail.sohu**
****hotmail**/
****gmail**/
****126**/
**mail.sina**.cn/
****39**/
****xywy**/
****haodf**/
**health.sohu**/
****fx120**/
****fh21**.cn/
****jxdyf**/
****pclady**.cn
**eladies.sina**.cn/
**women.sohu**/
**lady.163**/
****lady8844**/
****zdface**/
****yoka**/
****icbc**.cn/index.jsp
****boc.cn/
****ccb**/portal/cn/home/index.html
****abchina**/
****bankcomm**/
****cmbchina**/
****cib**.cn/
**finance.sina**.cn/fund/
**fund.jrj**.cn/
****1234567**.cn/
****chinaamc**/
****jsfund.cn/
****gffunds**.cn/
**autohome**.cn/
****pcauto**.cn/
****xcar**.cn/
**/**chinacars**
****cheshi**/
****bitauto**/
****che168**/
****51job**/
**zhaopin**/
**chinahr**/
****yingjiesheng**/
****cjol**/
****528**.cn/
**taobao**/
**union.360buy**/union_default.aspx?union_Id=163
****amazon.cn/
****dangdang**/
****google**.hk/products?client=aff-avalanche
****newegg**.cn?cm_mmc=265com-_-265com-_-265com-_-1110
**58**
****55bbs**/
****anjuke**/
****koubei**/
****ci123**/
****yaolan**/
****aibangtuan**/
**house.sina**.cn/
****google**.hk/webhp?client=aff-avalanche
****baidu**/
**one.cn.yahoo**/
**ditu.google**/maps?client=aff-avalanche
****google**.hk/products?client=aff-avalanche
****google.cn/music?client=aff-avalanche
**video.google**.hk/?client=aff-avalanche
**images.google**.hk/images?client=aff-avalanche
**news.google**.hk/nwshp?client=aff-avalanche
****265**/');return(false)
****265**
****google.cn/intl/zh-CN/about.html
****265**
****miibeian.gov.cn/
标签: qq, google, transitional dtd, zh, http www google, www google com, http www google com, e5, e6, mobile maps, url content, e7, ditu, btng, hk search, webhp, channel logo, boc, zhcw, textlink
分类: Ruby
时间: 2010-12-26

相关文章

  1. passing parameters url encoding js url Chinese garbled Solutions

    passing parameters url encoding js url Chinese garbled Solutions Published: dxy Font: [ increase decrease ] ty ...
  2. nodejs URL模块操作URL相关方法介绍

    这篇文章主要介绍了nodejs URL模块操作URL相关方法介绍,本文讲解了parse方法.format方法.querystring方法等,需要的朋友可以参考下 url模块 处理HTTP请求时url模块使用率超高,因为该 ...
  3. url 编码 js url传参中文乱码解决方案

    前后台用js传参过程中,如果是中文就容易出现乱码,所以最好是先编码. 1.配置文件web.config中 在节中加上整个网站的编码方式. <globalization fileEncoding="GB23 ...
  4. 通过JS来动态的修改url,实现对url的增删查改

    本文主要介绍了通过JS来动态的修改url,实现对url的增删查改,把代码分享给大家,希望能有所帮助 虽然可以通过get方式提交post表单等方式来动态修改url,但如果多个按钮能并行提交时,写多个大体相同,又有些细节差异 ...
  5. PHP解析URL并得到URL中的参数

    运行环境: PHP5.3.5 代码会解释 "PHP解析URL并得到URL中的参数" 这句话~ ^_^ <?php //例举一个URL格式的字符串: $str = 'http://test.com ...
  6. ruby capture page URL and parse all of the content URL

    # -*- coding: ISO-8859-1 -*- require 'open-uri' require 'uri' html = open('http://www.265.com/').read(2000000) ...
  7. How to capture the dynamic data page

    First of all, I am referring to dynamic data clearly what. Definitions: dynamic data here refers to the dynami ...
  8. Parameter is the JSP page that contains the url to bring dynamic vulnerability

    Some use http://127.0.0.1/person.jsp?url=BasicInfo.jsp mode to visit the page, the parameter value is the page ...
  9. struts2 in the page with the s: url s: param-value in the action taken in how the value of param in

    <s:url action="param/myparam.action"> <s:param name="a"> <s:property value= ...
  10. How to capture Java web technology to achieve, according to URL by Cached

    // This method is only available on JdK1.6 or above version Desktop.getDesktop().browse( new URL("http:// ...
  11. Ruby action frame in the page object

    #open the IE browser ie = Watir::IE.new # go to login site ie.goto test_site ie.show_frames ie.frame(:id, &quo ...
  12. URL rewriting class url class

    Clean URL is a class that implements a method to make URLs more user and search engine friendly by removing '? ...
  13. url data mode (url scheme data)

    First look at a piece of code <IMG SRC="data:image/gif;base64,R0lGODdhMAAwAPAAAAAAAP///ywAAAAAMAAw AAA ...
  14. Turn - end of the URL with the URL without slash and slash the impact of SEO

    Wanted to find the URL at the end of the slash, and the result is also extended to the SEO, and very interesti ...
  15. js修改地址栏URL参数解决url参数问题

    现在做网页,经常会碰到处理地址栏参数的问题,因此,就专门做了一个修改地址栏参数的方法,需要了解的朋友可以参考下 现在做网页,经常会碰到处理地址栏参数的问题,但一直是凭感觉做,之后漏洞百出也不知道从哪改起,因此,就专门做了 ...
  16. Object-oriented entry-python - the regular expression

    Bowen covers the following: 1. Regular expression re 2.url Library urllib 3.debug method 4. Object-oriented en ...
  17. login page example of realization of rails ruby

    Ruby rails made a login page script user name: a Password: require 'net/http' require "open-uri" req ...
  18. Web page data capture three ways (Reprinted)

    Personal website owners believe that all people have the data capture the experience of it, the current method ...
  19. URL of the page with Nokogiri grab a Chinese solution

    We use Nokogiri crawl the site information page, if you encounter a page URL you want to crawl inside a Chines ...