Python 爬虫之数据解析模块lxml基础（附：xpath和解析器介绍）_程序开发

Python 爬虫之数据解析模块lxml基础（附：xpath和解析器介绍）

admin

2023-07-04 15:44:21

0次

介绍：

最近在学Python爬虫，在这里对数据解析模块lxml做个学习笔记。

lxml、xpath及解析器介绍：

lxml是Python的一个解析库，支持HTML和XML的解析，支持xpath解析方式，而且解析效率非常高。xpath，全称XML Path Language，即XML路径语言，它是一门在XML文档中查找信息的语言，它最初是用来搜寻XML文档的，但是它同样适用于HTML文档的搜索

xml文件/html文件结点关系：
父节点(Parent)
子节点(Children)
同胞节点(Sibling)
先辈节点(Ancestor)
后代节点(Descendant)

xpath语法:
nodename    选取此节点的所有子节点
//          从任意子节点中选取
/           从根节点选取
.           选取当前节点
..          选取当前节点的父节点
@        选取属性

解析器比较:
解析器         速度      难度
re                最快      难
BeautifulSoup 慢        非常简单
lxml                 快        简单

学习笔记：

# -*- coding: utf-8 -*-

from lxml import etree

html_doc = """
The Dormouse's story
The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

selector = etree.HTML(html_doc)   #创建一个对象

links = selector.xpath('//p[@class="story"]/a/@href')   # 取出页面内所有的链接
for link in links:
    print link

xml_test = """
    
        lizibin
        m
        
sjz
        28
        
            konigerwin@163.com
            135......
        
    
    
        wsq
        f
        
shanghai
        25
        
            konigerwiner@163.com
            135......
        
    
    
        liqian
        f
        
SH
        28
        
            konigerwinarry@163.com
            135......
        
    
    
        qiangli
        f
        
SH
        29
        
            konigerwinarry@163.com
            135......
        
    
    
        buzhidao
        f
        
SH
        999
        
            konigerwinarry@163.com
            135......
        
    
"""

#r = requests.get('http://xxx.com/abc.xml')   也可以请求远程服务器上的xml文件
#etree.HTML(r.text.encode('utf-8'))
xml_code = etree.HTML(xml_test)     #生成一个etree对象

#选取所有子节点的name(地址)
print xml_code.xpath('//name')

选取所有子节点的name值(数据)
print xml_code.xpath('//name/text()')
print ''

#以notebook以根节点选取所有数据
notebook = xml_code.xpath('//notebook')

#取出第一个节点的name值(数据)
print notebook[0].xpath('.//name/text()')[0]

addres = notebook[0].xpath('.//name')[0]
#取出和第一个节点同级的 address 值
print addres.xpath('../address/text()')

#选取属性值
print addres.xpath('../address/@lang')

#选取notebook下第一个user的name属性
print xml_code.xpath('//notebook/user[1]/name/text()')

#选取notebook下最后一个user的name属性
print xml_code.xpath('//notebook/user[last()]/name/text()')

#选取notebook下倒数第二个user的name属性
print xml_code.xpath('//notebook/user[last()-1]/name/text()')

#选取notebook下前两名user的address属性
print xml_code.xpath('//notebook/user[position()<3]/address/text()')

#选取所有分类为web的name
print xml_code.xpath('//notebook/user[@category="cb"]/name/text()')

#选取所有年龄小于30的人
print xml_code.xpath('//notebook/user[age<30]/name/text()')

#选取所有class属性中包含dba的class属性
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')

上一篇：PHP正则表达式提取网页的超链接及标题

下一篇：PHP成长记（三） —— SSO单点登录/登出

Python 爬虫之数据解析模块lxml基础（附：xpath和解析器介绍）

相关内容

热门资讯