XPath定位中and、or、not、contains、starts-with和string(.)用法

我自己搭建了博客,以后可能不太在CSDN上发博文了,https://www.qingdujun.com/


下文总结了XPath常用的text()andornotcontains,当然也还有类似的positionlastends_withstarts_with等等。

data1 = selector.xpath("//input[@type='submit' and @name='fuck']");
data2 = selector.xpath("//input[@type='submit' or @name='fuck']");
data2 = selector.xpath("//input[@type='submit' and not(contains(@name,'fuck'))]");
data3 = selector.xpath("//input[starts-with(@id,'fuck')]"));
data4 = selector.xpath("//input[ends-with(@id,'fuck')]"));
data5 = selector.xpath("//input[contains(@id,'fuck')]"));

另外,举个例子解释下string(.)的用法:

<div id="test3">
    我左青龙,
    <span id="tiger">
        右白虎,
        <ul>上朱雀,
            <li>下玄武。</li>
        </ul>
        老牛在腰间,
    </span>
    龙头在胸前。
</div>

注意selector.xpath返回的是一个list,因为页面id要求是唯一的,所以以下[]中总是<=1个元素。

data = selector.xpath('//div[@id="test3"]')[0];
info = data.xpath('string(.)');

此时,info里面的内容即为“我左青龙,右白虎,上朱雀,下玄武。老牛在腰间,龙头在胸前。”

一个综合小例子:

# -*- coding: utf-8 -*-
import requests
from lxml import etree

class Bugzilla(object):
    def __init__(self):
        self.base_url = 'https://bugs.winehq.org/show_bug.cgi?id=';
        self.user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'};

    def getPage(self, url):
        html = requests.get(url, headers=self.user_agent);
        html.encoding = 'utf-8';
        return html.text;
        
    def getSelector(self, page):
        selector = etree.HTML(page);
        return selector;

    def getUrl(self, number):
        url = self.base_url + str(number);
        return url;

    def getDescription(self, selector):
        description = selector.xpath('//*[@id="c0"]/pre/text()');
        if len(description) <= 0:
            return "";
        return description[0];

    def getComments(self, selector):
        #comments = selector.xpath('//*[@id="comments"]/table/tbody/tr/td/div[position() > 1]/pre/text()');
        #comments = selector.xpath('//*[@class="bz_comment" or not(contains(@id,"c0"))]/pre/text()');
        data = selector.xpath('//*[@class="bz_comment"]/pre');
        comments = [];
        for comms in data:
            comments.append(comms.xpath('string(.)'));
        return comments; #'. '.join(comments);

if __name__ == '__main__':
    bug = Bugzilla();



©qingdujun
2019-2-19 北京 海淀


References:
[1] http://www.cnblogs.com/unknows/p/7684331.html
[2] https://blog.csdn.net/huang1600301017/article/details/84585065
[3] https://blog.csdn.net/zhouxuan623/article/details/43935039
[4] 《Python爬虫开发 从入门到实战》(绿皮),作者谢乾坤

已标记关键词 清除标记
<div class="post-text" itemprop="text"> <p>i am fetching specific data from a site for which i am using XPath but for this i have to exclude few variables for which i have to use NOT. but this NOT is not working in the code please explain what i have to do to make it work :</p> <p><strong>heres the html code</strong></p> <pre><code><tr><td colspan="2" valign="top" align="left"><span class="tl-document"> <left>some text here </left> </span></td></tr> <tr><td colspan="2" valign="top" align="left"> <span class="text-id">some text here,<sup>a</sup><sup>b</sup></span> <span class="text-id">some text here,<sup>a</sup></span> </td></tr> <tr><td colspan="2" valign="top" class="right"> <sup>a</sup>some text here<br> </td></tr> <tr><td colspan="2" valign="top" class="right"> <sup>b</sup>some text here<br> </td></tr> <td colspan="2" valign="top"> <br><div> <span class="tl-default">Objective</span> <p>some text here,</p> </div> <div> <span class="tl-default">Methods</span> <p>some text here,</p> </div> <div> </td> <td colspan="2" valign="top"> <br><div> <span class="tl-default">Objective</span> <p>some text here,</p> </div> </td> </code></pre> <p>trying to fetch only not td containing class and align and for this i am using this method for my xpath :</p> <pre><code>$getnew="http://www.example.com/; $html = new DOMDocument(); @$html->loadHtmlFile($getnew); $xpath = new DOMXPath( $html ); $y = $xpath->query('//td[@colspan="2" and valign="top" and (not(@class and @align))]'); $ycnt = $y->length; for ( $idf=6; $idf<$ycnt; $idf++) { if($idf==6){ echo "<p class='artbox'>".$y->item($idf)->nodeValue."</p>";} } </code></pre> <p>i am new to this so please suggest your opinions </p> </div>
<div class="post-text" itemprop="text"> <p>I'm using the OR operator (more than once) in my XPath expression to extract what I need in the content before a specific string is encountered such as 'Reference,' 'For more information,' etc. Any of these terms should return the same result, yet they may not be in that order. For example, 'Reference' might not be first and may not be in the content at all, and one of the matches uses a table, 'About the data.' I want all content before any one of these strings appears. </p> <p>Any help would be appreciated.</p> <pre class="lang-php prettyprint-override"><code>$expression = "//p[ starts-with(normalize-space(), 'Reference') or starts-with(normalize-space(), 'For more') ]/preceding-sibling::p"; </code></pre> <p>That would also need to take into account the table:</p> <pre class="lang-php prettyprint-override"><code>$expression = "//article/table/tbody/tr/td[ starts-with(normalize-space(), 'About the data used') ]/preceding-sibling::p"; </code></pre> <p>Here's an example:</p> <pre><code><root> <main> <article> <p> The stunning increase in homelessness announced in Los Angeles this week — up 16% over last year citywide — was an almost an incomprehensible conundrum. </p> <p> "We cannot let a set of difficult numbers discourage us or weaken our resolve" Garcetti said. </p> <p> References By Jeremy Herb, Caroline Kelly and Manu Raju, CNN </p> <p> For more information: Maeve Reston, CNN </p> <p>Maeve Reston, CNN</p> <table> <tbody> <tr> <td> <strong>About the data used</strong> </td> </tr> <tr> <td>From </td> <td>Washington, CNN</td> </tr> </tbody> </table> </article> </main> </root> </code></pre> <p>The result I'm looking for would be the following.</p> <pre class="lang-xml prettyprint-override"><code><p> The stunning increase in homelessness announced in Los Angeles this week — up 16% over last year citywide — was an almost an incomprehensible conundrum. </p> <p> "We cannot let a set of difficult numbers discourage us or weaken our resolve" Garcetti said. </p> </code></pre> </div>
©️2020 CSDN 皮肤主题: 精致技术 设计师:CSDN官方博客 返回首页