如何在具有动态内容的页面上使用 python 进行网页抓取？

2 阅读 0 评论 0 点赞

我正在尝试在航班页面上进行网络抓取，以比较航空公司特定月份的价格，目前我有这个：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
options.add_argument('--headless')
options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36')


# Dates
dates = ['2023-04-01', '2023-04-02', '2023-04-03', '2023-04-04', '2023-04-05', '2023-04-06', '2023-04-07', '2023-04-08', '2023-04-09', '2023-04-10', '2023-04-11', '2023-04-12', '2023-04-13', '2023-04-14', '2023-04-15', '2023-04-16', '2023-04-17', '2023-04-18', '2023-04-19', '2023-04-20', '2023-04-21', '2023-04-22', '2023-04-23', '2023-04-24', '2023-04-25', '2023-04-26', '2023-04-27', '2023-04-28', '2023-04-29', '2023-04-30']
prices = {}

for date in dates:

    driver = webdriver.Chrome(options=options)
    url = f'https://www.latamairlines.com/co/es/ofertas-vuelos?origin=BOG&inbound=null&outbound={date}T17%3A00%3A00.000Z&destination=BGA&adt=1&chd=0&inf=0&trip=OW&cabin=Economy&redemption=false&sort=RECOMMENDED'
    driver.get(url)
    html = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html, 'html.parser')
    spans = soup.find_all('span', {'class': 'display-currencystyle__CurrencyAmount-sc__sc-19mlo29-2 fMjBKP'})
    prices[date] = []
    for span in spans:
        texto = span.text.strip() 
        if texto not in prices[date]:
            prices[date].append(texto)

print(prices)

这个想法是遍历该月所有日期的所有页面，因为对于每个日期，URL 都会发生变化。我的想法是将此实现应用于更多网页（相应地更改 url 和搜索 html 标签）。但是，我遇到过如下页面： https: //www.easyfly.com.co/在其中配置日期和目的地时，我没有获得带有参数的 URL，而是我认为动态加载的应用程序:

在这些情况下应该怎么办？我应该向哪个 URL 发出请求？谢谢。

你能提供一个根据日期更改 URL 的网站吗？

是的，代码示例的页面会根据日期发生变化:: latamairlines.com/co/es/...

正如您已经注意到的，您无法访问特定的 URL。您可以在单击各种日期并尝试复制这些请求时查看网络日志；但由于您已经在使用 selenium，我认为只需单击必要的元素即可加载该页面会更容易 -文档的第 3-5 节解释了有关元素的定位和交互以及如何等待它们加载

本文分类：开发
本文标签：
浏览次数：2 次浏览
本文链接：/evv/1872749dbb4c.html