Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blogspot source #40

Closed
wants to merge 4 commits into from
Closed

Blogspot source #40

wants to merge 4 commits into from

Conversation

markbaas
Copy link

@markbaas markbaas commented Dec 10, 2016

  • fixes proxydb
  • adds support for proxy lists in blogspot.

@chill117
Copy link
Owner

chill117 commented Dec 12, 2016

  • Looks like a test is failing because you need to update the sample page HTML for proxydb.
  • Some of the lists that you added are already included; see proxies24. These could be combined into a single "blogspot" source.

@chill117
Copy link
Owner

I've included the proxydb fixes from this PR. If you'd like to continue with the blogspot source, let me know. If I have time, I'll look at adding it as well.

@markbaas
Copy link
Author

Okay thanks! I'm little on time atm. How would you like to combine the blogspot sources? I port the code from the python proxybroker which was die hard matching ips from the dom. I see the proxies24 plugin is a bit more extensive.


var reg = /\d+\.\d+\.\d+\.\d+\:\d+/g
var matches = [], found;
while (found = reg.exec(listHtml)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to use regular expressions on large chunks of text. Could this use an HTML parser (cheerio) or RSS/XML parser instead?

'newfreshproxies24.blogspot.com', 'irc-proxies24.blogspot.com',
'freeschoolproxy.blogspot.com', 'googleproxies24.blogspot.com',
'getdailyfreshproxy.blogspot.com']
},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of these URLs are no longer active blogs.

@chill117
Copy link
Owner

chill117 commented Dec 23, 2016

It looks like you already have some of the proxies24 URLs. So as long as all the existing URLs from that source are included, I think that covers it. So to summarize:

  • The proxies24 source can be removed
  • A new blogspot source will include the blog URLs from the proxies24 source as well as the working URLs you already have here. Some of the URLs in your PR are no longer working.

@chill117 chill117 closed this Feb 15, 2017
@chill117
Copy link
Owner

chill117 commented Feb 15, 2017

The idea of this PR is good. There are some sources hosted on blogspot that could all be scraped in the same way, because the HTML structures of the sources are similar enough.

@markbaas
Copy link
Author

I have been testing this for a while. The regexes are too heavy. You already included in most important sources, however I think you are still missing some ips, it would be better to fix the existing source reading for rss instead of the html posts.

@chill117
Copy link
Owner

chill117 commented Feb 16, 2017

I can re-visit this to create a blogspot source that consolidates all the current proxy sources that come from blogspot pages. It will make it easy in the future to scrape additional blogspot pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants