Create functions for web scrapingSlow web-scraping geolocatorWeb scraping using CasperJSWeb-scraping Reddit BotWeb Scraping with Python + asyncioWeb Scraping with VBAJava web scraping robotsVBA - XMLHTTP web scrapingSpeeding web-scraping up Python 3Scraping web data with Python 3Web scraping and downloading manga
MAXDOP Settings for SQL Server 2014
Varistor? Purpose and principle
Freedom of speech and where it applies
Drawing ramified coverings with tikz
Is there a word to describe the feeling of being transfixed out of horror?
What linear sensor for a keyboard?
Folder comparison
Indicating multiple different modes of speech (fantasy language or telepathy)
Find last 3 digits of this monster number
Engineer refusing to file/disclose patents
How to align and center standalone amsmath equations?
Visiting the UK as unmarried couple
Does having a TSA Pre-Check member in your flight reservation increase the chances that everyone gets Pre-Check?
Is it improper etiquette to ask your opponent what his/her rating is before the game?
Has Darkwing Duck ever met Scrooge McDuck?
Flux received by a negative charge
Will adding a BY-SA image to a blog post make the entire post BY-SA?
Could the E-bike drivetrain wear down till needing replacement after 400 km?
Wrapping Cryptocurrencies for interoperability sake
Two-sided logarithm inequality
Why does Async/Await work properly when the loop is inside the async function and not the other way around?
How can Trident be so inexpensive? Will it orbit Triton or just do a (slow) flyby?
How do I implement a file system driver driver in Linux?
Some numbers are more equivalent than others
Create functions for web scraping
Slow web-scraping geolocatorWeb scraping using CasperJSWeb-scraping Reddit BotWeb Scraping with Python + asyncioWeb Scraping with VBAJava web scraping robotsVBA - XMLHTTP web scrapingSpeeding web-scraping up Python 3Scraping web data with Python 3Web scraping and downloading manga
$begingroup$
I built a web scraper to pull a list of jobs on Facebook and other websites but want to break up the code into functions that I can reuse for other websites. This structure is working but I think it can be more efficient with functions. I'm getting stuck on how to structure the functions. It's only pulling two pages for testing.
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv
# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
"Fremont%2C%20CA",
"Los%20Angeles%2C%20CA",
"Mountain%20View%2C%20CA",
"Northridge%2CCA",
"Redmond%2C%20WA",
"San%20Francisco%2C%20CA",
"Santa%20Clara%2C%20CA",
"Seattle%2C%20WA",
"Woodland%20Hills%2C%20CA"]
# Preparing the monitoring of the loop
start_time = time()
requests = 0
with open('facebook_job_list.csv', 'w', newline='') as f:
header = csv.writer(f)
header.writerow(["Website", "Title", "Location", "Job URL"])
for page in pages:
for c in cities:
# Requests the html page
response = get("https://www.facebook.com/careers/jobs/?page=" + page +
"&results_per_page=100&locations[0]=" + c)
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
# Monitor the frequency of requests
requests += 1
elapsed_time = time() - start_time
print("Request:; Frequency: request/s".format(requests, requests/elapsed_time))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: ; Status code: ".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 2:
warn("Number of requests was greater than expected.")
break
# Parse the content of the request with BeautifulSoup
page_soup = BeautifulSoup(response.text, 'html.parser')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
link = container.get("href")
job_link = "https://www.facebook.com" + link
with open('facebook_job_list.csv', 'a', newline='') as f:
rows = csv.writer(f)
rows.writerow([site, title, location, job_link])
python functional-programming web-scraping
New contributor
$endgroup$
add a comment |
$begingroup$
I built a web scraper to pull a list of jobs on Facebook and other websites but want to break up the code into functions that I can reuse for other websites. This structure is working but I think it can be more efficient with functions. I'm getting stuck on how to structure the functions. It's only pulling two pages for testing.
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv
# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
"Fremont%2C%20CA",
"Los%20Angeles%2C%20CA",
"Mountain%20View%2C%20CA",
"Northridge%2CCA",
"Redmond%2C%20WA",
"San%20Francisco%2C%20CA",
"Santa%20Clara%2C%20CA",
"Seattle%2C%20WA",
"Woodland%20Hills%2C%20CA"]
# Preparing the monitoring of the loop
start_time = time()
requests = 0
with open('facebook_job_list.csv', 'w', newline='') as f:
header = csv.writer(f)
header.writerow(["Website", "Title", "Location", "Job URL"])
for page in pages:
for c in cities:
# Requests the html page
response = get("https://www.facebook.com/careers/jobs/?page=" + page +
"&results_per_page=100&locations[0]=" + c)
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
# Monitor the frequency of requests
requests += 1
elapsed_time = time() - start_time
print("Request:; Frequency: request/s".format(requests, requests/elapsed_time))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: ; Status code: ".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 2:
warn("Number of requests was greater than expected.")
break
# Parse the content of the request with BeautifulSoup
page_soup = BeautifulSoup(response.text, 'html.parser')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
link = container.get("href")
job_link = "https://www.facebook.com" + link
with open('facebook_job_list.csv', 'a', newline='') as f:
rows = csv.writer(f)
rows.writerow([site, title, location, job_link])
python functional-programming web-scraping
New contributor
$endgroup$
add a comment |
$begingroup$
I built a web scraper to pull a list of jobs on Facebook and other websites but want to break up the code into functions that I can reuse for other websites. This structure is working but I think it can be more efficient with functions. I'm getting stuck on how to structure the functions. It's only pulling two pages for testing.
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv
# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
"Fremont%2C%20CA",
"Los%20Angeles%2C%20CA",
"Mountain%20View%2C%20CA",
"Northridge%2CCA",
"Redmond%2C%20WA",
"San%20Francisco%2C%20CA",
"Santa%20Clara%2C%20CA",
"Seattle%2C%20WA",
"Woodland%20Hills%2C%20CA"]
# Preparing the monitoring of the loop
start_time = time()
requests = 0
with open('facebook_job_list.csv', 'w', newline='') as f:
header = csv.writer(f)
header.writerow(["Website", "Title", "Location", "Job URL"])
for page in pages:
for c in cities:
# Requests the html page
response = get("https://www.facebook.com/careers/jobs/?page=" + page +
"&results_per_page=100&locations[0]=" + c)
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
# Monitor the frequency of requests
requests += 1
elapsed_time = time() - start_time
print("Request:; Frequency: request/s".format(requests, requests/elapsed_time))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: ; Status code: ".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 2:
warn("Number of requests was greater than expected.")
break
# Parse the content of the request with BeautifulSoup
page_soup = BeautifulSoup(response.text, 'html.parser')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
link = container.get("href")
job_link = "https://www.facebook.com" + link
with open('facebook_job_list.csv', 'a', newline='') as f:
rows = csv.writer(f)
rows.writerow([site, title, location, job_link])
python functional-programming web-scraping
New contributor
$endgroup$
I built a web scraper to pull a list of jobs on Facebook and other websites but want to break up the code into functions that I can reuse for other websites. This structure is working but I think it can be more efficient with functions. I'm getting stuck on how to structure the functions. It's only pulling two pages for testing.
from time import time
from requests import get
from time import sleep
from random import randint
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup
import csv
# Range of only 2 pages
pages = [str(i) for i in range(1, 3)]
cities = ["Menlo%20Park%2C%20CA",
"Fremont%2C%20CA",
"Los%20Angeles%2C%20CA",
"Mountain%20View%2C%20CA",
"Northridge%2CCA",
"Redmond%2C%20WA",
"San%20Francisco%2C%20CA",
"Santa%20Clara%2C%20CA",
"Seattle%2C%20WA",
"Woodland%20Hills%2C%20CA"]
# Preparing the monitoring of the loop
start_time = time()
requests = 0
with open('facebook_job_list.csv', 'w', newline='') as f:
header = csv.writer(f)
header.writerow(["Website", "Title", "Location", "Job URL"])
for page in pages:
for c in cities:
# Requests the html page
response = get("https://www.facebook.com/careers/jobs/?page=" + page +
"&results_per_page=100&locations[0]=" + c)
# Pauses the loop between 8 and 15 seconds
sleep(randint(8, 15))
# Monitor the frequency of requests
requests += 1
elapsed_time = time() - start_time
print("Request:; Frequency: request/s".format(requests, requests/elapsed_time))
clear_output(wait=True)
# Throw a warning for non-200 status codes
if response.status_code != 200:
warn("Request: ; Status code: ".format(requests, response.status_code))
# Break the loop if number of requests is greater than expected
if requests > 2:
warn("Number of requests was greater than expected.")
break
# Parse the content of the request with BeautifulSoup
page_soup = BeautifulSoup(response.text, 'html.parser')
job_containers = page_soup.find_all("a", "_69jm")
# Select all 100 jobs containers from a single page
for container in job_containers:
site = page_soup.find("title").text
title = container.find("div", "_69jo").text
location = container.find("div", "_1n-z _6hy- _21-h").text
link = container.get("href")
job_link = "https://www.facebook.com" + link
with open('facebook_job_list.csv', 'a', newline='') as f:
rows = csv.writer(f)
rows.writerow([site, title, location, job_link])
python functional-programming web-scraping
python functional-programming web-scraping
New contributor
New contributor
New contributor
asked 8 mins ago
iron502iron502
11
11
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
iron502 is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216135%2fcreate-functions-for-web-scraping%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
iron502 is a new contributor. Be nice, and check out our Code of Conduct.
iron502 is a new contributor. Be nice, and check out our Code of Conduct.
iron502 is a new contributor. Be nice, and check out our Code of Conduct.
iron502 is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f216135%2fcreate-functions-for-web-scraping%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown