Data Collection [30 points] Select a website of your interest with diverse data. You can collect data from as many websites as you want, but it must make sense. Choosing a good data source is just as important to data science as the code! A good data source has accurate, consistent, and ample data to scrape with few gaps. In the comments of your code, describe why you chose this data source and what makes it a good source. Web Crawling [15 points] You must use scrapy or another similar web crawling library to develop a python script to navigate through multiple pages and gather an extensive dataset. Ensure the script efficiently handles dynamic content and follows ethical scraping practices. Web Scraping [15 points] Completed Once you’ve collected your urls from your crawler, you must use a web scraping library to extract the html from each url. To get a response from a website, you can use requests or selenium. To parse the data into html, you can use BeautifulSoup. You can also use any library of your choice for this step. Scrape a minimum of 50 lines of data. However, depending on your project, you might need a lot more data, so don’t let this requirement limit you. As shown in class, we can scrape almost 1000+ data points in under 5 minutes! Regular Expressions [15 points] Completed Apply regular expressions to find specific patterns and extract information from the raw html data you’ve scraped. While there are lots of ways to parse data, you must use 3 different regular expressions at minimum using the python re library. Data Cleaning [20 points] Completed After the data is parsed from html into legible strings, numbers, jsons, etc, use pandas to clean up your data. You must elegantly handle missing values, duplicates, and anomalies. What you do in these cases depends on what you’re planning to do with your data. Choose a strategy that makes sense for your project! Your data must be well formatted and organized in a pandas DataFrame object with appropriately named columns. Anyone looking at this dataset should easily be able to tell what each column represents. Display a subset of your data in your presentation and in your code block. Describe your data cleaning process and what choices you made to deal with incomplete/bad data. Pandas Powered Analysis [20 points] Utilize pandas for in-depth analysis, data manipulation, and statistical insights. Create clear visualizations (plots, graphs, etc) to convey findings effectively. You can optionally use libraries such as scipy, matplotlib, or plotly to help you display your data, however, pandas alone is extremely powerful. this must be done in google colab My Question: How many of each cuisine restaurants are in NY? what i have so far: pip install scrapy import re import requests import pandas as pd import matplotlib.pyplot as plt import datetime from collections import Counter from tabulate import tabulate today = datetime.date.today() print(today) api_key = ‘6fnGj5y8pz7JG15PVWJgZ7KMcegdW5wEkBORY3BGNZtHIax4p1D6cMm2b419q2ISiaURfJ5EeJsmoHdKqVZQV_d0tpUz6M6xswdo0AlJegcHIJp3GtH4IVPbTKNnZXYx’ url = ‘https://api.yelp.com/v3/businesses/search’ params = { ‘location’: ‘nyc’, ‘sort_by’: ‘best_match’, ‘limit’: 50 } headers = { “accept”: “application/json”, “Authorization”: f”Bearer {api_key}” } response = requests.get(url, params=params, headers=headers) if response.status_code == 200: data = response.json() businesses = data.get(‘businesses’, []) print(businesses) names_and_locations = [ ( business.get(‘name’, ”), business.get(‘location’, {}).get(‘address1’, ”), business.get(‘categories’, [{}])[0].get(‘alias’, ”), business.get(‘categories’, [{}])[0].get(‘title’, ”) if business.get(‘categories’, []) else ”, # Add Cuisine title ) for business in businesses ] print(“Names, Locations, and Cuisine:”) for name, location, category_alias, category_title in names_and_locations: print(f”{name}: {location}, Category Alias: {category_alias}, Cuisine: {category_title}”) rating_pattern = re.compile(r'”rating”:\s*([\d.]+)’) ratings = rating_pattern.findall(response.text) print(f”Ratings: {ratings}”) price_pattern = re.compile(r'”price”:\s*”([^”]+)”‘) prices = price_pattern.findall(response.text) print(f”Prices: {prices}”) cuisines = [business.get(‘categories’, [{}])[0].get(‘title’, ”) if business.get(‘categories’, []) else ” for business in businesses] print(f”Cuisines: {cuisines}”) # Count occurrences of each cuisine cuisine_counts = Counter(cuisines) table_data = [] for i, business in enumerate(businesses, start=1): name = business.get(‘name’, ”) rating = business.get(‘rating’, ”) review_count = business.get(‘review_count’, ”) price = business.get(‘price’, ”) cuisine_title = business.get(‘categories’, [{}])[0].get(‘title’, ”) if business.get(‘categories’, []) else ” # Extract Cuisine title table_data.append({‘#’: i, ‘Name’: name, ‘Rating’: rating, ‘Review Count’: review_count, ‘Price’: price, ‘Cuisine’: cuisine_title}) df = pd.DataFrame(table_data) print(tabulate(df, headers=’keys’, tablefmt=’pretty’, showindex=False)) summary_stats = df.describe() # Plot the bar graph plt.figure(figsize=(12, 6)) plt.bar(cuisine_counts.keys(), cuisine_counts.values()) plt.title(‘Cuisine Distribution’) plt.xlabel(‘Cuisine’) plt.ylabel(‘Number of Occurrences’) plt.xticks(rotation=45, ha=’right’) plt.show() else: print(f’Failed to retrieve data. Status code: {response.status_code}’)
Last Completed Projects
| topic title | academic level | Writer | delivered |
|---|
jQuery(document).ready(function($) { var currentPage = 1; // Initialize current page
function reloadLatestPosts() { // Perform AJAX request $.ajax({ url: lpr_ajax.ajax_url, type: 'post', data: { action: 'lpr_get_latest_posts', paged: currentPage // Send current page number to server }, success: function(response) { // Clear existing content of the container $('#lpr-posts-container').empty();
// Append new posts and fade in $('#lpr-posts-container').append(response).hide().fadeIn('slow');
// Increment current page for next pagination currentPage++; }, error: function(xhr, status, error) { console.error('AJAX request error:', error); } }); }
// Initially load latest posts reloadLatestPosts();
// Example of subsequent reloads setInterval(function() { reloadLatestPosts(); }, 7000); // Reload every 7 seconds });

