{"id":5,"date":"2023-07-23T18:50:18","date_gmt":"2023-07-23T18:50:18","guid":{"rendered":"https:\/\/zahiralam.com\/blog\/?p=5"},"modified":"2024-09-05T10:48:44","modified_gmt":"2024-09-05T10:48:44","slug":"data-scraping-with-python-a-step-by-step-guide-with-real-life-example","status":"publish","type":"post","link":"https:\/\/zahiralam.com\/blog\/data-scraping-with-python-a-step-by-step-guide-with-real-life-example\/","title":{"rendered":"Data Scraping with Python: A Step-by-Step Guide with Real-Life Example"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><em>Introduction<\/em><\/h2>\n\n\n\n<p>Data scraping, also known as web scraping, is the process of extracting valuable information from websites automatically. Python has a plethora of libraries and tools that make web scraping a breeze. In this blog, we will walk you through a step-by-step guide on how to perform data scraping using Python, utilizing the popular libraries Beautiful Soup and Requests. We will demonstrate the process with a real-life example of scraping movie details from a movie review website.\n\n\n\n<p><strong>Step 1: Install the Required Libraries<\/strong>\n\n\n\n<p>First, make sure you have Python installed on your system. You&#8217;ll also need to install the necessary libraries: Requests and Beautiful Soup.\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">pip install requests beautifulsoup4<\/pre>\n\n\n\n<p><strong>Step 2: Import the Libraries<\/strong>\n\n\n\n<p>Now, import the required libraries into your Python script:\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">import requests\nfrom bs4 import BeautifulSoup<\/pre>\n\n\n\n<p><strong>Step 3: Choose a Website and Inspect the Structure<\/strong>\n\n\n\n<p>For our example, let&#8217;s scrape movie details from a popular movie review website like IMDb. Before proceeding, inspect the HTML structure of the website to identify the elements you want to extract.\n\n\n\n<p>For demonstration purposes, let&#8217;s extract the movie titles, release years, and IMDb ratings from the &#8220;Top Rated Movies&#8221; page.\n\n\n\n<p><strong>Step 4: Send a GET Request<\/strong>\n\n\n\n<p>To start scraping, send a GET request to the website using the <code>requests<\/code> library to fetch the HTML content:\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">url = \"https:\/\/www.imdb.com\/chart\/top\/\"\nresponse = requests.get(url)<\/pre>\n\n\n\n<p><strong>Step 5: Parse the HTML Content<\/strong>\n\n\n\n<p>Parse the HTML content using Beautiful Soup:\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">soup = BeautifulSoup(response.content, \"html.parser\")<\/pre>\n\n\n\n<p><strong>Step 6: Locate the Relevant Data<\/strong>\n\n\n\n<p>Using the browser&#8217;s inspector tool, identify the HTML elements that contain the movie titles, release years, and IMDb ratings. For our example, we find that the movie titles are inside <code>&lt;td class=\"titleColumn\"&gt;<\/code>, the release years are inside <code>&lt;span class=\"secondaryInfo\"&gt;<\/code>, and the IMDb ratings are inside <code>&lt;td class=\"ratingColumn imdbRating\"&gt;<\/code>.\n\n\n\n<p><strong>Step 7: Extract the Data<\/strong>\n\n\n\n<p>Now, extract the required data using Beautiful Soup&#8217;s methods:\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">titles = []\nrelease_years = []\nratings = []\nmovie_rows = soup.select(\"td.titleColumn\")\nfor row in movie_rows:\n    title = row.a.text\n    titles.append(title)\n    year = row.span.text.strip(\"()\")\n    release_years.append(year)\nrating_rows = soup.select(\"td.ratingColumn.imdbRating\")\nfor row in rating_rows:\n    rating = row.strong.text\n    ratings.append(rating)\n<\/pre>\n\n\n\n<p><strong>Step 8: Display the Results<\/strong>\n\n\n\n<p>Finally, display the scraped data:\n\n\n\n<pre class=\"wp-block-syntaxhighlighter-code\">for i in range(len(titles)):\n    print(f\"{titles[i]} ({release_years[i]}), Rating: {ratings[i]}\")\n<\/pre>\n\n\n\n<p><strong>Step 9: Run the Script and Enjoy the Results<\/strong>\n\n\n\n<p>Run the Python script, and you will get a list of top-rated movies along with their release years and IMDb ratings.\n\n\n\n<p><em><strong>Conclusion<\/strong><\/em>\n\n\n\n<p>Data scraping using Python is a powerful technique to extract information from websites effortlessly. In this blog, we explored a step-by-step guide to scrape movie details from a movie review website using Python&#8217;s Requests and Beautiful Soup libraries. Remember to always review the website&#8217;s terms of service and avoid overwhelming their servers with too many requests. Happy scraping!\n\n\n\n<p>\n\n\n\n<p>(Note: In practice, always be respectful of the website&#8217;s terms of use and considerate of their servers&#8217; load. Some websites may have scraping restrictions or require consent for web scraping activities.)\n","protected":false},"excerpt":{"rendered":"<p>Introduction Data scraping, also known as web scraping, is the process of extracting valuable information from websites automatically. Python has a plethora of libraries and [&#8230;]<\/p>\n","protected":false},"author":1,"featured_media":1092,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-5","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-scraping"],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/posts\/5","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/comments?post=5"}],"version-history":[{"count":4,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/posts\/5\/revisions"}],"predecessor-version":[{"id":23,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/posts\/5\/revisions\/23"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/media\/1092"}],"wp:attachment":[{"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/media?parent=5"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/categories?post=5"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/zahiralam.com\/blog\/wp-json\/wp\/v2\/tags?post=5"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}