How to Download ANY Video from the Internet using Python | The Easy Way

Arnav Bonigala
7 min readApr 8, 2023

--

Photo by Wahid Khene on Unsplash

Disclaimer: this tutorial is purely for educational purposes and should not be used to pirate any content available on the internet. I assume no responsibility for the usage of the code published in this article and on GitHub.

Have you ever gone on a trip (perhaps a road-trip or via plane) and really wished you downloaded highlights from the latest sports games or movies from (not endorsed) pirated sites? But wait, these highlights aren’t on YouTube or any site that has a publicly available MP4 convertor.

How would you go about downloading the video contents? The first thought that comes to mind is screen recording the video, but this is time consuming and loses some of the original quality from the video. The next thought would be to inspect element and find the video file that the webpage is playing from. This a crude method and does not work for sites that use buffering (which is a large majority) so we cannot use it.

Photo by Mike van den Bos on Unsplash

What is buffering, and why does it prevent us from directly downloading the video file?

In modern, online video playing systems, we use a technology called “buffering” to essentially load the video while the user is playing it. This sounds a bit complicated, but let me show you the reality of it.

The webpage downloads the first segment of video (usually about 10–25 seconds) from some remote location in a “blob” format. Once the viewer watches about 6 or 7 seconds, assuming the segment is 10 seconds long, the webpage downloads the next segment of video (of the same length) and seamlessly stitches it on to the end of the segment already playing.

The reason we have to do this is because if we were playing a video from a standard video HTML tag, the browser would have to download that video in its entirety before it can be viewed. For videos that are short (preferably less than one minute), this isn’t an issue. But, when we start to get to larger and larger videos, this browser process becomes very time consuming and uses too many resources.

By only downloading short segments at an interval, we reduce the initial download time before playback can begin, and greatly reduce the storage load on the machine.

So if you’ve been paying attention, you should have the following questions:

  1. How does the browser know what segment of the video comes next?
  2. How can we take a list of segments, download each one, and stitch them together?
  3. Is this the best way? (always question the implementation to see if you can figure out something better)

I’ll answer the first question right now, then we’ll dive into the implementation and see how exactly we can automate this entire process using Python. Then, I’ll talk about other methods and why I believe this method to be the best.

When you open any video player that uses this buffering technology, the underlying JavaScript supporting the webpage downloads a very specific file from the webpage’s backend. This file is typically called “master.m3u8” and stores information about the locations and orders of all segments of video that will be played.

If we download this file, and then search through it for the links to the segments, then stitch those segments together using a software like FFmpeg, we will have successfully download a buffered video.

To do this, we’ll use the popular automated testing software, Selenium. Selenium allows us to emulate an entire Chrome browser using code, so to install it, use the following command.

pip install selenium

With that, you should be able to utilize Selenium’s extensive toolkit for automation (assuming you have chromedriver installed which I will not cover).

Let’s import the necessary Selenium modules for this project.

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

import requests

If the site you are downloading from uses Cloudflare protection, you can use https://pypi.org/project/undetected-chromedriver/ as a replacement for Selenium.

Next, we’ll import the Selenium webdriver and create an instance of it.

from selenium import webdriver

# replace with path to chromedriver
driver = webdriver.Chrome(executable_path=r"C:\path\to\chromedriver.exe")

My code seeks to download from https://animetake.tv, so I will be having the driver get that webpage. Make sure to replace this url with whatever site you want to download from. Use the link that contains the video, not the home page of the website.

driver.get("https://animetake.tv") # get the video source

One thing to note here: if you are trying to download from a pirated site, you may notice that the video is being played from an iFrame. You’ll need to navigate to the source of the iFrame before this will work.

To do this, you can use some of my code from previous projects that I’ll embed right below here.

WebDriverWait(driver, timeout=5).until(EC.presence_of_element_located((By.ID, 'videowrapper_mplayer')))

# select the videos parent tag
videowrapper = driver.find_element(By.ID, 'videowrapper_mplayer')

# get the iframe displaying the video (this source iframes from another source)
iframe = videowrapper.find_element(By.TAG_NAME, 'iframe')
src = iframe.get_attribute("src")

# redirect to that iframe
driver.get(src)

# find another iframe that leads to ACTUAL source
iframe = driver.find_element(By.TAG_NAME, 'iframe')
link = iframe.get_attribute("src")

# redirect to that link
driver.get(link)

Now, with Selenium and JavaScript, we’ll need to check the network requests and find the “master.m3u8” file that controls the buffering technology.

JS_get_network_requests = "var performance = window.performance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;"
network_requests = driver.execute_script(JS_get_network_requests)

We use JavaScript here to easily interface with the network requests. Trying to do this within Python itself would require us to take a more complicated approach. Take a moment to look through the JavaScript and understand what exactly it is doing (just gets network requests in the form of a dictionary).

Once the Selenium browser executes this JavaScript code, we can loop through the network_requests object and search for the “master.m3u8” file.

for n in network_requests:
if "master.m3u8" in n["name"]:
url = n["name"]

If this code doesn’t work, make sure that you are indeed trying to download from a buffered video and not from a <video> tag. If you are sure of that, go through the network requests for the video player site and check that the m3u8 file that is being downloaded is called master. If it is not, change the name in your code.

Now that we’ve gotten the URL for the master.m3u8 file, we need to use a special python library that allows us to read the file and search for the video names.

pip install m3u8
import m3u8

r = requests.get(url) # get the master.m3u8 file
m3u8_master = m3u8.loads(r.text) # convert to a readable format
playlist_url = m3u8_master.data["playlists"][0]['uri'] # get the URL of the playlist containing all video segments
r = requests.get(playlist_url) # get the playlist file
playlist = m3u8.loads(r.text) # load into readable format

As you can see here, the m3u8 file is divided into playlists that we can look through. In my case, the m3u8 file contained a playlist for audio, and a playlist for video. I need to reuse this code for both. If your m3u8 file doesn’t have two playlists, you will be fine just doing it once.

We’re going to use the m3u8 file to load the request text into a Python readable format (basically a dictionary) and look for different playlists. Each playlist contains a link to another m3u8 file that we can easily use requests to get then m3u8 to load again.

We are going to use requests to get this playlist_url, then write a .ts file that combines each segment in the playlist.

r = requests.get(playlist.data['segments'][0]['uri']) # get the URL of the segments

with open('video.ts', 'wb') as f:
for segment in playlist.data['segments']: # go through each segment and write it to the file
url = segment['uri']
r = requests.get(url)

f.write(r.content)

We once again find another “playlist” inside our playlist for the segments, then we get that and load it into a dictionary. We now loop through the segments and get each suburl.

When you run this, you’ll notice a new file show up in your directory that should contain the downloaded video. If your m3u8 file only has one playlist, you should have successfully downloaded the video and audio in one file.

If your m3u8 file has two playlists, continue along this tutorial and we’ll repeat the same steps to get the audio.

for n in network_requests: # reassign url to the master.m3u8 url
if "master.m3u8" in n["name"]:
url = n["name"]

r = requests.get(url)
m3u8_master = m3u8.loads(r.text)
playlist_url = m3u8_master.data["media"][1]['uri'] # get audio playlist url
r = requests.get(playlist_url)
playlist = m3u8.loads(r.text)

r = requests.get(playlist.data['segments'][0]['uri']) # get the segments

with open('audio.ts', 'wb') as f:
for segment in playlist.data['segments']: # write each segment to a new file
url = segment['uri']
r = requests.get(url)

f.write(r.content)

This repeat chunk of code can be extrapolated to a function, but for the sake of the tutorial, I will not be doing that.

Now we have a problem, we have two files that we need to merge. The best solution for this will be to use an FFmpeg command (no need to reinvent the wheel).

Download FFmpeg and add it to PATH (or your virtual environment), so we can utilize it in our code.

import subprocess

# subprocess will allow us to execute a command

subprocess.run(['ffmpeg', '-i', 'video.ts', '-i', 'audio.ts', '-c', 'copy', 'output.ts'])
subprocess.run(['ffmpeg', '-i', 'output.ts', f'{file_name}.mp4'])

This chunk of code merges our two .ts files and the converts that output into an mp4 file.

Let’s clean up and delete the unnecessary files we have downloaded.

import os

os.remove("video.ts")
os.remove("audio.ts")
os.remove("output.ts")

With that, our code is complete and should be fully functioning for downloading videos from any website!

I’ll now answer the third question, which is why this is the best implementation.

If we were to screenrecord videos, we would be unable to utilize the machine for the entire duration of the video. With this method, the code takes a few minutes to download the entire video, and you can put Selenium into a headless mode, allowing you to also use the machine while the code downloads in the background.

By using code, we also open ourselves up to new ways of optimization and automation. We can recycle the code and use it for other projects, as well as improve it constantly.

With that, I leave you to whatever you came here for. Happy coding!

--

--