428 lines
14 KiB
Plaintext
428 lines
14 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"In this exercise you will download the audio of a youtube video, convert it to the WAV format, and then use the Google Cloud API to transcribe its content.\n",
|
||
|
"\n",
|
||
|
"The lines below import some libraries that make this quite simple. Summerized they are:\n",
|
||
|
"\n",
|
||
|
" * **pafy**, a library to download youtube video and audio.\n",
|
||
|
" * **pydub**, a library to convert audio, for example from mp3 to wav.\n",
|
||
|
" * **google api**, contains a lot of stuff, in particular audio transcription using the speech API. This is done on a Google server, you send it audio and get a transcription back. This way Google can improve their machine learning algorithms and serve this to you.\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"This exercise uses reasonably complicated syntax and things you have not learned properly. Do not worry too much about that, try to understand what is happening. We only ask you to insert small and simple pieces of code.\n",
|
||
|
"\n",
|
||
|
"Select the 'cell' below and press CTRL+ENTER or SHIFT+ENTER to run the code inside it."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# !pip install pafy pydub youtube-dl google-cloud google-cloud-speech google-api-python-client\n",
|
||
|
"\n",
|
||
|
"# import sys\n",
|
||
|
"# !conda install --yes --prefix {sys.prefix} -c conda-forge ffmpeg "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import os\n",
|
||
|
"os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'credentials.json'\n",
|
||
|
"os.environ['PATH'] += ';' + os.path.join(os.path.abspath(os.curdir), 'bin')\n",
|
||
|
"\n",
|
||
|
"import io\n",
|
||
|
"import pafy\n",
|
||
|
"from pydub import AudioSegment\n",
|
||
|
"\n",
|
||
|
"from google.cloud import speech\n",
|
||
|
"from google.cloud.speech import enums\n",
|
||
|
"from google.cloud.speech import types\n",
|
||
|
"\n",
|
||
|
"from googleapiclient import discovery"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"First use `pafy` to get some information of a video.\n",
|
||
|
"\n",
|
||
|
"**Exercise:** What are the types of url, video and video.length? You can use function `type( ... )` to find out."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# YOUR CODE HERE, ~ 3 lines\n",
|
||
|
"None\n",
|
||
|
"None\n",
|
||
|
"None"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Run the cell."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"url = 'https://www.youtube.com/watch?v=yY-P3D63Z18'\n",
|
||
|
"\n",
|
||
|
"video = pafy.new(url)\n",
|
||
|
"\n",
|
||
|
"print(\"Url:\", url)\n",
|
||
|
"print(\"Title:\", video.title)\n",
|
||
|
"print(\"Author:\", video.author)\n",
|
||
|
"print(\"Description:\", video.description)\n",
|
||
|
"print(\"Length:\", video.length)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Now let's actually download the audio of the video and save it. The first line gets the best audio format from the existing ones (YouTube provides multiple formats and encodings of video and audio).\n",
|
||
|
"\n",
|
||
|
"The second line downloads the file to `audio.webm`.\n",
|
||
|
"\n",
|
||
|
"Run the cell."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"audio = video.getbestaudio()\n",
|
||
|
"\n",
|
||
|
"# remove the file if it already exists\n",
|
||
|
"if os.path.exists(audio.filename):\n",
|
||
|
" os.unlink(audio.filename)\n",
|
||
|
"\n",
|
||
|
"filename = audio.download(filepath=audio.filename)\n",
|
||
|
"print(\"Filename:\", filename)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Google's api works easier with a WAV file than with a WEBM file (even though webm is their own format). Moreover, the audio cannot be more than 60 seconds long. Longer audio need the so called `streaming api`, which is a bit harder to use.\n",
|
||
|
"\n",
|
||
|
"Let's keep it simple. The recipe below is, line by line:\n",
|
||
|
"\n",
|
||
|
" * open the WEBM file\n",
|
||
|
" * save it as WAV\n",
|
||
|
" * show the audio\n",
|
||
|
" \n",
|
||
|
"Run the cell."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"sound = AudioSegment.from_file(audio.filename)\n",
|
||
|
"\n",
|
||
|
"sound.export(\"audio.wav\", format=\"wav\", bitrate=\"128k\")\n",
|
||
|
"\n",
|
||
|
"sound"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The variable `sound` contains an `AudioSegment`, an object with `frame_count()` method and a `frame_rate` attribute. \n",
|
||
|
"\n",
|
||
|
"As you may know, sound is a wave. For computers to store such waves, an audio file is composed of many frames. Each frame specifies the amplitude of the wave at a specific time. `frame_count()` calculates how many frames there are in total, and `frame_rate` states how many frames should go in one second.\n",
|
||
|
"\n",
|
||
|
"**Excercise:** Divide `frame_count()` by `frame_rate` to find out how many seconds are exactly in the audio file. Use both `/` and `//` for the devision, what is the difference?"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"print(\"type(sound): \", type(sound))\n",
|
||
|
"\n",
|
||
|
"# YOUR CODE HERE, ~ 2 lines, use both / and //\n",
|
||
|
"audio_length_s = None # Use /\n",
|
||
|
"audio_length_s_ = None # Use //\n",
|
||
|
"# END OF YOUR CODE\n",
|
||
|
"\n",
|
||
|
"print(\"audio_length_s: \", audio_length_s)\n",
|
||
|
"print(\"audio_length_s_: \", audio_length_s_)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Now that we have a WAV file, we use the tools google provide to load these files in a data type Google likes, and we also specify a configuration which states that we want English transcriptions. This improves the transcriptions quality, as now the system know that **Je t'adore** is less likely to occur as **Shut the door**.\n",
|
||
|
"\n",
|
||
|
"The first two lines open the audio file and place its data in memory.\n",
|
||
|
"\n",
|
||
|
"The third line converts the data to a format in which Google can handle.\n",
|
||
|
"\n",
|
||
|
"The last line creates a configuration for the transcription task (in which the language is also specifyied).\n",
|
||
|
"\n",
|
||
|
"Run the cell."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"with open('audio.wav', 'rb') as audio_file:\n",
|
||
|
" content = audio_file.read()\n",
|
||
|
"\n",
|
||
|
"audio = types.RecognitionAudio(content=content)\n",
|
||
|
"config = types.RecognitionConfig(language_code='en-US')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Let's start transcoding. The first line creates a client, which is sort of a telephone that does the communication with Google.\n",
|
||
|
"\n",
|
||
|
"The second line does the hard work, or at least asks Google to do so.\n",
|
||
|
"\n",
|
||
|
"Run the cell (**it will result in an error**).\n",
|
||
|
"\n",
|
||
|
"Read the message, however cryptic it may seem. What does this error mean?\n",
|
||
|
"\n",
|
||
|
" * A) You need to pay for this Google service,\n",
|
||
|
" * B) You need to make an appointment (rendezvous) with a service agent from Google\n",
|
||
|
" * C) The audio was too long, and this API only accepts smaller files\n",
|
||
|
" * D) Google speech does not support audio files in stereo\n",
|
||
|
" * E) The transcription service was permanently terminated in 2016, Google now only offers web search and e-mail"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true,
|
||
|
"scrolled": false
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"client = speech.SpeechClient()\n",
|
||
|
"\n",
|
||
|
"response = client.recognize(config, audio)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"An important part of programming is failing. No code was ever right the first time and most code isn't even right in the final version. `github.com` is a place to distribute software, but also register software bugs and manage solutions. Browse some of the issues there and you'll get the point (or not browse the site, and trust me on this).\n",
|
||
|
"\n",
|
||
|
"This issue is with the file length, it's too large as Google only allows 60 seconds. So let's extract **the first 30 seconds** and try again.\n",
|
||
|
"\n",
|
||
|
"To extract a time slice, the syntax is `[ <start_in_miliseconds> : <end_in_miliseconds> ]`. If you leave the start out, it starts from the beginning. If you leave the end out, it ends at the end. \n",
|
||
|
"\n",
|
||
|
"**Exercise:** Correct the error. For that, choose the correct line out of the 5 commented lines. \n",
|
||
|
"\n",
|
||
|
"All text after a hashtag (`#`) is commented, that is, <u>ignored by the computer</u>. This way you can disable code or add comments. You can uncomment a pieace of code by deleting the hashtag (`#`).\n",
|
||
|
"\n",
|
||
|
"*TIP*: You can test each option by uncommenting one at a time."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"sound = AudioSegment.from_file(filename)\n",
|
||
|
"\n",
|
||
|
"### Select one of these 5 options by removing the # and the space:\n",
|
||
|
"# sound = sound[:]\n",
|
||
|
"# sound = sound[1000:]\n",
|
||
|
"# sound = sound.get_sample_slice(0, 5*44000)\n",
|
||
|
"# sound = sound.split_to_mono()[0]\n",
|
||
|
"# sound = sound[103]\n",
|
||
|
"\n",
|
||
|
"\n",
|
||
|
"sound.export(\"audio.wav\", format=\"wav\", bitrate=\"128k\")\n",
|
||
|
"\n",
|
||
|
"sound"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"with open('audio.wav', 'rb') as audio_file:\n",
|
||
|
" content = audio_file.read()\n",
|
||
|
"\n",
|
||
|
"audio = types.RecognitionAudio(content=content)\n",
|
||
|
"config = types.RecognitionConfig(language_code='en-US')"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"client = speech.SpeechClient()\n",
|
||
|
"\n",
|
||
|
"response = client.recognize(config, audio)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true,
|
||
|
"scrolled": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"response"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"`response` is an object, autocomplete won't work on it, though. However, the output suggests that there is a `results` attribute.\n",
|
||
|
"\n",
|
||
|
"**Exercise:** Get the `results` attribute of `response`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"# YOUR CODE, ~1 line\n",
|
||
|
"None\n",
|
||
|
"# END OF YOR CODE"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"To fully unravel this nested object, we need syntax not yet properly discusses in this course, namely `[0]`. It indicates that we want the first (zero-th) result and the first (zero-th) alternative. This time, the API only gives on result and one alternative.\n",
|
||
|
"\n",
|
||
|
"**Note:** The Google Speech Recognition API returns more results if we gave it more audio clips to translate at once, and more alternatives if it is not too sure about what was said, and came up with more options. Confidence is a number between `0` and `1` to indicate how ... confident Google is about its answer.\n",
|
||
|
"\n",
|
||
|
"This is how we unravel `response` to the transcript text."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"metadata": {
|
||
|
"collapsed": true
|
||
|
},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"response.results[0].alternatives[0].transcript"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"You have seen how to use existing software, to do a reasoanbly complicated task. Possibly you have never programmed before, yet using stuff from others allowed you to translate audio into text, without knowing how this was done. The artificial intelligence behind this translation was black-boxed, and is a field on its own.\n",
|
||
|
"\n",
|
||
|
"**Take home**\n",
|
||
|
"\n",
|
||
|
" * Using existing software, for example on PyPI or github, can help you get something done;\n",
|
||
|
" * You need to figure out how to tie together different software into something that gets your job done.\n",
|
||
|
" * Using software from others has benefits:\n",
|
||
|
" - if you write it yourself, you will do things wrong and have bugs messing up your system from time to time. Others may have more time for their specific problem and solve those with updates\n",
|
||
|
" - it enforces you to write modular code, since you are not allowed to change the library. This often results in better maintainable code.\n",
|
||
|
" - friends that will work on your code can read the documentation of the libraries you used, and get an idea of what is going on. Imagine all the work of explaining or documenting it yourself!\n",
|
||
|
" - you can, if you really need to, change code of libraries, but it is something you should try to avoid.\n",
|
||
|
" <br>\n",
|
||
|
" **NEVER (NEVER!!) DO THIS VIA THE `site-packages` (OR `dist-packages`) DIRECTORIES OF THE PYTHON PATH.\n",
|
||
|
" <br>\n",
|
||
|
" Those files are deleted and replaced each update**\n",
|
||
|
" * a lot is already out there!\n",
|
||
|
" * not everything, though :("
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"hide_input": false,
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.4.3"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 2
|
||
|
}
|