A robust way to generate corpus and other meta data for the specified artist using Genius.com š¼š¤š¶
Hey! š
Glad you asked, while performing a corpus based analysis on artist Bob Dylan, I quickly noticed that there wasnāt a single, updated file containing all the lyrics.
In comes CorpusGenius, a robust solution to generate a Corpus containing (along with other meta files, below) lyrics by user-specified artist, scrapped from Genius.com using geniusās API and John W. Millerās lyricsgenius wrapper.
Since, rather than skipping directly to lyrics for a song, it follows a waterfall model by going from :
All Albums ā> All songs by albums + unreleased/EPs/Miscellaneous songs ā> Lyrics by each song ā> Final corpus.
Thus, by using this python script youāll be able to download:
1) A CSV file containing all albums released by artist 2) A CSV file containing all tracks, both by albums and individual song not released as albums. Such as EPs,demos,bootlegs/singles/live performances/specials/etc. 3) A CSV file containing songs sung/released/performed but NOT written by the specified artist, along with their original songwriters and the album it originally appears on for the specified artist in question 4) A CSV file containing lyrics of all songs 5) A CSV file containing lyrics of all songs released by year 6) A CSV file containing a single corpus of all songs stored in one single cell of csv file
Uses modules lyricsgenius, pandas, requests, unidecode, colorama etc. See requirements.txt for details.
Before you can start, youāll have to set up an API client -
Start by :
git clone https://github.com/jatanjay/CorpusGenius.git
When In the same current working directory :
pip install -r requirements.txt
Run CorpusGenius!
python corpusgenius.py
It took 134.9138 minutes to generate corpus along with all other meta data for artist Bob Dylan
Details for Bob Dylan (albums : 63 , tracks : 1923 )
From this we can approximate that it may take less than 4.2 seconds to generate all data for a song. _____
Suppose if the specified artist is a band, take extra care while entering member names since small changes could change the final corpus.
For example,if your specified artist is The Beatles, then youād enter individual band member names as ā
John Lennon, Paul McCartney, George Harrison, Ringo Starr, Lennon-McCartney
Note: Lennon-McCartney is added since sometimes genius.com attributes songwriter credits as Lennon-McCartney rather than John Lennon and Paul McCartney.
For example consider just entering band members names (excluding Lennon-McCartney) Granted it will work all good by skipping songs not written by āThe Beatles + John Lennon, Paul McCartney, George Harrison, Ringo Starrā
But the moment a song that is stored on genius.com with song-writer credits for āLennon-McCartneyā ā
We see it does exactly what we told, in this case results in a less accurate final corpus.
Thus for your specified artist, let the project run for a while and at any moment you feel songs are erroneously skipped, note down the name it is stored as and re-run CorpusGenius, this time adding it along with the earlier names. Since itās impossible to know under what names song-writers are credited, a little trial & error is required š
Note : If using Excel as your CSV reader (and your corpus is huge) since Excel cannot read more than 32767 characters in a single cell, it might erroneously show words in random cells. If that happens open the file with Notepad or similar.
Jatan J. Pandya (jpandya) Ā© 2020 / https://github.com/jatanjay/
Say we are interested in generating a corpus for Artist : Bob Dylan.
CSV file thus generated will contain albums in fashion:
year, album title, album id
1962, Bob Dylan, 26515
1963, The Freewheelinā Bob Dylan, 17327
1964, The Times They Are A-Changinā, 28249
1964, Another Side of Bob Dylan, 25519
1965, Highway 61 Revisited, 13573
. . .
. . .
. . .
2019, The Rolling Thunder Revue: The 1975 Live Recordings (Sampler), 648356
--------------------------------------snip--------------------------------------------
A further note ::
For albums that have no release info. on Genius.com will be set as āN/Aā (Not available) ā>
--------------------------------
year, album title, album id
N/A xxxxxxxxxxxx xxxxxxxx
-------------snip--------------- You'll notice that along with studio albums, the CSV also contains various bootlegs/alternate albums/ albums that are compilations of Live performances, Outtakes, special releases etc.
Granted, these albums will contain more or less of the same songs, and would be thought of as duplicates. The reason these are included is because more often or not, itās a common fact that: / Bootlegs/demos etc. are often unfinished versions of final songs. Lyrically they are a rich source of alternate lyrics. Hence should not be excluded as itāll affect the final corpus. / For the same reason Outtakes/Live performances are not excluded as artists usually change lyrics on the fly. Hence,should not be excluded as itāll affect the final corpus. Lastly, Genius.com is an ever-changing website. A single word change for a song that is a live song will make it a unique song.
Firstly it will find all songs by EACH album released by the artist, including box-sets/alternate albums/ special/bootlegs/live etc. like ā>
album title, song title, song id
Under the Red Sky, 10,000 Men, 200681
Under the Red Sky, 2 X 2, 200682
Blonde on Blonde, 4th Time Around, 105774
Dylan (1973), A Fool Such as I, 199634
"The Bootleg Series, Vol. 9: The Witmark Demos: 1962-1964", A Hard Rainās ..., 105186
Bob Dylanās Greatest Hits Vol. II , A Hard Rainās ..., 105186
-----------------------------------------------------snip--------------------------------------------
For example after considering the edge case, above list of songs by Bob Dylan will look something like this (along with their āyearsā (not shown here)
album title, song title, song id
N/A, "10,000 Men", 200681
Under the Red Sky, "10,000 Men", 200681
Under the Red Sky, 2 X 2, 200682
N/A, 2 X 2, 200682
N/A, 32-20 Blues, 1686914
N/A, 4th Time Around, 105774
Blonde on Blonde, 4th Time Around, 105774
Dylan (1973), A Fool Such as I, 199634
N/A, A Fool Such as I, 199634
N/A, 900 Miles from My Home, 1994655
-----------------------------------------------------snip--------------------------------------------
For songs that have no album info. on Genius.com will be set as āN/Aā (Not available)
CSV file thus generated will contain lyrics by each song in fashion:
song title lyrics
----------------------------------------snip-----------------------------------
. .
. .
A Hard Rainās A-Gonna Fall [Gaslight 1962], {
Oh, where have you been, my blue
-eyed son? . . .
Iāve stepped in the middle of seven
sad forests Iāve been out in front
of a dozen dead oceans Iāve been
ten thousand miles in the mouth of
a graveyard . . .
}
. .
. .
----------------------------------------snip-----------------------------------
Songs that are repeated will be added to the adjacent cell. This is because, even if the songs do have same title, it is possible that the lyrics can be different. As we saw, since artist change lyrics for songs in the live performances, itās necessary two songs with same songs similar. Next, again since genius.com is an ever-changing website, i.e. anytime a songās lyrics is changed, it will result in a new lyrics for that song and hence a different corpus in the end! And suppose if the two songs appended are completely same, they will be discarded since.
This CSV file contains lyrics by album tracks for each album by year. For example, considering Bob Dylanās discography:
year, all_lyrics
-----------------------------------------------------snip--------------------------------------------
. .
. .
1966, {'Well, your railroad gate, you know I just cant jump it Sometimes
it gets so hard, you see I just sitting here beating on my trumpet
With all these promises you left for me But where are you tonight,
sweet Marie? Well, I waited for you when I was half sick Yes, I
waited for you when you hated me. Well, I waited for you inside of
the frozen traffic Yeah, when you knew I had some other place to be Now,
where are you tonight sweet Marie? Well, anybody can be just like me,
obviously But then, now again, not too many can be like you, fortunately
Well, six white horses that you . . .
". . . and lyrics of all other songs released in the year 1966"
}
.
2020, {
.
.
}
-----------------------------------------------------snip-------------------------------------------
This CSV file containing exactly the subset of songs that are NOT written by the artist. Along with the title of the song, the csv file will also contain the original songwriter and the album it appears on for artist in question.
song title album title & original song writer (if available)
Mr. Bojangles, "['N/A', {'Jerry Jeff Walker'}]","['Dylan (1973)', {'Jerry Jeff Walker'}]"
-----------------------------------------------------snip-------------------------------------------
Thus here song āMr. Bojanglesā :
Is written by Jerry Jeff Walker and not Bob Dylan.
But, recorded nonetheless on album Dylan (1973)
It is repeated twice since the song āMr Bojanglesā appears twice on the final song csv file
There is also a version of Mr. Bojangles on genius.com that doesnāt have required info. hence set to āN/Aā
For songs that have no album info. or song-writer info. on Genius.com will be set as āN/Aā (Not available)
Finally the CSV file containing all the lyrics all the songs attached back to back and stored in a single cell.
If I have seen further it is by standing on the shoulders of Giants. -Isaac Newton
John W. Miller for his excellent lyricsgenius wrapper.
Authors, countless contributors for the various modules used in the project.
Please see LICENSE.md for more details.
By using this tool you agree use to the lyrics for personal purposes. If not please see lyricsgenius docs, API Terms of Service As a reminder CorpusGenius is not responsible for your usage of the tool.