CorpusGenius

A robust way to generate corpus and other meta data for the specified artist using Genius.com šŸŽ¼šŸŽ¤šŸŽ¶

View the Project on GitHub jatanjay/CorpusGenius

CorpusGenius.

https://jatanjay.github.io/CorpusGenius/



What is CorpusGenius ?

Hey! šŸ‘‹

Glad you asked, while performing a corpus based analysis on artist Bob Dylan, I quickly noticed that there wasnā€™t a single, updated file containing all the lyrics.
In comes CorpusGenius, a robust solution to generate a Corpus containing (along with other meta files, below) lyrics by user-specified artist, scrapped from Genius.com using geniusā€™s API and John W. Millerā€™s lyricsgenius wrapper.

Since, rather than skipping directly to lyrics for a song, it follows a waterfall model by going from :

All Albums ā€“> All songs by albums + unreleased/EPs/Miscellaneous songs ā€“> Lyrics by each song ā€“> Final corpus.

Thus, by using this python script youā€™ll be able to download:

1) A CSV file containing all albums released by artist 2) A CSV file containing all tracks, both by albums and individual song not released as albums. Such as EPs,demos,bootlegs/singles/live performances/specials/etc. 3) A CSV file containing songs sung/released/performed but NOT written by the specified artist, along with their original songwriters and the album it originally appears on for the specified artist in question 4) A CSV file containing lyrics of all songs 5) A CSV file containing lyrics of all songs released by year 6) A CSV file containing a single corpus of all songs stored in one single cell of csv file


Requirements

Uses modules lyricsgenius, pandas, requests, unidecode, colorama etc. See requirements.txt for details.


Getting Started

Before you can start, youā€™ll have to set up an API client -


Running

Start by :

git clone https://github.com/jatanjay/CorpusGenius.git

When In the same current working directory :

pip install -r requirements.txt

Run CorpusGenius!

python corpusgenius.py

All files will be stored in your current working directory


Time

It took 134.9138 minutes to generate corpus along with all other meta data for artist Bob Dylan

Details for Bob Dylan (albums : 63 , tracks : 1923 )

From this we can approximate that it may take less than 4.2 seconds to generate all data for a song. _____

Tips

Tips when your artist is a band :

Suppose if the specified artist is a band, take extra care while entering member names since small changes could change the final corpus. For example,if your specified artist is The Beatles, then youā€™d enter individual band member names as ā€“ John Lennon, Paul McCartney, George Harrison, Ringo Starr, Lennon-McCartney

Note: Lennon-McCartney is added since sometimes genius.com attributes songwriter credits as Lennon-McCartney rather than John Lennon and Paul McCartney.

For example consider just entering band members names (excluding Lennon-McCartney) Granted it will work all good by skipping songs not written by ā€œThe Beatles + John Lennon, Paul McCartney, George Harrison, Ringo Starrā€

Original Beatles members

But the moment a song that is stored on genius.com with song-writer credits for ā€œLennon-McCartneyā€ ā€“

Exception!

We see it does exactly what we told, in this case results in a less accurate final corpus.

So, what should I do?

Thus for your specified artist, let the project run for a while and at any moment you feel songs are erroneously skipped, note down the name it is stored as and re-run CorpusGenius, this time adding it along with the earlier names. Since itā€™s impossible to know under what names song-writers are credited, a little trial & error is required šŸ˜€

Tips for viewing the final corpus :

Note : If using Excel as your CSV reader (and your corpus is huge) since Excel cannot read more than 32767 characters in a single cell, it might erroneously show words in random cells. If that happens open the file with Notepad or similar.


Author

Jatan J. Pandya (jpandya) Ā© 2020 / https://github.com/jatanjay/


FAQs or Why the CSVs are the way they are

Say we are interested in generating a corpus for Artist : Bob Dylan.

  1. ā€œartist_nameā€_albums.csv :

    CSV file thus generated will contain albums in fashion:

    year,   album title,                                                            album id
    1962,   Bob Dylan,                                                              26515
    1963,   The Freewheelinā€™ Bob Dylan,                                             17327
    1964,   The Times They Are A-Changinā€™,                                          28249
    1964,   Another Side of Bob Dylan,                                              25519
    1965,   Highway 61 Revisited,                                                   13573
    .                   .                                                            .
    .                   .                                                            .
    .                   .                                                            .
    2019,   The Rolling Thunder Revue: The 1975 Live Recordings (Sampler),          648356
    --------------------------------------snip--------------------------------------------
    

    A further note ::

    For albums that have no release info. on Genius.com will be set as ā€œN/Aā€ (Not available) ā€“>

                                        --------------------------------
                                        year,   album title,    album id
                                        N/A     xxxxxxxxxxxx    xxxxxxxx
                                        -------------snip--------------- You'll notice that along with studio albums, the CSV also contains various bootlegs/alternate albums/ albums that are compilations of Live performances, Outtakes, special releases etc.
    

    Granted, these albums will contain more or less of the same songs, and would be thought of as duplicates. The reason these are included is because more often or not, itā€™s a common fact that: / Bootlegs/demos etc. are often unfinished versions of final songs. Lyrically they are a rich source of alternate lyrics. Hence should not be excluded as itā€™ll affect the final corpus. / For the same reason Outtakes/Live performances are not excluded as artists usually change lyrics on the fly. Hence,should not be excluded as itā€™ll affect the final corpus. Lastly, Genius.com is an ever-changing website. A single word change for a song that is a live song will make it a unique song.

  2. ā€œartist_nameā€_tracks.csv :

    Firstly it will find all songs by EACH album released by the artist, including box-sets/alternate albums/ special/bootlegs/live etc. like ā€“>

    album title,                                                       song title,        	song id
    Under the Red Sky,                                                 10,000 Men,        	200681
    Under the Red Sky,                                                 2 X 2,              	200682
    Blonde on Blonde,                                                  4th Time Around,    	105774
    Dylan (1973),                                                      A Fool Such as I,   	199634
    "The Bootleg Series, Vol. 9: The Witmark Demos: 1962-1964",        A Hard Rainā€™s ...,	105186
    Bob Dylanā€™s Greatest Hits Vol. II ,                                A Hard Rainā€™s ...,	105186
    -----------------------------------------------------snip--------------------------------------------
    

    For example after considering the edge case, above list of songs by Bob Dylan will look something like this (along with their ā€˜yearsā€™ (not shown here)

    album title,                                         song title,                song id
    N/A,                                                 "10,000 Men",               200681
    Under the Red Sky,                                   "10,000 Men",               200681
    Under the Red Sky,                                    2 X 2,                     200682
    N/A,                                                  2 X 2,                     200682
    N/A,                                                  32-20 Blues,               1686914
    N/A,                                                  4th Time Around,           105774
    Blonde on Blonde,                                     4th Time Around,           105774
    Dylan (1973),                                         A Fool Such as I,          199634
    N/A,                                                  A Fool Such as I,          199634
    N/A,                                                 900 Miles from My Home,     1994655
    -----------------------------------------------------snip--------------------------------------------
    

    For songs that have no album info. on Genius.com will be set as ā€œN/Aā€ (Not available)

  3. ā€œartist_nameā€_lyrics.csv :

    CSV file thus generated will contain lyrics by each song in fashion:

    song title                                        lyrics
    ----------------------------------------snip-----------------------------------
    .                                                                     .
    .                                                                     .                                     
    A Hard Rainā€™s  A-Gonna Fall [Gaslight 1962],       {
                                                       Oh, where have you been, my blue
                                                       -eyed son? . . .
                                                       Iā€™ve stepped in the middle of seven 
                                                       sad forests Iā€™ve been out in front 
                                                       of a dozen dead oceans Iā€™ve been 
                                                       ten thousand miles in the mouth of 
                                                       a graveyard . . .
                                                       }                                                  
    .                                                                      .
    .                                                                      .
    ----------------------------------------snip-----------------------------------
    

    Songs that are repeated will be added to the adjacent cell. This is because, even if the songs do have same title, it is possible that the lyrics can be different. As we saw, since artist change lyrics for songs in the live performances, itā€™s necessary two songs with same songs similar. Next, again since genius.com is an ever-changing website, i.e. anytime a songā€™s lyrics is changed, it will result in a new lyrics for that song and hence a different corpus in the end! And suppose if the two songs appended are completely same, they will be discarded since.

  4. ā€œartist_nameā€_lyrics_by_years.csv:

    This CSV file contains lyrics by album tracks for each album by year. For example, considering Bob Dylanā€™s discography:

    year,            all_lyrics
    -----------------------------------------------------snip--------------------------------------------
    .                         .
    .                         .
    1966,        {'Well, your railroad gate, you know I just cant jump it Sometimes
    			it gets so hard, you see I just sitting here beating on my trumpet 
    			With all these promises you left for me But where are you tonight, 
    			sweet Marie?  Well, I waited for you when I was half sick Yes, I 
    			waited for you when you hated me. Well, I waited for you inside of 
    			the frozen traffic Yeah, when you knew I had some other place to be Now, 
    			where are you tonight sweet Marie? Well, anybody can be just like me, 
    			obviously But then, now again, not too many can be like you, fortunately  
    			Well, six white horses that you . . . 
    			". . . and lyrics of all other songs released in the year 1966"
                }
    .
    2020,        { 			   
    								.
    								.
                 }
    -----------------------------------------------------snip-------------------------------------------
    
  5. songs_not_by_ā€œartist_nameā€.csv

    This CSV file containing exactly the subset of songs that are NOT written by the artist. Along with the title of the song, the csv file will also contain the original songwriter and the album it appears on for artist in question.

    song title                  album title & original song writer (if available)
    Mr. Bojangles,              "['N/A', {'Jerry Jeff Walker'}]","['Dylan (1973)', {'Jerry Jeff Walker'}]"
    -----------------------------------------------------snip-------------------------------------------
    

    Thus here song ā€œMr. Bojanglesā€ :

    1. Is written by Jerry Jeff Walker and not Bob Dylan.

    2. But, recorded nonetheless on album Dylan (1973)

    3. It is repeated twice since the song ā€œMr Bojanglesā€ appears twice on the final song csv file

    4. There is also a version of Mr. Bojangles on genius.com that doesnā€™t have required info. hence set to ā€œN/Aā€

      For songs that have no album info. or song-writer info. on Genius.com will be set as ā€œN/Aā€ (Not available)

  6. ā€œartist_nameā€_corpus.csv

    Finally the CSV file containing all the lyrics all the songs attached back to back and stored in a single cell.


Acknowledgments

If I have seen further it is by standing on the shoulders of Giants. -Isaac Newton

John W. Miller for his excellent lyricsgenius wrapper.

Authors, countless contributors for the various modules used in the project.


License

Please see LICENSE.md for more details.

By using this tool you agree use to the lyrics for personal purposes. If not please see lyricsgenius docs, API Terms of Service As a reminder CorpusGenius is not responsible for your usage of the tool.