Categories

[Python] First steps in data analysis

You are here:
  • Main
  • Linux
  • [Python] First steps in data analysis
< All Topics

Hello everybody,

today I’m going to talk about my very first steps in data analyzing, I am trying to get a deeper understand of a topic (Magic The Gathering Legacy meta). I had a csv file with some information that weren’t exhaustive, so I did some data manipulation, data gathering, and lastly some data analysis.

Here you can find the environment

  • Windows 11 Home
  • PyCharm Community Edition 2021.3.2

Installing dependencies

First of all, we need to install some fine python packages, in order to load csv and get some data from the web.

pip install pandas
pip install requests
pip install json

The first one is a strong pacakge for data analysis and manipulation, it works with a structure called dataframe, and you can easily slice and dice data, add or drop column, search multiple rows, as if it were a sql table.

The second one let you to contact the web in order to get information, in this case we’re using it to collect additional information from an external API.

Lastly, we have the package that let you manipulate and use json in python.

Load CSV

Now, we need to load the csv and get only the column that we need. Below you can find the header of csv and some rows. The columns that we need are: Qty, Card, Sideboard, Player_username, then we need to sum the Qty by Card, Sideboard and Player_username

def load_of_csv(file_name):
    col_list = ['Card', 'Sideboard', 'Player_Username', 'Qty']
    df = pd.read_csv(file_name, usecols=col_list)
    card_played = df.groupby(['Card', 'Sideboard', 'Player_Username']).sum()
    number_of_decks = len(pd.unique(df['Player_Username']))
    return card_played, number_of_decks

Add a column, retrieving information

Now, in the previous dataframe, we miss a really important information for Magic player, which type of card is. So we’re going to ask an external API (https://api.scryfall.com/cards/named?exact=) for information about the card giving the exact name, then extract card type information and add it to the dataframe.

def number_and_type(card_played):
    most_played_cards = card_played.groupby(['Card']).sum()
    index_card = most_played_cards.index.tolist()
    base_url = 'https://api.scryfall.com/cards/named?exact='
    card_type = [None] * len(index_card)
    ct = 0
    for card in index_card:
        card_wo_space = card.replace(" ", "")
        url = base_url + card_wo_space
        r = requests.get(url)
        data = json.loads(r.text)
        card_type[ct] = data['type_line']
        ct = ct + 1
    most_played_cards['CardType'] = card_type
    return most_played_cards

Get more info, querying data

Now, for each card I want to know how many decks are running it, then filter for card types and get the top 10 most played card by type:

def elaborate_data(card_played,most_played_cards):
    played_card_by_player = card_played.groupby(['Card', 'Player_Username']).sum()
    top_ten = most_played_cards.sort_values(by=['Qty'], ascending=False)[['Qty', 'CardType']]
    i = 0
    number_in_decks_played = [None] * len(top_ten.index.tolist())
    for item in top_ten.index.tolist():
        decks = list(zip(*played_card_by_player.index.tolist()))[0].count(item)
        number_in_decks_played[i] = decks
        i = i+1
    top_ten['NumberOfDecks'] = number_in_decks_played
    top_played_lands = top_ten[top_ten['CardType'].str.contains("Land")].iloc[0:10]
    print(top_played_lands)
    top_played_creatures = top_ten[top_ten['CardType'].str.contains("Creature")].iloc[0:10]
    print(top_played_creatures)
    top_played_sorc = top_ten[top_ten['CardType'].str.contains("Instant|Sorcery")].iloc[0:10]
    print(top_played_sorc)
    top_played_pw = top_ten[top_ten['CardType'].str.contains("Planeswalker")].iloc[0:10]
    print(top_played_pw)

As you can see, there are some main steps in that function:

  • with the first 2 commands, we’re getting information about how many cards each player is running and sorting a dataframe
  • in the while loop, we’re creating additional column and we’re deriving information about how many decks are running that card
  • in the last block, we’re querying dataframe and getting the information that we wanted

Putting all together

Now we have to create the main function in order to load the previously said csv:

def main():
    card_played, number_of_decks = load_of_csv('ManaTraders_SeriesLegacy_February_2022.csv')
    most_played_cards = number_and_type(card_played)
    elaborate_data(card_played, most_played_cards)

if __name__ == "__main__":
    main()

Final Thoughts

That’s it, we’ve loaded a csv, added some information to the dataframe from the web and queried it.

With less than 100 rows of code you did really powerful things!

Regards

Vito

Table of Contents