[Python] First steps in data analysis
today I’m going to talk about my very first steps in data analyzing, I am trying to get a deeper understand of a topic (Magic The Gathering Legacy meta). I had a csv file with some information that weren’t exhaustive, so I did some data manipulation, data gathering, and lastly some data analysis.
Here you can find the environment
- Windows 11 Home
- PyCharm Community Edition 2021.3.2
First of all, we need to install some fine python packages, in order to load csv and get some data from the web.
pip install pandas pip install requests pip install json
The first one is a strong pacakge for data analysis and manipulation, it works with a structure called dataframe, and you can easily slice and dice data, add or drop column, search multiple rows, as if it were a sql table.
The second one let you to contact the web in order to get information, in this case we’re using it to collect additional information from an external API.
Lastly, we have the package that let you manipulate and use json in python.
Now, we need to load the csv and get only the column that we need. Below you can find the header of csv and some rows. The columns that we need are: Qty, Card, Sideboard, Player_username, then we need to sum the Qty by Card, Sideboard and Player_username
def load_of_csv(file_name): col_list = ['Card', 'Sideboard', 'Player_Username', 'Qty'] df = pd.read_csv(file_name, usecols=col_list) card_played = df.groupby(['Card', 'Sideboard', 'Player_Username']).sum() number_of_decks = len(pd.unique(df['Player_Username'])) return card_played, number_of_decks
Add a column, retrieving information
Now, in the previous dataframe, we miss a really important information for Magic player, which type of card is. So we’re going to ask an external API (https://api.scryfall.com/cards/named?exact=) for information about the card giving the exact name, then extract card type information and add it to the dataframe.
def number_and_type(card_played): most_played_cards = card_played.groupby(['Card']).sum() index_card = most_played_cards.index.tolist() base_url = 'https://api.scryfall.com/cards/named?exact=' card_type = [None] * len(index_card) ct = 0 for card in index_card: card_wo_space = card.replace(" ", "") url = base_url + card_wo_space r = requests.get(url) data = json.loads(r.text) card_type[ct] = data['type_line'] ct = ct + 1 most_played_cards['CardType'] = card_type return most_played_cards
Get more info, querying data
Now, for each card I want to know how many decks are running it, then filter for card types and get the top 10 most played card by type:
def elaborate_data(card_played,most_played_cards): played_card_by_player = card_played.groupby(['Card', 'Player_Username']).sum() top_ten = most_played_cards.sort_values(by=['Qty'], ascending=False)[['Qty', 'CardType']] i = 0 number_in_decks_played = [None] * len(top_ten.index.tolist()) for item in top_ten.index.tolist(): decks = list(zip(*played_card_by_player.index.tolist())).count(item) number_in_decks_played[i] = decks i = i+1 top_ten['NumberOfDecks'] = number_in_decks_played top_played_lands = top_ten[top_ten['CardType'].str.contains("Land")].iloc[0:10] print(top_played_lands) top_played_creatures = top_ten[top_ten['CardType'].str.contains("Creature")].iloc[0:10] print(top_played_creatures) top_played_sorc = top_ten[top_ten['CardType'].str.contains("Instant|Sorcery")].iloc[0:10] print(top_played_sorc) top_played_pw = top_ten[top_ten['CardType'].str.contains("Planeswalker")].iloc[0:10] print(top_played_pw)
As you can see, there are some main steps in that function:
- with the first 2 commands, we’re getting information about how many cards each player is running and sorting a dataframe
- in the while loop, we’re creating additional column and we’re deriving information about how many decks are running that card
- in the last block, we’re querying dataframe and getting the information that we wanted
Putting all together
Now we have to create the main function in order to load the previously said csv:
def main(): card_played, number_of_decks = load_of_csv('ManaTraders_SeriesLegacy_February_2022.csv') most_played_cards = number_and_type(card_played) elaborate_data(card_played, most_played_cards) if __name__ == "__main__": main()
That’s it, we’ve loaded a csv, added some information to the dataframe from the web and queried it.
With less than 100 rows of code you did really powerful things!