Data Input and Files

Data Input and Files#

Reading and writing files is critical for any problems you are trying to solve which involve data of some type (scientific, configuration, etc.) Fortunately Python is well supported with libraries to read and write a wide variety of data formats.

Input from the keyboard#

You can get input from the keyboard with the input() function.

answer = input('What is your name?')
print(f'Hello {answer}!')

What is your name? Santa Claus
Hello Santa Claus!

(Structured) Text files#

Simple text files are useful and very common for small/medium-sized datasets, configuration files, etc. Text based formats can also easily be inspected via a text editor. Such files can be just raw text, or structured in some fashion like CSV, YAML or JSON. (HTML and CSS files on the web are also examples, but less useful as for data-science/scientific purposes).

You can often find text files as the standard option for downloading data from various online databases, e.g.

NSF Award database - offers options to download as CSV files
HEPData - an online open access database of data associated with HEP publications, traditionally the data from plots and tables in CSV, JSON (and ROOT) formats, but also more recently from things like likelihoods

Opening such files is very simple in Python, either from the core language (simple text files) or via standard libraries (CSV, YAML, JSON). The standard way to do this is in Python is with the ‘with…as’ syntax, using a so-called “context manager”. The file will be opened and all of the associated indented code will be executed. You do not need to “close” the file as you may have done in other programming contexts. The “context manager” with do that automatically.

Here we open a simple text file for writing with ‘w’. Alternatively we could open a file for ‘r’, but the more interesting cases for that are when one reads structured data (csv, yaml, json) as described below.

with open('example.txt','w') as f:
    f.write('Hello World!')

CSV files#

CSV files are very common in the data science world as a structured (text) data formats.

The basic structure involves “rows” and “columns”, where each row represents an entry of interest and the corresponding columns contain different attributes associated with that entry. The first row is usually the “header” with labels for the columns and the first row sometimes contains an “index”. These can be opened an examined row by row. For example, a CSV (with a header, but no index) for the Seven Dwarfs would be:

!cat assets/seven_dwarfs.csv

Name,Hair Color,Beard,Height (cm)
Doc,Gray,True,120
Grumpy,White,True,118
Happy,Brown,True,115
Sleepy,Blonde,True,117
Bashful,Black,True,116
Sneezy,Red,True,114
Dopey,Bald,False,113

import csv
with open('assets/seven_dwarfs.csv', mode ='r')as file:
  csvFile = csv.reader(file)
  for lines in csvFile:
        print(lines)

['Name', 'Hair Color', 'Beard', 'Height (cm)']
['Doc', 'Gray', 'True', '120']
['Grumpy', 'White', 'True', '118']
['Happy', 'Brown', 'True', '115']
['Sleepy', 'Blonde', 'True', '117']
['Bashful', 'Black', 'True', '116']
['Sneezy', 'Red', 'True', '114']
['Dopey', 'Bald', 'False', '113']

However the most common method for opening and accessing data from CSV files is via the Pandas library into a Pandas dataframe.

import pandas as pd
df = pd.read_csv('assets/seven_dwarfs.csv')
df

	Name	Hair Color	Beard	Height (cm)
0	Doc	Gray	True	120
1	Grumpy	White	True	118
2	Happy	Brown	True	115
3	Sleepy	Blonde	True	117
4	Bashful	Black	True	116
5	Sneezy	Red	True	114
6	Dopey	Bald	False	113

YAML files#

YAML files are simple structured data files, most often used to contain configuration information and/or web configuration (e.g. “frontmatter” in jekyll markdown files). It is simple to use (read, write) for limited basic structured data, but should be avoided for complicated or large data structures.

In the following example, the file will be loaded as a Python dict with just one key (“dwarfs”). That key accesses a list of dicts, one per dwarf.

!cat assets/seven_dwarfs.yml

dwarfs:
  - name: Doc
    role: Leader
    beard: true
    height_cm: 120
    favorite_color: blue

  - name: Grumpy
    role: Complainer
    beard: true
    height_cm: 118
    favorite_color: red

  - name: Happy
    role: Cheerful
    beard: true
    height_cm: 115
    favorite_color: yellow

  - name: Sleepy
    role: Tired
    beard: true
    height_cm: 117
    favorite_color: gray

  - name: Bashful
    role: Shy
    beard: true
    height_cm: 116
    favorite_color: pink

  - name: Sneezy
    role: Allergic
    beard: true
    height_cm: 114
    favorite_color: green

  - name: Dopey
    role: Silent
    beard: false
    height_cm: 113
    favorite_color: purple

import yaml

# Load in the data
with open("assets/seven_dwarfs.yml", "r") as file:
    data = yaml.safe_load(file)

# Print something about each dwarf
for dwarf in data["dwarfs"]:
    print(f"{dwarf['name']} is the {dwarf['role']} one.")

Doc is the Leader one.
Grumpy is the Complainer one.
Happy is the Cheerful one.
Sleepy is the Tired one.
Bashful is the Shy one.
Sneezy is the Allergic one.
Dopey is the Silent one.

JSON file#

JSON (JavaScript Object Notation) files grew out of the need to exchange large amounts of data in a web (server, client/browser) environment. JSON files can handle jagged and/or varying length data.

!cat assets/seven_dwarfs.json

{
  "dwarfs": [
    {
      "name": "Doc",
      "role": "Leader",
      "beard": true,
      "height_cm": 120,
      "favorite_color": "blue"
    },
    {
      "name": "Grumpy",
      "role": "Complainer",
      "beard": true,
      "height_cm": 118,
      "favorite_color": "red"
    },
    {
      "name": "Happy",
      "role": "Cheerful",
      "beard": true,
      "height_cm": 115,
      "favorite_color": "yellow"
    },
    {
      "name": "Sleepy",
      "role": "Tired",
      "beard": true,
      "height_cm": 117,
      "favorite_color": "gray"
    },
    {
      "name": "Bashful",
      "role": "Shy",
      "beard": true,
      "height_cm": 116,
      "favorite_color": "pink"
    },
    {
      "name": "Sneezy",
      "role": "Allergic",
      "beard": true,
      "height_cm": 114,
      "favorite_color": "green"
    },
    {
      "name": "Dopey",
      "role": "Silent",
      "beard": false,
      "height_cm": 113,
      "favorite_color": "purple"
    }
  ]
}

import json

# Load in the data
with open('assets/seven_dwarfs.json', 'r') as f:
    data = json.load(f)

# Print each dwarf's name
for dwarf in data["dwarfs"]:
    if (dwarf['beard']):
        print(f'{dwarf["name"]} has a beard')
    else:
        print(f'{dwarf["name"]} does not have a beard')

Doc has a beard
Grumpy has a beard
Happy has a beard
Sleepy has a beard
Bashful has a beard
Sneezy has a beard
Dopey does not have a beard

Binary files#

In data-intensive fields where data volumes can get large and I/O speed becomes important, text files are limiting and other formats are often used, such as (non-text) “binary” formats and various (sometimes customizable) data compression algorithms may be involved. Some examples include:

ROOT files - This is a very common format (from the ROOT data analysis framework) used widely in particle, nuclear and astroparticle physics. They can be opened and manipulated in Python using the uproot library. Uproot is often used with Awkward Array to read and access jagged arrays. We will be discussing uproot, ROOT files and Awkward Array in the other HEP-specific parts of this workshop.
Parquet files - as the larger data science community outgrew CSV files, they also needed a binary space-efficient data storage format. Apache Parquet has been developed to fill that niche and is well integrated with (for example) Pandas.
HDF5 files - Hierarchical Data Format (HDF) files were originally developed with the high performance computing community (at NCSA) and the latest version, HDF5, has been adopted by a number of other scientific domains as well as garnered some interest within the high energy physics community. The h5py package provides a pythonic interface to HDF5 files.