site

files for beauhilton.com
git clone https://git.beauhilton.com/site.git
Log | Files | Refs

2099-06-11-lightgbm-multiclass.md (9403B)


      1 ---
      2 layout: post
      3 title: "Python, Write My Paper"
      4 toc: false
      5 image: /images/unsplash-grey-flowerbuds.jpg
      6 categories:
      7   - AI for MDs
      8 tags:
      9   - coding
     10   - python
     11   - fstring
     12   - laziness
     13 ---
     14 
     15 ![laziness is prioritization]({{ site.url }}/images/pixabay-lazy-1.png){:class="img-responsive"}
     16 
     17 Computers are good at doing tedious things.[^1]
     18 
     19 Many of the early advances in computing were accomplished to help people do tedious things they didn't want to do, like the million tiny equations that make up a calculus problem.[^2] It has also been said, and repeated, and I agree, that one of the three virtues of a good programmer is laziness.[^3]
     20 
     21 One of the most tedious parts of my job is writing paragraphs containing the results of lots of math relating to some biomedical research project. To make this way easier, I use a core Python utility called the `f-string`, in addition to some other tools I may write about at a later date.[^4]
     22 
     23 ## The problem
     24 
     25 First, here's an example of the kinds of sentences that are tedious to type out, error prone, and have to be fixed every time something changes on the back end (--> more tedium, more room for errors).
     26 
     27 "In the study period there were 1,485,880 hospitalizations for 708,089 unique patients, 439,696 (62%) of whom had only one hospitalization recorded.
     28 The median number of hospitalizations per patient was 1 (range 1-176, [1.0 , 2.0])."
     29 
     30 The first paragraph of a results section of a typical medical paper is chock-full of this stuff. If we find an error in how we calculated any of this, or find that there was a mistake in the database that needs fixing (and this happens woefully often), all of the numbers need replaced. It's a pain.
     31 How might we automate the writing of this paragraph?[^5]
     32 
     33 ## The solution
     34 
     35 First, we're going to do the math (which we were doing anyway), and assign each math-y bit a unique name. Then we're going to plug in the results of these calculations to our sentences.
     36 If you're not familiar with Python or Pandas, don't worry - just walk through the names and glance at the stuff after the equals sign, but don't get hung up on it.
     37 The basic syntax is:
     38 
     39 ```python
     40 some_descriptive_name = some_dataset["some_column_in_that_dataset"].some_mathy_bit()
     41 ```
     42 
     43 After we generate the numbers we want, we write the sentence, insert the code, and then use some tricks to get the numbers in the format we want.
     44 
     45 In most programming languages, "string" means "text, not code or numbers." So an `f-string` is a `formatted-string`, and allows us to insert code into blocks of normal words using an easy, intuitive syntax.
     46 
     47 Here's an example:
     48 
     49 ```python
     50 name_of_word_block = f"""Some words with some {code} we want Python to evaluate,
     51 maybe with some extra formatting thrown in for fun,
     52 such as commas to make long numbers more readable ({long_number:,}),
     53 or a number of decimal places to round to
     54 ({number_with_stuff_after_the_decimal_but_we_only_want_two_places:.2f},
     55 or a conversion from a decimal to a percentage and get rid of everything after the '.'
     56 {some_number_divided_by/some_other_number*100:.0f}%)."""
     57 ```
     58 
     59 First, declare the name of the block of words. Then write an `f`, which will tell Python we want it to insert the results of some code into the following string, which we start and end with single or triple quotes (triple quotes let you break strings into multiple lines).
     60 Add in the code within curly brackets, `{code}`, add some optional formatting after a colon, `{code:formatting_options}`, and prosper.
     61 
     62 As you can see from the last clause, you can do additional math or any operation you want within the `{code}` block. I typically like to do the math outside of the strings to keep them cleaner looking, but for simple stuff it can be nice to just throw the math in the f-string itself.
     63 
     64 Here's the actual code I used to make those first two sentences from earlier. First the example again, then the math, then the f-strings.[^7]
     65 
     66 "In the study period there were 1,485,880 hospitalizations for 708,089 unique patients, 439,696 (62%) of whom had only one hospitalization recorded.
     67 The median number of hospitalizations per patient was 1 (range 1-176, [1.0 , 2.0])."
     68 
     69 ```python
     70 n_encs = data["encounterid"].nunique()
     71 n_pts = data["patientid"].nunique()
     72 
     73 pts_one_encounter = df[df["encounternum"] == 1].nunique()
     74 min_enc_per_pt = df["encounternum"].min()
     75 q1_enc_per_pt = df["encounternum"].quantile(0.25)
     76 median_enc_per_pt = df["encounternum"].median()
     77 q3_enc_per_pt = df["encounternum"].quantile(0.75)
     78 max_enc_per_pt = df["encounternum"].max()
     79 
     80 sentence01 = f"In the study period there were {n_encs:,} hospitalizations for {n_pts:,} unique patients, {pts_one_encounter:,} ({pts_one_encounter/n_pts*100:.0f}%) of whom had only one hospitalization recorded. "
     81 sentence02 = f"The median number of hospitalizations per patient was {median_enc_per_pt:.0f} (range {min_enc_per_pt:.0f}-{max_enc_per_pt:.0f}, [{q1_enc_per_pt} , {q3_enc_per_pt}]). "
     82 ```
     83 
     84 If you want to get real ~~lazy~~ ~~fancy~~ lazy, you can combine these sentences into a paragraph, save that paragraph to a text file, and then automatically include this text file in your final document.
     85 
     86 ```python
     87 paragraph01 = sentence01 + sentence02
     88 results_text_file = "results_paragraphs.txt"
     89 with open(results_text_file, "w") as text_file:
     90     print(paragraph01, file=text_file)
     91 ```
     92 
     93 To automatically include the text file in your document, you'll have to figure out some tool appropriate to your writing environment. I think there's a way to source text files in Microsoft Word, though I'm less familiar with Word than other document preparation tools such as LaTeX. If you know how to do it in Word, let me know (or I'll look into it and update this post).
     94 
     95 Here's how to do it in LaTeX. Just put `\input` and the path to your text file at the appropriate place in your document:[^6]
     96 
     97 ```latex
     98 \input{"results_paragraphs_latex.txt"}
     99 ```
    100 
    101 With this workflow, I can run the entire analysis, have all the mathy bits translated into paragraphs that include the right numbers, and have those paragraphs inserted into my text in the right spots.
    102 
    103 I should note that there are other ways to do this. There are ways of weaving actual Python and R code into LaTeX documents, and RMarkdown is a cool way of using the simple syntax of Markdown with input from R. I like the modular approach outlined here, as it lets me just tag on a bit to the end of the Python code I was writing anyway, and integrate it into the LaTeX I was writing anyway. I plan to use this approach for the foreseeable future, but if you have strong arguments for why I should switch to another method, I would love to hear it, especially if it might better suit my laziness.
    104 
    105 Addendum: As I was writing this, I found a similar treatment of the same subject. It's great, with examples in R and Python. [Check it out](https://jabranham.com/blog/2018/05/reporting-statistics-in-latex/).
    106 
    107 [^1]: _Automate the Boring Stuff_ by Al Sweigart is a great introduction to programming in general, and is available for free as a [hypertext book](https://automatetheboringstuff.com/). It teaches exactly what its name denotes, in an interactive and easy-to-understand combination of code and explanation.
    108 
    109 [^2]: I'm revisiting [Walter Isaacson's _The Innovators"](https://en.wikipedia.org/wiki/The_Innovators_(book)), which I first listened to before I got deeply into programming, and on this go-through I am vibing much harder with the repeated (and repeated) (and again repeated) impetus for building the first and subsequent computing machines: tedious things are tedious.
    110 
    111 [^3]: The other two are impatience and hubris. Here is one of the [most lovely websites on the internet](http://threevirtues.com/)
    112 
    113 [^4]: For example, TableOne, which makes the (_incredibly_ tedious) task of making that classic first table in any biomedical research paper _so much easier_. Here's a link to [TableOne's project page](https://github.com/tompollard/tableone), which also includes links out to examples and their academic paper on the software.
    114 
    115 [^5]: Assign it to a resident, of course.
    116 
    117 [^6]: You may have noticed that the name of this file is "results_paragraphs_latex.txt" rather than "results_paragraphs.txt," and that's because LaTeX needs a little special treatment if you're going to use the percentage symbol. LaTeX uses the percentage symbol as a comment sign, meaning that anything after the symbol is ignored and left out of the document. You have to "escape" the percentage symbol with a slash, like this: `\%`. I have this simple bit of code that converts the normal text file into a LaTeX-friendly version:
    118 
    119     ```python
    120     # make a LaTeX-friendly version (escape the % symbols with \)
    121     # Read in the file
    122     with open(results_text_file, "r") as file:
    123         filedata = file.read()
    124     # Replace the target string
    125     filedata = filedata.replace("%", "\%")
    126     # Write the file
    127     text_file_latex = "results_paragraphs_latex.txt"
    128     with open(text_file_latex, "w") as file:
    129         file.write(filedata)
    130     ```
    131 
    132 [^7]: You may have noticed there are two datasets I'm pulling from for this, "data," which includes everything on the basis of _hospitalizations_, and "df," short for "dataframe," which is a subset of "data" that only includes each _patient_ once (rather than a new entry for every hospitalization), along with a few other alterations that allow me to do patient-wise calculations.