1) Case Data

2) Patient Data

3) Time Series Data

4) Additional Data

Before the Start..

Levels of administrative divisions in South Korea

Upper Level (Provincial-level divisions)

Lower Level (Municipal-level divisions)

List of cities in South Korea

List of counties of South Korea

List of districts in South Korea

image.png

Sources

1) Case

Data of COVID-19 infection cases in South Korea

  1. case_id: the ID of the infection case

    • case_id(7) = region_code(5) + case_number(2)
    • You can check the region_code in 'Region.csv'

2) PatientInfo

Epidemiological data of COVID-19 patients in South Korea

  1. patient_id: the ID of the patient

    • patient_id(10) = region_code(5) + patient_number(5)
    • You can check the region_code in 'Region.csv'
    • There are two types of the patient_number

      1) local_num: The number given by the local government.  
      2) global_num: The number given by the KCDC

3) Time

Time series data of COVID-19 status in South Korea

-       A test is a diagnosis of an infection.

4) TimeAge

Time series data of COVID-19 status in terms of the age in South Korea

-       The status in terms of the age has been presented since March 2nd.

5) TimeGender

Time series data of COVID-19 status in terms of the gender in South Korea

-       The status in terms of the gender has been presented since March 2nd.

6) TimeProvince

Time series data of COVID-19 status in terms of the Province in South Korea

-     The confirmed status in terms of the provinces has been presented since Feburary 21th.
-     The value before Feburary 21th can be different.


7) Region

Location and statistical data of the regions in South Korea

Source of the statistic: KOSTAT (Statistics Korea)

8) Weather

Data of the weather in the regions of South Korea

Source of the weather data: KMA (Korea Meteorological Administration)

9) SearchTrend

Trend data of the keywords searched in NAVER which is one of the largest portal in South Korea

-       The unit means relative value by setting the highest search volume in the period to 100.
-       The unit means relative value by setting the highest search volume in the period to 100.

Source of the data: NAVER DataLab

10) SeoulFloating

Data of floating population in Seoul, South Korea (from SK Telecom Big Data Hub)

Source of the data: SKT Big Data Hub

11) Policy

Data of the government policy for COVID-19 in South Korea

1) Case DataFrame:

Initial plan:

Lets check if there are any missing values.

After checking for missing values in the Case dataframe, we can see that there are no missing values, which is great.

Time to check the correctness of the case_id format. As the formula for it is: case_id(7) = region_code(5) + case_number(2)

For example 5 characters + 2 characters should be resulting in 7 characters, more in depth region_code = 12345 case_number = 24, case_id should be 7

Clear whitespaces, and then check for duplicates.

Fixed errors related to spacing, checked for duplicates, continued with feature engineering, and used 'case_number' as a reference instead of the index

Although the default index would have served the purpose, the 'case_number' column offers a more straightforward and convenient way to identify specific cases.

Now, let's move on to validating the case_id logic by utilizing the 'region.csv' and 'case.csv' tables.

The first five digits represent the region code, which is determined based on the province. The initial sequence of five unique digits corresponds to the province, followed by the case number.

After successfully identifying unique province codes, we could proceed to decipher the case_id logic

We create a code that aims to identify and correct any mismatches between the 'case_id' column in the 'case' table and the expected format for case IDs.

It does this by creating a dictionary to map provinces to codes, adding region codes to the 'case' table, converting columns to strings, padding case numbers with zeros, creating a 'correct_id' column, identifying rows with mismatched case IDs, and checking for mismatched rows and printing results.

The output indicates that all case IDs match their respective provinces, and no outliers exist.

With that in mind, it is time to check which regions and cities are the most affected.

The top five affected regions are displayed, indicating that Daegu is the province most severely impacted by COVID-19. Let's delve into the specific cities within Daegu that are affected.

For better clarity, we should rename the '-' to 'unknown'.

We can see that Nam-Gu, a city in Daegu, is the most affected province, followed by cities with unidentified names, which we will set as 'Other city' for visualization purposes.

It is evident that Nam-gu, a city of Daegu, is the leading area in terms of cases compared to other, unidentified cities.

Time to identify outliers in latitude and longitude and then draw a map of cases.

Since there appear to be no outliers, we can proceed with creating the map and identify 'Other cities' based on their coordinates(latitude and longtitude).

Now that we can visually see the distribution of cases that have higher number of COVID-19 cases, it would be beneficial to examine the 'infection_case' column for a closer look at their origins, top 5 as well.

But before that, a quick reminder of the infection_case column:

The first category, "etc.," signifies cases under investigation and can be set aside for the time being.

Among the remaining four categories making up the top five, most cases stem from overseas travel and have surged rapidly in terms of confirmed cases.

Contact with infected individuals is another substantial factor, but crowded social gathering venues like churches and clubs also contribute to the high case numbers.

In light of this, we can investigate the proportion of cases that represent group infections.

We observe that there are 124 group cases, so it is time to visualize the percentage of cases that are not grouped.

The majority of cases are grouped ones, indicating that COVID-19 spreads readily within groups and can be easily contained.

With that in mind, let's identify the patient profile from the patientinfo dataframe.

2) PatientInfo DataFrame:

Initial plan:

  1. What is the distribution of patients by age and sex?
  2. How many cases are related to overseas inflow?
  3. Explore the timeline of patient states (isolated, released, deceased).

We observe that the 'age' column has 1380 missing values. Since the 'age' column is categorical, we can impute the missing values with the mode, unlike numerical columns where the median would be used.

We can do the same for the 'sex' collumn.

For the 'infection_case' column, we can fill the missing values with 'Unknown' to improve clarity.

Having addressed the missing values, we can proceed to check for duplicates.

Upon checking, no duplicate rows were found. Next, we will proceed to change the data types for the date columns

Having concluded the initial data cleaning phase, we can now proceed to address questions that are relevant to our dataframe.

First, we can check the distribution of patients by age and sex.

We can see that the dominant category is female, but in this case, it is slightly skewed as we had to assign the mode of the gender to the missing values.

Considering this, we can check how many cases are related to the overseas inflow.

We can see that 840 cases are attributed to overseas inflow, and 703 to be decided, which indicates that the number could fluctuate considerably. In light of this, we can visualize the data to gain a clearer understanding of the significance of the figures.

The vast majority of infection cases are attributed to contact with infected individuals. This indicates that the virus spreads readily through close contact.

The other sources of infection, such as unknown sources, overseas inflow, and Itaewon clubs, play a relatively smaller role in the overall caseload.

Now that we have identified this, we can begin to investigate the patient's state timeline (isolated, released, deceased). Let's first identify and address any rows that have missing values in any of the following columns: symptom_onset_date, confirmed_date, released_date, deceased_date."

Having observed that all of the columns contain missing values, the best approach would be to address the 'deceased_date' column specifically.

Therefore, one approach to address this issue is to introduce a new column 'deceased' where you can replace the missing 'deceased_date' values with a specific placeholder (such as 'Not Deceased') and fill the non-missing 'deceased_date' values with 'Deceased'.

We can see that the death rate was high in February. This could be because healthcare systems were not yet prepared to handle the situation effectively at the early stages of the outbreak. This could have led to a higher number of deaths in the beginning.

In addition, at the beginning of an outbreak, there is often a lack of information about the disease, its symptoms, and how to treat it. This could also contribute to a higher mortality rate.

Finally, we can see that the number of released patients was high in March. The reported number of released patients could have been influenced by the reporting practices. For example, if a large number of recoveries were reported at once in March, it could lead to a spike in the number of released patients.

With that in note, it is time to move to the Time dataframe.

3) Time DataFrame:

Initial plan:

First, let's ensure the accuracy of date and time formats.

Since the format has been changed, we can check for missing values.

Since there are no missing values, which is a positive indicator, we can proceed to a short analysis of the Time dataframe.

We will start by examining the trend of COVID-19 tests over time.

We can see that the number of tests has been increasing over time, indicating an expansion in testing capacity during this period.

Additionally, there are noticeable jumps in the line, possibly due to changes in testing policy or capacity.

Now we can move to exploring the daily changes in confirmed, released, and deceased cases.

The disease had a significant number of confirmed cases in early 2020, with a peak in February. After the peak, the number of confirmed cases started to decrease.

The number of released cases has been steadily increasing, indicating effective recovery and management of the disease.

The number of deceased cases remained relatively stable, which could suggest that the disease’s fatality rate did not significantly increase during this period.

4) TimeAge DataFrame:

Initial plan:

First, we can check for the missing values.

Upon examining the dataset, we discovered that there are no missing values in the 'deceased' column.

Lastly, we standardized the date format and extracted age group information, converting the age ranges from '10s', '20s' to their corresponding integer values.

After completing the initial data cleaning, we can analyze how the age distribution of confirmed cases has changed over time.

We can see that the number of confirmed cases for all age groups has increased over time.

Additionally, the 20s age group has the highest number of confirmed cases, followed by the 30s age group.

Lastly, the 0s age group has the lowest number of confirmed cases.

Now we can move to explore the number of deceased cases by age group.

We can see that the older age groups are more affected, which is a significant observation.

Which indicates that older individuals are at a higher risk of severe illness or death from this disease.

With that in note we can move to TimeGender dataframe.

5) TimeGender DataFrame:

Initial plan:

We can see that there are no missing values, therefore we can ensure the correctness of date and gender formats.

Since the size of the dataset is not that large, we will retain the original labels 'male' and 'female' for clarity and consistency. Otherwise, we would simply replace them with the abbreviations 'M' and 'F' for memory's sake.

In any case, we can ensure the datetime format.

The chart shows that confirmed cases for both genders have been rising steadily, with a higher number of cases recorded among males than females.

This trend may stem from a combination of social and biological factors.

Therefore, we can move on to analyzing the decease between these two groups

Given the higher number of confirmed cases among females, it might seem reasonable to expect that the number of deceased cases would also be higher for them."

However, this is not the case, as we can see that males do have a higher death rate.

Men are more likely to have health conditions that make COVID-19 worse, such as smoking, heart disease, and diabetes.

These conditions can make the virus stronger and cause more deaths among men.

With that knowledge, we can proceed to analyze the TimeProvince dataframe

6) TimeProvince DataFrame:

Initial plan:

We can start the analysis of the time_province dataframe by examining for missing values.

We can observe that there are no missing values, which is a positive indicator. Now, we can proceed to confirm that the date format is accurate and the province entries are valid.

We verified that the date column has a correct format and confirmed that the province entries are valid.

Therefore, we are going to explore the confirmed, released, and deceased cases by province over time.

Our analysis has consistently indicated that Daegu is the hotspot for the COVID-19 pandemic.

This is supported by the fact that it has the highest number of confirmed and released cases.

Additionally, the low death rate is a positive indicator.

Now, we can move to Region dataframe, for more extensive analysis.

7) Region DataFrame:

Initial plan:

First, as in any dataframe, we check for missing values.

Great, we can see that there are no missing values in the Region dataframe. Now, we can proceed to validate the formatting of lat and lon.

First we can start with the geographical distribution.

We can see that COVID-19 is spreading across all South Korea's provinces, with some provinces having more cases than others.

Moving forward we can check what is the distribution of the elderly population ratio across different provinces?

We can see that some of the provinces have a higher proportion of elderly people than others, and this variation is clearly illustrated by the box plots.

The length of each box represents the range of elderly population ratios within each province, indicating the spread of this proportion within a particular region.

With this information, we can now move on to examining how the weather dataframe can inform us about the impact of weather conditions on COVID-19 cases.

8) Weather DataFrame:

Initial plan:

As we identified previously, the weather data contains records from 2016, which is not relevant to our analysis.

Therefore, it would be more appropriate to merge the weather and time dataframes to ensure consistency in the dates.

Additionally, we merge the dataframes once more to combine the COVID-19 case data with the weather data, grouping them based on the provinces.

Subsequently, we examine the correlation between the average temperature and confirmed cases.

A correlation coefficient of 0.01557045154821599 is very small and indicates a very weak positive correlation between average temperature and confirmed COVID-19 cases.

This means that there is a very slight tendency for confirmed COVID-19 cases to increase as average temperature increases.

So, one possible explanation is that as temperatures rise, people may spend more time outdoors, increasing the likelihood of transmission of the virus.

Additionally, warmer weather may lead to increased humidity, which can make it more difficult for respiratory droplets to evaporate, potentially increasing the transmission of the virus.

With this, we can check what is the search trend in the SearchTrend dataframe.

9) SearchTrend DataFrame:

Initial plan:

Because there are no missing values, we can immediately remove any dates that are less than 2020, as the dataset should only contain dates from the beginning of 2020.

With similar approach as from the previous dataframe.

After successfuly merging we can check for the search trend overtime.

Looking at search trends gives us important information about what people were interested in and knew about the Coronavirus pandemic.

The data shows that people started showing a lot of interest in “Coronavirus” in January 2020. This was probably because of the first news reports about the virus at the end of 2019.

The most interest was in March 2020, when the virus was spreading around the world and people were getting more worried.

On the other hand, search trends for other sicknesses like “Cold,” “Flu,” and “Pneumonia” stayed pretty much the same from February to June 2020.

This means that people’s interest in these sicknesses didn’t really change much during the start of the pandemic.

The high search trend for “Coronavirus” shows that a lot of people were trying to find information and keep up-to-date about the virus when it first started spreading.

With this, we move to SeoulFloating dataframe.

10) SeoulFloating DataFrame:

Initial plan:

Great, we can see that there are no missing values, as well as we took care of the date structure and floating population seems to be correct type as well.

We can quickly explore the fluctuation in the floating population in Seoul.

The spike around the end of February end could be related to regional outbreaks of COVID-19.

For example, if there was a significant outbreak in the Seoul Metropolitan Region around this time, it could have led to a temporary increase in the floating population as people moved to or from the area, as well same thing can be applied to other days, additionally it may be an outlier, which we should check.

Having identified outliers, since their status as genuine or erroneous remains uncertain, we can utilize logarithmic transformation on the floating population data and visualize the resulting distribution.

This technique can help reduce the impact of extreme values, whether they are genuine observations or outliers caused by data errors or irregularities.

The chart shows how Seoul's floating population changed over time. We used log transformation to make the data easier to understand by reducing the impact of extreme values.

Even with this transformation, there are still some noticeable outliers, especially around the end of February.

This means these points differ a lot from the rest, even when viewed on a logarithmic scale.

We can see that the floating population fluctuates over time, and this could be due to different factors like social events, holidays, or changes in travel habits.

With this in mind, we can proceed to analyze the types and impacts of government policies over time.

11) Policy DataFrame:

Initial plan:

For the final dataframe, we first address the missing values and then ensure the correct data formats for the 'start_date' and 'end_date' columns, get dummy variables for policy column.

We can get rid of the missing values and proceed further with the cleaning.

Now we can analyze the types and impacts of government policies over time.

These policies related to COVID-19 should aim to reduce the spread of the virus, protect public health, and support individuals and businesses affected by the pandemic.

Education policies could include measures to ensure the safety of students and teachers, such as social distancing and remote learning.

Alert policies could include measures to track and contain outbreaks of the virus.

Social policies could include measures to support vulnerable populations, such as the elderly and low-income families.

Administrative policies could include measures to ensure the continuity of government services and operations.

Transformation policies could include measures to promote innovation and adaptability in response to the pandemic.

Health policies could include measures to increase access to healthcare and medical supplies.

Therefore, we need to check gov_policy to get a better understanding.

In general, government policies related to COVID-19 should aim to reduce the spread of the virus, protect public health, and support individuals and businesses affected by the pandemic.

The policies listed in the chart could be analyzed to determine how they contribute to these goals.

For example, policies related to school openings could be evaluated based on their effectiveness in reducing the spread of the virus among students and teachers.

Policies related to infectious disease alert levels could be evaluated based on their effectiveness in tracking and containing outbreaks of the virus.

Policies related to social distancing campaigns could be evaluated based on their effectiveness in reducing the spread of the virus in public spaces.

But based on todays news, we can say that it helped.

Summary of COVID-19 Data Analysis Project in South Korea

The project involved the analysis of various dataframes related to COVID-19 in South Korea, focusing on different aspects of the pandemic. Here is a summarized overview of the findings:

Case DataFrame:

PatientInfo DataFrame: