stimmenfryslan/notebooks/Surveys with repeated predi...

7.1 KiB

Surveys with repeated prediction quizes

Participants specify some demographics (approximate age, gender, etc) in a survey before they start the prediction quiz. Then each survey should contain one prediction quiz, where the participant provides his pronunciation for 19 words.

In conclusion, this is only partly true. All surveys have at least one prediction quiz attached, with 19 pronunciation responses, but some surveys have more quizes attached. All off the quizes have completely answered all 19 words.

In [1]:
# Enable portforwording from 3307 locally to 3306 on the stimmen database machine
# ssh -L 3307:127.0.0.1:3306 stimmen.housing.rug.nl

import pandas
import MySQLdb
from collections import defaultdict, Counter
from ipy_table import make_table, set_row_style
from IPython.display import display

from getpass import getpass

if 'mysql_password' not in globals():
    mysql_password = getpass()
try:
    db = MySQLdb.connect(host='127.0.0.1', port=3307, user='stimmen', passwd=mysql_password, db='stimmen', charset='utf8')
except MySQLdb.OperationalError as e:
    globals().pop('mysql_password')
    raise
········
In [2]:
answers = pandas.read_sql('''
SELECT
    survey.id AS survey_id, 
    prediction_quiz_id,
    user_lat, user_lng,
    question_text, answer_text
FROM       core_surveyresult as survey
INNER JOIN core_predictionquizresult as result ON survey.id = result.survey_result_id
INNER JOIN core_predictionquizresultquestionanswer as answer
    ON result.id = answer.prediction_quiz_id
WHERE
    survey.submitted_at >= '2017-09-17'
    AND result.submitted_at >= '2017-09-17'
''', db)

Words

Words for which prediction quiz participants provided pronunciations.

In [3]:
questions = answers['question_text'].unique()
print(' - ' + '\n - '.join(questions))
 - "borst" (*lichaamsdeel)
 - "gegaan"
 - "gezet"
 - "geel"
 - "bij" (*insect)
 - "avond"
 - "kaas"
 - "deurtje"
 - "koken"
 - "dag"
 - "heel"
 - "blad" (aan een boom)
 - "armen" (*lichaamsdeel)
 - "trein"
 - "oog"
 - "zaterdag"
 - "sprak (toe)"
 - "vis"
 - "tand"

Repeated quizes

Some surveys repeated the predictions quiz.

In [4]:
survey_counts = answers.groupby('survey_id').count()['answer_text']

print('# surveys \w repeated pronunciations', (survey_counts != 19).sum())
print('# surveys                           ', len(survey_counts))

repeat_survey_ids = set(survey_counts[survey_counts != 19].index)
# surveys \w repeated pronunciations 197
# surveys                            3104
In [5]:
# answers of all surveys with repeated quizes

repeat_survey_answers = answers[[
    survey_id in repeat_survey_ids
    for survey_id in answers['survey_id']
]]
In [6]:
## Sanity check (is it safe to use '|' as a token):
assert not any('|' in x for x in answers['answer_text'])

## How often one word has a reported pronunciation within one survey
question_counts = repeat_survey_answers.groupby(['survey_id', 'question_text']).agg({
    'prediction_quiz_id': len,
    'answer_text': lambda x: '|'.join(set(x))    
})
In [7]:
print('regarding surveys with repeated quizes')

print('# reported pronunciations, counting repeats', len(repeat_survey_answers))
print('# reported pronunciations                  ', len(question_counts))
regarding surveys with repeated quizes
# reported pronunciations, counting repeats 8398
# reported pronunciations                   3743
In [8]:
# Sanity check, each survey with repeated quizes, are all the words repeated
# Conclusion: they are (which is good), since nothing was printed
for survey_id, rows in repeat_survey_answers.groupby('survey_id'):
    if len(set(
        rows.groupby('question_text').count()['survey_id']
    )) != 1:
        print('for survey', survey_id, ', different words were provided a pronunciation a different number of times')
In [9]:
print('how often different pronunciations for a word were given within one survey:', sum(
    '|' in x # we used '|' as a pronunciation-seperator a few cells up, so it's existance equals > 1 pronunciation
    for x in question_counts['answer_text']
))
how often different pronunciations for a word were given within one survey: 1203