7.1 KiB
7.1 KiB
Surveys with repeated prediction quizes¶
Participants specify some demographics (approximate age, gender, etc) in a survey before they start the prediction quiz. Then each survey should contain one prediction quiz, where the participant provides his pronunciation for 19 words.
In conclusion, this is only partly true. All surveys have at least one prediction quiz attached, with 19 pronunciation responses, but some surveys have more quizes attached. All off the quizes have completely answered all 19 words.
In [1]:
# Enable portforwording from 3307 locally to 3306 on the stimmen database machine # ssh -L 3307:127.0.0.1:3306 stimmen.housing.rug.nl import pandas import MySQLdb from collections import defaultdict, Counter from ipy_table import make_table, set_row_style from IPython.display import display from getpass import getpass if 'mysql_password' not in globals(): mysql_password = getpass() try: db = MySQLdb.connect(host='127.0.0.1', port=3307, user='stimmen', passwd=mysql_password, db='stimmen', charset='utf8') except MySQLdb.OperationalError as e: globals().pop('mysql_password') raise
········
In [2]:
answers = pandas.read_sql(''' SELECT survey.id AS survey_id, prediction_quiz_id, user_lat, user_lng, question_text, answer_text FROM core_surveyresult as survey INNER JOIN core_predictionquizresult as result ON survey.id = result.survey_result_id INNER JOIN core_predictionquizresultquestionanswer as answer ON result.id = answer.prediction_quiz_id WHERE survey.submitted_at >= '2017-09-17' AND result.submitted_at >= '2017-09-17' ''', db)
Words¶
Words for which prediction quiz participants provided pronunciations.
In [3]:
questions = answers['question_text'].unique() print(' - ' + '\n - '.join(questions))
- "borst" (*lichaamsdeel) - "gegaan" - "gezet" - "geel" - "bij" (*insect) - "avond" - "kaas" - "deurtje" - "koken" - "dag" - "heel" - "blad" (aan een boom) - "armen" (*lichaamsdeel) - "trein" - "oog" - "zaterdag" - "sprak (toe)" - "vis" - "tand"
Repeated quizes¶
Some surveys repeated the predictions quiz.
In [4]:
survey_counts = answers.groupby('survey_id').count()['answer_text'] print('# surveys \w repeated pronunciations', (survey_counts != 19).sum()) print('# surveys ', len(survey_counts)) repeat_survey_ids = set(survey_counts[survey_counts != 19].index)
# surveys \w repeated pronunciations 197 # surveys 3104
In [5]:
# answers of all surveys with repeated quizes repeat_survey_answers = answers[[ survey_id in repeat_survey_ids for survey_id in answers['survey_id'] ]]
In [6]:
## Sanity check (is it safe to use '|' as a token): assert not any('|' in x for x in answers['answer_text']) ## How often one word has a reported pronunciation within one survey question_counts = repeat_survey_answers.groupby(['survey_id', 'question_text']).agg({ 'prediction_quiz_id': len, 'answer_text': lambda x: '|'.join(set(x)) })
In [7]:
print('regarding surveys with repeated quizes') print('# reported pronunciations, counting repeats', len(repeat_survey_answers)) print('# reported pronunciations ', len(question_counts))
regarding surveys with repeated quizes # reported pronunciations, counting repeats 8398 # reported pronunciations 3743
In [8]:
# Sanity check, each survey with repeated quizes, are all the words repeated # Conclusion: they are (which is good), since nothing was printed for survey_id, rows in repeat_survey_answers.groupby('survey_id'): if len(set( rows.groupby('question_text').count()['survey_id'] )) != 1: print('for survey', survey_id, ', different words were provided a pronunciation a different number of times')
In [9]:
print('how often different pronunciations for a word were given within one survey:', sum( '|' in x # we used '|' as a pronunciation-seperator a few cells up, so it's existance equals > 1 pronunciation for x in question_counts['answer_text'] ))
how often different pronunciations for a word were given within one survey: 1203