發表文章

目前顯示的是 3月 5, 2019的文章

Royal Shakespeare Company

圖片
British theatre company Royal Shakespeare Company logo The Royal Shakespeare Company ( RSC ) is a major British theatre company, based in Stratford-upon-Avon, Warwickshire, England. The company employs over 1,000 staff and produces around 20 productions a year. The RSC plays regularly in London, Newcastle upon Tyne and on tour across the UK and internationally. The company's home is in Stratford-upon-Avon, where it has recently redeveloped its Royal Shakespeare and Swan theatres as part of a £112.8-million "Transformation" project. The theatres re-opened in November 2010, having closed in 2007. The new buildings attracted 18,000 visitors within the first week and received a positive media response both upon opening, and following the first full Shakespeare performances. Performances in Stratford-upon-Avon continued throughout the Transformation project at the temporary Courtyard Theatre. As well as the plays of Shakespeare and his contemporaries, the RSC produces new wo

Partial De-duplication in R based on string value match

圖片
0 1 I have a dataframe named 'reviews' like this: score_phrase title score release_year release_month release_day 1 Amazing LittleBigPlanet PS Vita 9 2012 9 12 2 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition 9 2012 9 12 3 Great Splice: Tree of Life 8.5 2012 9 12 4 Great NHL 13 8.5 2012 9 11 5 Great NHL 13 8.5 2012 9 11 6 Good Total War Battles: Shogun 7 2012 9 11 7 Awful Double Dragon: Neon 3 2012 9 11 8 Amazing Guild Wars 2 9 2012 9 11 9 Awful Double Dragon: Neon 3 2012 9 11 10 Good Total War Battles: Shogun 7 2012 9 11 Objective: Slight mismatch/typo in column values cause duplication in records. Here Row 1 and Row 2 are duplicates and Row 2 should be dropped after de-duplication. I used dedup() function of 'SCRUBR' package to perform de-duplication but on a large dataset, I get incorrect number of duplicates when I toggle tolerance level for string matching. For example: partial_dup_data <- reviews[1:100,] %>% dedup(tolerance = 0.7