Last week, I posted some R code that downloads the user and timestamp of tweets that contain a given hashtag going back as far as Twitter search will allow. As I noted in the post, the text of these tweets isn’t stored because of encoding issues with R and its JSON packages. A few people emailed asking for a version of the code that can archive the tweet text as well, and so I cleaned up my Python code for the task. The code, as posted below the break and on GitHub, supports resuming downloads and only uses standard Python libraries. You should be able to copy the methods and start downloading with just a call like doSearch("#ff") or doSearch("#feb17").
10 comments on “Archiving Tweets with Python”
1 Pings/Trackbacks for "Archiving Tweets with Python"
[...] the #march11 and #saudi tags in particular, leaving out the #march20 for now. After running my Twitter historical archiving script and `sort -n | uniq > sample.csv` on the output, I ran the following commands in [...]

Michael,
Looked for an RSS feed on your site and couldn’t seem to find one. Do you have one ?
(Not interested in email subscription – just something I can see in a reader.)
Thanx much, Glenn (@glenn_ferrell)
Hi mjbommar!
thank you for this fantastic piece of code. It’s really helpful.
Cheers,
QuantTrader
Thanks a lot for this piece of code!!! It`s really helpfull!!!
Hi, i have a little problem with the code!!! I hope that you can help me!!! The firts time that i running archiving a lot of interesting data!!! Its very amazing and quickly!!! Then program a sequence that runs the code every day in Windows-Python 2.7 installation, but I get no additional information …
When i check directly in Twitter the search have this message: {“results”:[],”max_id”:71789121567334401,”since_id”:69250581910388736,”refresh_url”:”?since_id=71789121567334401&q=EXCEL”,”results_per_page”:100,”page”:1,”completed_in”:0.017846,”warning”:”adjusted since_id to 69250581910388736 (), requested since_id was older than allowed”,”since_id_str”:”69250581910388736″,”max_id_str”:”71789121567334401″,”query”:”EXCEL”}
Can you help me?
P.D. Sorry for my english…
Sorry, i forget mentioning that:
In IDLE Python when i run the code to test i see this message:
doSearch: !nextPage, maxID=71789121567334401
{‘q’: ‘EXCEL’, ‘rpp’: 100, ‘max_id’: 71789121567334401L}
doQuery: Fetching http://search.twitter.com/search.json?q=EXCEL&rpp=100&max_id=71789121567334401
len(tweets) = 1 => breaking.
great post. is there a similar functionality for facebook data? to see if people are mentioning a term in their status? I’ve visualised some twitter data using your technique here http://www.tips-for-excel.com/twitter-data/
Would be great if I could add Facebook data and then monitor trends across both.
Thanks for the code.
Just wondering how would i keep the data as JSON and not export it to Excel?
Hi Mark,
The doQuery() method contains the bulk of the code necessary to only store the JSON. Take everything from there up to the json.load method.
Hi Michael, great post. It seems that you used the ggplot2 library to make the graph, so I wonder how did you do to order the user´s increasing. I´ve tryed to do it but it doesn´t work.
Sorry for my English.
Hi George,
I’m sure there are many ways, but here’s how I did it:
# Now build the table of most frequent tweetersnumTop < - 30
userFrequency <- arrange(as.data.frame(table(tweets$user)), -Freq)
names(userFrequency) <- c("Name", "Freq")
userFrequencyTop <- userFrequency[1:numTop, ]
userFrequencyTop$Name <- factor(userFrequencyTop$Name, levels=userFrequencyTop$Name)