Tuesday, 10 September 2013

Structuring a program for Data Collection

Structuring a program for Data Collection

As part of a wider project, I'm required to use the Twitter Rest API to
collect data about a user's network, so that their network may later be
graphed in 3D space.
Previously, I had no thought invested in the rate limits, and made a
Javascript program to (through codebird.js) grab the user's twitter id,
find their friends and followers ids, and recurse outward, expanding the
network in local runtime memory.
This fell short of collecting anything close to a reasonable network
depth, due to the very restricting rate limits on the API calls Twitter
imposes (15 calls per 15 minutes! That's 1 user added to my network model
a second).
Twitter API 1.1 Rate Limits
I decided I'd have to collect the user data over time (before the user
tries to graph their network), on a remote server (so that the user
doesn't have to wait for their data to be collected; that could take
hours), using PHP and a MySQL database.
After developing a database model and local cache models, I've hit a few
more snags;
My model requires asynchronous, periodic serving of queues (sending data
between the client and API, avoiding the rate limits), which is fairly
difficult (I'm running my script on a free web host with which I can't
configure pthread) in PHP.
My model requires huge execution times for the PHP script (hours) to wait
for the rate limits, but my webhost will terminate my script after 15s
My current model is (very simply):
userQueue = []
friendsQueue = []
followersQueue = []
registerUser(user):
userQueue.add(user)
friendsQueue.add(user)
followersQueue.add(user)
serveUserQueue():
serve the first 100 elements of userQueue, and send to the API as one
batch
add the API responses to the database
serveFriendsQueue():
serve the first friendsQueue element, and send to the API
for every user in the API response list:
add database link between the sent user and the iterated user
registerUser(user)
serveFollowersQueue():
serve the first followersQueue element, and send to the API
for every user in the API response list:
add database link between the sent user and the iterated user
registerUser(user)
// if only this were possible
call every 12 minutes (serveUserQueue())
call every 1 minute (serveFriendsQueue())
call every 1 minute (serveFollowersQueue())
Note this is a very very very simplified version, and removes all the
recursive base-case approaching and data-repetition prevention, but it
shows you the general algorithm and need for asynchronous calls.
It's been problem after problem on this project, and I feel like I'm
making no progress in what should be a very simple interaction with the
Twitter API.
Help?!
How should I approach this? Should I restructure my code, or my entire
approach?
Should I develop a completely PHP friendly solution (strictly consecutive
execution - I currently just don't see how I can do this)?
I feel like 'cron jobs' are going to be a tempting but inappropriate
answer, since I lose all stage of my recursive execution when the script
ends (and I'd really like an elegant solution).
Any direction / advice at all is appreciated.
Thanks!

No comments:

Post a Comment